Artificial Pretraining of Masked Language Models
According to Chinchilla scaling laws for Transformers [1], data may soon be a bottleneck for training really large English language models. The vast majority of languages already lack data for training even moderately sized networks. Thus, during the last few months, we experimented with the following question: what if we could pretrain Masked Language Models (MLM, think BERT or RoBERTa) using only artificially created data, then continue pretraining on languages with very little resources?...