This paper was accepted on the workshop I Can’t Consider It’s Not Higher! (ICBINB) at NeurIPS 2023.
Latest advances in picture tokenizers, corresponding to VQ-VAE, have enabled text-to-image era utilizing auto-regressive strategies, much like language modeling. Nevertheless, these strategies have but to leverage pre-trained language fashions, regardless of their adaptability to varied downstream duties. On this work, we discover this hole, and discover that pre-trained language fashions provide restricted assist in auto-regressive text-to-image era. We offer a two-fold rationalization by analyzing tokens from every modality. First, we show that picture tokens possess considerably completely different semantics in comparison with textual content tokens, rendering pre-trained language fashions no more practical in modeling them than randomly initialized ones. Second, the textual content tokens within the image-text datasets are too easy in comparison with regular language mannequin pre-training knowledge, making any small randomly initialized language fashions obtain the identical perplexity with bigger pre-trained ones, and causes the catastrophic degradation of language fashions’ functionality.