Transformers have demonstrated spectacular efficiency on class-conditional ImageNet benchmarks, reaching state-of-the-art FID scores. Nonetheless, their computational complexity will increase with transformer depth/width or the variety of enter tokens and requires patchy approximation to function on even latent enter sequences. On this paper, we tackle these points by presenting a novel strategy to reinforce the effectivity and scalability of picture technology fashions, incorporating state area fashions (SSMs) because the core element and deviating from the extensively adopted transformer-based and U-Internet architectures. We introduce a category of SSM-based fashions that considerably cut back ahead move complexity whereas sustaining comparable efficiency and taking enter precise sequences with out patchy approximations. By means of in depth experiments and rigorous analysis, we show that our proposed strategy reduces the Gflops utilized within the mannequin with out sacrificing the standard of generated photographs. Our findings counsel that state area fashions might be an efficient various to consideration mechanisms in transformer-based architectures, providing a extra environment friendly answer for large-scale picture technology duties.
![](https://mlr.cdn-apple.com/media/figure2_3659d18279.png)
![](https://mlr.cdn-apple.com/media/figure3_3977eb63a3.jpg)