有 diffusion ,混合结构。ClosedAI 虽然连技术报告都不发了,但最后一点良心体现在这张生成样例图里
https://imgur.com/a/YGzxVIp官网给出的 Prompt:
```
A wide image taken with a phone of a glass whiteboard, in a room overlooking the Bay Bridge. The field of view shows a woman writing, sporting a tshirt wiith a large OpenAI logo. The handwriting looks natural and a bit messy, and we see the photographer's reflection.
The text reads:
(left)
"Transfer between Modalities:
Suppose we directly model
p(text, pixels, sound) [equation]
with one big autoregressive transformer.
Pros:
* image generation augmented with vast world knowledge
* next-level text rendering
* native in-context learning
* unified post-training stack
Cons:
* varying bit-rate across modalities
* compute not adaptive"
(Right)
"Fixes:
* model compressed representations
* compose autoregressive prior with a powerful decoder"
On the bottom right of the board, she draws a diagram:
"tokens -> [transformer] -> [diffusion] -> pixels"
```
已经有论文给出详细猜测:
https://arxiv.org/abs/2504.02782