How we ran the benchmark
We wrote one prompt, a barista with freckles making coffee in a sunlit cafe, and generated it on four models without changing a word. Then we looked at every output at full size and noted cost, resolution, and where each one was strongest. No cherry-picking across multiple tries.
