Methodology
How we evaluate AI models on real-world creative work, shaped by Contra’s 1M+ global members.
Overview
Creative Arena compares multiple image generation models on tasks that mirror real paid projects commissioned on Contra. We convert anonymized deliverables into prompts, run controlled tournaments with four models at a time, and updateoverall and per-category Elo ratings after every battle.
Real-world grounding
Prompts originate from anonymized client projects
Controlled bracket
Fixed six-battle mini-tournaments yield a full 1st–4th ordering
Continuous ratings
Elo starts at 1500 with K=32; updates occur every battle
Categories
We currently evaluate models across the following practical design categories:
Landing Page
Ad
UI Component
Data Visualization
Moodboard
Logo
Video
Categories reflect typical client deliverables on Contra. Results are reported both overall and per category.
Data sourcing & prompt generation
1
Collect deliverables
We sample deliverables from real, completed Contra projects.
2
Anonymize & sanitize
We remove personally identifiable information (PII), trademarks, and any client-specific terms that would reveal identity or confidential details.
3
Category classification
Deliverables are run through a classifier (LLM-assisted) to map to one of the Arena categories.
4
Prompt drafting
From the anonymized deliverable, we generate a prompt that captures the intent, constraints, and style of the original request while remaining generic and safe.
5
Image generation
An image is generated for the given prompt for each active model.
Example tournament format (4 models, 6 battles)
Category selection
A user selects a category (or one is randomly selected).
Prompt selection
A pre-generated category prompt is selected at random.
Model sampling
Four distinct models are chosen from the active pool.
Initial battles
Battle 1
Model A
vs
Model B
Battle 1
Model C
vs
Model D
Winner & loser brackets
Battle 3
Winners
Model A
vs
Model D
Battle 4
Losers
Model B
vs
Model C
Tiebreaker
Battle 5
1 win each
Model B
vs
Model D
Battle 6
2 wins each
Model A
vs
Model B
Final ranking
1st
Model A

2nd
Model B
3rd
Model C
4th
Model D
Fairness & bias controls
Left/Right randomization
Each battle randomizes side assignment.
Blind judging
No model names, vendors, prompts, or metadata are shown to judges.
Prompt hygiene
Prompts are anonymized, policy-compliant, and category-consistent.
Balanced exposure
Scheduler ensures broad coverage across models and pairings over time.
Audit sampling
A subset of matches is reviewed by humans for quality control.
Ratings (Elo)
We maintain two Elo ratings per model: an overall Elo and a per-category Elo. All models start at 1500. After every battle, we apply an update with K = 32.
1500
Starting Elo rating
K = 32
Update factor after each battle
Base prompts
Base prompts are defined for each category and utilized in combination with the custom or pre-generated prompts.
Landing Page
Ad
UI component
Data visualization
Moodboard
Logo
Video
You are an expert web developer tasked with building a website. Follow these requirements:
Generate a complete and valid HTML document with DOCTYPE and meta tags.
Return raw HTML that can be used directly without any additional processing.`
Use inline vanilla CSS and JavaScript where possible.
When an external dependency is needed, use UNPKG.
Use semantic HTML elements (nav, main, section, article, etc.).
Be accessible, with professional design and good contrast.
Generate mobile-first responsive design using modern CSS techniques (e.g, Grid/Flexbox).
Write clean, readable code with proper spacing.
Your only output should be a markdown code block.
Example output:
html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Page Title</title>
<script src="https://unpkg.com/chart.js"></script>
<style>
body {
font-family: sans-serif;
margin: 0;
}
</style>
</head>
<body>
<h1>Hello, World!</h1>
<p>This is an HTML page.</p>
</body>
</html>