Methodology

How we evaluate AI models on real-world creative work, shaped by Contra’s 1M+ global members.

All categories

Design

Writing

Marketing

Social Media

Engineering

Video & Animation

Music & Audio

Top skills

Graphic Designer

14.84%

Web Developer

10.55%

Web Designer

10.1%

UI Designer

9.04%

Brand Designer

8.46%

Video Editor

8.29%

Top tools

Adobe Suite

48.222%

Figma

27.034%

Canva

18.253%

WordPress

11.748%

React

9.298%

JavaScript

8.22%

All categories

Design

Writing

Marketing

Social Media

Engineering

Video & Animation

Music & Audio

Top skills

Graphic Designer

14.84%

Web Developer

10.55%

Web Designer

10.1%

UI Designer

9.04%

Brand Designer

8.46%

Video Editor

8.29%

Top tools

Adobe Suite

48.222%

Figma

27.034%

Canva

18.253%

WordPress

11.748%

React

9.298%

JavaScript

8.22%

All categories

Design

Writing

Marketing

Social Media

Engineering

Video & Animation

Music & Audio

Top skills

Graphic Designer

14.84%

Web Developer

10.55%

Web Designer

10.1%

UI Designer

9.04%

Brand Designer

8.46%

Video Editor

8.29%

Top tools

Adobe Suite

48.222%

Figma

27.034%

Canva

18.253%

WordPress

11.748%

React

9.298%

JavaScript

8.22%

Overview

Creative Arena compares multiple image generation models on tasks that mirror real paid projects commissioned on Contra. We convert anonymized deliverables into prompts, run controlled tournaments with four models at a time, and updateoverall and per-category Elo ratings after every battle.

Real-world grounding

Prompts originate from anonymized client projects

Real-world grounding

Prompts originate from anonymized client projects

Real-world grounding

Prompts originate from anonymized client projects

Controlled bracket

Fixed six-battle mini-tournaments yield a full 1st–4th ordering

Controlled bracket

Fixed six-battle mini-tournaments yield a full 1st–4th ordering

Controlled bracket

Fixed six-battle mini-tournaments yield a full 1st–4th ordering

Continuous ratings

Elo starts at 1500 with K=32; updates occur every battle

Continuous ratings

Elo starts at 1500 with K=32; updates occur every battle

Continuous ratings

Elo starts at 1500 with K=32; updates occur every battle

Categories

We currently evaluate models across the following practical design categories:

Landing Page

Ad

UI Component

Data Visualization

Moodboard

Logo

Video

Categories reflect typical client deliverables on Contra. Results are reported both overall and per category.

Data sourcing & prompt generation

1

Collect deliverables

We sample deliverables from real, completed Contra projects.

1

Collect deliverables

We sample deliverables from real, completed Contra projects.

1

Collect deliverables

We sample deliverables from real, completed Contra projects.

2

Anonymize & sanitize

We remove personally identifiable information (PII), trademarks, and any client-specific terms that would reveal identity or confidential details.

2

Anonymize & sanitize

We remove personally identifiable information (PII), trademarks, and any client-specific terms that would reveal identity or confidential details.

2

Anonymize & sanitize

We remove personally identifiable information (PII), trademarks, and any client-specific terms that would reveal identity or confidential details.

3

Category classification

Deliverables are run through a classifier (LLM-assisted) to map to one of the Arena categories.

3

Category classification

Deliverables are run through a classifier (LLM-assisted) to map to one of the Arena categories.

3

Category classification

Deliverables are run through a classifier (LLM-assisted) to map to one of the Arena categories.

4

Prompt drafting

From the anonymized deliverable, we generate a prompt that captures the intent, constraints, and style of the original request while remaining generic and safe.

4

Prompt drafting

From the anonymized deliverable, we generate a prompt that captures the intent, constraints, and style of the original request while remaining generic and safe.

4

Prompt drafting

From the anonymized deliverable, we generate a prompt that captures the intent, constraints, and style of the original request while remaining generic and safe.

5

Image generation

An image is generated for the given prompt for each active model.

5

Image generation

An image is generated for the given prompt for each active model.

5

Image generation

An image is generated for the given prompt for each active model.

Example tournament format (4 models, 6 battles)

Category selection

A user selects a category (or one is randomly selected).

Category selection

A user selects a category (or one is randomly selected).

Category selection

A user selects a category (or one is randomly selected).

Prompt selection

A pre-generated category prompt is selected at random.

Prompt selection

A pre-generated category prompt is selected at random.

Prompt selection

A pre-generated category prompt is selected at random.

Model sampling

Four distinct models are chosen from the active pool.

Model sampling

Four distinct models are chosen from the active pool.

Model sampling

Four distinct models are chosen from the active pool.

Initial battles

Battle 1

Model A

vs

Model B

Battle 1

Model A

vs

Model B

Battle 1

Model A

vs

Model B

Battle 1

Model C

vs

Model D

Battle 1

Model C

vs

Model D

Battle 1

Model C

vs

Model D

Winner & loser brackets

Battle 3

Winners

Model A

vs

Model D

Battle 3

Winners

Model A

vs

Model D

Battle 3

Winners

Model A

vs

Model D

Battle 4

Losers

Model B

vs

Model C

Battle 4

Losers

Model B

vs

Model C

Battle 4

Losers

Model B

vs

Model C

Tiebreaker

Battle 5

1 win each

Model B

vs

Model D

Battle 5

1 win each

Model B

vs

Model D

Battle 5

1 win each

Model B

vs

Model D

Battle 6

2 wins each

Model A

vs

Model B

Battle 6

2 wins each

Model A

vs

Model B

Battle 6

2 wins each

Model A

vs

Model B

Final ranking

1st

Model A

1st

Model A

1st

Model A

2nd

Model B

2nd

Model B

2nd

Model B

3rd

Model C

3rd

Model C

3rd

Model C

4th

Model D

4th

Model D

4th

Model D

Fairness & bias controls

Left/Right randomization

Each battle randomizes side assignment.

Left/Right randomization

Each battle randomizes side assignment.

Left/Right randomization

Each battle randomizes side assignment.

Blind judging

No model names, vendors, prompts, or metadata are shown to judges.

Blind judging

No model names, vendors, prompts, or metadata are shown to judges.

Blind judging

No model names, vendors, prompts, or metadata are shown to judges.

Prompt hygiene

Prompts are anonymized, policy-compliant, and category-consistent.

Prompt hygiene

Prompts are anonymized, policy-compliant, and category-consistent.

Prompt hygiene

Prompts are anonymized, policy-compliant, and category-consistent.

Balanced exposure

Scheduler ensures broad coverage across models and pairings over time.

Balanced exposure

Scheduler ensures broad coverage across models and pairings over time.

Balanced exposure

Scheduler ensures broad coverage across models and pairings over time.

Audit sampling

A subset of matches is reviewed by humans for quality control.

Audit sampling

A subset of matches is reviewed by humans for quality control.

Audit sampling

A subset of matches is reviewed by humans for quality control.

Ratings (Elo)

We maintain two Elo ratings per model: an overall Elo and a per-category Elo. All models start at 1500. After every battle, we apply an update with K = 32.

1500

Starting Elo rating

K = 32

Update factor after each battle

Base prompts

Base prompts are defined for each category and utilized in combination with the custom or pre-generated prompts.

Landing Page

Ad

UI component

Data visualization

Moodboard

Logo

Video

You are an expert web developer tasked with building a website. Follow these requirements:

  • Generate a complete and valid HTML document with DOCTYPE and meta tags.

  • Return raw HTML that can be used directly without any additional processing.`

  • Use inline vanilla CSS and JavaScript where possible.

  • When an external dependency is needed, use UNPKG.

  • Use semantic HTML elements (nav, main, section, article, etc.).

  • Be accessible, with professional design and good contrast.

  • Generate mobile-first responsive design using modern CSS techniques (e.g, Grid/Flexbox).

  • Write clean, readable code with proper spacing.

  • Your only output should be a markdown code block.

Example output:

html

<!DOCTYPE html>

<html lang="en">

<head>

<meta charset="UTF-8">

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Page Title</title>

<script src="https://unpkg.com/chart.js"></script>

<style>

body {

font-family: sans-serif;

margin: 0;

}

</style>

</head>

<body>

<h1>Hello, World!</h1>

<p>This is an HTML page.</p>

</body>

</html>

Landing Page

Ad

UI component

Data visualization

Moodboard

Logo

Video

You are an expert web developer tasked with building a website. Follow these requirements:

  • Generate a complete and valid HTML document with DOCTYPE and meta tags.

  • Return raw HTML that can be used directly without any additional processing.`

  • Use inline vanilla CSS and JavaScript where possible.

  • When an external dependency is needed, use UNPKG.

  • Use semantic HTML elements (nav, main, section, article, etc.).

  • Be accessible, with professional design and good contrast.

  • Generate mobile-first responsive design using modern CSS techniques (e.g, Grid/Flexbox).

  • Write clean, readable code with proper spacing.

  • Your only output should be a markdown code block.

Example output:

html

<!DOCTYPE html>

<html lang="en">

<head>

<meta charset="UTF-8">

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Page Title</title>

<script src="https://unpkg.com/chart.js"></script>

<style>

body {

font-family: sans-serif;

margin: 0;

}

</style>

</head>

<body>

<h1>Hello, World!</h1>

<p>This is an HTML page.</p>

</body>

</html>

Landing Page

Ad

UI component

Data visualization

Moodboard

Logo

Video

You are an expert web developer tasked with building a website. Follow these requirements:

  • Generate a complete and valid HTML document with DOCTYPE and meta tags.

  • Return raw HTML that can be used directly without any additional processing.`

  • Use inline vanilla CSS and JavaScript where possible.

  • When an external dependency is needed, use UNPKG.

  • Use semantic HTML elements (nav, main, section, article, etc.).

  • Be accessible, with professional design and good contrast.

  • Generate mobile-first responsive design using modern CSS techniques (e.g, Grid/Flexbox).

  • Write clean, readable code with proper spacing.

  • Your only output should be a markdown code block.

Example output:

html

<!DOCTYPE html>

<html lang="en">

<head>

<meta charset="UTF-8">

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Page Title</title>

<script src="https://unpkg.com/chart.js"></script>

<style>

body {

font-family: sans-serif;

margin: 0;

}

</style>

</head>

<body>

<h1>Hello, World!</h1>

<p>This is an HTML page.</p>

</body>

</html>

FAQs

What makes Contra's Creative Arena different?

Prompts start from anonymized deliverables of real, paid client projects on Contra, keeping tasks practical and grounded.

How are models selected for tournaments?

Four distinct models are sampled from the active pool. Side assignment (left/right) is randomized every battle.

Do ratings change after every battle?

Yes. Elo ratings update after each individual battle. We maintain both overall and per-category ratings.

FAQs

What makes Contra's Creative Arena different?

Prompts start from anonymized deliverables of real, paid client projects on Contra, keeping tasks practical and grounded.

How are models selected for tournaments?

Four distinct models are sampled from the active pool. Side assignment (left/right) is randomized every battle.

Do ratings change after every battle?

Yes. Elo ratings update after each individual battle. We maintain both overall and per-category ratings.

FAQs

What makes Contra's Creative Arena different?

Prompts start from anonymized deliverables of real, paid client projects on Contra, keeping tasks practical and grounded.

How are models selected for tournaments?

Four distinct models are sampled from the active pool. Side assignment (left/right) is randomized every battle.

Do ratings change after every battle?

Yes. Elo ratings update after each individual battle. We maintain both overall and per-category ratings.

Questions about this methodology?

Questions about this methodology?