English Benchmarks Are Not Enough

Generated from batch: maa-preview-001

AAA.win tested 6 AI agents on 12 multilingual business tasks across 4 languages. This preview report is generated from structured run data and should be edited before publication.

Executive Summary

Overall Leaderboard

Rank Agent Score Pass Rate Critical Failure Rate Cost Tier
1 Claude Main 87 94% 8% premium
2 OpenAI Main 86 94% 8% premium
3 Qwen Main 84 92% 11% standard
4 Gemini Main 80 86% 8% standard
5 DeepSeek Main 79 67% 8% low
6 Grok Main 75 42% 33% standard

Language Winners

Language Winner Score Critical Failure Rate
Chinese Qwen Main 89 0%
English OpenAI Main 93 0%
Japanese Claude Main 87 11%
Spanish Claude Main 89 0%

Task Type Winners

Task Type Winner Score Critical Failure Rate
Support Claude Main 90 17%
Writing / Localization OpenAI Main 89 0%
Extraction / Analysis Qwen Main 88 8%

Failure Modes

Failure Tag Count
unsafe_refund_promise 17
literal_translation 16
weak_cta 14
unsupported_claim 10
invalid_json 7
wrong_intent 5
hallucinated_signing_date 4
missing_field 4
missed_dependency 3
too_verbose 3

Task Results

Task Language Type Winner Score Primary Risk
Chinese Customer Complaint Triage 中文 Support Qwen Main 85 unsafe_refund_promise
Chinese App Review Pain Point Summary 中文 Writing / Localization OpenAI Main 89 hallucinated_issue
Chinese Contract Field Extraction 中文 Extraction / Analysis Qwen Main 96 hallucinated_signing_date
SaaS Landing Page Hero Rewrite English Writing / Localization OpenAI Main 93 generic_ai_copy
Meeting Notes Action Item Extraction English Extraction / Analysis OpenAI Main 89 discussion_as_action
Refund Policy Boundary Reply English Support OpenAI Main 96 unsafe_refund_promise
Japanese Business Email Politeness Rewrite 日本語 Writing / Localization OpenAI Main 85 unnatural_japanese
Japanese Appointment Intent Classification 日本語 Support Claude Main 92 wrong_intent
Japanese Product Specification Extraction 日本語 Extraction / Analysis Qwen Main 91 hallucinated_material
Spanish Support Reply for Wrong Item Español Support Claude Main 89 unsafe_refund_promise
Spanish Ad Headline Localization Español Writing / Localization Claude Main 92 literal_translation
Spanish Order Confirmation Extraction Español Extraction / Analysis Claude Main 85 wrong_date_format

Methodology Snapshot

Publication Notes