Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning

About Archer

Archer is a challenging bilingual text-to-SQL dataset specific to complex reasoning, including arithmetic, commonsense and hypothetical reasoning. It contains 1,042 English questions and 1,042 Chinese questions, along with 521 unique SQL queries, covering 20 English databases across 20 domains. This leaderboard provides a different data splitting from the original paper for better evaluation, where we further choose 8 databases from train set to be test data. Thus now the train set contains 8 databases, the dev set contains 2 databases and the blind test set contains 10 databases.

Paper (Zheng et al. 24)

Data Examples

Arithmetic Reasoning Example

How much higher is the maximum power of a BMW car than the maximum power of a Fiat car?

宝⻢汽⻋的最⾼功率⽐⻜雅特汽⻋的最⾼功率⾼多少？

SELECT MAX(horsepower) - (SELECT MAX (horsepower) FROM cars_data A JOIN car_names B ON A.id=B.makeid WHERE B.model="fiat") AS diff FROM cars_data A JOIN car_names B ON A.id=B.makeid WHERE B.model="bmw"

Commonsense Reasoning Example

Which 4-cylinder car needs the most fuel to drive 300 miles? List how many gallons it needs, and its make and model.

开300英⾥耗油最多的四缸⻋的品牌和型号分别是什么，它需要多少加仑的油？

Commonsense Knowledge: Fuel used is calculated by divding distance driven by fuel consumption.

SELECT B. Make, B.Model, 1.0 * 300 / mpg AS n_gallon FROM cars_data A JOIN car_names B ON A.Id=B.MakeId WHERE cylinders="4" ORDER BY mpg ASC LIMIT 1

Hypothetical Reasoning Example

If all cars produced by the Daimler Benz company have 4- cylinders, then in all 4-cylinder cars, which one needs the most fuel to drive 300 miles? Please list how many gallons it needs, along with its make and model.

假如⽣产⾃奔驰公司的⻋都是四缸，开300英⾥耗油最多的四缸⻋的品牌和型号分别是什么，它需要多少加仑的油

SELECT B.Make, B.Model, 1.0 * 300 / mpg AS n_gallon FROM cars_data A JOIN car_names B ON A.id=B.makeid JOIN model_list C ON B.model=C.model JOIN car_makers D on C.maker=D.id WHERE D.fullname="Daimler Benz" or A.cylinders="4” ORDER BY mpg ASC LIMIT 1

Submission

For submission, please follow the guidance here.

Leaderboard

The leaderboard of Archer is shown as follows. The evaluation metric is the EXecution accuracy (EX) of predicted SQL. The leaderboard is based on EX results on the blind test set.

English

Rank	Model	Size	Dev	Test
1 Sep 27, 2025	OraPlan-SQL (gpt-5) Oracle	UNK	72.12	54.96
2 Sep 27, 2025	gpt-4o Harbin Institute of Technology	UNK	41.34	48.66
3 Sep 27, 2025	claude4-sonnet + gpt4o Guangdong University of Technology	UNK	35.12	45.61
4 Sep 27, 2025	DeepSeek-V3.1-Think JD Five Star Appliance Group Co., Ltd.	671B	47.11	43.89
5 Jul 16, 2025	qwen3-235b-a22b + kamfu knowledge model GUANGDONG KAMFU TECHNOLOGY CO.,LTD	235B	43.31	43.32
6 Sep 10, 2024	GPT-4o + zpoint-embedding KnowDee	UNK	22.12	42.18
7 Sep 10, 2024	GPT-4o + Deepseek-Coder-33b Harbin Institute of Technology	UNK	34.62	39.12
7 Sep 10, 2024	GPT-4o HITSZ-GDDW Tech	UNK	31.73	39.12
9 Sep 5, 2024	GPT-4o + deepseek IDMG (Beijing University of Posts and Telecommunications)	UNK	31.73	31.87
10 Sep 10, 2024	deepseek-chat JD-5Star	UNK	24.04	31.11
11 Sep 10, 2024	GPT-4o MI&TLab (Harbin Institute of Technology)	UNK	32.69	30.73
11 Sep 10, 2024	GPT-4o + all-MiniLM-L6-v2 NUDT	UNK	38.46	30.73
13 Sep 10, 2024	GPT-4o Foshan university	UNK	22.12	25.62
14 Mar 15, 2024	GPT-3.5 + CT-3 baseline	UNK	10.57	15.84
15 Mar 15, 2024	GPT-3.5 + CT-3 + COT baseline	UNK	13.46	15.27
16 Mar 15, 2024	GPT-3.5 + API Doc baseline	UNK	14.42	11.83
17 Mar 15, 2024	T5-3b baseline	3B	0	0
17 Mar 15, 2024	T5-large baseline	0.8B	0	0
17 Mar 15, 2024	T5-base baseline	0.2B	0	0

Chinese

Rank	Model	Size	Dev	Test
1 Sep 27, 2025	OraPlan-SQL (gpt-5) Oracle	UNK	79.81	56.67
2 Sep 27, 2025	gpt-4o Harbin Institute of Technology	UNK	31.73	44.08
3 Sep 10, 2024	GPT-4o + zpoint-embedding KnowDee	UNK	25.96	42.94
4 Sep 27, 2025	claude4-sonnet + gpt4o Guangdong University of Technology	UNK	33.27	41.41
5 Sep 10, 2024	GPT-4o + Deepseek-Coder-33b Harbin Institute of Technology	UNK	23.08	39.89
6 Sep 27, 2025	DeepSeek-V3.1-Think JD Five Star Appliance Group Co., Ltd.	671B	46.15	39.12
7 Sep 10, 2024	GPT-4o HITSZ-GDDW Tech	UNK	24.04	37.79
8 Jul 16, 2025	qwen3-235b-a22b + kamfu knowledge model GUANGDONG KAMFU TECHNOLOGY CO.,LTD	235B	38.46	37.40
9 Sep 5, 2024	GPT-4o + deepseek IDMG (Beijing University of Posts and Telecommunications)	UNK	24.04	29.39
10 Sep 10, 2024	GPT-4o MI&TLab (Harbin Institute of Technology)	UNK	24.04	28.63
11 Sep 10, 2024	GPT-4o + all-MiniLM-L6-v2 NUDT	UNK	25.96	27.10
12 Sep 10, 2024	deepseek-chat JD-5Star	UNK	23.08	25.00
13 Sep 10, 2024	GPT-4o Foshan university	UNK	17.14	22.90
14 Mar 15, 2024	GPT-3.5 + CT-3 + COT baseline	UNK	12.50	15.49
15 Mar 15, 2024	GPT-3.5 + CT-3 baseline	UNK	10.58	12.21
16 Mar 15, 2024	GPT-3.5 + API Doc baseline	UNK	10.58	10.31
17 Mar 15, 2024	mT5-xl baseline	3.7B	0	0
17 Mar 15, 2024	mT5-large baseline	1.2B	0	0
17 Mar 15, 2024	mT5-base baseline	0.6B	0	0