Archer LogoArcher

A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning

About Archer

Archer is a challenging bilingual text-to-SQL dataset specific to complex reasoning, including arithmetic, commonsense and hypothetical reasoning. It contains 1,042 English questions and 1,042 Chinese questions, along with 521 unique SQL queries, covering 20 English databases across 20 domains. This leaderboard provides a different data splitting from the original paper for better evaluation, where we further choose 8 databases from train set to be test data. Thus now the train set contains 8 databases, the dev set contains 2 databases and the blind test set contains 10 databases.

Paper (Zheng et al. 24)

Data Examples

Arithmetic Reasoning Example

How much higher is the maximum power of a BMW car than the maximum power of a Fiat car?

宝⻢汽⻋的最⾼功率⽐⻜雅特汽⻋的最⾼功率⾼多少?

SELECT MAX(horsepower) - (SELECT MAX (horsepower) FROM cars_data A JOIN car_names B ON A.id=B.makeid WHERE B.model="fiat") AS diff FROM cars_data A JOIN car_names B ON A.id=B.makeid WHERE B.model="bmw"


Commonsense Reasoning Example

Which 4-cylinder car needs the most fuel to drive 300 miles? List how many gallons it needs, and its make and model.

开300英⾥耗油最多的四缸⻋的品牌和型号分别是什么,它需要多少加仑的油?

Commonsense Knowledge: Fuel used is calculated by divding distance driven by fuel consumption.

SELECT B. Make, B.Model, 1.0 * 300 / mpg AS n_gallon FROM cars_data A JOIN car_names B ON A.Id=B.MakeId WHERE cylinders="4" ORDER BY mpg ASC LIMIT 1


Hypothetical Reasoning Example

If all cars produced by the Daimler Benz company have 4- cylinders, then in all 4-cylinder cars, which one needs the most fuel to drive 300 miles? Please list how many gallons it needs, along with its make and model.

假如⽣产⾃奔驰公司的⻋都是四缸,开300英⾥耗油最多的 四缸⻋的品牌和型号分别是什么,它需要多少加仑的油

SELECT B.Make, B.Model, 1.0 * 300 / mpg AS n_gallon FROM cars_data A JOIN car_names B ON A.id=B.makeid JOIN model_list C ON B.model=C.model JOIN car_makers D on C.maker=D.id WHERE D.fullname="Daimler Benz" or A.cylinders="4” ORDER BY mpg ASC LIMIT 1

Submission

For submission, please follow the guidance here.

Leaderboard

The leaderboard of Archer is shown as follows. The evaluation metric is the EXecution accuracy (EX) of predicted SQL. The leaderboard is based on EX results on the blind test set.

English

Rank Model Size Dev Test

1

Sep 10, 2024

GPT-4o + zpoint-embedding

KnowDee

UNK 22.12 42.18

2

Sep 10, 2024

GPT-4o + Deepseek-Coder-33b

Harbin Institute of Technology

UNK 34.62 39.12

2

Sep 10, 2024

GPT-4o

HITSZ-GDDW Tech

UNK 31.73 39.12

4

Sep 5, 2024

GPT-4o + deepseek

IDMG (Beijing University of Posts and Telecommunications)

UNK 31.73 31.87

5

Sep 10, 2024

deepseek-chat

JD-5Star

UNK 24.04 31.11

6

Sep 10, 2024

GPT-4o

MI&TLab (Harbin Institute of Technology)

UNK 32.69 30.73

6

Sep 10, 2024

GPT-4o + all-MiniLM-L6-v2

NUDT

UNK 38.46 30.73

8

Sep 10, 2024

GPT-4o

Foshan university

UNK 22.12 25.62

9

Mar 15, 2024

GPT-3.5 + CT-3

baseline

UNK 10.57 15.84

10

Mar 15, 2024

GPT-3.5 + CT-3 + COT

baseline

UNK 13.46 15.27

11

Mar 15, 2024

GPT-3.5 + API Doc

baseline

UNK 14.42 11.83

12

Mar 15, 2024

T5-3b

baseline

3B 0 0

12

Mar 15, 2024

T5-large

baseline

0.8B 0 0

12

Mar 15, 2024

T5-base

baseline

0.2B 0 0

Chinese

Rank Model Size Dev Test

1

Sep 10, 2024

GPT-4o + zpoint-embedding

KnowDee

UNK 25.96 42.94

2

Sep 10, 2024

GPT-4o + Deepseek-Coder-33b

Harbin Institute of Technology

UNK 23.08 39.89

3

Sep 10, 2024

GPT-4o

HITSZ-GDDW Tech

UNK 24.04 37.79

4

Sep 5, 2024

GPT-4o + deepseek

IDMG (Beijing University of Posts and Telecommunications)

UNK 24.04 29.39

5

Sep 10, 2024

GPT-4o

MI&TLab (Harbin Institute of Technology)

UNK 24.04 28.63

6

Sep 10, 2024

GPT-4o + all-MiniLM-L6-v2

NUDT

UNK 25.96 27.10

7

Sep 10, 2024

deepseek-chat

JD-5Star

UNK 23.08 25.00

8

Sep 10, 2024

GPT-4o

Foshan university

UNK 17.14 22.90

9

Mar 15, 2024

GPT-3.5 + CT-3 + COT

baseline

UNK 12.50 15.49

10

Mar 15, 2024

GPT-3.5 + CT-3

baseline

UNK 10.58 12.21

11

Mar 15, 2024

GPT-3.5 + API Doc

baseline

UNK 10.58 10.31

12

Mar 15, 2024

mT5-xl

baseline

3.7B 0 0

12

Mar 15, 2024

mT5-large

baseline

1.2B 0 0

12

Mar 15, 2024

mT5-base

baseline

0.6B 0 0