logo

TableBench

A Comprehensive and Complex Benchmark for Table QA

About TableBench

TableBench is a comprehensive and complex benchmark designed to evaluate Table Question Answering (TableQA) capabilities, aligning closely with the "Reasoning Complexity of Questions" dimension in real-world Table QA scenarios. It covers 18 question categories across 4 major ategories—including Fact Checking, Numerical Reasoning, Data Analysis, and Visualization—with 886 carefully curated test cases. TableBench substantially pushes the boundaries of large language models in complex TableQA scenarios.

News

  • Apr. 18, 2025:

    🔥🔥 We have released a cleaner version of TableBench. Please download the updated version of TableBench again. We have thoroughly reviewed all test set cases and fixed any errors we found.

    🚀🚀 Brand new Leaderboard: We've included the performance of many newly released models in our latest leaderboard and will continue to keep it up to date. Submissions are welcome! For submission guidelines, please refer to the Submission section.

    🔍🔍 Refined Evaluation Metrics: In response to community feedback and in-depth discussions, we've updated the evaluation metrics for Fact Checking, Numerical Reasoning, and Data Analysis. You can find the detailed specifications of these new metrics and evaluation tools on our github repo.

    Thank you for all the feedback!!!


  • Jan. 21, 2025: We are thrilled to share that our paper has been accepted to AAAI 2025! We sincerely thank our co-authors, the anonymous reviewers, and all the researchers and users on GitHub or through email whose valuable feedback and support have greatly contributed to this work."

  • Aug. 29, 2024: We officially released the TableBench and TableInstruct dataset on huggingface! TableBench focuses on the reasoning complexity of questions, covering a wide range of reasoning types across 4 major categories—Fact Checking, Numerical Reasoning, Data Analysis, and Visualization—with a total of 18 fine-grained subcategories. The dataset consists of 886 carefully constructed test instances.

Challenges from TableBench

1. Multi-hop fact-checking involves multiple steps to establish the relationship between facts across information from table.

2. Multi-hop numerical reasoning involves calculating the intermediate results of multiple steps based on the data in the table to achieve the final conclusion.

3. Trend forecasting involves estimating future data trends based on historical data analysis.

4. Chart generation necessitates executing program commands to create charts.

intro_case

Submission

🤗🤗 We warmly welcome submissions to our leaderboard, including both your own methods and contributions showcasing the latest model performance! TableBench features two separate leaderboards. Please refer to the Submission Guidelines below for details, and submit your results as instructed to tablebench2025@gmail.com.

Citation

@inproceedings{wu2025tablebench,
  title={Tablebench: A comprehensive and complex benchmark for table question answering},
  author={Wu, Xianjie and Yang, Jian and Chai, Linzheng and Zhang, Ge and Liu, Jiaheng and Du, Xeron and Liang, Di and Shu, Daixin and Cheng, Xianfu and Sun, Tianzhen and others},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={24},
  pages={25497--25506},
  year={2025}
}
⚙ This website is an improved version based on the original source code from Bird-Bench!
Leaderboard - Methodology
Model FC
NR
DA
VIZ
Overall

Human Performance
85.91
Jan 31, 2025 o3-mini-2025-01-31 + DP
baseline
86.46 82.07 35.56 32.0 59.9
Jan 20, 2025 Deepseek-R1 + DP
baseline
82.29 75.51 35.06 20.0 56.31
Apr 05, 2025 Llama-4-Maverick-17B-128E-Instruct + TCoT
baseline
80.21 73.3 28.84 54.0 52.73
May 13, 2024 GPT-4o + TCoT
baseline
81.25 68.51 32.09 56.0 51.96
Apr 09, 2024 GPT-4-Turbo + TCoT
baseline
86.46 66.25 32.07 62.0 51.5
Mar 25, 2025 Deepseek-Chat-V3 + TCoT
baseline
82.29 69.27 27.32 48.0 50.56
Jul 23, 2024 Llama-3.1-405B-Instruct + TCoT
baseline
78.12 64.99 29.11 32.0 48.87
Sep 19, 2024 Qwen2.5-72B-Instruct + TCoT
baseline
80.21 64.48 28.89 34.0 48.79
Apr 05, 2025 Llama-4-Scout-17B-16E-Instruct + TCoT
baseline
81.25 64.23 23.07 30.0 46.53
Nov 12, 2024 Qwen2.5-Coder-32B-Instruct + TCoT
baseline
84.38 57.68 27.14 32.0 45.51
Mar 06, 2025 QWQ-32B + DP
baseline
81.25 59.7 21.45 12.0 43.87
Jul 23, 2024 Llama3.1-70B-Instruct + TCoT
baseline
70.83 53.15 24.66 26.0 41.05
Nov 4, 2024 TableGPT2-7B + TCoT
[Su et al. '24]
75.0 51.39 25.53 28.0 41.05
Apr 13, 2024 Llama3-70B-Chat + TCoT
baseline
70.83 43.58 29.66 4.0 38.68
Jan 25, 2024 GPT-3.5-Turbo + PoT
baseline
58.33 47.86 24.19 38.0 37.15
Nov 12, 2024 Qwen2.5-Coder-7B-Instruct + TCoT
baseline
71.88 40.05 24.23 6.0 35.12
Aug 13, 2024 TableLLM-Qwen2-7B + TCoT
[Wu et al. '24]
60.42 36.52 23.19 24.0 31.9
Aug 13, 2024 TableLLM-Llama3.1-8B + TCoT
[Wu et al. '24]
69.79 30.73 24.34 24.0 30.77
Aug 13, 2024 TableLLM-DeepseekCoder-7B + TCoT
[Wu et al. '24]
63.54 34.26 21.33 36.0 30.51
Aug 13, 2024 TableLLM-Llama3-8B + TCoT
[Wu et al. '24]
64.58 30.48 23.57 26.0 29.8
Aug 13, 2024 TableLLM-CodeQwen-7B + TCoT
[Wu et al. '24]
57.29 22.17 22.34 36.0 24.81
Apr 13, 2024 Llama3-8B-Chat + SCoT
baseline
65.62 14.61 22.06 0.0 22.2
Sep 19, 2024 Qwen2.5-7B-Instruct + TCoT
baseline
60.42 16.88 20.73 6.0 22.14
Dec 11, 2023 Mixtral-8x7B-Instruct + PoT
baseline
31.25 30.48 12.01 6.0 21.7
Jul 23, 2024 Llama3.1-8B-Instruct + DP
baseline
54.17 7.3 16.2 2.0 15.42
Sep 27, 2023 Mistral-7B-Instruct + SCoT
baseline
36.46 4.53 12.89 0.0 10.97
Leaderboard - Large Language Models
Model Size
DP
TCoT
SCoT
PoT
Jan 31, 2025 o3-mini 🤔
(2025-01-31)
OpenAI
UNK 59.9 - - -
Jan 20, 2025 Deepseek-R1 🤔
(2025-01-20)
Deepseek
685B 56.31 - - -
Apr 05, 2025 Llama-4-Maverick-17B-128E-Instruct
Meta AI
402B 49.49 52.73 41.92 34.3
May 13, 2024 GPT-4o
(2024-05-13)
OpenAI
UNK 40.91 51.96 41.43 45.71
Apr 09, 2024 GPT-4-Turbo
(2024-04-09)
OpenAI
UNK 38.74 51.5 42.27 48.49
Mar 25, 2025 Deepseek-Chat-V3
(2025-03-24)
Deepseek
685B 36.99 50.56 45.99 45.54
Jul 23, 2024 Llama-3.1-405B-Instruct
Meta AI
405B 36.04 48.87 39.85 26.21
Sep 19, 2024 Qwen2.5-72B-Instruct
Alibaba
72B 27.5 48.79 35.99 42.4
Apr 05, 2025 Llama-4-Scout-17B-16E-Instruct
Meta AI
109B 44.93 46.53 35.52 17.42
Nov 12, 2024 Qwen2.5-Coder-32B-Instruct
Alibaba
32B 26.98 45.51 30.17 38.72
Mar 06, 2025 QWQ-32B 🤔
Alibaba
32B 43.87 43.48 37.06 31.58
Jul 23, 2024 Llama3.1-70B-Instruct
Meta AI
70B 30.02 41.05 38.12 30.1
Nov 4, 2025 TableGPT2-7B
[Su et al. '24]
7B 27.95 41.05 31.4 38.67
Apr 13, 2024 Llama3-70B-Instruct
Meta AI
70B 28.5 38.68 31.12 34.77
Jan 25, 2024 GPT-3.5-Turbo
(2024-01-25)
OpenAI
UNK 24.05 26.83 26.8 37.15
Nov 12, 2024 Qwen2.5-Coder-7B-Instruct
Alibaba
32B 21.6 35.12 22.12 16.46
Aug 13, 2024 TableLLM-Qwen2-7B
[Wu et al. '24]
7B 22.29 31.9 23.62 12.87
Aug 13, 2024 TableLLM-Llama3.1-8B
[Wu et al. '24]
8B 22.3 30.77 21.92 27.17
Aug 13, 2024 TableLLM-DeepseekCoder-7B
[Wu et al. '24]
7B 23.15 30.51 23.56 18.74
Aug 13, 2024 TableLLM-Llama3-8B
[Wu et al. '24]
8B 20.78 29.8 20.78 14.75
Aug 13, 2024 TableLLM-CodeQwen-7B
[Wu et al. '24]
7B 20.15 24.81 20.55 15.14
Apr 13, 2024 Llama3-8B-Instruct
Meta AI
8B 21.32 19.9 22.2 14.51
Sep 19, 2024 Qwen2.5-7B-Instruct
Alibaba
7B 17.63 22.14 21.32 21.86
Dec 11, 2023 Mixtral-8x7B-Instruct-v0.1
Mistral AI
56B 17.38 16.56 18.04 21.7
Jul 23, 2024 Llama3.1-8B-Instruct
Meta AI
8B 15.42 12.78 13.53 14.18
Sep 27, 2023 Mistral-7B-Instruct-v0.2
Mistral AI
7B 10.8 10.77 10.97 2.53