TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

1CCSE, Beihang University 2University of Waterloo 3Fudan University 4Beijing Information Science and Technology University
Teaser

Introduction

Recent advancements in Large Language Models (LLMs) have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant challenges when applied in industrial scenarios, particularly due to the increased complexity of reasoning required with real-world tabular data, underscoring a notable disparity between academic benchmarks and practical applications. To address this discrepancy, we conduct a detailed investigation into the application of tabular data in industrial scenarios and propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Furthermore, we introduce TableLLM, trained on our meticulously constructed training set TableInstruct, achieving comparable performance with GPT-3.5. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT-4, achieves only a modest score compared to humans.

Dataset Construction

Dataset Construction

To bridge the gap between academic benchmarks and industrial scenarios, we comprehensively analyze tabular data applications in real-world contexts, categorizing these problems into four major categories and 18 specific subcategories. We define the complexity of these tasks based on the reasoning steps required for problem-solving and provide detailed guidelines for defining and decomposing these steps, which are rigorously followed during the annotation process. Additionally, we introduce an annotation framework that combines manual and automated methods to enhance annotation efficiency, as illustrated in Figure. Finally, we propose two high-quality corpora: TableBench, a comprehensive and complex benchmark consisting of 886 samples, and TableInstruct (20K samples in total), massive instruction corpora designed to instruct LLMs with various reasoning methods.

Dataset Statistic

Overview of TableBench
Reasoning Steps Analysis Dataset Topics

Question Categories

Drawing from real-world scenarios and user demands for tabular data, we devise four primary question categories: fact-checking, numerical reasoning, data analysis, and visualization, encompassing 18 subcategories, thoroughly illustrating the various challenges encountered in TableQA scenarios. Compared with existing datasets, TableBench encompasses a broader spectrum of question categories as presented in Table, with a particular emphasis on data analysis and chart generation capabilities that are notably scarce in prior datasets.

Reasoning Steps

We define the complexity of the dataset by calculating the number of reasoning steps required to solve the problem. Figure demonstrates that the overall complexity of TableBench is significantly higher than that of existing datasets, particularly in questions about data analysis and visualization. .

Overall Performance

Dataset Statistic
Dataset Statistic

We evaluate 30+ models with sizes ranging from 7B to 110B parameters, including general/code LLMs, open-source/proprietary models, and SFT models. We explore three distinct reasoning methodologies to augment the reasoning capabilities for tabular data: Textual Chain of Thought (TCoT), Symbolic Chain of Thought (SCoT), and Program of Thought (PoT). TCoT utilizes a textual reasoning approach, employing a series of inferential steps to deduce the final answer. SCoT adopts symbolic reasoning steps, leveraging programming language commands to iteratively simulate and refine results through a 'Think then Code' process. Conversely, PoT involves generating executable code, employing lines of code as reasoning steps within a programming environment to achieve results.

Further Analysis

Dataset Statistic Dataset Statistic

Effect of Parsing Ratio

In comparison to the DP, SCoT, and TCoT methods in Figure above, the data points on the left side of the quadratic curve show that at low parsing ratios, the overall score increases as the parsing ratio decreases, suggesting that certain models (e.g., StructLLM), possess strong table understanding capabilities but exhibit weaker instruction-following abilities. This may be attributed to differences in the instruction format during instruction tuning compared to the format we employ. The right side of the quadratic curve reveals that despite the strong instruction-following performance of the DP method, the non-reasoning DP method faces a clear performance ceiling. In contrast, reasoning-based methods show significant potential for improvement. The curve of the PoT highlights the substantial potential of the PoT to enhance the overall score by increasing the parsing rate.

Data Efficiency of TableInstruct

We construct datasets of varying sizes by sampling from TableInstruct with sampling rates ranging from 0.2 to 0.6. Figure above visually depicts the relative performance at different sampling rates. Surprisingly, with only 60% of the samples, the model retains over 90% of the performance of the complete dataset. The full data provides the highest knowledge coverage, enabling the model to achieve optimal overall performance, comparable to GPT-3.5, with inference costs being only a fraction, indicating the high efficiency of TableInstruct.