Translating natural language questions into SQL makes data easier to access. This allows anyone, even without technical skills, to work directly with databases.
We used our text-to-SQL benchmark methodology on eight large language models (LLMs) to assess their performance in SQL command generation:
Benchmark methodology for text-to-SQL
To evaluate the text-to-SQL capabilities of the eight LLMs, we utilized the following methodology:
- Dataset Selection: The evaluation utilized the BIRD-SQL dataset,1 a standard benchmark recognized in academic research for complex text-to-SQL tasks. This dataset provides natural language questions paired with the corresponding target database schema and the accurate ground truth SQL query needed to answer the question.
- Task Definition: For each test case derived from the dataset, the LLM was presented with the natural language question and the relevant database schema (including table names, column names, and primary/foreign key relationships).
- LLM Task Execution: The core task for each LLM was to generate a syntactically valid and semantically correct SQL query based solely on the provided question and schema context.
- Evaluation Metric: The primary metric was semantic correctness, often measured via execution accuracy. This involves executing the LLM-generated SQL against the appropriate database and comparing its results to the results produced by the ground truth SQL query from the BIRD-SQL dataset.
- Semantic Comparison: To ensure fair evaluation focused on logical correctness rather than minor syntactic differences, SQL normalization techniques were applied before comparison. This typically includes steps like lowercasing keywords and identifiers, removing non-essential aliases, and standardizing whitespace to compare the core query logic effectively.
Common errors in LLM-generated SQL
LLMs often make four error types: faulty joins, aggregation mistakes, missing filters, and syntax errors.
Incorrect join logic
Models often struggled to correctly identify and implement the necessary `JOIN` operations between tables, sometimes omitting them entirely or misusing less optimal subqueries.
Question:
What is the zip code of all the charter schools in Fresno County Office of Education?
Correct SQL:
SELECT T2.Zip
FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode
WHERE T1."District Name" = 'Fresno County Office of Education' AND T1."Charter School (Y/N)" = 1
Predicted SQL:
SELECT Zip
FROM schools
WHERE Charter = 1.0 AND County = 'Fresno'
The LLM failed to correctly join the `frpm` and `schools` tables using the `CDSCode`. It also hallucinated column names (`Charter`) and filter values (`County = ‘Fresno’`).
Errors in join logic fundamentally break the relational aspect of the query, leading to incomplete or incorrect data retrieval when multiple tables are involved.
Aggregation and grouping errors
Applying aggregate functions (like `MAX`, `AVG`, `COUNT`, `SUM`) or `GROUP BY` clauses incorrectly was another common failure point, leading to results that didn’t semantically match the user’s intent.
Question:
What is the highest eligible free rate for K-12 students in the schools in Alameda County?
Correct SQL:
SELECT "Free Meal Count (K-12)" / "Enrollment (K-12)"
FROM frpm
WHERE "County Name" = 'Alameda'
ORDER BY (CAST("Free Meal Count (K-12)" AS REAL) / "Enrollment (K-12)") DESC LIMIT 1
Predicted SQL:
SELECT MAX(CAST("Free Meal Count (K-12)" AS FLOAT) / "Enrollment (K-12)") AS highest_eligible_free_rate
FROM frpm
WHERE "County Name" = 'Alameda'
The LLM used `MAX()` on the calculated ratio directly. While this returns the maximum value, the correct approach requires calculating the rate per row, ordering by this rate (`ORDER BY … DESC`), and then taking the top one (`LIMIT 1`) to find the record associated with the highest rate.
This highlights the difficulty LLMs face in translating nuanced requests for rankings or specific aggregate calculations into the correct SQL operational steps.
Missing or wrong filters
Models sometimes failed to include necessary `WHERE` clauses or select the wrong columns in the `SELECT` statement, not fully addressing the constraints or the information explicitly requested in the prompt.
Question:
What is the phone number of the school that has the highest number of test takers with an SAT score of over 1500?
Correct SQL:
SELECT T2.Phone
FROM satscores AS T1 INNER JOIN schools AS T2 ON T1.cds = T2.CDSCode
ORDER BY T1.NumGE1500 DESC LIMIT 1
Predicted SQL:
SELECT sname, NumGE1500
FROM satscores
ORDER BY NumGE1500 DESC LIMIT 1
The LLM correctly identified the logic to find the school (`ORDER BY NumGE1500 DESC LIMIT 1`) but failed to select the requested `Phone` number and omitted the necessary join to the `schools` table to retrieve it.
These errors often stem from incomplete parsing of the user’s request or failure to map all parts of the request to the final SQL query components.
Syntax errors
Beyond semantic errors, outright syntax mistakes occurred, such as using incorrect table aliases or producing incomplete SQL statements, which prevent the query from executing.
Question:
How many accounts have an owner disposition and request for a statement to be generated upon a transaction?
Correct SQL:
SELECT COUNT(T1.account_id)
FROM account AS T1 INNER JOIN disp AS T2 ON T1.account_id = T2.account_id
WHERE T2.type = 'OWNER' AND T1.frequency = 'POPLATEK PO OBRATU'
Predicted SQL:
SELECT COUNT(DISTINCT a.account_id)
FROM accounts a JOIN disp d ON a.account_id = d.account_id
WHERE d.type = 'OWNER' AND a.frequency = 'POPLATEK PO OBRATU...'
The LLM used incorrect aliases (`accounts` instead of `account`) and included an incomplete string literal (`’POPLATEK PO OBRATU…’`), resulting in invalid SQL syntax.
These syntax issues highlight challenges in generating code that strictly adheres to SQL grammar and database-specific conventions.
Why some LLMs are better at SQL
Several key things determine how well a Large Language Model (LLM) can turn a plain English question into a correct SQL database query.
1. Model size and training data
- Size and design: Larger models or those built with specific structures may handle complex tasks, such as SQL generation, more effectively.
- What it learned: The data used to train the LLM is essential. If it sees many examples of questions linked to SQL answers, especially those involving complex operations such as joins or calculations (SUM, AVG), it will likely perform better.
2. Fine-Tuning for SQL tasks
- Models can be given extra training specifically focused on text-to-SQL tasks. This “fine-tuning” helps them understand database structures and SQL rules more effectively than models trained only on general text. Training on specific instructions also helps.
3. Reasoning and schema mapping abilities
- Reasoning: How well can the LLM figure out the exact steps needed from a sometimes vague question? Creating SQL often requires logical steps.
- Understanding the database map (Schema): Some LLMs are better at connecting concepts in the question (such as “customers” or “total sales”) to the actual table and column names in the database, even if the names aren’t immediately obvious.
What is text-to-SQL?
Text-to-SQL is a natural language processing (NLP) technology that converts plain language text queries into structured SQL (Structured Query Language) commands, which can be executed on databases. This technology bridges the gap between human communication and database query languages, allowing users without technical SQL knowledge to interact with databases effectively.
At its core, Text to SQL uses sophisticated algorithms and machine learning models to understand the intent behind a user’s natural language question and translate it into the appropriate SQL syntax. For example, when a user types “Show me all customers from New York who made purchases last month,” a Text to SQL system interprets this request and converts it into a precise SQL query with the correct SELECT, FROM, WHERE, and JOIN statements needed to retrieve the requested information.
The technology typically works through several processing layers:
- Natural language understanding (NLU): Analyzes the semantic structure of the text input to identify entities, relationships, and the query intent.
- Schema mapping: Maps the identified entities to the corresponding database tables and fields.
- Query construction: Builds the appropriate SQL query structure with the necessary joins, conditions, and aggregations.
- Validation and optimization: Ensures the generated SQL is valid and optimized for the specific database system.
Modern text-to-SQL systems employ advanced neural networks and transformer-based models (similar to BERT or GPT) that have been trained on vast datasets of text-to-SQL pairs, enabling them to handle increasingly complex queries and adapt to different database schemas.
Limitations and practical risks
The benchmark results, with the highest performers, such as Grok-3, reaching 80% accuracy and others, like Deepseek-R1 and GPT-4o, exceeding 70%, indicate the growing potential of LLMs as tools for database interaction. For tasks such as:
- Generating draft queries: Assisting developers or analysts by producing initial SQL queries that can be reviewed and refined.
- Exploratory data analysis: Enabling users with some SQL knowledge to quickly formulate simpler queries without writing them from scratch.
- Democratizing simple data access: Allowing less technical users to ask basic questions of databases, provided the queries aren’t overly complex.
However, the results simultaneously underscore significant limitations. No model achieved near-perfect accuracy, with even the best model producing incorrect SQL for 1 out of every 5 complex queries. Furthermore, a substantial number of evaluated models performed significantly lower, clustering in the 57-66% accuracy range. This leads to several practical concerns:
- Reliability for critical tasks: An error rate of 20% (for the best) to over 40% (for lower-scoring models) is generally unacceptable for mission-critical applications, automated reporting, or any scenario where accuracy is paramount and human oversight is limited. Misleading results generated from subtly incorrect SQL (due to flawed joins, aggregations, or filters, as detailed in our error analysis) pose a significant risk.
- Dependence on human oversight: Given the observed error rates across all models, generated SQL must be validated by a knowledgeable human user before being trusted for important decisions or operations, especially for queries beyond basic lookups.
- Complexity ceiling: The benchmark (BIRD-SQL) emphasizes complex queries. The performance drop-off for many models suggests they struggle significantly as query complexity (multiple joins, nested logic, specific calculations) increases.
- Model selection: The wide performance variance (from 57% to 80%) highlights that not all LLMs are equally adept at text-to-SQL. Selecting a model suitable for the required complexity and accuracy is vital.
Current LLMs are powerful assistants for text-to-SQL, capable of augmenting human workflows. However, they are not yet reliable enough to function as fully autonomous, trustworthy query generators for demanding, high-stakes database interaction, primarily due to the persistent challenge of consistently generating correct SQL for complex requests.
FAQ
Can I fully trust the SQL query generated by LLMs?
Based on our findings, you should not fully trust complex queries generated by current LLMs without validation. While useful for drafting and simple requests, even top-performing models have significant error rates (up to 20% on complex tasks). Always review and verify the generated SQL, especially for critical applications.
Can LLMs help with more than SELECT statements?
Yes, many LLMs have capabilities beyond simple SELECT generation. They can often assist in understanding and suggesting modifications to existing SQL code or even generating DDL (Data Definition Language) like CREATE TABLE statements based on descriptions, although accuracy for these tasks also requires verification.
How can I get better results from an LLM?
Providing clear context is key. Ensure the LLM has access to the database schema (table names, column names, relationships). Clearly stating the desired outcome and potentially providing a few relevant example queries (few-shot prompting) for the LLM to learn from can significantly improve its ability to select the correct tables and construct accurate queries.
Does using an LLM ease database type version differences?
While LLMs may abstract away some minor syntax differences between database dialects, they don’t entirely resolve database type version compatibility issues. They may still generate SQL specific to one dialect (e.g., PostgreSQL vs. MySQL) or fail to use functions compatible with older versions unless explicitly guided or trained to do so. Validation against the target database remains important.
External Links
- 1. BIRD-bench.
Comments
Your email address will not be published. All fields are required.