I have been relying on SQL for data analysis for 18 years, beginning with my days as a consultant. Translating natural-language questions into SQL makes data more accessible, allowing anyone, even those without technical skills, to work directly with databases.
We used our text-to-SQL benchmark methodology on 24 large language models (LLMs) to assess their performance in SQL command generation:
Common errors in LLM-generated SQL
LLMs often make four error types: faulty joins, aggregation mistakes, missing filters, and syntax errors.
Incorrect join logic
Models often struggled to correctly identify and implement the necessary `JOIN` operations between tables, sometimes omitting them entirely or misusing less optimal subqueries.
Question:
What is the zip code of all the charter schools in Fresno County Office of Education?
Correct SQL:
SELECT T2.Zip
FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode
WHERE T1."District Name" = 'Fresno County Office of Education' AND T1."Charter School (Y/N)" = 1
Predicted SQL:
SELECT Zip
FROM schools
WHERE Charter = 1.0 AND County = 'Fresno'
The LLM failed to correctly join the `frpm` and `schools` tables using the `CDSCode`. It also hallucinated column names (`Charter`) and filter values (`County = ‘Fresno’`).
Errors in join logic fundamentally break the relational aspect of the query, leading to incomplete or incorrect data retrieval when multiple tables are involved.
Aggregation and grouping errors
Applying aggregate functions (like `MAX`, `AVG`, `COUNT`, `SUM`) or `GROUP BY` clauses incorrectly was another common failure point, leading to results that didn’t semantically match the user’s intent.
Question:
Which active district has the highest average score in Reading?
Correct SQL:
SELECT
T1.District
FROM schools AS T1
INNER JOIN satscores AS T2
ON T1.CDSCode = T2.cds
WHERE
T1.StatusType = 'Active'
GROUP BY
T1.District
ORDER BY
AVG(T2.AvgScrRead) DESC
LIMIT 1;
Predicted SQL:
SELECT dname, AVG(AvgScrRead) AS avg_reading_score
FROM satscores
GROUP BY dname
ORDER BY avg_reading_score DESC
LIMIT 1
The LLM correctly identified that the phrase “highest average score” requires grouping the data by district (GROUP BY dname) and using an aggregate function (AVG(AvgScrRead)). This part of the logic is correct.
However, the LLM failed to incorporate a critical filter from the question: the word “active“. To fulfill this requirement, the query needed to JOIN the satscores table with the schools table and then filter the results with a WHERE T1.StatusType = ‘Active’ clause.
This highlights a common LLM failure: correctly executing a primary, obvious instruction (calculating an average) while missing a secondary but equally important condition (filtering by status). This shows a weakness in synthesizing multiple constraints into a single, correct query.
Missing or wrong filters
Models sometimes failed to include necessary `WHERE` clauses or select the wrong columns in the `SELECT` statement, not fully addressing the constraints or the information explicitly requested in the prompt.
Question:
What is the phone number of the school that has the highest number of test takers with an SAT score of over 1500?
Correct SQL:
SELECT T2.Phone
FROM satscores AS T1 INNER JOIN schools AS T2 ON T1.cds = T2.CDSCode
ORDER BY T1.NumGE1500 DESC LIMIT 1
Predicted SQL:
SELECT sname, NumGE1500
FROM satscores
ORDER BY NumGE1500 DESC LIMIT 1
The LLM correctly identified the logic to find the school (`ORDER BY NumGE1500 DESC LIMIT 1`) but failed to select the requested `Phone` number and omitted the necessary join to the `schools` table to retrieve it.
These errors often stem from incomplete parsing of the user’s request or failure to map all parts of the request to the final SQL query components.
Syntax errors
Beyond semantic errors, outright syntax mistakes occurred, such as using incorrect table aliases or producing incomplete SQL statements, which prevent the query from executing.
Question:
How many accounts have an owner disposition and request for a statement to be generated upon a transaction?
Correct SQL:
SELECT COUNT(T1.account_id)
FROM account AS T1 INNER JOIN disp AS T2 ON T1.account_id = T2.account_id
WHERE T2.type = 'OWNER' AND T1.frequency = 'POPLATEK PO OBRATU'
Predicted SQL:
SELECT COUNT(DISTINCT a.account_id)
FROM accounts a JOIN disp d ON a.account_id = d.account_id
WHERE d.type = 'OWNER' AND a.frequency = 'POPLATEK PO OBRATU...'
The LLM used incorrect aliases (`accounts` instead of `account`) and included an incomplete string literal (`’POPLATEK PO OBRATU…’`), resulting in invalid SQL syntax.
These syntax issues highlight challenges in generating code that strictly adheres to SQL grammar and database-specific conventions.
Why some LLMs are better at SQL
Several key things determine how well a Large Language Model (LLM) can turn a plain English question into a correct SQL database query.
1. Model size and training data
- Size and design: Larger models or those built with specific structures may handle complex tasks, such as SQL generation, more effectively.
- What it learned: The data used to train the LLM is essential. If it sees many examples of questions linked to SQL answers, especially those involving complex operations such as joins or calculations (SUM, AVG), it will likely perform better.
2. Fine-Tuning for SQL tasks
- Models can be given extra training specifically focused on text-to-SQL tasks. This “fine-tuning” helps them understand database structures and SQL rules more effectively than models trained only on general text. Training on specific instructions also helps.
3. Reasoning and schema mapping abilities
- Reasoning: How well can the LLM figure out the exact steps needed from a sometimes vague question? Creating SQL often requires logical steps.
- Understanding the database map (Schema): Some LLMs are better at connecting concepts in the question (such as “customers” or “total sales”) to the actual table and column names in the database, even if the names aren’t immediately obvious.
How LLMs generate SQL: A step-by-step look
To see the factors like “reasoning” and “schema mapping” in action, let’s walk through the step-by-step process a model follows to generate a query. This entire workflow is powered by a technique called Retrieval-Augmented Generation (RAG).
Stage 1: Initial analysis and database selection
When presented with a question, the LLM first analyzes the user’s intent to select the most relevant database tool.
- Question: “How many accounts have an owner disposition and request for a statement to be generated upon a transaction?”
- LLM’s Action: The model identifies keywords like “accounts,” “disposition,” and “transaction.” It concludes that the
financial
database tool is the correct choice over others likecalifornia_schools
orsuperhero
.
Stage 2: Retrieving the schema via RAG
Once the model chooses a tool, it needs the database’s “map” the schema. It doesn’t have this information memorized. Instead, the RAG system retrieves it in real-time.
- Retrieval: The user’s question is used to search a vector database that stores schema information. The search finds and retrieves the most relevant schema details, such as the definitions for the
accounts
anddisp
tables. - Augmentation: This retrieved schema text is automatically inserted into the prompt alongside the original question.
- Generation: The LLM now has all the context it needs to proceed.
This RAG process ensures the model receives only the necessary schema information, making its task more focused and efficient.
Stage 3: Reasoning and query construction
With the question and the RAG-provided schema, the model maps concepts from the user’s request to the specific table and column names it just received.
LLM’s internal monologue:
- Goal: The user wants a count, so I’ll start with
SELECT COUNT(...)
. - Conditions:
- “…owner disposition…” -> The
disp
table schema has atype
column. I need aWHERE
clause fortype = 'OWNER'
. - “…statement to be generated upon a transaction…” -> The
accounts
table schema has afrequency
column. The filter should befrequency = 'POPLATEK PO OBRATU'
.
- “…owner disposition…” -> The
- Joins: The information is split across the
accounts
anddisp
tables. The schema shows they are linked byaccount_id
, so I need toJOIN
them.
Stage 4: Generating the final SQL
Finally, the model assembles these logical pieces into a syntactically correct SQL query. The quality of this output depends on:
- Reasoning capability: The model’s ability to logically connect the user’s request to the provided schema.
- SQL knowledge from training: The model’s core understanding of SQL syntax and functions.
This process explains why errors occur. If the retrieved schema is ambiguous or a term in the question doesn’t map cleanly, the LLM must make an educated guess, which can lead to the errors we analyzed earlier.
What is text-to-SQL?
Text-to-SQL is a natural language processing (NLP) technology that converts plain language text queries into structured SQL (Structured Query Language) commands, which can be executed on databases. This technology bridges the gap between human communication and database query languages, allowing users without technical SQL knowledge to interact with databases effectively.
At its core, Text to SQL uses sophisticated algorithms and machine learning models to understand the intent behind a user’s natural language question and translate it into the appropriate SQL syntax. For example, when a user types “Show me all customers from New York who made purchases last month,” a Text to SQL system interprets this request and converts it into a precise SQL query with the correct SELECT, FROM, WHERE, and JOIN statements needed to retrieve the requested information.
The technology typically works through several processing layers:
- Natural language understanding (NLU): Analyzes the semantic structure of the text input to identify entities, relationships, and the query intent.
- Schema mapping: Maps the identified entities to the corresponding database tables and fields.
- Query construction: Builds the appropriate SQL query structure with the necessary joins, conditions, and aggregations.
- Validation and optimization: Ensures the generated SQL is valid and optimized for the specific database system.
Modern text-to-SQL systems employ advanced neural networks and transformer-based models (similar to BERT or GPT) that have been trained on vast datasets of text-to-SQL pairs, enabling them to handle increasingly complex queries and adapt to different database schemas.
Limitations and practical risks
The benchmark results, with the highest performers, indicate the growing potential of LLMs as tools for database interaction. For tasks such as:
- Generating draft queries: Assisting developers or analysts by producing initial SQL queries that can be reviewed and refined.
- Exploratory data analysis: Enabling users with some SQL knowledge to quickly formulate simpler queries without writing them from scratch.
- Democratizing simple data access: Allowing less technical users to ask basic questions of databases, provided the queries aren’t overly complex.
However, the results simultaneously underscore significant limitations. No model achieved near-perfect accuracy, with even the best model producing incorrect SQL for 1 out of every 5 complex queries.This leads to several practical concerns:
- Reliability for critical tasks: An error rate of 20% (for the best) to over 40% (for lower-scoring models) is generally unacceptable for mission-critical applications, automated reporting, or any scenario where accuracy is paramount and human oversight is limited. Misleading results generated from subtly incorrect SQL (due to flawed joins, aggregations, or filters, as detailed in our error analysis) pose a significant risk.
- Dependence on human oversight: Given the observed error rates across all models, generated SQL must be validated by a knowledgeable human user before being trusted for important decisions or operations, especially for queries beyond basic lookups.
- Complexity ceiling: The benchmark (BIRD-SQL) emphasizes complex queries. The performance drop-off for many models suggests they struggle significantly as query complexity (multiple joins, nested logic, specific calculations) increases.
- Model selection: The wide performance variance (from 57% to 80%) highlights that not all LLMs are equally adept at text-to-SQL. Selecting a model suitable for the required complexity and accuracy is vital.
Current LLMs are powerful assistants for text-to-SQL, capable of augmenting human workflows. However, they are not yet reliable enough to function as fully autonomous, trustworthy query generators for demanding, high-stakes database interaction, primarily due to the persistent challenge of consistently generating correct SQL for complex requests.
Benchmark methodology for text-to-SQL
To evaluate the text-to-SQL capabilities of the eight LLMs, we utilized a methodology centered on an agentic retrieval-augmented generation (RAG) framework. This approach mirrors real-world applications where an AI agent must first identify the correct data source and then retrieve the necessary context to complete a task.
1. Dataset and ground truth: We used 500 questions from the “moderate” difficulty level of BIRD-SQL.1 dataset, a standard academic benchmark renowned for its complex, multi-table queries that mirror real-world business intelligence tasks. Each question is paired with its target database and a ground truth SQL query, providing a reliable basis for evaluation. The schemas are intentionally diverse and non-homogeneous to test the models’ adaptability.
2. Execution framework: An agentic RAG approach
Each LLM was configured to operate within an agentic framework built with LangChain, following a two-stage process for every question:
- Stage 1: Database selection (Routing): The agent was first presented with the natural language question and a list of available database “tools” (e.g., financial, california_schools). Its initial task was to act as a “router,” analyzing the user’s intent to determine which database was the most relevant to answer the question.
- Stage 2: Schema retrieval and query generation (RAG): Once the agent selected a database tool, it used the user’s question to retrieve the most relevant schema components (table definitions, column descriptions) from a ChromaDB vector store. This retrieved schema was then automatically inserted into a new prompt, providing the LLM with the precise context needed to generate the final SQL query.
3. Evaluation Framework: A two-stage graded approach
We moved beyond simple execution accuracy to a more nuanced, two-stage evaluation focused on semantic correctness, using a powerful LLM as an expert grader.
- Stage 1: Database selection accuracy: The first metric was routing accuracy. If an LLM failed to select the correct database for a given question, it received a score of 0 for that task, as any subsequent SQL would be fundamentally incorrect. Therefore, SQL quality was only graded if a model passed this initial routing stage.
- Stage 2: SQL quality grading (LLM-as-judge): For queries where the correct database was selected, we used Claude 4 sonnet as an expert “judge” to evaluate the predicted SQL against the ground truth. The judge assigned a score based on a detailed 5-point rubric measuring semantic equivalence, whether the two queries would produce the exact same result set.
- Score 5 (Perfect match): Semantically identical to the ground truth.
- Score 4 (Minor error): Correct core logic (joins, aggregations) but a minor typo in a name or value.
- Score 3 (Logic error): A significant error in a JOIN condition, aggregation, or WHERE clause.
- Score 2 (Wrong tables): A fundamental failure to understand the schema, using incorrect tables or columns.
- Score 1 (Invalid SQL): The output is not valid SQL or is a hallucination.
The final “Correct SQL Command Generation Rate” shown in our primary chart represents the percentage of total questions for which the LLM both selected the correct database and generated a query that earned a perfect score of 5. This rigorous, two-stage process ensures our results reflect the model’s ability to perform the end-to-end task accurately and reliably.
Further reading
Explore other RAG benchmarks, such as:
- Embedding Models: OpenAI vs Gemini vs Cohere
- Top Vector Database for RAG: Qdrant vs Weaviate vs Pinecone
- Hybrid RAG: Boosting RAG Accuracy
- Agentic RAG benchmark: multi-database routing and query generation
FAQ
Reference Links

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Comments 0
Share Your Thoughts
Your email address will not be published. All fields are required.