I have relied on SQL for data analysis for 18 years, beginning in my days as a consultant. Translating natural-language questions into SQL makes data more accessible, allowing anyone, even those without technical skills, to work directly with databases.
We used our text-to-SQL benchmark methodology on 24 large language models (LLMs) to assess their performance in SQL command generation:
Common errors in LLM-generated SQL
LLMs often make four error types: faulty joins, aggregation mistakes, missing filters, and syntax errors.
Incorrect join logic
Models often struggled to correctly identify and implement the necessary `JOIN` operations between tables, sometimes omitting them entirely or misusing less optimal subqueries.
The LLM failed to correctly join the `frpm` and `schools` tables using the `CDSCode`. It also hallucinated column names (`Charter`) and filter values (`County = ‘Fresno’`).
Errors in join logic fundamentally break the relational aspect of the query, leading to incomplete or incorrect data retrieval when multiple tables are involved.
Aggregation and grouping errors
Applying aggregate functions (like `MAX`, `AVG`, `COUNT`, `SUM`) or `GROUP BY` clauses incorrectly was another common failure point, leading to results that didn’t semantically match the user’s intent.
The LLM correctly identified that the phrase “highest average score” requires grouping the data by district (GROUP BY dname) and using an aggregate function (AVG(AvgScrRead)). This part of the logic is correct.
However, the LLM failed to incorporate a critical filter from the question: the word “active“. To fulfill this requirement, the query needed to JOIN the satscores table with the schools table and then filter the results with a WHERE T1.StatusType = ‘Active’ clause.
This highlights a common LLM failure: correctly executing a primary, obvious instruction (calculating an average) while missing a secondary but equally important condition (filtering by status). This shows a weakness in synthesizing multiple constraints into a single, correct query.
Missing or wrong filters
Models sometimes failed to include necessary `WHERE` clauses or select the wrong columns in the `SELECT` statement, not fully addressing the constraints or the information explicitly requested in the prompt.
The LLM correctly identified the logic to find the school (`ORDER BY NumGE1500 DESC LIMIT 1`) but failed to select the requested `Phone` number and omitted the necessary join to the `schools` table to retrieve it.
These errors often stem from incomplete parsing of the user’s request or failure to map all parts of the request to the final SQL query components.
Syntax errors
Beyond semantic errors, outright syntax mistakes occurred, such as using incorrect table aliases or producing incomplete SQL statements, which prevent the query from executing.
The LLM used incorrect aliases (`accounts` instead of `account`) and included an incomplete string literal (`’POPLATEK PO OBRATU…’`), resulting in invalid SQL syntax.
These syntax issues highlight challenges in generating code that strictly adheres to SQL grammar and database-specific conventions.
Why some LLMs are better at SQL
Several key things determine how well a Large Language Model (LLM) can turn a plain English question into a correct SQL database query.
1. Model size and training data
- Size and design: Larger models or those built with specific structures may handle complex tasks, such as SQL generation, more effectively.
- What it learned: The data used to train the LLM is essential. If it sees many examples of questions linked to SQL answers, especially those involving complex operations such as joins or calculations (SUM, AVG), it will likely perform better.
2. Fine-Tuning for SQL tasks
- Models can be given extra training specifically focused on text-to-SQL tasks. This “fine-tuning” helps them understand database structures and SQL rules more effectively than models trained only on general text. Training on specific instructions also helps.
3. Reasoning and schema mapping abilities
- Reasoning: How well can the LLM figure out the exact steps needed from a sometimes vague question? Creating SQL often requires logical steps.
- Understanding the database map (Schema): Some LLMs are better at connecting concepts in the question (such as “customers” or “total sales”) to the actual table and column names in the database, even if the names aren’t immediately obvious.
How LLMs generate SQL: A step-by-step look
To see the factors like “reasoning” and “schema mapping” in action, let’s walk through the step-by-step process a model follows to generate a query. This entire workflow is powered by a technique called Retrieval-Augmented Generation (RAG).
Stage 1: Initial analysis and database selection
When presented with a question, the LLM first analyzes the user’s intent to select the most relevant database tool.
- Question: “How many accounts have an owner disposition and request for a statement to be generated upon a transaction?”
- LLM’s Action: The model identifies keywords like “accounts,” “disposition,” and “transaction.” It concludes that the
financialdatabase tool is the correct choice over others likecalifornia_schoolsorsuperhero.
Stage 2: Retrieving the schema via RAG
Once the model chooses a tool, it needs the database’s “map” the schema. It doesn’t have this information memorized. Instead, the RAG system retrieves it in real-time.
- Retrieval: The user’s question is used to search a vector database that stores schema information. The search finds and retrieves the most relevant schema details, such as the definitions for the
accountsanddisptables. - Augmentation: This retrieved schema text is automatically inserted into the prompt alongside the original question.
- Generation: The LLM now has all the context it needs to proceed.
This RAG process ensures the model receives only the necessary schema information, making its task more focused and efficient.
Stage 3: Reasoning and query construction
With the question and the RAG-provided schema, the model maps concepts from the user’s request to the specific table and column names it just received.
LLM’s internal monologue:
- Goal: The user wants a count, so I’ll start with
SELECT COUNT(...). - Conditions:
- “…owner disposition…” -> The
disptable schema has atypecolumn. I need aWHEREclause fortype = 'OWNER'. - “…statement to be generated upon a transaction…” -> The
accountstable schema has afrequencycolumn. The filter should befrequency = 'POPLATEK PO OBRATU'.
- “…owner disposition…” -> The
- Joins: The information is split across the
accountsanddisptables. The schema shows they are linked byaccount_id, so I need toJOINthem.
Stage 4: Generating the final SQL
Finally, the model assembles these logical pieces into a syntactically correct SQL query. The quality of this output depends on:
- Reasoning capability: The model’s ability to logically connect the user’s request to the provided schema.
- SQL knowledge from training: The model’s core understanding of SQL syntax and functions.
This process explains why errors occur. If the retrieved schema is ambiguous or a term in the question doesn’t map cleanly, the LLM must make an educated guess, which can lead to the errors we analyzed earlier.
What is text-to-SQL?
Text-to-SQL is a natural language processing technology that converts everyday language into a SQL query written in structured query language. Instead of manually writing SQL code, a user asks a question in natural language, and the system generates a SQL statement that can be executed on a database.
The main purpose of text-to-SQL is to reduce the gap between how people think about data and how databases require queries to be written. This is especially relevant for non-technical users and data analysts who understand the business context but may not be comfortable writing SQL syntax from scratch.
At a basic level, when a user asks a question such as:
- “Show all customers from New York who made purchases last month.”
The system translates that request into a generated SQL query that selects the correct columns, filters rows using date and location constraints, and joins the necessary database tables. The quality of the output depends on whether the system can generate accurate queries that reflect both the user intent and the database schema.
Where text-to-SQL is useful today
Text-to-SQL works reasonably well for:
- Generate draft queries that data analysts can review and adjust.
- Support exploratory data analysis where speed matters more than precision.
- Allow non-technical users to access simple data through predefined schemas.
- Assist SQL users by reducing the need to write repetitive queries.
In these cases, text-to-SQL functions as an assistive AI tool rather than an autonomous system. Human review remains part of the workflow, especially when correctness matters.
How does text-to-SQL work?
Modern text-to-SQL systems rely on large language models trained on pairs of natural language questions and SQL queries. These models learn patterns that connect everyday language to SQL structures, table names, columns, and relationships. The process typically follows a sequence of steps:
Natural language understanding
The system first analyzes the user input to determine intent, constraints, and entities. This step involves:
- Identifying what the user is asking for (for example, totals, filters, comparisons)
- Extracting relevant conditions such as time ranges, locations, or categories
- Interpreting ambiguous phrases that may require business context
Errors at this stage often lead to a correct-looking SQL query that answers the wrong question.
Schema mapping
Next, the system maps terms from the question to the database schema. This includes:
- Matching concepts in the question to table names and columns
- Understanding relationships between tables
- Respecting data types, such as dates, numeric fields, or categories
Schema mapping becomes more challenging as the number of tables grows or when column names do not closely match how users describe data in natural-language questions.
SQL query construction
Once intent and schema elements are identified, the system constructs the SQL query. This may involve:
- Selecting the correct tables and columns
- Adding joins across all the tables needed
- Applying filters, aggregations, and grouping logic
- Producing syntactically valid SQL code for systems like MySQL or PostgreSQL
At this stage, the system can easily produce valid but logically incorrect SQL, for example, by using the wrong join condition or aggregation.
Validation and execution
Some systems include validation layers that verify that the generated SQL query can execute and return results. More advanced tools may attempt limited optimization or ask follow-up questions when the query is ambiguous.
However, validation rarely guarantees a correct answer. A query can execute successfully and still be incorrect in subtle ways.
Limitations and practical risks
Despite strong benchmark scores, real-world use exposes several limitations that cannot be ignored.
Reliability and correctness
Even top-performing models fail to produce correct SQL for a significant share of complex queries. An error rate of 20% or higher means:
- One in five generated queries may return misleading results
- Errors are often semantic rather than syntactic
- Incorrect joins, filters, or aggregations may go unnoticed
This is particularly risky in reporting, forecasting, or decision-support systems, where users assume the output is correct.
Dependence on human oversight
Given the current performance, the generated SQL must be reviewed by someone who understands SQL and the database. Without this oversight:
- Users may trust an incorrect query because it executes successfully
- Errors may propagate into dashboards, reports, or downstream systems
- Accountability becomes unclear when decisions rely on AI-generated outputs
Text-to-SQL does not remove the need for SQL expertise; it shifts where that expertise is applied.
Complexity ceiling
As query complexity increases, performance drops sharply. Models struggle with:
- Multiple joins across many tables
- Nested logic and subqueries
- Domain-specific calculations
- Queries that require deep knowledge of the database schema
Benchmarks such as BIRD-SQL highlight that complex queries remain the primary failure point, even for advanced models.
Model variability
Performance differences between models are significant. Some language models perform reasonably well, while others fail frequently on the same dataset. This means:
- Model selection has a direct impact on accuracy
- Fine-tuning and training data matter
- General-purpose models may not work well without domain adaptation
There is no universal solution that works equally well across databases and use cases.
Data governance and privacy
Text-to-SQL systems introduce additional access risks:
- Users may query sensitive tables without understanding implications
- Generated SQL can expose metadata about the database schema
- Data privacy controls must be enforced outside the language model
Without strong access controls, text-to-SQL can weaken existing governance practices.
Benchmark methodology for text-to-SQL
To evaluate the text-to-SQL capabilities of the eight LLMs, we utilized a methodology centered on an agentic retrieval-augmented generation (RAG) framework. This approach mirrors real-world applications where an AI agent must first identify the correct data source and then retrieve the necessary context to complete a task.
1. Dataset and ground truth: We used 500 questions from the “moderate” difficulty level of BIRD-SQL.1 dataset, a standard academic benchmark renowned for its complex, multi-table queries that mirror real-world business intelligence tasks. Each question is paired with its target database and a ground truth SQL query, providing a reliable basis for evaluation. The schemas are intentionally diverse and non-homogeneous to test the models’ adaptability.
2. Execution framework: An agentic RAG approach
Each LLM was configured to operate within an agentic framework built with LangChain, following a two-stage process for every question:
- Stage 1: Database selection (Routing): The agent was first presented with the natural language question and a list of available database “tools” (e.g., financial, california_schools). Its initial task was to act as a “router,” analyzing the user’s intent to determine which database was the most relevant to answer the question.
- Stage 2: Schema retrieval and query generation (RAG): Once the agent selected a database tool, it used the user’s question to retrieve the most relevant schema components (table definitions, column descriptions) from a ChromaDB vector store. This retrieved schema was then automatically inserted into a new prompt, providing the LLM with the precise context needed to generate the final SQL query.
3. Evaluation Framework: A two-stage graded approach
We moved beyond simple execution accuracy to a more nuanced, two-stage evaluation focused on semantic correctness, using a powerful LLM as an expert grader.
- Stage 1: Database selection accuracy: The first metric was routing accuracy. If an LLM failed to select the correct database for a given question, it received a score of 0 for that task, as any subsequent SQL would be fundamentally incorrect. Therefore, SQL quality was only graded if a model passed this initial routing stage.
- Stage 2: SQL quality grading (LLM-as-judge): For queries where the correct database was selected, we used Claude 4 sonnet as an expert “judge” to evaluate the predicted SQL against the ground truth. The judge assigned a score based on a detailed 5-point rubric measuring semantic equivalence, whether the two queries would produce the exact same result set.
- Score 5 (Perfect match): Semantically identical to the ground truth.
- Score 4 (Minor error): Correct core logic (joins, aggregations) but a minor typo in a name or value.
- Score 3 (Logic error): A significant error in a JOIN condition, aggregation, or WHERE clause.
- Score 2 (Wrong tables): A fundamental failure to understand the schema, using incorrect tables or columns.
- Score 1 (Invalid SQL): The output is not valid SQL or is a hallucination.
The final “Correct SQL Command Generation Rate” shown in our primary chart represents the percentage of total questions for which the LLM both selected the correct database and generated a query that earned a perfect score of 5. This rigorous, two-stage process ensures our results reflect the model’s ability to perform the end-to-end task accurately and reliably.
Further reading
Explore other RAG benchmarks, such as:
- Embedding Models: OpenAI vs Gemini vs Cohere
- Top Vector Database for RAG: Qdrant vs Weaviate vs Pinecone
- Hybrid RAG: Boosting RAG Accuracy
- Agentic RAG benchmark: multi-database routing and query generation
FAQ
Reference Links
Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
Be the first to comment
Your email address will not be published. All fields are required.