Contact Us
No results found.

Tabular Models Benchmark: Performance Across 19 Datasets 2026

Cem Dilmegani
Cem Dilmegani
updated on Jan 26, 2026

We benchmarked 7 widely used tabular learning models across 19 real-world datasets, covering ~260,000 samples and over 250 total features, with dataset sizes ranging from 435 to nearly 49,000 rows.

Our goal was to understand top-performing model families for datasets of different sizes and structure (e.g. numeric vs. categorical) which make up a typical enterprise data architecture.

Tabular learning models benchmark results

In the chart, the winning model receives 1 point. In case of a draw, the point is shared equally among the tied models. Win rate measures how often a model finishes first within a given regime, providing a stricter view of dominance than average rank.

Different models win under different structural conditions, and the success rate varies with dataset size and feature composition.

In particular:

  • Foundation models are the most successful when the data is limited
  • XGBoost is the sole consistent winner on large + numeric datasets
  • On large + hybrid datasets:
    • Wins are distributed across TabICL, LightGBM, and Logistic Regression
    • Hybrid data at scale remains the most ambiguous regime, where multiple approaches remain viable

Disclaimer: Feature types are categorized as numeric or hybrid based on the dominant input representation after preprocessing.

How to interpret the dataset mix:

  • Size buckets range from small datasets with fewer than 1.000 rows to large datasets with more than 40.000 rows.
  • Task types include binary classification, multiclass classification, and regression.
  • Feature types reflect practical enterprise data:
    • Numeric: primarily continuous or ordinal variables
    • Hybrid: a mix of numeric and categorical features

This variation makes the benchmark well-suited for understanding which model families perform reliably under different conditions.

You can see our methodology below.

High-level results by dataset size and feature type

Here is how models behave across dataset size buckets and feature types, rather than focusing on individual dataset scores.

For each dataset size bucket, the chart reports the average ROC-AUC achieved by each model, separately for numeric and hybrid datasets.

Small datasets (<1K rows)

On small datasets, foundation-style tabular models are the most successful.

  • TabPFN and TabICL achieve the strongest performance on both numeric and hybrid datasets.
  • The performance gap is especially pronounced on hybrid datasets
  • Logistic regression performs competitively on numeric data, but degrades sharply on hybrid data

When data is scarce, models with strong inductive bias outperform both boosting and neural baselines. In this regime, prior knowledge and learned feature interactions matter more than model capacity.

Medium datasets (1K–10K rows)

On medium-sized datasets, overall performance improves, but structural differences remain.

  • All models perform strongly on numeric datasets (often exceeding 97% ROC-AUC)
  • Hybrid datasets remain more challenging.
  • TabPFN and TabICL continue to lead, but the gap is closer.

Medium-sized datasets represent a transition regime: signal density increases, but inductive bias still provides a measurable advantage, particularly on mixed feature types.

Large datasets (>10K rows)

At scale, performance patterns shift.

  • On large numeric datasets, XGBoost and TabICL perform better than other models.
  • On large + hybrid datasets, performance converges:
    • Differences are smaller, and model choice becomes less obvious

At scale, classic gradient boosting fully exploits the numeric signal. For hybrid data, robustness and categorical handling matter more than raw model complexity.

Average rank by regime

Models are ranked within each regime (dataset size × feature type).
Ranks are normalized so that higher values indicate stronger relative performance, making cross-regime comparisons easier.

Small datasets

On small datasets, foundation-style models dominate the rankings.

  • TabPFN and TabICL rank first on both the numeric and hybrid datasets
  • Gradient boosting models consistently rank near the bottom
  • The gap between foundation models and boosting is larger on Hybrid data

Average rank highlights the same pattern observed in raw performance:
When data is scarce, learned priors and inductive bias outweigh scale-driven optimization.

Medium datasets

On medium-sized datasets, rankings begin to shift.

  • TabPFN and TabICL remain top-ranked across both feature types
  • CatBoost emerges as a strong third option on hybrid datasets
  • Boosting models improve their relative position compared to the small-data regime

This regime reflects a balance point. Data volume increases, but feature interactions still reward models with stronger inductive bias.

Large datasets

On large datasets, dominance becomes regime-specific.

  • Large + numeric:
    • XGBoost ranks first with a small margin, with TabICL close behind.
  • Large + hybrid:
    • No single model dominates
    • TabICL, LightGBM, CatBoost, and TabPFN all achieve similar average ranks

Average rank confirms that model superiority is conditional, not universal.
Strong overall rankings often mask sharp performance differences across regimes.

Model-specific observations

This section summarizes where each model class performs well and where it struggles, based on the full set of results.

Foundation-style models: TabPFN and TabICL

Strengths

  • Consistently top-performing on small and medium datasets
  • Particularly strong on hybrid datasets, where categorical structure matters
  • High win rates on small datasets

Limitations

  • Less dominant on large + numeric datasets
  • Practical constraints (feature limits, task support) affect applicability

Foundation-style models are best suited for data-scarce or mixed-feature problems, especially when rapid performance without extensive tuning is required.

Gradient boosting models: XGBoost and LightGBM

Strengths

  • Competitive on large datasets
  • Strong and stable performance as data volume increases
  • Remain competitive on hybrid data at scale

Limitations

  • Underperforming compared to foundation models on smaller datasets
  • Require careful preprocessing and tuning for categorical-heavy data

Gradient boosting remains the default choice for large numeric tables, and a strong baseline even in mixed-feature regimes.

CatBoost

Strengths

  • Most robust model on hybrid datasets, particularly at larger scales
  • Native categorical handling provides consistent gains
  • Rarely performs poorly across regimes

Limitations

  • Rarely is the top performer
  • Less dominant on purely numeric datasets

CatBoost is the safest choice when categorical features dominate, especially in medium-to-large datasets.

RealMLP

Observations

  • Rarely wins across regimes
  • Often ranks near the bottom, except on a small number of datasets

Generic neural MLPs struggle on tabular data without strong inductive bias, reinforcing a long-standing lesson in applied machine learning. 1

Logistic regression (baseline)

Observations

  • Competitive on numeric datasets, even at scale
  • Occasionally wins or ranks highly on hybrid datasets
  • Performance degrades sharply when feature interactions dominate

Despite its simplicity, logistic regression remains a meaningful baseline and should not be skipped in tabular benchmarks.

Key takeaways of the tabular learning models benchmark

Across 19 real-world datasets, tabular model performance is driven primarily by feature structure, not by model complexity or dataset size alone.

Rather than asking:

“Which tabular model is best?”

A more actionable question is:

“Given my dataset size and feature composition, which class of models is likely to work?”

That perspective offers greater practical value than leaderboard-style rankings and better aligns with real-world enterprise decision-making.

Conceptual foundations of foundation-style tabular models

Foundation-style tabular models aim to generalize across diverse tabular datasets by learning strong priors over table structure, feature interactions, and task behavior, rather than optimizing for a single dataset.

Unlike traditional tabular models, which are trained independently for each dataset, foundation-style approaches are pretrained on large collections of tabular problems and then applied to new datasets through inference-time adaptation.

In this benchmark, TabPFN and TabICL represent two prominent approaches within this paradigm.

Key capabilities of foundation-style tabular models

Foundation-style tabular models typically exhibit the following capabilities:

  • Strong inductive bias: By learning common patterns across many tabular datasets, these models encode assumptions about feature interactions, target distributions, and noise characteristics that generalize well to unseen problems.
  • Unified handling of feature types: Numeric and categorical features are embedded into a shared representation space, allowing the model to reason over mixed-feature tables without extensive manual preprocessing.
  • Inference-time adaptation: Rather than retraining, these models adapt to new datasets using context examples or dataset-level statistics, enabling strong performance under data scarcity.
  • Transfer across tasks: A single pretrained model can perform classification or regression on previously unseen datasets, often with minimal configuration.

These properties explain why foundation-style models perform particularly well on small and medium datasets, where classical methods lack sufficient data to fully estimate complex feature interactions.

TabPFN: Prior-data fitting for tabular prediction

TabPFN (Tabular Prior-Data Fitted Network) reframes tabular learning as a Bayesian inference problem.

Instead of learning parameters for a single dataset, TabPFN is trained on millions of synthetic tabular tasks sampled from a distribution of data-generating processes. During inference, the model effectively performs amortized Bayesian inference, conditioning on the observed dataset to produce predictions.

Key characteristics of TabPFN include:

  • A transformer architecture that processes entire datasets as context.
  • Training on a wide distribution of synthetic tasks to encode general-purpose priors.
  • Strong performance in low-data regimes without hyperparameter tuning.2

In practice, this design enables TabPFN to outperform traditional boosting methods on small and medium hybrid datasets, as observed in the benchmark.

However, because the model relies on learned priors rather than scale-driven optimization, its advantage diminishes as the dataset size increases.

TabICL: In-context learning for tabular data

TabICL extends the idea of in-context learning to tabular prediction.

Instead of fitting model parameters, TabICL conditions on examples from the dataset provided directly in the input context. The model learns to infer decision rules from these examples, similar to how large language models perform few-shot learning.

Key aspects of TabICL include:

  • Dataset rows encoded as structured tokens
  • Task adaptation through context examples rather than gradient-based training
  • A single pretrained model capable of handling diverse tabular tasks
3

As with TabPFN, performance gains are strongest under data scarcity and become less pronounced on large Numeric datasets, where traditional boosting fully exploits available signal.

This approach allows TabICL to achieve strong performance on Hybrid datasets, especially when feature interactions are complex and labeled data is limited.

Why foundation-style models lose dominance at scale?

The benchmark results highlight an important limitation of foundation-style tabular models.

On large numeric datasets, models such as XGBoost outperform foundation approaches. This reflects a fundamental trade-off:

  • Foundation models rely on learned priors and generalization across tasks.
  • Gradient boosting exploits dataset-specific signal through iterative optimization.4

When sufficient data is available, scale-driven methods can fully learn feature interactions directly from the dataset, reducing the relative value of pretrained priors.

This explains why foundation-style models excel under data scarcity, while classic boosting dominates at scale.

Methodology of tabular learning models benchmark

We benchmark 7 ML models on 19 tabular datasets using 5-fold stratified cross-validation.

Models:

  • LogisticRegression – Linear baseline
  • XGBoost – Gradient boosting
  • LightGBM – Gradient boosting
  • CatBoost – Gradient boosting with native categorical support
  • RealMLP – Deep learning (MLP)
  • TabPFN – Transformer-based prior-fitted network
  • TabICL – Transformer-based in-context learning

19 datasets from OpenML:

  • Binary classification: 14 datasets
  • Multiclass classification: 1 dataset
  • Regression: 4 datasets
  • Dataset sizes range from ~600 to ~45,000 samples.

Evaluation

Cross-Validation

  • 5-fold stratified CV for classification
  • 5-fold CV for regression
  • Same random seed (42) across all experiments

Metrics

Preprocessing

  • Numerical features: StandardScaler
  • Categorical features: One-hot encoding (except CatBoost, which handles natively)
  • Missing values: Median imputation (numerical), mode imputation (categorical)

Limitations

  • TabPFN: Limited to datasets with ≤500 features after preprocessing
  • TabICL: Classification tasks only (no regression support)
  • Sample size: TabPFN uses a maximum of 10,000 training samples

Reproducibility

All experiments use:

  • Fixed random seed: 42
  • Same train/test splits across models
  • Default hyperparameters (no tuning)
Principal Analyst
Cem Dilmegani
Cem Dilmegani
Principal Analyst
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
View Full Profile
Researched by

Be the first to comment

Your email address will not be published. All fields are required.

0/450