We benchmarked SAP-RPT-1-OSS against gradient boosting (LightGBM, CatBoost) on 17 tabular datasets spanning the full semantic-numeral spectrum, small/high-semantic tables, mixed business datasets, and large low-semantic numerical datasets.
Our goal is to measure where a relational LLM’s pretrained semantic priors may provide advantages over traditional tree models and where they face challenges under scale or low-semantic structure.
SAP-RPT-1-OSS vs. Gradient Boosting: Benchmark results
- Success Rate: Represents the average normalized score (0.0 to 1.0). A higher bar indicates the model is consistently closer to the best possible performance for datasets in that category.
- 100 – 500 rows (3 Datasets):
- Included: wine (178), sonar (208), vote (435).
- Result: SAP performs best on 2 of 3 datasets. It achieves the highest scores on wine and sonar, suggesting that LLM priors may be beneficial when training data is scarce. However, CatBoost secured a narrow victory on the vote dataset (within 0.1%), indicating that tree models remain highly competitive even at small scales.
- 501 – 1,000 rows (3 Datasets):
- Included: cylinder_bands (540), breast_cancer (569), credit_g (1,000).
- Result: SAP performs best on all 3 datasets. On cylinder_bands, SAP outperformed LightGBM by a 5.5% margin, potentially due to better handling of semantic descriptions of industrial defects, though further ablation studies would be needed to confirm this mechanism.
- 1,000 – 10,000 rows (5 Datasets):
- Included: titanic (1.3K), car_evaluation (1.7K), spambase (4.6K), compas (5.2K), employee_salaries (9.2K).
- Result: SAP achieves best results on 4 out of 5 datasets, performing particularly well on text-heavy tasks like spambase and titanic. However, CatBoost significantly outperforms SAP on compas by 10.4%, indicating dataset-specific characteristics that favor tree models even in this size range.
- 10,000+ rows (6 Datasets):
- Included: california_housing (20K), house_sales (21K), default_credit (30K), adult_income (48K), diamonds (53K), higgs_100k (98K).
- Result: As data volume grows, the potential “prior knowledge” advantage of the LLM diminishes. LightGBM and CatBoost achieve best results on 5 out of 6 datasets, offering better accuracy at a fraction of the computational cost. The one exception, california_housing, shows only a modest 1.7% advantage for SAP.
1. Benchmark results datasets table
Below is the complete breakdown of model performance across all 17 datasets.
2. Cost & efficiency analysis
We calculated the direct computational cost for each model based on the RunPod H200 instance pricing of $3.59/hour.
SAP-RPT-1-OSS incurs significantly higher costs due to the time required for text embedding preprocessing and the LLM architecture’s heavy memory overhead. In contrast, LightGBM and CatBoost complete tasks almost instantly on this hardware. The costs below reflect the total wall-clock time (preprocessing + training) for a 3-fold cross-validation run.
Average cost per dataset (17 Dataset Avg)
Cost breakdown by dataset size
- Small Datasets (<1K rows): SAP is relatively cheap (≈ $0.03 per run). The high win rate here makes the cost negligible.
- Large Datasets (>20K rows): SAP becomes expensive.
- Example: Training on adult_income (48k rows) takes ≈$12 minutes total for 3 folds.
- Cost: 12 mins X $0.06/min = $0.72 per experiment.
- Comparison: LightGBM finishes the same task for $0.01.
Conclusion: While $0.22 per dataset is not expensive in absolute terms, SAP is 22x more expensive than the baseline. This cost differential may be justified for small, semantic-rich datasets where SAP shows meaningful accuracy improvements (e.g., cylinder_bands with +5.5% lift), but becomes harder to justify for large datasets where tree models achieve equal or better performance at a fraction of the cost.
3. Analysis framework: The Semantic Spectrum
To interpret these results, it is crucial to understand how we selected the data. We did not choose datasets at random; we curated a suite of 17 datasets specifically chosen to span the Semantic-Numerical Spectrum.
Our core hypothesis was that SAP (being LLM-based) would excel where data has linguistic meaning, while Tree models would dominate in raw numerical calculation. We categorised our datasets into three distinct clusters:
Cluster A: High-semantic datasets (6 datasets)
Characteristics: Features contain rich text descriptions, categorical labels with real-world meaning (e.g., “physician fee freeze”), or domain-specific terminology.
- Datasets:
- cylinder_bands: Industrial printing defects.
- titanic: Passenger names and titles.
- vote: Congressional voting records (Categorical “Yes/No” on policies).
- breast_cancer: Medical tumor descriptions.
- spambase: Email word frequencies.
- wine: Chemical origins.
Cluster B: Mixed business data (6 datasets)
Characteristics: The standard tabular format found in most enterprise databases, a mix of numeric values (salary, age) and categorical strings (job title, race, department).
- Datasets:
- employee_salaries: Job titles vs. salary.
- compas: Criminal history and demographics (Sensitive attributes).
- adult_income: Census demographics.
- credit_g: German credit risk profiles.
- default_credit: Taiwan credit default data.
- car_evaluation: Vehicle buying parameters.
Cluster C: Low-semantic/pure numerical data (5 datasets)
Characteristics: Features are abstract measurements, sensor readings, or physics coordinates. The column names often don’t matter; only the mathematical relationships do.
- Datasets:
- higgs_100k: Physics particle kinematics.
- diamonds: Physical dimensions and price.
- sonar: Frequency energy bounces.
- california_housing: Lat/Long coordinates and census stats.
- house_sales: King County real estate (mostly numerical features).
4. Deep dive: Where SAP wins vs. fails
Applying the analysis framework to our results reveals four distinct performance patterns. The table below summarizes exactly where SAP excels and where it breaks down.
Foundation models for relational and structured data
Foundation models have changed the way machine learning systems are designed and trained. Models such as GPT for language and CLIP for vision are trained on large, diverse datasets and can generalize across tasks with little or no fine-tuning. However, this progress has not yet reached relational databases, which remain the primary source of structured data in enterprise systems.
Relational data is stored in multiple tables, connected via primary and foreign keys. These tables contain heterogeneous data, including text, numbers, and timestamps. Despite their importance, most predictive systems in enterprise data warehouses still rely on manual feature engineering and task-specific models. As organizations seek predictive insights directly from their relational schemas, the need for relational foundation models becomes clear.
A relational foundation model should work across different databases and schemas, adapt to new datasets, and answer predictive questions such as “Which customers are likely to churn?” or “What products will sell next month?” without requiring manual data restructuring.
Conceptual foundations of relational foundation models
The main goal of a relational foundation model is to make accurate predictions and perform diverse tasks over structured tables. These models must understand how information is represented across different tables, how entities are linked through relationships, and how temporal information influences outcomes.
Key capabilities of such models include:
- Schema generalization: The ability to adapt to new relational schemas without retraining from scratch.
- Unified input representation: Handling different column types such as numerical, categorical, and textual features.
- Integration of temporal and structural context: Capturing dependencies across time and between entities linked by primary and foreign keys.
- Transferability: Performing predictive tasks on new datasets through pre-training and zero-shot learning.
Griffin
Griffin is one of the first large-scale attempts to build a unified relational foundation model. It represents relational data as a temporal, heterogeneous graph, where each row becomes a node and edges correspond to foreign-key relationships. Key features include:
Unified feature encoder
- Categorical and text features are encoded with a pre-trained text encoder, while numerical values use a learned float encoder.
- Metadata such as table names, column names, and edge types are embedded to help the model recognize the relational schema.
- Task embeddings enable a single model to perform regression and classification tasks with shared decoders.
Message passing and attention
Griffin integrates message passing neural networks with a cross-attention module. The message-passing component aggregates information within and across relations, while cross-attention focuses on relevant cells within each row. This design helps the model handle diverse data and maintain context between connected entities.
Pre-training and fine-tuning
The model is pre-trained on single-table datasets via a masked-cell completion task and then fine-tuned on relational databases for specific tasks. Experiments on large relational benchmarks show that Griffin outperforms traditional GNN baselines and single-table models in both accuracy and transfer learning efficiency.
Figure 1: Graph showing the Griffin Model Framework.1
Relational transformer
While Griffin focuses on graph aggregation, the Relational Transformer (RT) applies transformer architectures directly to relational databases. It treats every cell as a token enriched with its value, column name, and table name.
Input representation
Each token combines:
- A value embedding that depends on its datatype (numerical, text, or datetime).
- A schema embedding is generated from the table and column text.
- A mask token is used when the value is hidden during pre-training.
This structure enables RT to process relational databases with different schemas while maintaining a consistent input format.
Relational attention
RT introduces a relational attention mechanism that operates at the cell level. It includes:
- Column attention for learning value distributions within columns.
- Feature attention for combining attributes within the same row or linked parent rows.
- Neighbor attention for aggregating information from connected child rows.
Together, these attention layers form a relational graph transformer that models dependencies across rows, columns, and tables.
Training and transfer results
RT is pretrained on relational databases from RelBench, covering domains such as eCommerce, advertising, and social networks. In experiments, the pretrained model achieved up to 94% of the performance of fully supervised models in zero-shot settings. It also learned faster during fine-tuning, requiring fewer training steps to reach high accuracy.2
This approach suggests that relational databases share transferable patterns across domains and that cell-level tokenization provides a practical foundation for predictive tasks on structured data.
VIEIRA
VIEIRA takes a different approach by focusing on programming with foundation models rather than building a single predictive engine. It extends the SCALLOP probabilistic logic compiler with a declarative language that integrates large language models, vision models, and other pretrained components as foreign predicates.3
Relational paradigm
In VIEIRA, foundation models are treated as stateless functions with relational inputs and outputs. This enables composing models such as GPT, CLIP, or SAM according to logical rules. For example:
- A program can use GPT to extract knowledge from text and store it as structured relations.
- CLIP can classify images and link them to textual labels in a table.
Applications
The framework supports a wide range of tasks:
- Date and math reasoning using GPT.
- Kinship reasoning using text extraction and logical inference.
- Question answering that combines retrieval and reasoning.
- Visual question answering and image editing through multimodal composition.
By unifying symbolic logic and neural inference, VIEIRA allows data analysts and developers to build interpretable systems that use pretrained foundation models to answer predictive queries over structured data and images.
Training paradigms and evaluation
All three studies emphasize large-scale pre-training followed by fine-tuning. Their datasets include benchmarks such as RelBench, 4DBInfer, and TPBerta, which represent realistic enterprise databases with multiple tables and temporal information.
Typical predictive tasks include:
- User churn prediction.
- Sales forecasting and lifetime value estimation.
- Missing value imputation and column completion.
Evaluation metrics such as AUROC, R², and mean absolute error measure the accuracy of predictions.
The experiments demonstrate that pre-trained models can perform predictive tasks with limited supervision, often surpassing Gradient Boosting baselines (such as LightGBM and CatBoost) that rely on manual feature engineering.
Case studies
SAP’s sap-rpt-1
Forecasting demand, identifying late payments, and spotting sales opportunities each required a different model, extensive training, and ongoing maintenance. This approach slowed down decision-making and increased operational complexity.
sap-rpt-1 changes this process by introducing a single relational foundation model that performs a wide range of predictive tasks through in-context learning. Instead of retraining a new model for each use case, users provide a few examples of their target pattern, such as “customers who paid on time” and “customers who paid late.” The model then recognizes the pattern and immediately produces accurate predictions for new data.
The model is designed with a two-dimensional attention mechanism that captures relationships across rows and columns, while also embedding metadata, such as table and column names, into vector embeddings. This design allows it to understand the semantics of relational schemas and the temporal information within business tables.
SAP’s approach brings several advantages for data analysts and business users:
- A single model that works across multiple tables and domains
- No need for repeated fine-tuning or custom development
- Access to predictive insights in minutes rather than weeks
- Integration with existing data warehouses and SAP systems
By embedding sap-rpt-1 within the SAP ecosystem, business experts can interact with their own data directly and receive predictions through intuitive interfaces. The result is a faster path from structured data to actionable decisions without manual feature engineering.
Figure 2: Error-reduction factor of sap-rpt-1-large versus narrow-AI baselines across SAP domains.
Kumo.AI’s KumoRFM: a relational graph transformer for predictive analytics
Kumo.AI, founded by Stanford professor Jure Leskovec, created KumoRFM, a relational foundation model that uses a relational graph transformer to analyze relational databases and data warehouses. It represents relational data as a temporal heterogeneous graph, where each entity is a node and primary keys and foreign keys form the edges between tables.
This graph-based approach enables KumoRFM to learn from multiple tables simultaneously and adapt to new relational schemas. The model is pre-trained on diverse data sources and can generalize to new datasets without building separate models for each predictive task.
KumoRFM can be used through different interfaces depending on user expertise:
- PQL (Predictive Query Language): A specialized query language for defining predictive queries on structured data.
- Natural language interface: For non-technical users, natural language inputs are automatically translated into PQL queries.
- Python SDK: Allows developers to integrate the model into enterprise AI pipelines and applications.
The KumoRFM architecture dynamically samples the database to create context subgraphs and prediction subgraphs. These subgraphs are processed by the relational graph transformer, which captures dependencies and temporal information across related entities. Through in-context learning, the model provides accurate predictions and can explain its reasoning process.
Kumo offers two deployment options suited to enterprise environments:
- SaaS platform: A cloud-based service built on Apache Spark for easy access and scaling
- Data warehouse native: Allows organizations to use their own data in Snowflake or Databricks without moving it outside their secure environment
Unlike traditional knowledge graphs that require manual schema definition, KumoRFM automatically constructs its relational graph from structured sources. This makes it well-suited for eCommerce, finance, and healthcare, where relationships, temporal patterns, and evolving context are essential for reliable predictions.
Key capabilities of KumoRFM include:
- Flexibility across different tables and schema structures
- Compatibility with a variety of column types and custom identifiers
- Adaptation to specific tasks during inference time
- High accuracy and interpretability in predictive tasks
Figure 3: The image shows how Relational Foundation Models (RFMs) function across multiple domains, such as eCommerce, finance, and healthcare, to make predictions, provide explanations, and evaluate outcomes.4
Emerging directions
Research on relational foundation models is expanding toward several promising directions:
- Unified relational reasoning: Combining transformer architectures with symbolic logic frameworks like VIEIRA to handle both numeric and semantic reasoning.
- Multi-modal relational data: Integrating images, text, and structured tables into a single relational schema for comprehensive analysis.
- Scalable pre-training: Building data warehouses of relational and temporal graphs that support cross-domain generalization.
- Interpretability and causality: Developing methods to trace how predictions arise from relationships among entities and columns.
- Practical enterprise AI systems: Using these models in chat interfaces that allow users to ask predictive questions directly against their own data in databases or knowledge graphs.
Benchmark methodology
Benchmark setup & environment
To ensure fair comparisons between CPU-bound trees and GPU-accelerated models, we utilized a high-performance environment capable of handling both efficiently.
- Hardware: RunPod instance with an NVIDIA H200 140GB GPU.
- Software: Python 3.12 with pinned libraries for reproducibility:
- scikit-learn 1.5.2, lightgbm 4.5.0, catboost 1.2.7
- torch 2.5.1, pandas 2.2.3, numpy 2.1.3
- sap-rpt-oss (Source: Official GitHub)
- Reproducibility: random_state=42 was used consistently across all splits, initializations, and models.
Datasets: The semantic spectrum
We evaluated the models on 17 supervised learning datasets sourced from OpenML and Scikit-Learn. Rather than random selection, we curated this suite to span the “Semantic-Numerical Spectrum” testing the hypothesis that LLMs excel where features contain linguistic meaning rather than just raw statistics.
The inventory:
- Small & semantic (<1K rows):
- wine (178), sonar (208), vote (435), cylinder_bands (540), breast_cancer (569).
- Medium/mixed (1K – 10K rows):
- credit_g (1K), titanic (1.3K), car_evaluation (1.7K), spambase (4.6K), compas (5.2K), employee_salaries (9.2K).
- Large/numerical (10K+ rows):
- california_housing (20K), house_sales (21K), default_credit (30K), adult_income (48K), diamonds (53K), higgs (sampled to 100K).
Tasks covered:
- 11 Binary Classification tasks
- 2 Multiclass Classification tasks
- 4 Regression tasks
Model configurations & preprocessing
We aimed for a realistic “practitioner’s comparison,” using strong defaults rather than exhaustive hyperparameter tuning.
LightGBM & CatBoost
To ensure a fair comparison against the computationally heavy SAP model, we increased the robust default estimators.
- LightGBM: n_estimators=500, learning_rate=0.05, num_leaves=31. Runs on CPU (n_jobs=-1).
- CatBoost: iterations=500, learning_rate=0.05, depth=6. Runs on GPU (task_type=”GPU”).
- Preprocessing: Simple Label Encoding for categoricals; no scaling for numerics; median/mode imputation for missing values.
SAP-RPT-1-OSS
We configured SAP to balance performance and cost based on our preliminary configuration experiments.
- Configuration: max_context_size=4096, bagging=4.
- Note:
- Context: Testing on adult_income showed that increasing context from 4096 to 8192 tripled runtime (4 min to 12 min) for negligible accuracy gain (0.917 vs 0.917 ROC-AUC).
- Bagging: Increasing bagging from 4 to 8 (SAP’s default setting used in the article5 ) offered diminishing returns.
- Preprocessing: None. The raw pandas DataFrame is passed directly. The model encodes using text embeddings (sentence-transformers/all-MiniLM-L6-v2).
Evaluation protocol
Cross-validation strategy
We utilized 3-Fold Cross-Validation with shuffling.
- We reduced the standard 5-fold to 3-fold to accommodate SAP’s slow inference times (40% time saving) while maintaining statistical validity.
- Splitting: StratifiedKFold for classification; Standard K-Fold for regression.
Metrics & diagnostics
We moved beyond simple accuracy to capture a holistic view of model performance:
- Primary ranking metrics: ROC-AUC (Binary), Balanced Accuracy (Multiclass), R² (Regression).
- Secondary diagnostics: We tracked Matthews correlation coefficient (MCC) and log loss to ensure wins were not artifacts of class imbalance, and MAPE for regression error calibration.
- Cost calculation: Based on the total wall-clock time (preprocessing + training + inference) on the RunPod H200 instance ($3.59/hr).
Statistical significance
We applied a Wilcoxon signed-rank test (p<0.05) to pairwise model comparisons to determine if performance differences were statistically significant or random noise.
Limitations & internal validity
We explicitly acknowledge the following constraints in our methodology:
- Standardized configurations vs tuning: We utilized fixed, strong default configurations for all models rather than performing exhaustive hyperparameter optimization (e.g., nested CV or Optuna sweeps). While this ensures a consistent baseline, it is worth noting that Tree models often see performance gains with dataset-specific tuning, which could narrow the margins in the “Competitive” cluster.
- Data scale boundaries: Our analysis focused on datasets under 100k rows to simulate typical mid-sized enterprise scenarios. We observed the LLM’s advantage diminishing as data volume grew, but we did not extend testing to million-row scales where inference latency and cost would likely become the primary constraints.
- Infrastructure uniformity: To maintain a consistent testing environment, we executed all models on the same NVIDIA H200 hardware. LightGBM and CatBoost are highly optimized for commodity CPUs; therefore, in a production environment dedicated solely to Tree models, the cost differential would likely be wider.
- Generalization beyond semantics: Our “Semantic Spectrum” hypothesis successfully predicted many outcomes, but the LLM’s strong performance on abstract datasets like sonar and california_housing suggests capabilities beyond linguistic understanding. This indicates that the model may also be leveraging high-dimensional regularization patterns, a phenomenon that warrants further investigation beyond the scope of this initial study.
Reference Links





Be the first to comment
Your email address will not be published. All fields are required.