๐จโ๐ป Benchmarking Small Language Models in the Real World
on April 21, 2026
On Saturday, April 18th, I participated in a hackathon that invites machine learning engineers, data engineers and researchers to Paris for an in-depth study of the evaluation of small language models (SLM).
This hackathon, organized by AI Tinkerers Paris, addresses a very concrete problem:
test the actual ability of language models to produce production-ready executable code.
The theme - "Benchmarking Small Language Models in the Real World" - sets a clear framework: moving beyond impressive demos to confront models with real-world constraints (execution, performance, resources).
๐ฏ Objective
The challenge is to automatically generate Polars queries from natural language, with a strong requirement:
- produce correct code
- ensure that it is executable
- Optimize execution time and memory consumption
The scoring reflects this reality:
Score = N / (T ร VRAM^0.1 ร RAM^0.01)
๐ In other words: accuracy alone is not enough - system efficiency becomes a central constraint.
โ๏ธ Technical Context
Each team works in a standardized environment:
- execution under Docker
- Use of Polars for data processing
- GPU/memory constraints
- evaluation dataset provided
The goal is not to create "the best prompt", but to build a system capable of:
- withstand real data
- produce robust code
- run an automated benchmark
๐งฉ Positioning
This hackathon stands out because of its approach:
- no marketing demo
- no "impressive but fragile" generation
- focus on what really works
๐ This is an environment that directly exposes the current limitations of LLM:
- hallucinations
- syntax errors
- Misunderstanding of the data schema
And it forces you to build around it.
- prompt structured engineering
- systematic validation
- fast iteration loop
๐ Reading
This is not a "creative" hackathon, but an engineering benchmark.
The final deliverable is not an idea, but:
a measurable, reproducible, and comparable system.
๐๏ธ Day's Organization
Morning - framing and initialization
The morning is dedicated to setting up the technical and organizational framework:
- Team formation via the event portal
- Presentation of the problem by the organizers and mentors
- Clarification of expectations (generation of executable Polars code + scoring)
- Initial setup of the work environment
The Darkwood team is formed around:
The objective of this phase is to reduce uncertainty and align everyone on an executable pipeline from the very first hours.
Midday - Forced Break
- Lunch provided on site
- informal exchanges between teams
- Consultation of the message center (general questions, clarifications from the jury)
๐ Short phase, without real decoupling: the project remains in iteration.
Afternoon - Implementation and Iterations
The afternoon is entirely production-oriented:
- implementation of the benchmark (
polars-bench) - integration of the model via
ai-harness - experimentation on prompts (structure, format, constraints)
- Adjusting model behavior via dataset and prompt engineering
- progressive validation via actual execution
The iterations focus on three areas:
- Hallucination reduction (schema-aware prompting)
- improved executable code rate
- score optimization (time / memory / accuracy)
๐ The logic is not to add features, but to tighten the system around real constraints.
โ๏ธ Benchmark Architecture
The system relies on a clear separation of responsibilities:
ai-harnessโ pattern orchestration layerpolars-benchโ execution and evaluation engine
This breakdown allows us to isolate the generation (LLM) from the execution reality (runtime), which is precisely the objective of the benchmark.
๐ค Sponsors, Mentors & Organizers, Teams
This hackathon is made possible thanks to an ecosystem of complementary players: sponsors, mentors and organizers, each playing a key role in the overall experience.
๐ข Organizers
The event is organized by the AI โโTinkerers Paris community, a collective active in experimenting with and sharing AI technologies.
- ๐ Official website: https://paris.aitinkerers.org
- ๐ Hackathon page: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk
- ๐ฅ Organizers: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk/organizers
Their positioning is clear: to promote concrete experimentation around AI models, with a focus on real engineering rather than demonstration.
๐ผ Sponsors
Sponsors support the event by providing:
- technical resources (GPU, infrastructure, tools)
- funding
- visibility
๐ Their role is critical to enabling a realistic environment (compute constraints, Docker execution, etc.).
-
Mistral
- Sponsor page: Hackathon Sponsors
- Representative: Matthieu Dinot โ AI Scientist at Mistral
-
Alpic
- Sponsor page: Hackathon Sponsors
- Representative: Nikolay Rodionov โ COO at Alpic
- Other links:
-
Cloudfare
- Sponsor page: Hackathon Sponsors
- Representative: Nans Cyril Bouissou โ Account executive at Cloudfare
-
Fold
- Sponsor page: Hackathon Sponsors
- Representative: Raouf Chebri โ Developer Relations Engineer at Replit
-
Microsoft
- Sponsor page: Hackathon Sponsors
- Representative: Julien Bichon โ Developer Experience | GTM Manager at Microsoft
-
ESGI (venue partner mentioned alongside sponsors)
- Sponsor / venue page: Hackathon Sponsors
- Representative: Astrid Beaucourt โ Communications Officer at ESGI
- School links:
๐งโ๐ซ Mentors
Mentors support participants throughout the hackathon:
- assistance in structuring approaches
- Feedback on templates and prompts
- tips on performance and optimization
๐ They help avoid common dead ends:
- over-optimization of the prompt
- lack of validation
- errors in data interpretation
-
Arthur Mensch
- Mentor page: Hackathon Mentors
- Role: Co-founder and CEO of Mistral AI
- Public profile link: not visible in the provided data
-
Leo Arsenin
- Mentor page: Hackathon Mentors
- LinkedIn: Lรฉo Arsenin
- Role: Solutions engineer at Cloudfare
-
Matthieu Dinot
- Mentor page: Hackathon Mentors
- LinkedIn: Matthieu Dinot
- Role: AI Scientist at Mistral
-
Preetham Kaukuntla
- Mentor page: Hackathon Mentors
- LinkedIn: Preetham Kaukuntla
- Role: Staff Data Scientist at Glassdoor
-
Mentors Message Board
๐ฅ Teams
-
- Status: 1st Place
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_qFVxlREukLQ
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_qFVxlREukLQ
- Demo video: YouTube
- Members:
-
- Status: Best Startup
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_-p8xvdGl-oA
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_-p8xvdGl-oA
- Members:
-
- Status: 10x Data Scientist
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_ub4tDzlrft0
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_ub4tDzlrft0
- Members:
-
- Status: Finalist
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_y4_Yz6P5BZE
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_y4_Yz6P5BZE
- Members:
-
- Status: Finalist
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_njEtwEBAmUE
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_njEtwEBAmUE
- Members:
-
- Status: Submitted
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_wRj-nEG24zE
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_wRj-nEG24zE
- Members:
-
- Status: Submitted
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_RFBTGYYwbm4
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_RFBTGYYwbm4
- Members:
-
- Status: Submitted
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_Gc4SneBV2D4
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_Gc4SneBV2D4
- Members:
๐ Useful Links
- ๐ Home: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk
- ๐ฅ Teams: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk/teams
- ๐ฉ Message Center: https://paris.aitinkerers.org/message_center?board_key=meetup_mu_eZJ5tCXlA2A
- ๐ Submissions: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk/entries
- ๐ Results: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk/results
๐งฑ Technical Stack
The architecture is intentionally simple, but constrained by real-world conditions:
- Python for execution
- Polars as a search engine
- Quinn 2.5 as a generation model
- FastAPI for the interface
- Local sandbox (CPU) with memory constraints
The core of the system is an API that transforms a natural request into directly executable Polars code.
๐ Execution Pipeline
The benchmark requires a complete chain, without shortcuts:
User Query (NL)
โ
Prompt enrichi (schema + rรจgles)
โ
LLM
โ
Code Polars gรฉnรฉrรฉ
โ
Exรฉcution rรฉelle
โ
Validation
โ
Scoring
๐ The model is not evaluated on what it "says", but on what its code actually produces.
๐ง Generation Strategy
The key choice is simple:
โ train the model โ to constrain one's behavior
Rather than heavy-handed fine-tuning, the system is based on:
- a structured prompt
- a dataset of targeted examples
Required Format
System:
- rรจgles Polars
- contraintes strictes
User:
- schรฉma
- requรชte
Assistant:
- code uniquement
๐ The model learns an execution pattern, not general knowledge.
๐๏ธ Dataset
The dataset is not massive, it is intentional:
- 15 to 500 examples
- focused on critical patterns
Cutlery:
- selection
- filters
- aggregations
- joins
- window functions
- nulls
- edge cases
๐ Coverage is more important than volume.
๐ Schema injection
The model doesn't guess anything. The scheme is injected systematically:
{
"columns": ["card_name", "mana_cost", "frequency"]
}
Direct effects:
- fewer hallucinations
- Valid queries on the first try
- strong dependence on the provided context
๐ Without a diagram, the system collapses.
โก Inference Constraints
- ~6GB model
- CPU only
- high latency
Consequence:
- Every call counts
- Re-sorting is expensive
- The prompt must be precise from the outset
๐ The cost of error is included in the score.
๐งช Validation & scoring
The benchmark validates a complete behavior, not a raw output.
Levels
- Valid code
- Successful execution
- Correct result
- Acceptable performance
Score
Score = N / (T * VRAM^0.1 * RAM^0.01)
๐ We measure a constrained system, not an abstract model.
๐ Architectural choices
Three options:
- fine-tuning โ too heavy
- multi-models โ too complex
- prompt + dataset โ chosen
๐ Decision: encode the behavior in the data
๐งฉ Implementation
The API exposes a simple stream:
-
input:
- request
- metadata
-
output:
- Polars code
The service:
- builds the prompt
- calls the model
- returns the code
๐ Voluntary simplicity: the real difficulty lies elsewhere.
๐ฆ Positioning
The project is not presented as an LLM wrapper, but as:
a secure analytical copilot for tabular data
With :
- guided prompts
- structured context
- execution verified
๐ The product is defined by its constraints, not by the model.
๐งญ System reading
This benchmark shows one thing:
The performance of a small model is a property of the system, not of the model alone.
The real levers:
- structure of the prompt
- dataset
- Context injection
- runtime validation
๐ We're going from an ML problem to an engineering problem.
๐งฑ Components
Harness
- abstraction of models
- standardization of inputs/outputs
Executor
- isolated execution
- error capture
- runtime metrics
Scoring
- validity
- performance
- stability
๐ What is measured
- Execution โ Is it running?
- Correction โ Is this correct?
- Polars Quality โ correct use of the motor
- Performance โ actual cost
๐ Real-life case
Input:
"Find the top 10 most frequent cards with mana cost โค 3"
Expected :
- filter correct
- correct aggregation
- sorting + limit
Observed:
- valid but incorrect code
- Python fallback
- silent errors
๐ The benchmark reveals the gap between generation and execution.
โ ๏ธ Real stakes
An LLM generates plausible code, not reliable code
So, in production:
- sandbox required
- systematic validation
- controlled retrievals
- monitoring
๐ Without this, the system is unstable by default.
๐ง Project Contribution
This benchmark introduces:
- an end-to-end executable pipeline
- a multi-criteria assessment
- a reproducible approach
- a production-friendly environment
Rest :
ai-harnessโ orchestrationpolars-benchโ evaluation
๐ This is no longer a model test, it's a complete system test.
๐น Demo
Submission video:
๐ฆ Hackathon Deliverables
- Full benchmark code
- Test set
- Reproducible pipeline
- Comparative results
Example of a provided structure:
- submission_example
- main benchmark code
- datasets
- logs
๐งญ Conclusion
This hackathon is not intended to demonstrate that LLMs work.
It shows where they break.
And most importantly:
A useful benchmark does not measure generation. It measures performance.
๐ฎ Darkwood Perspective
This type of benchmark fits directly into a broader vision:
- orchestration of AI pipelines
- systematic validation
- generation/execution separation
- observability
๐ This is not a demo tool. It's a building block for constructing reliable systems.