👨‍💻 Benchmarking Small Language Models in the Real World

on April 21, 2026

On Saturday, April 18th, I participated in a hackathon that invites machine learning engineers, data engineers and researchers to Paris for an in-depth study of the evaluation of small language models (SLM).

This hackathon, organized by AI Tinkerers Paris, addresses a very concrete problem:

test the actual ability of language models to produce production-ready executable code.

The theme - "Benchmarking Small Language Models in the Real World" - sets a clear framework: moving beyond impressive demos to confront models with real-world constraints (execution, performance, resources).

🎯 Objective

The challenge is to automatically generate Polars queries from natural language, with a strong requirement:

produce correct code
ensure that it is executable
Optimize execution time and memory consumption

The scoring reflects this reality:

Score = N / (T × VRAM^0.1 × RAM^0.01)

👉 In other words: accuracy alone is not enough - system efficiency becomes a central constraint.

⚙️ Technical Context

Each team works in a standardized environment:

execution under Docker
Use of Polars for data processing
GPU/memory constraints
evaluation dataset provided

The goal is not to create "the best prompt", but to build a system capable of:

withstand real data
produce robust code
run an automated benchmark

🧩 Positioning

This hackathon stands out because of its approach:

no marketing demo
no "impressive but fragile" generation
focus on what really works

👉 This is an environment that directly exposes the current limitations of LLM:

hallucinations
syntax errors
Misunderstanding of the data schema

And it forces you to build around it.

prompt structured engineering
systematic validation
fast iteration loop

📊 Reading

This is not a "creative" hackathon, but an engineering benchmark.

The final deliverable is not an idea, but:

a measurable, reproducible, and comparable system.

🏗️ Day's Organization

Morning - framing and initialization

The morning is dedicated to setting up the technical and organizational framework:

Team formation via the event portal
Presentation of the problem by the organizers and mentors
Clarification of expectations (generation of executable Polars code + scoring)
Initial setup of the work environment

The Darkwood team is formed around:

The objective of this phase is to reduce uncertainty and align everyone on an executable pipeline from the very first hours.

Midday - Forced Break

Lunch provided on site
informal exchanges between teams
Consultation of the message center (general questions, clarifications from the jury)

👉 Short phase, without real decoupling: the project remains in iteration.

Afternoon - Implementation and Iterations

The afternoon is entirely production-oriented:

implementation of the benchmark (polars-bench)
integration of the model via ai-harness
experimentation on prompts (structure, format, constraints)
Adjusting model behavior via dataset and prompt engineering
progressive validation via actual execution

The iterations focus on three areas:

Hallucination reduction (schema-aware prompting)
improved executable code rate
score optimization (time / memory / accuracy)

👉 The logic is not to add features, but to tighten the system around real constraints.

⚙️ Benchmark Architecture

The system relies on a clear separation of responsibilities:

ai-harness → pattern orchestration layer
polars-bench → execution and evaluation engine

This breakdown allows us to isolate the generation (LLM) from the execution reality (runtime), which is precisely the objective of the benchmark.

🤝 Sponsors, Mentors & Organizers, Teams

This hackathon is made possible thanks to an ecosystem of complementary players: sponsors, mentors and organizers, each playing a key role in the overall experience.

🏢 Organizers

The event is organized by the AI Tinkerers Paris community, a collective active in experimenting with and sharing AI technologies.

🌐 Official website: https://paris.aitinkerers.org
📅 Hackathon page: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk
👥 Organizers: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk/organizers

Their positioning is clear: to promote concrete experimentation around AI models, with a focus on real engineering rather than demonstration.

💼 Sponsors

Sponsors support the event by providing:

technical resources (GPU, infrastructure, tools)
funding
visibility

👉 Their role is critical to enabling a realistic environment (compute constraints, Docker execution, etc.).

Mistral
- Sponsor page: Hackathon Sponsors
- Representative: Matthieu Dinot — AI Scientist at Mistral
Alpic
- Sponsor page: Hackathon Sponsors
- Representative: Nikolay Rodionov — COO at Alpic
- Other links:
  - X / Twitter
  - GitHub
Cloudfare
- Sponsor page: Hackathon Sponsors
- Representative: Nans Cyril Bouissou — Account executive at Cloudfare
Fold
- Sponsor page: Hackathon Sponsors
- Representative: Raouf Chebri — Developer Relations Engineer at Replit
Microsoft
- Sponsor page: Hackathon Sponsors
- Representative: Julien Bichon — Developer Experience | GTM Manager at Microsoft
ESGI (venue partner mentioned alongside sponsors)
- Sponsor / venue page: Hackathon Sponsors
- Representative: Astrid Beaucourt — Communications Officer at ESGI
- School links:
  - LinkedIn
  - X/Twitter

🧑‍🏫 Mentors

Mentors support participants throughout the hackathon:

assistance in structuring approaches
Feedback on templates and prompts
tips on performance and optimization

👉 They help avoid common dead ends:

over-optimization of the prompt
lack of validation
errors in data interpretation

Arthur Mensch
- Mentor page: Hackathon Mentors
- Role: Co-founder and CEO of Mistral AI
- Public profile link: not visible in the provided data
Leo Arsenin
- Mentor page: Hackathon Mentors
- LinkedIn: Léo Arsenin
- Role: Solutions engineer at Cloudfare
Matthieu Dinot
- Mentor page: Hackathon Mentors
- LinkedIn: Matthieu Dinot
- Role: AI Scientist at Mistral
Preetham Kaukuntla
- Mentor page: Hackathon Mentors
- LinkedIn: Preetham Kaukuntla
- Role: Staff Data Scientist at Glassdoor
Mentors Message Board
- Mentors Message Board

👥 Teams

Darkwood
- Status: 1st Place
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_qFVxlREukLQ
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_qFVxlREukLQ
- Demo video: YouTube
- Members:
bluebull
- Status: Best Startup
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_-p8xvdGl-oA
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_-p8xvdGl-oA
- Members:
  - Vasiliki Doropoulou
Muon
- Status: 10x Data Scientist
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_ub4tDzlrft0
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_ub4tDzlrft0
- Members:
  - Imane Momayiz
Polaris
- Status: Finalist
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_y4_Yz6P5BZE
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_y4_Yz6P5BZE
- Members:
Training Expert
- Status: Finalist
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_njEtwEBAmUE
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_njEtwEBAmUE
- Members:
  - Eva Useros Marugan
  - Paul-Louis Fouesnant
CodeMind
- Status: Submitted
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_wRj-nEG24zE
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_wRj-nEG24zE
- Members:
Coffe is life
- Status: Submitted
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_RFBTGYYwbm4
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_RFBTGYYwbm4
- Members:
  - Pierre Lepagnol
  - Filipp Trigub
PiLLM
- Status: Submitted
- Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_Gc4SneBV2D4
- Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_Gc4SneBV2D4
- Members:
  - Din Sokheng
  - Ahmed Abdelaziz Mokeddem

🔗 Useful Links

🏠 Home: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk
👥 Teams: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk/teams
📩 Message Center: https://paris.aitinkerers.org/message_center?board_key=meetup_mu_eZJ5tCXlA2A
🏆 Submissions: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk/entries
📊 Results: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk/results

🧱 Technical Stack

The architecture is intentionally simple, but constrained by real-world conditions:

Python for execution
Polars as a search engine
Quinn 2.5 as a generation model
FastAPI for the interface
Local sandbox (CPU) with memory constraints

The core of the system is an API that transforms a natural request into directly executable Polars code.

🔄 Execution Pipeline

The benchmark requires a complete chain, without shortcuts:

User Query (NL)
    ↓
Prompt enrichi (schema + règles)
    ↓
LLM
    ↓
Code Polars généré
    ↓
Exécution réelle
    ↓
Validation
    ↓
Scoring

👉 The model is not evaluated on what it "says", but on what its code actually produces.

🧠 Generation Strategy

The key choice is simple:

❌ train the model ✅ to constrain one's behavior

Rather than heavy-handed fine-tuning, the system is based on:

a structured prompt
a dataset of targeted examples

Required Format

System:
- règles Polars
- contraintes strictes

User:
- schéma
- requête

Assistant:
- code uniquement

👉 The model learns an execution pattern, not general knowledge.

🗂️ Dataset

The dataset is not massive, it is intentional:

15 to 500 examples
focused on critical patterns

Cutlery:

selection
filters
aggregations
joins
window functions
nulls
edge cases

👉 Coverage is more important than volume.

📊 Schema injection

The model doesn't guess anything. The scheme is injected systematically:

{
  "columns": ["card_name", "mana_cost", "frequency"]
}

Direct effects:

fewer hallucinations
Valid queries on the first try
strong dependence on the provided context

👉 Without a diagram, the system collapses.

⚡ Inference Constraints

~6GB model
CPU only
high latency

Consequence:

Every call counts
Re-sorting is expensive
The prompt must be precise from the outset

👉 The cost of error is included in the score.

🧪 Validation & scoring

The benchmark validates a complete behavior, not a raw output.

Levels

Valid code
Successful execution
Correct result
Acceptable performance

Score

Score = N / (T * VRAM^0.1 * RAM^0.01)

👉 We measure a constrained system, not an abstract model.

🔁 Architectural choices

Three options:

fine-tuning → too heavy
multi-models → too complex
prompt + dataset → chosen

👉 Decision: encode the behavior in the data

🧩 Implementation

The API exposes a simple stream:

input:
- request
- metadata
output:
- Polars code

The service:

builds the prompt
calls the model
returns the code

👉 Voluntary simplicity: the real difficulty lies elsewhere.

📦 Positioning

The project is not presented as an LLM wrapper, but as:

a secure analytical copilot for tabular data

With :

guided prompts
structured context
execution verified

👉 The product is defined by its constraints, not by the model.

🧭 System reading

This benchmark shows one thing:

The performance of a small model is a property of the system, not of the model alone.

The real levers:

structure of the prompt
dataset
Context injection
runtime validation

👉 We're going from an ML problem to an engineering problem.

🧱 Components

Harness

abstraction of models
standardization of inputs/outputs

Executor

isolated execution
error capture
runtime metrics

Scoring

validity
performance
stability

📊 What is measured

Execution → Is it running?
Correction → Is this correct?
Polars Quality → correct use of the motor
Performance → actual cost

🔍 Real-life case

Input:

"Find the top 10 most frequent cards with mana cost ≤ 3"

Expected :

filter correct
correct aggregation
sorting + limit

Observed:

valid but incorrect code
Python fallback
silent errors

👉 The benchmark reveals the gap between generation and execution.

⚠️ Real stakes

An LLM generates plausible code, not reliable code

So, in production:

sandbox required
systematic validation
controlled retrievals
monitoring

👉 Without this, the system is unstable by default.

🧠 Project Contribution

This benchmark introduces:

an end-to-end executable pipeline
a multi-criteria assessment
a reproducible approach
a production-friendly environment

Rest :

ai-harness → orchestration
polars-bench → evaluation

👉 This is no longer a model test, it's a complete system test.

📹 Demo

Submission video:

📦 Hackathon Deliverables

Full benchmark code
Test set
Reproducible pipeline
Comparative results

Example of a provided structure:

submission_example
main benchmark code
datasets
logs

🧭 Conclusion

This hackathon is not intended to demonstrate that LLMs work.

It shows where they break.

And most importantly:

A useful benchmark does not measure generation. It measures performance.

🔮 Darkwood Perspective

This type of benchmark fits directly into a broader vision:

orchestration of AI pipelines
systematic validation
generation/execution separation
observability

👉 This is not a demo tool. It's a building block for constructing reliable systems.