Darkwood Blog Blog
  • Articles
en
  • de
  • fr
Login
  • Blog
  • Articles

๐Ÿ‘จโ€๐Ÿ’ป Benchmarking Small Language Models in the Real World

on April 21, 2026

On Saturday, April 18th, I participated in a hackathon that invites machine learning engineers, data engineers and researchers to Paris for an in-depth study of the evaluation of small language models (SLM).

This hackathon, organized by AI Tinkerers Paris, addresses a very concrete problem:

test the actual ability of language models to produce production-ready executable code.

The theme - "Benchmarking Small Language Models in the Real World" - sets a clear framework: moving beyond impressive demos to confront models with real-world constraints (execution, performance, resources).

๐ŸŽฏ Objective

The challenge is to automatically generate Polars queries from natural language, with a strong requirement:

  • produce correct code
  • ensure that it is executable
  • Optimize execution time and memory consumption

The scoring reflects this reality:

Score = N / (T ร— VRAM^0.1 ร— RAM^0.01)

๐Ÿ‘‰ In other words: accuracy alone is not enough - system efficiency becomes a central constraint.

โš™๏ธ Technical Context

Each team works in a standardized environment:

  • execution under Docker
  • Use of Polars for data processing
  • GPU/memory constraints
  • evaluation dataset provided

The goal is not to create "the best prompt", but to build a system capable of:

  • withstand real data
  • produce robust code
  • run an automated benchmark

๐Ÿงฉ Positioning

This hackathon stands out because of its approach:

  • no marketing demo
  • no "impressive but fragile" generation
  • focus on what really works

๐Ÿ‘‰ This is an environment that directly exposes the current limitations of LLM:

  • hallucinations
  • syntax errors
  • Misunderstanding of the data schema

And it forces you to build around it.

  • prompt structured engineering
  • systematic validation
  • fast iteration loop

๐Ÿ“Š Reading

This is not a "creative" hackathon, but an engineering benchmark.

The final deliverable is not an idea, but:

a measurable, reproducible, and comparable system.

๐Ÿ—๏ธ Day's Organization

Morning - framing and initialization

The morning is dedicated to setting up the technical and organizational framework:

  • Team formation via the event portal
  • Presentation of the problem by the organizers and mentors
  • Clarification of expectations (generation of executable Polars code + scoring)
  • Initial setup of the work environment

The Darkwood team is formed around:

  • Mathieu Ledru
  • Victor-eliejah Garnier
  • Mirza Marotsaha

The objective of this phase is to reduce uncertainty and align everyone on an executable pipeline from the very first hours.

Midday - Forced Break

  • Lunch provided on site
  • informal exchanges between teams
  • Consultation of the message center (general questions, clarifications from the jury)

๐Ÿ‘‰ Short phase, without real decoupling: the project remains in iteration.

Afternoon - Implementation and Iterations

The afternoon is entirely production-oriented:

  • implementation of the benchmark (polars-bench)
  • integration of the model via ai-harness
  • experimentation on prompts (structure, format, constraints)
  • Adjusting model behavior via dataset and prompt engineering
  • progressive validation via actual execution

The iterations focus on three areas:

  • Hallucination reduction (schema-aware prompting)
  • improved executable code rate
  • score optimization (time / memory / accuracy)

๐Ÿ‘‰ The logic is not to add features, but to tighten the system around real constraints.

โš™๏ธ Benchmark Architecture

The system relies on a clear separation of responsibilities:

  • ai-harness โ†’ pattern orchestration layer
  • polars-bench โ†’ execution and evaluation engine

This breakdown allows us to isolate the generation (LLM) from the execution reality (runtime), which is precisely the objective of the benchmark.

๐Ÿค Sponsors, Mentors & Organizers, Teams

This hackathon is made possible thanks to an ecosystem of complementary players: sponsors, mentors and organizers, each playing a key role in the overall experience.

๐Ÿข Organizers

The event is organized by the AI โ€‹โ€‹Tinkerers Paris community, a collective active in experimenting with and sharing AI technologies.

  • ๐ŸŒ Official website: https://paris.aitinkerers.org
  • ๐Ÿ“… Hackathon page: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk
  • ๐Ÿ‘ฅ Organizers: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk/organizers

Their positioning is clear: to promote concrete experimentation around AI models, with a focus on real engineering rather than demonstration.

๐Ÿ’ผ Sponsors

Sponsors support the event by providing:

  • technical resources (GPU, infrastructure, tools)
  • funding
  • visibility

๐Ÿ‘‰ Their role is critical to enabling a realistic environment (compute constraints, Docker execution, etc.).

  • Mistral

    • Sponsor page: Hackathon Sponsors
    • Representative: Matthieu Dinot โ€” AI Scientist at Mistral
  • Alpic

    • Sponsor page: Hackathon Sponsors
    • Representative: Nikolay Rodionov โ€” COO at Alpic
    • Other links:
      • X / Twitter
      • GitHub
  • Cloudfare

    • Sponsor page: Hackathon Sponsors
    • Representative: Nans Cyril Bouissou โ€” Account executive at Cloudfare
  • Fold

    • Sponsor page: Hackathon Sponsors
    • Representative: Raouf Chebri โ€” Developer Relations Engineer at Replit
  • Microsoft

    • Sponsor page: Hackathon Sponsors
    • Representative: Julien Bichon โ€” Developer Experience | GTM Manager at Microsoft
  • ESGI (venue partner mentioned alongside sponsors)

    • Sponsor / venue page: Hackathon Sponsors
    • Representative: Astrid Beaucourt โ€” Communications Officer at ESGI
    • School links:
      • LinkedIn
      • X/Twitter

๐Ÿง‘โ€๐Ÿซ Mentors

Mentors support participants throughout the hackathon:

  • assistance in structuring approaches
  • Feedback on templates and prompts
  • tips on performance and optimization

๐Ÿ‘‰ They help avoid common dead ends:

  • over-optimization of the prompt
  • lack of validation
  • errors in data interpretation
  • Arthur Mensch

    • Mentor page: Hackathon Mentors
    • Role: Co-founder and CEO of Mistral AI
    • Public profile link: not visible in the provided data
  • Leo Arsenin

    • Mentor page: Hackathon Mentors
    • LinkedIn: Lรฉo Arsenin
    • Role: Solutions engineer at Cloudfare
  • Matthieu Dinot

    • Mentor page: Hackathon Mentors
    • LinkedIn: Matthieu Dinot
    • Role: AI Scientist at Mistral
  • Preetham Kaukuntla

    • Mentor page: Hackathon Mentors
    • LinkedIn: Preetham Kaukuntla
    • Role: Staff Data Scientist at Glassdoor
  • Mentors Message Board

    • Mentors Message Board

๐Ÿ‘ฅ Teams

  • Darkwood

    • Status: 1st Place
    • Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_qFVxlREukLQ
    • Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_qFVxlREukLQ
    • Demo video: YouTube
    • Members:
      • Mathieu Ledru
      • Mirza Marotsaha
      • Victor-eliejah GARNIER
  • bluebull

    • Status: Best Startup
    • Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_-p8xvdGl-oA
    • Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_-p8xvdGl-oA
    • Members:
      • Vasiliki Doropoulou
  • Muon

    • Status: 10x Data Scientist
    • Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_ub4tDzlrft0
    • Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_ub4tDzlrft0
    • Members:
      • Imane Momayiz
  • Polaris

    • Status: Finalist
    • Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_y4_Yz6P5BZE
    • Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_y4_Yz6P5BZE
    • Members:
      • Hippolyte Dupont
      • Ghaith ABDESSALEM
      • Jacques Dumora
  • Training Expert

    • Status: Finalist
    • Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_njEtwEBAmUE
    • Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_njEtwEBAmUE
    • Members:
      • Eva Useros Marugan
      • Paul-Louis Fouesnant
  • CodeMind

    • Status: Submitted
    • Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_wRj-nEG24zE
    • Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_wRj-nEG24zE
    • Members:
      • Amelie Smith
      • Damien Frechou
      • Anis Kaci
      • Choutri Adel Djalil
  • Coffe is life

    • Status: Submitted
    • Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_RFBTGYYwbm4
    • Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_RFBTGYYwbm4
    • Members:
      • Pierre Lepagnol
      • Filipp Trigub
  • PiLLM

    • Status: Submitted
    • Entry: /hackathons/h_sj1ca_J4Hdk/entries/ht_Gc4SneBV2D4
    • Team: /hackathons/h_sj1ca_J4Hdk/teams/ht_Gc4SneBV2D4
    • Members:
      • Din Sokheng
      • Ahmed Abdelaziz Mokeddem

๐Ÿ”— Useful Links

  • ๐Ÿ  Home: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk
  • ๐Ÿ‘ฅ Teams: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk/teams
  • ๐Ÿ“ฉ Message Center: https://paris.aitinkerers.org/message_center?board_key=meetup_mu_eZJ5tCXlA2A
  • ๐Ÿ† Submissions: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk/entries
  • ๐Ÿ“Š Results: https://paris.aitinkerers.org/hackathons/h_sj1ca_J4Hdk/results

๐Ÿงฑ Technical Stack

The architecture is intentionally simple, but constrained by real-world conditions:

  • Python for execution
  • Polars as a search engine
  • Quinn 2.5 as a generation model
  • FastAPI for the interface
  • Local sandbox (CPU) with memory constraints

The core of the system is an API that transforms a natural request into directly executable Polars code.

๐Ÿ”„ Execution Pipeline

The benchmark requires a complete chain, without shortcuts:

User Query (NL)
    โ†“
Prompt enrichi (schema + rรจgles)
    โ†“
LLM
    โ†“
Code Polars gรฉnรฉrรฉ
    โ†“
Exรฉcution rรฉelle
    โ†“
Validation
    โ†“
Scoring

๐Ÿ‘‰ The model is not evaluated on what it "says", but on what its code actually produces.

๐Ÿง  Generation Strategy

The key choice is simple:

โŒ train the model โœ… to constrain one's behavior

Rather than heavy-handed fine-tuning, the system is based on:

  • a structured prompt
  • a dataset of targeted examples

Required Format

System:
- rรจgles Polars
- contraintes strictes

User:
- schรฉma
- requรชte

Assistant:
- code uniquement

๐Ÿ‘‰ The model learns an execution pattern, not general knowledge.

๐Ÿ—‚๏ธ Dataset

The dataset is not massive, it is intentional:

  • 15 to 500 examples
  • focused on critical patterns

Cutlery:

  • selection
  • filters
  • aggregations
  • joins
  • window functions
  • nulls
  • edge cases

๐Ÿ‘‰ Coverage is more important than volume.

๐Ÿ“Š Schema injection

The model doesn't guess anything. The scheme is injected systematically:

{
  "columns": ["card_name", "mana_cost", "frequency"]
}

Direct effects:

  • fewer hallucinations
  • Valid queries on the first try
  • strong dependence on the provided context

๐Ÿ‘‰ Without a diagram, the system collapses.

โšก Inference Constraints

  • ~6GB model
  • CPU only
  • high latency

Consequence:

  • Every call counts
  • Re-sorting is expensive
  • The prompt must be precise from the outset

๐Ÿ‘‰ The cost of error is included in the score.

๐Ÿงช Validation & scoring

The benchmark validates a complete behavior, not a raw output.

Levels

  1. Valid code
  2. Successful execution
  3. Correct result
  4. Acceptable performance

Score

Score = N / (T * VRAM^0.1 * RAM^0.01)

๐Ÿ‘‰ We measure a constrained system, not an abstract model.

๐Ÿ” Architectural choices

Three options:

  • fine-tuning โ†’ too heavy
  • multi-models โ†’ too complex
  • prompt + dataset โ†’ chosen

๐Ÿ‘‰ Decision: encode the behavior in the data

๐Ÿงฉ Implementation

The API exposes a simple stream:

  • input:

    • request
    • metadata
  • output:

    • Polars code

The service:

  • builds the prompt
  • calls the model
  • returns the code

๐Ÿ‘‰ Voluntary simplicity: the real difficulty lies elsewhere.

๐Ÿ“ฆ Positioning

The project is not presented as an LLM wrapper, but as:

a secure analytical copilot for tabular data

With :

  • guided prompts
  • structured context
  • execution verified

๐Ÿ‘‰ The product is defined by its constraints, not by the model.

๐Ÿงญ System reading

This benchmark shows one thing:

The performance of a small model is a property of the system, not of the model alone.

The real levers:

  • structure of the prompt
  • dataset
  • Context injection
  • runtime validation

๐Ÿ‘‰ We're going from an ML problem to an engineering problem.

๐Ÿงฑ Components

Harness

  • abstraction of models
  • standardization of inputs/outputs

Executor

  • isolated execution
  • error capture
  • runtime metrics

Scoring

  • validity
  • performance
  • stability

๐Ÿ“Š What is measured

  • Execution โ†’ Is it running?
  • Correction โ†’ Is this correct?
  • Polars Quality โ†’ correct use of the motor
  • Performance โ†’ actual cost

๐Ÿ” Real-life case

Input:

"Find the top 10 most frequent cards with mana cost โ‰ค 3"

Expected :

  • filter correct
  • correct aggregation
  • sorting + limit

Observed:

  • valid but incorrect code
  • Python fallback
  • silent errors

๐Ÿ‘‰ The benchmark reveals the gap between generation and execution.

โš ๏ธ Real stakes

An LLM generates plausible code, not reliable code

So, in production:

  • sandbox required
  • systematic validation
  • controlled retrievals
  • monitoring

๐Ÿ‘‰ Without this, the system is unstable by default.

๐Ÿง  Project Contribution

This benchmark introduces:

  • an end-to-end executable pipeline
  • a multi-criteria assessment
  • a reproducible approach
  • a production-friendly environment

Rest :

  • ai-harness โ†’ orchestration
  • polars-bench โ†’ evaluation

๐Ÿ‘‰ This is no longer a model test, it's a complete system test.

๐Ÿ“น Demo

Submission video:

๐Ÿ“ฆ Hackathon Deliverables

  • Full benchmark code
  • Test set
  • Reproducible pipeline
  • Comparative results

Example of a provided structure:

  • submission_example
  • main benchmark code
  • datasets
  • logs

๐Ÿงญ Conclusion

This hackathon is not intended to demonstrate that LLMs work.

It shows where they break.

And most importantly:

A useful benchmark does not measure generation. It measures performance.

๐Ÿ”ฎ Darkwood Perspective

This type of benchmark fits directly into a broader vision:

  • orchestration of AI pipelines
  • systematic validation
  • generation/execution separation
  • observability

๐Ÿ‘‰ This is not a demo tool. It's a building block for constructing reliable systems.

Site

  • Sitemap
  • Contact
  • Legal mentions

Network

  • Hello
  • Blog
  • Apps
  • Photos

Social

Darkwood 2026, all rights reserved