🚀 I'm building a dictation engine in PHP (Flow + Symfony + Whisper.cpp)

on February 22, 2026

Building a dictation engine in 2026 is trivial.

Building a clean architecture around a dictation engine is more interesting.

This article presents Flowvox, an MVP of an audio transcription engine developed in PHP, based on:

Symfony
Symfony Messenger
Flow: in-house orchestrator
ffmpeg
whisper.cpp

The source code is available as open source: 👉 https://github.com/darkwood-com/flowvox

The goal wasn't simply to use Whisper. The goal was to properly structure the pipeline.

The problem: transcription is only one step

A minimal voice engine can be summarized as follows:

Audio → Texte

But in a real-world system, several constraints emerge:

Start/stop activation
Finalizing the audio file
Recorder state management
Orchestration of the stages
Extension to post-processing (summary, LLM, analysis)

The question then becomes:

How to model a clean, scalable and controlled audio pipeline?

Technical Stack

The MVP is based on:

PHP 8+
Symfony
Symfony Messenger
Flow (orchestrator)
ffmpeg (local audio capture)
whisper.cpp (local open source transcription)

No remote API. No cloud service. 100% local transcription.

General Architecture

The architecture is organized into three flows:

InputProvider → Recorder → Transcribe

Each step is isolated and responsible for a specific role.

InputProviderFlow

Responsibility :

Listen for the commands voice:start and voice:stop
Issue a VoiceControlEvent

CLI commands trigger messages via Symfony Messenger.

The worker, in the background, receives these events and injects them into Flow.

This decoupling allows:

Granular control
Multi-session management
A clear separation of responsibilities

RecorderFlow

Responsibility :

Controlling a VoiceRecorder instance
Manage the lifecycle of an ffmpeg process

The VoiceRecorder encapsulates a system process launched via:

Symfony\Component\Process\Process

Central problem:

How can I properly manage start/stop without corrupting the audio file?

Three states are explicitly modeled:

idle
recording
stopping

During a stop, a SIGINT is sent to ffmpeg in order to properly finalize the WAV header.

The stopping state prevents:

The double-start
Competing conflicts
Incomplete files

The process is controlled, not endured.

TranscribeFlow

Responsibility :

Receive a finalized WAV file
Launch whisper.cpp
Produce a transcribed text

Whisper is run locally via CLI.

The MVP remains intentionally simple:

No streaming
No real-time chunking
A synchronous transcription

The goal is to validate the integration and orchestration.

Worker and orchestration

The engine operates via a Symfony worker:

php bin/console voice:worker

This worker:

Flow Instantiation
Record the flows
Listen to Symfony Messenger
Orders the execution of the steps

Available commands:

voice:start
voice:stop
voice:worker-list

The complete feed becomes:

voice:start
→ Recorder démarre
→ voice:stop
→ Recorder finalise
→ TranscribeFlow s’exécute
→ Texte produit

Without an external global state.

Why Flow?

Flow allows:

A pipeline-oriented architecture
Input Processing strategies (IP Strategy)
Explicit event management
A clear separation between orchestration and business logic

The system is not coupled with Whisper.

Whisper is an implementation. Flow is the structure.

What the MVP approves

Proper management of a system process
Explicit modeling of states
Event orchestration
Pipeline scalability

This is not a product.

It's an architectural foundation.

Possible Developments

The next natural iterations:

Streaming by audio chunk
Parallel transcription
Post-processing LLM
Integration desktop
Mobile support
Multi-model batching

But these changes do not alter the core:

A clear architecture. A masterful orchestration. An expandable pipeline.

Source code

The open source repository is available here:

👉 https://github.com/darkwood-com/flowvox

Contributions, suggestions and feedback are welcome.

Sources

This article was based from guillaume.id blog post.

Conclusion

Building a voice engine in PHP is simple.

Building a clean architecture around a voice engine is more interesting.

Flowvox validates a principle:

Transcription is only one component. The orchestration is the true structure.

🚀 I'm building a dictation engine in PHP (Flow + Symfony + Whisper.cpp)

The problem: transcription is only one step

Technical Stack

General Architecture

InputProviderFlow

RecorderFlow

TranscribeFlow

Worker and orchestration

Why Flow?

What the MVP approves

Possible Developments

Source code

Sources

Conclusion

Site

Network

Social