π I'm building a dictation engine in PHP (Flow + Symfony + Whisper.cpp)
on February 22, 2026
Building a dictation engine in 2026 is trivial.
Building a clean architecture around a dictation engine is more interesting.
This article presents Flowvox, an MVP of an audio transcription engine developed in PHP, based on:
- Symfony
- Symfony Messenger
- Flow: in-house orchestrator
- ffmpeg
- whisper.cpp
The source code is available as open source: π https://github.com/darkwood-com/flowvox
The goal wasn't simply to use Whisper. The goal was to properly structure the pipeline.
The problem: transcription is only one step
A minimal voice engine can be summarized as follows:
Audio β Texte
But in a real-world system, several constraints emerge:
- Start/stop activation
- Finalizing the audio file
- Recorder state management
- Orchestration of the stages
- Extension to post-processing (summary, LLM, analysis)
The question then becomes:
How to model a clean, scalable and controlled audio pipeline?
Technical Stack
The MVP is based on:
- PHP 8+
- Symfony
- Symfony Messenger
- Flow (orchestrator)
- ffmpeg (local audio capture)
- whisper.cpp (local open source transcription)
No remote API. No cloud service. 100% local transcription.
General Architecture
The architecture is organized into three flows:
InputProvider β Recorder β Transcribe
Each step is isolated and responsible for a specific role.
InputProviderFlow
Responsibility :
- Listen for the commands
voice:startandvoice:stop - Issue a
VoiceControlEvent
CLI commands trigger messages via Symfony Messenger.
The worker, in the background, receives these events and injects them into Flow.
This decoupling allows:
- Granular control
- Multi-session management
- A clear separation of responsibilities
RecorderFlow
Responsibility :
- Controlling a
VoiceRecorderinstance - Manage the lifecycle of an
ffmpegprocess
The VoiceRecorder encapsulates a system process launched via:
Symfony\Component\Process\Process
Central problem:
How can I properly manage start/stop without corrupting the audio file?
Three states are explicitly modeled:
idlerecordingstopping
During a stop, a SIGINT is sent to ffmpeg in order to properly finalize the WAV header.
The stopping state prevents:
- The double-start
- Competing conflicts
- Incomplete files
The process is controlled, not endured.
TranscribeFlow
Responsibility :
- Receive a finalized WAV file
- Launch whisper.cpp
- Produce a transcribed text
Whisper is run locally via CLI.
The MVP remains intentionally simple:
- No streaming
- No real-time chunking
- A synchronous transcription
The goal is to validate the integration and orchestration.
Worker and orchestration
The engine operates via a Symfony worker:
php bin/console voice:worker
This worker:
- Flow Instantiation
- Record the flows
- Listen to Symfony Messenger
- Orders the execution of the steps
Available commands:
voice:start
voice:stop
voice:worker-list
The complete feed becomes:
voice:start
β Recorder dΓ©marre
β voice:stop
β Recorder finalise
β TranscribeFlow sβexΓ©cute
β Texte produit
Without an external global state.
Why Flow?
Flow allows:
- A pipeline-oriented architecture
- Input Processing strategies (IP Strategy)
- Explicit event management
- A clear separation between orchestration and business logic
The system is not coupled with Whisper.
Whisper is an implementation. Flow is the structure.
What the MVP approves
- Proper management of a system process
- Explicit modeling of states
- Event orchestration
- Pipeline scalability
This is not a product.
It's an architectural foundation.
Possible Developments
The next natural iterations:
- Streaming by audio chunk
- Parallel transcription
- Post-processing LLM
- Integration desktop
- Mobile support
- Multi-model batching
But these changes do not alter the core:
A clear architecture. A masterful orchestration. An expandable pipeline.
Source code
The open source repository is available here:
π https://github.com/darkwood-com/flowvox
Contributions, suggestions and feedback are welcome.
Conclusion
Building a voice engine in PHP is simple.
Building a clean architecture around a voice engine is more interesting.
Flowvox validates a principle:
Transcription is only one component. The orchestration is the true structure.