How FactoryOS Listens and Speaks
Voice is a first-class layer in FactoryOS, not a feature attached to one part of the product. Speech-to-text and text-to-speech both run on the system itself, available from the personal assistant chat, from knowledge chat, and from any workflow that wants to use them. Voice has its own settings tab and its own engines underneath.
Voice Is Not a Bolt-On
Most AI products treat voice as a wrapper over a cloud transcription service. FactoryOS treats voice as an in-house capability. Whatever you say to the system never leaves it, and whatever the system says back was generated on it.
That distinction matters more in some industries than others. For a law firm dictating client notes, a clinic recording patient encounters, or a finance team talking through M&A drafts, voice is the moment most likely to leak the most sensitive material. Keeping the whole path local closes that door.
Speech In and Speech Out
Speech-to-text runs on Whisper, the well-known open-source transcription model, available in several local builds the system can pick from depending on what's installed and how fast it needs to be. Text-to-speech runs on Supertonic (a neural engine with its own catalog of voices) and Piper (a second neural engine that uses downloadable voice model files), with lightweight fallbacks like espeak available for low-resource setups.
The voice layer isn't tied to any one of these. It detects what's installed, picks the best available, and lists the rest as options. New STT and TTS engines can be added as better local models become available, and the system swaps to them once configured. The piece on top — how you talk to the assistant and how it talks back — stays the same regardless of which engine is doing the work.
Two Ways It Listens
When voice is on, it runs in one of two modes:
- Push-to-talk. The microphone opens only while you hold a key, the way a radio works. Best for shared offices and meeting rooms. - Open conversation. The microphone stays listening throughout, so you can speak freely without pressing anything. Best for private offices or working alone.
You pick the mode from the voice settings, and you can change it any time. Different surfaces can run different modes — your personal assistant might be open conversation at your desk while a focused chat in a shared space stays push-to-talk.
Open conversation is sometimes called "always listening," and that phrasing makes some people uneasy because it conjures a cloud service hoarding audio. The microphone is open locally and only locally — that audio never leaves the box, and the privacy section below covers exactly where every byte of it ends up.
Voices You Pick or Train
The text-to-speech side comes with multiple voices, and the set is expandable. Each TTS engine ships with its own catalog — Supertonic has styles like M1 and F1; Piper draws from a library of downloadable voice models — and new voices can be added to either engine. The system catalogs them all in the same picker.
If you want to go further, FactoryOS can train a voice on a sample of your own speech. It takes about thirty minutes of recorded audio and a meaningful chunk of GPU time to do the training, but once it's done, the trained voice slots in like any other option. Some people enjoy the novelty of hearing their own voice come back from the assistant; others find it uncanny and stick to the stock voices. Either is fine.
Changing the voice doesn't require restarting anything. Pick a new one and the next utterance uses it.
Admins Set the Menu
Voice has its own permissions, separate from how individual users configure it. An admin can turn voice off entirely for everyone on the box, allow it for some roles and not others, or choose which STT and TTS engines are available system-wide. Those settings define the menu users get to see.
Inside that menu, individual users pick what they prefer — which engine, which voice, which mode — but only from the options the admin has enabled. A company comfortable with Whisper but wanting to standardize on Piper for output, for instance, can simply not turn Supertonic on; users still have voice, just with the engine the org chose for them.
That two-tier shape matches how the rest of FactoryOS works — defaults set on top, individual control below, both resolved through the same [permissions cascade](how-factoryos-decides-who-sees-what).
Where Voice Shows Up
Voice is wired into multiple places. The personal assistant chat in the top bar is the most common surface — that's where the morning briefing might be read aloud, or where you might ask a question hands-free. The knowledge chat that's scoped to a project or channel has its own voice toggle. Workflows can include speech-to-text and text-to-speech as steps, letting an automated flow take audio in or deliver audio out without anyone writing code.
Each surface decides independently whether it accepts voice. Some default to text and let you flip voice on when it suits the moment; others might stay text-only depending on context.
Always Local, Always Optional
The full voice path runs on the box. Microphone audio is transcribed locally; generated speech is synthesized locally; nothing crosses a network boundary unless you've explicitly set up an integration that does. Open-conversation mode keeps the microphone open, but the audio it picks up has nowhere else to go — there is no upload, no remote transcription, no cloud index of what you said yesterday. The deeper privacy story — what touches disk, what's logged, what an auditor would see — is covered in [where your voice data actually goes](where-your-voice-data-actually-goes).
Voice is also opt-in. The capability sits dormant until you flip a toggle, and even when on, microphones aren't recording in the background — they're listening only inside the modes that explicitly require it. Someone who prefers typing can use FactoryOS for years and never trigger the voice layer, and someone who likes talking can use voice everywhere it's wired in. The choice belongs to the person at the keyboard — inside whatever menu the admin has set.