How Many Staff One AI Box Supports
The honest answer to "how many people can one FactoryOS box serve" is a range, not a number. A single on-premise box comfortably handles ten to twenty actively-using staff before queue time becomes noticeable. The terse throughput math points higher — closer to a hundred-plus named users on the same hardware — and the gap between those two numbers is what this article is about.
The Memory Bandwidth Ceiling
The ceiling on a local AI box is set by memory bandwidth, not compute. LLM token decode streams the model's weights from memory on every step, and a DGX Spark moves 273 GB/s across a 128 GB unified pool¹. Divide that bandwidth by the model's per-token active footprint and you get the box's theoretical token-rate ceiling.
A dense 32-billion-parameter model at 4-bit quantization carries roughly 16 GB of active weight per decode step. 273 ÷ 16 ≈ 17 tokens per second as a single-stream ceiling. A Mixture-of-Experts model with around 3B active parameters at NVFP4 — much lighter per token — sits closer to a 20–80 tokens-per-second envelope on the same box. Both numbers are calculated from bandwidth and weight footprint, not quoted from a benchmark.
Batching Multiplies the Ceiling
A single weight read serves several chats at once when the inference server batches them. LMSYS measured a DGX Spark running SGLang at batch sizes from 1 to 32 across Qwen 3 32B, Llama 3.1 8B/70B, Gemma, and DeepSeek; Llama 3.1 8B moved from about 20 tokens-per-second solo to 360+ tokens-per-second aggregate when batched. EAGLE3 speculative decoding added another ~2× end-to-end on top of that².
So the same hardware that gives one user 20 tok/s can give a dozen users a steady eight to ten tok/s — comfortably faster than they read.
Search Is Half the Work
A FactoryOS chat is not pure generation. Each grounded answer first runs retrieval across the brain — BM25, vector search, reranking, the temporal knowledge graph — and often a live web fetch on top of that. All of that work shares the same GPU and the same memory bus that the decoder needs.
A box that could theoretically batch thirty pure-generation chats serves fewer when every chat first runs a hybrid search, a rerank pass, and a long-context decode against the retrieved evidence. The bandwidth ceiling assumes the GPU spends all its cycles decoding; real chats spend a meaningful share on the retrieval pipeline, and that share grows with how well-grounded the answer needs to be.
Active Users vs. Named Users
Named-user headcount and active-decoding load are different numbers, and duty cycle separates them. A typical office worker fires a query, reads the answer for ten to twenty seconds, and goes back to their actual job; their duty cycle on the GPU is five to ten percent.
So forty named users at that duty cycle translate to two to four actively-decoding chats at any given moment. The "hundred-plus named users on one box" headline falls out of that arithmetic at light, bursty usage — it's an arithmetic ceiling, not a planning target. The planning target is the active number.
The Queue Keeps Peaks Soft
The GPU is a serialized resource with a priority queue in front of it. When concurrent demand exceeds in-flight capacity, requests wait their turn — they don't fail, and the box doesn't tank.
Humans-in-chat outrank background work by design: indexing, batch summarization, persona renders, and the agent runtime all yield to live chat traffic, so a busy morning doesn't starve the people doing the asking. Peak load shows up as a few extra seconds of latency, never as a crashed appliance. The mechanism is the subject of [why heavy load means delays, not crashes](why-heavy-load-means-delays-not-crashes).
What Moves the Range
A handful of factors slide the comfort number along the ten-to-twenty (or beyond) range:
- Model size and quantization. A dense 70B model on one box serves fewer concurrent users than a 32B MoE; FP8, NVFP4, and 4-bit quantization buy more headroom per token. - Context length. Long retrieved contexts inflate the per-token KV cache and slow the batcher down. - Retrieval depth. Heavy use of the [knowledge graph and codebase RAG](how-factoryos-retrieves-the-right-context) eats GPU cycles before generation begins. - Background load. Scheduled flows on the Manifold, web-agent runs, and image generation share the box; their priority is lower than chat, but they exist. - Usage shape. Twenty people each asking a one-shot question per minute is light; five people running ten-thousand-token research threads is heavy.
A box that lives at the high end on one of these factors lands toward the low end of the headcount range. A box that lives at the low end on all of them lands toward the high end (or beyond).
Sizing Before You Talk to Us
The useful estimate before a consult is active users at peak, not just org headcount. Count the staff most likely to use AI inside any given fifteen-minute window — usually a fraction of total headcount — and assume each one is decoding maybe ten percent of that time.
A team of forty knowledge workers typically lands at two to four simultaneous active chats. A team of a hundred lands at five to ten. A team of two hundred starts to brush the ceiling on a single Spark and is worth a conversation about either a larger GPU or a second box.
If the estimate sits comfortably under twenty active, one box is the starting hardware. Over twenty and the sizing conversation widens to a bigger GPU, horizontal scaling across multiple Sparks, or workload shaping — retrieval caching, smaller models for routine tasks, background-only nightly runs. Those are the questions the first call is for.