How Many Staff One AI Box Supports

How Many Staff One AI Box Supports

The honest answer to "how many people can one FactoryOS box serve" is a range, not a number. A single on-premise box comfortably handles ten to twenty actively-using staff before queue time becomes noticeable. The terse throughput math points higher — closer to a hundred-plus named users on the same hardware — and the gap between those two numbers is what this article is about.

The Memory Bandwidth Ceiling

The ceiling on a local AI box is set by memory bandwidth, not compute. LLM token decode streams the model's weights from memory on every step, and a DGX Spark moves 273 GB/s across a 128 GB unified pool¹. Divide that bandwidth by the model's per-token active footprint and you get the box's theoretical token-rate ceiling.

A dense 32-billion-parameter model at 4-bit quantization carries roughly 16 GB of active weight per decode step. 273 ÷ 16 ≈ 17 tokens per second as a single-stream ceiling. A Mixture-of-Experts model with around 3B active parameters at NVFP4 — much lighter per token — sits closer to a 20–80 tokens-per-second envelope on the same box. Both numbers are calculated from bandwidth and weight footprint, not quoted from a benchmark.

Batching Multiplies the Ceiling

A single weight read serves several chats at once when the inference server batches them. LMSYS measured a DGX Spark running SGLang at batch sizes from 1 to 32 across Qwen 3 32B, Llama 3.1 8B/70B, Gemma, and DeepSeek; Llama 3.1 8B moved from about 20 tokens-per-second solo to 360+ tokens-per-second aggregate when batched. EAGLE3 speculative decoding added another ~2× end-to-end on top of that².

So the same hardware that gives one user 20 tok/s can give a dozen users a steady eight to ten tok/s — comfortably faster than they read.

Search Is Half the Work

A FactoryOS chat is not pure generation. Each grounded answer first runs retrieval across the brain — BM25, vector search, reranking, the temporal knowledge graph — and often a live web fetch on top of that. All of that work shares the same GPU and the same memory bus that the decoder needs.

A box that could theoretically batch thirty pure-generation chats serves fewer when every chat first runs a hybrid search, a rerank pass, and a long-context decode against the retrieved evidence. The bandwidth ceiling assumes the GPU spends all its cycles decoding; real chats spend a meaningful share on the retrieval pipeline, and that share grows with how well-grounded the answer needs to be.

Active Users vs. Named Users

Named-user headcount and active-decoding load are different numbers, and duty cycle separates them. A typical office worker fires a query, reads the answer for ten to twenty seconds, and goes back to their actual job; their duty cycle on the GPU is five to ten percent.

So forty named users at that duty cycle translate to two to four actively-decoding chats at any given moment. The "hundred-plus named users on one box" headline falls out of that arithmetic at light, bursty usage — it's an arithmetic ceiling, not a planning target. The planning target is the active number.

The Queue Keeps Peaks Soft

The GPU is a serialized resource with a priority queue in front of it. When concurrent demand exceeds in-flight capacity, requests wait their turn — they don't fail, and the box doesn't tank.

Humans-in-chat outrank background work by design: indexing, batch summarization, persona renders, and the agent runtime all yield to live chat traffic, so a busy morning doesn't starve the people doing the asking. Peak load shows up as a few extra seconds of latency, never as a crashed appliance. The mechanism is the subject of [why heavy load means delays, not crashes](why-heavy-load-means-delays-not-crashes).

What Moves the Range

A handful of factors slide the comfort number along the ten-to-twenty (or beyond) range:

- Model size and quantization. A dense 70B model on one box serves fewer concurrent users than a 32B MoE; FP8, NVFP4, and 4-bit quantization buy more headroom per token. - Context length. Long retrieved contexts inflate the per-token KV cache and slow the batcher down. - Retrieval depth. Heavy use of the [knowledge graph and codebase RAG](how-factoryos-retrieves-the-right-context) eats GPU cycles before generation begins. - Background load. Scheduled flows on the Manifold, web-agent runs, and image generation share the box; their priority is lower than chat, but they exist. - Usage shape. Twenty people each asking a one-shot question per minute is light; five people running ten-thousand-token research threads is heavy.

A box that lives at the high end on one of these factors lands toward the low end of the headcount range. A box that lives at the low end on all of them lands toward the high end (or beyond).

Sizing Before You Talk to Us

The useful estimate before a consult is active users at peak, not just org headcount. Count the staff most likely to use AI inside any given fifteen-minute window — usually a fraction of total headcount — and assume each one is decoding maybe ten percent of that time.

A team of forty knowledge workers typically lands at two to four simultaneous active chats. A team of a hundred lands at five to ten. A team of two hundred starts to brush the ceiling on a single Spark and is worth a conversation about either a larger GPU or a second box.

If the estimate sits comfortably under twenty active, one box is the starting hardware. Over twenty and the sizing conversation widens to a bigger GPU, horizontal scaling across multiple Sparks, or workload shaping — retrieval caching, smaller models for routine tasks, background-only nightly runs. Those are the questions the first call is for.

Recent Articles

Budgeting for Private AI Growth

An honest private AI budget has two numbers: a small real start and a growth path that is neither free nor unlimited, planned without surprise invoices.

Calculating ROI on Private AI

ROI on private AI is two columns: cost to own, and return in time, risk, and capability. A worked look at why honest math favors ownership.

CAPEX Versus OPEX for AI

AI lands on your books two ways: a rented subscription or an owned asset. Why the accounting, not the features, decides the real cost over time.

Total Cost of Ownership Local AI vs Cloud

Most AI TCO models are built by vendors selling subscriptions. This one is built from the buyer's side, with three years of numbers on both lines.

Popular Articles

Total Cost of Ownership Local AI vs Cloud

Most AI TCO models are built by vendors selling subscriptions. This one is built from the buyer's side, with three years of numbers on both lines.

Calculating ROI on Private AI

ROI on private AI is two columns: cost to own, and return in time, risk, and capability. A worked look at why honest math favors ownership.

Budgeting for Private AI Growth

An honest private AI budget has two numbers: a small real start and a growth path that is neither free nor unlimited, planned without surprise invoices.

CAPEX Versus OPEX for AI

AI lands on your books two ways: a rented subscription or an owned asset. Why the accounting, not the features, decides the real cost over time.

Other Categories