I built a model serving API from scratch. Not because the world needs another inference server, but because I wanted to understand what happens between β€œsend prompt” and β€œreceive tokens.” The things ML system design interviews ask about: batching, backpressure, streaming, graceful degradation. I wanted hands-on experience so I could talk about them from building, not reading.

The result: a FastAPI server wrapping Ollama with a bounded request queue, SSE streaming, naive batching, 11 custom Prometheus metrics, and structured logging. It runs on a $7/month ARM server. I ran 8 structured experiments against it. The data revealed things I didn’t expect.

Source: github.com/brianhliou/model-serving-api

The system

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Hetzner CAX21 (ARM64, 4 vCPU, 8GB RAM, $7/month)      β”‚
β”‚                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Caddy   β”‚β†’ β”‚ FastAPI  β”‚β†’ β”‚ Ollama β”‚  β”‚  Grafana  β”‚ β”‚
β”‚  β”‚ (TLS,    β”‚  β”‚ (queue,  β”‚  β”‚(llama  β”‚  β”‚  Alloy    β”‚ β”‚
β”‚  β”‚  proxy)  β”‚  β”‚  batch,  β”‚  β”‚ 3.2)   β”‚  β”‚(telemetry)β”‚ β”‚
β”‚  β”‚          β”‚  β”‚  metrics)β”‚  β”‚        β”‚  β”‚           β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚       :443          :8000       :11434                   β”‚
β”‚                                                         β”‚
β”‚  Docker Compose, bridge network, internal DNS            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Four containers on a single machine. Caddy terminates TLS with automatic Let’s Encrypt certificates (three lines of config). FastAPI handles the serving logic: bounded request queue, batch dispatcher, OpenAI-compatible API, Prometheus metrics. Ollama wraps llama.cpp and runs the model. Grafana Alloy scrapes metrics every 15 seconds and ships them to Grafana Cloud.

The containers communicate over a Docker bridge network using service names as hostnames. Caddy resolves api, FastAPI resolves ollama, Alloy resolves api. Sub-millisecond latency between containers because the traffic never leaves the host. Only Caddy exposes ports to the internet (80, 443). The FastAPI port binds to 127.0.0.1 only.

The core idea: a bounded request queue sits between clients and the model. When the queue is full, clients get an instant 503 with Retry-After instead of waiting indefinitely. This is backpressure: the serving layer’s most important job.

The API is OpenAI-compatible (/v1/chat/completions), supports streaming via SSE, and exposes 11 custom Prometheus metrics (TTFT, tokens/sec, queue depth, error rates by type, backend latency).

What the model actually does

The model running on this server is Llama 3.2 3B Instruct: 3.21 billion parameters, 28 transformer layers, 128K token context window. It was built through knowledge distillation from Meta’s larger Llama 3.1 8B and 70B models. The 3B model wasn’t trained from scratch. Instead, Meta pruned the 8B architecture down to 3B parameters, then trained the smaller model to match the output distributions of the larger ones. This is why a 3B model performs competitively with many 7B models.

How a single token is generated

When you send β€œWhat is 2+2?” to the API, the model processes it through 28 identical transformer layers. Each layer does two things:

  1. Attention: The model decides which parts of the input matter for predicting the next word. For each position, it computes query, key, and value vectors, calculates attention scores between all positions, and produces a weighted sum. Llama 3.2 uses Grouped Query Attention (GQA): 24 query heads share 8 key-value heads (a 3:1 ratio). This cuts the memory needed for cached attention data by 3x.

  2. Feed-forward network: Each token’s representation passes through a gated network (SwiGLU) with three weight matrices: gate, up, and down projections. The gate controls information flow through element-wise multiplication. Each FFN layer has ~75 million parameters.

After all 28 layers, the model produces a probability distribution over 128,256 possible tokens. Temperature scaling adjusts how β€œrandom” the selection is (lower = more deterministic), and top-p sampling filters the candidate set. One token is drawn from this distribution.

This entire process repeats for every single output token, one at a time.

Two-phase inference

Token generation has two distinct phases with very different performance characteristics:

Prefill (prompt processing): All input tokens are processed in parallel through the transformer. This is compute-bound: lots of matrix multiplications that can be parallelized across CPU cores. Speed: 50-150+ tokens/second.

Decode (generation): Each output token is generated sequentially. The model must read its entire 2 GB of weights from memory to produce one token. This is memory-bandwidth-bound: the CPU can compute faster than it can load data. Speed: 7-8 tokens/second.

The decode bottleneck explains a key number in my experiments. To generate one token, the CPU reads ~2 GB of model weights from RAM. With the server’s DDR4 memory bandwidth, the theoretical ceiling is roughly 15-30 tokens/second. After overhead from the KV cache, dequantization, and non-sequential memory access, the practical rate is ~7.5 tok/s. This rate is nearly identical whether the system is idle or under heavy load.

How a 3B model fits in 8GB RAM

The raw model weights in 16-bit precision would be 6.4 GB. That doesn’t fit. Ollama uses Q4_K_M quantization: weights are compressed from 16 bits to ~4.5 bits per parameter by clustering weight values into discrete bins using k-means. Sensitive layers (attention output, FFN down projection) get 5-6 bits; less sensitive layers get 4 bits.

The memory budget on this server:

Component RAM
Model weights (Q4_K_M) ~2.0 GB
KV cache (inference state) ~0.5-1.0 GB
FastAPI + Python runtime ~100 MB
Caddy + Alloy + Docker ~200 MB
OS + kernel ~300 MB
Total active ~3.1-3.6 GB
Page cache (remaining) ~4.4-4.9 GB

Comfortable margin. The KV cache stores attention keys and values from all previous tokens so the model doesn’t recompute them. Each token in the cache costs 112 KB across all 28 layers and 8 KV heads. At 2K context, that’s ~224 MB. At 8K, ~900 MB. At the model’s full 128K context, the KV cache alone would need ~14 GB, which is why CPU inference practically limits context length.

Backpressure works, but the math is brutal

I sent 100 simultaneous requests against a queue of 50:

Outcome Count
Rejected instantly (503) 50
Accepted, then timed out (504) 50
Successful (200) 0

Zero successful completions. Not one.

Ollama processes requests sequentially. Each request takes 2-3 seconds. 50 queued requests need 100-150 seconds to drain. The request timeout is 60 seconds. So by the time the server gets to request #26, the deadline has already passed.

The queue protects the system from crashing. Rejected clients get an instant response and can retry. But the queue doesn’t make the system faster. A queue of 50 with a sequential backend and 60s timeout means accepting work you can’t finish.

The correct formula: max_queue_size = (timeout / avg_request_duration) * backend_concurrency. For this system: 60s / 2.5s * 1 = 24. My queue of 50 is too large.

The β€œqueue” isn’t really a queue

Looking deeper at the implementation, QueueManager is not a FIFO queue. It’s a counter. There’s no asyncio.Queue, no waiting, no ordering. When acquire() is called, it checks if active >= max_size. If yes, it immediately raises QueueFullError. If no, it increments the counter. That’s it. No mutex needed because asyncio is single-threaded.

This is actually a load shedder, not a queue. Requests are either admitted instantly or rejected instantly. The name β€œqueue” is misleading. In the backpressure flood experiment, asyncio task scheduling, not arrival order, determined which requests got admitted. Request #0 (the first to arrive) was rejected while request #1 got in.

503 rejection isn’t fast enough

The 50 rejected requests averaged 0.87 seconds to get their 503 response. That’s nearly a full second to say β€œno.” For a fast-fail mechanism, that’s too slow.

The latency comes from the network stack: TLS handshake to the server, HTTP request parsing, response propagation back through Caddy. Under extreme load (100 simultaneous requests), the server’s event loop is contended. At concurrency 60 in another experiment, 503 rejections took only 0.73 seconds. The 140ms difference reflects the server being less overloaded.

Latency doesn’t just increase. It cliffs.

I swept concurrency from 1 to 60:

Concurrency Avg Latency Success Rate
1 7.4s 100%
2 4.3s 100%
5 10.0s 100%
10 18.7s 100%
20 42.9s 100%
30 20.9s 27%
50 43.7s 16%
60 39.2s 20%

The jump from 20 to 30 is the interesting part. Latency drops from 42.9s to 20.9s, but success rate craters from 100% to 27%.

At concurrency 20, all requests fit in the queue and all eventually complete, with the last ones barely making the 60s timeout. At 30, the requests that timeout (73%) are removed from the average, leaving only the fast early ones that Ollama processed first. The average looks better, but the system is failing.

Averages lie at the boundary. When requests start timing out, the surviving β€œsuccessful” requests look artificially fast because they were the lucky ones processed first. You need success rate alongside latency, not one or the other.

Batch tiers are visible in the data

At concurrency 10, latencies form a clear bimodal distribution: 4 requests complete at ~10.6s, 6 at ~24.1s. These are two batch rounds. The batch dispatcher collects requests for up to 100ms or 8 requests, then fires them all concurrently via asyncio.gather. But Ollama processes them sequentially, so the first batch finishes, then the second batch starts.

At concurrency 15: trimodal (6 at ~14.6s, 8 at ~32.7s, 1 at ~35.0s). Three batch rounds. At concurrency 20: four tiers. The batch_size=8 configuration creates predictable staircase patterns in the latency distribution.

Concurrency 2 is faster than concurrency 1

This was unexpected. The single request at concurrency 1 took 7.4s. At concurrency 2, the mean was 4.3s, with the faster request completing in 3.1s.

The explanation: concurrency 1 included a cold-start penalty (model loading, KV cache warmup). At concurrency 2, both requests arrive together, get batched, and share the warmup cost. Compare to later experiments where warm-model sequential requests took 2-3s. The 7.4s single request was paying a one-time tax.

Streaming is faster than non-streaming under load

I expected streaming to add overhead from more HTTP chunks and I/O. Under no contention, that’s true: streaming (3.03s) is slightly slower than non-streaming (2.85s). The SSE framing and chunk processing add about 6% overhead.

At concurrency 5, the picture reverses:

Mode Avg Latency
Non-streaming 12.29s
Streaming 8.35s

Streaming is 32% faster. The reason is in my implementation: non-streaming requests go through a batch dispatcher that collects requests for 100ms before dispatching as a group. Streaming requests bypass the batcher entirely, going directly to backend.stream().

This was an honest finding about my own code. The batch dispatcher adds more latency than it saves because Ollama processes requests sequentially regardless. Batching only helps when the backend can exploit parallelism (like a GPU with continuous batching). With a sequential backend, it’s pure overhead.

The 100ms batch window is the problem. At solo concurrency, a single request waits up to 100ms for more requests that may never arrive. At high concurrency, the window fills quickly, but the backend can’t parallelize the batch anyway.

Time to first token degrades 10x under contention

The most dramatic finding. I measured TTFT (time to first token) for streaming requests:

Condition Mean TTFT Min Max
No contention 0.87s 0.53s 0.93s
Concurrency 5 9.02s 1.08s 10.83s

A 10x degradation from just 5 concurrent users.

TTFT measures how long until the client sees the first token. This maps directly to the two-phase inference described above. The 0.87s baseline TTFT is the prefill time: the model processes the prompt tokens through all 28 layers before it can start generating output. Under contention, requests queue behind each other at Ollama.

The concurrent TTFTs show a clear staircase pattern: 0.86s, 3.47s, 6.01s, 8.60s, 11.01s. Each step is approximately 2.5s apart, the time for Ollama to finish one request’s prefill and generation before starting the next. TTFT under sequential processing is essentially queue_position * avg_request_duration.

The sequential TTFT distribution (20 samples) is Gaussian centered on 0.886s with a standard deviation of just 15ms. Extremely consistent. The first request was an outlier at 0.53s because the model was already warm from a prior experiment.

TTFT is the metric that matters most for user experience. A user staring at a blank screen for 9 seconds will close the tab. This is why production systems use continuous batching: it allows the model to interleave generation across requests, keeping TTFT low even under load.

Token generation rate is rock-solid

Five sequential streaming requests, 100 tokens each:

Run Tokens/sec Mean Inter-Token Interval
1 8.0 125ms
2 7.6 131ms
3 7.7 129ms
4 7.9 126ms
5 8.0 126ms

No degradation as output gets longer. Once Ollama starts generating, it produces tokens at a steady ~7.8 tok/s on ARM64.

Why 7.5 tok/s?

The Hetzner CAX21 uses Ampere Altra processors (ARM Neoverse N1 cores) with DDR4 memory. Token generation is memory-bandwidth-bound: each token requires reading the entire model weights (~2 GB for Q4_K_M) from RAM. The arithmetic intensity is only ~3.2 FLOPs per byte of memory accessed, which puts decode squarely in the memory-bound regime of the roofline model.

llama.cpp (which Ollama wraps) uses ARM NEON SIMD instructions for the core computation: 128-bit wide vector operations that process 4 floats or 16 int8 values simultaneously. Hand-written kernels for each quantization format handle dequantization and multiply-accumulate in fused operations.

Inter-token timing isn’t perfectly constant

Looking at the raw chunk timestamps across 100 tokens, the inter-token interval ranges from 109ms to 163ms with a coefficient of variation of 11.2%. There are periodic spikes every 5-7 tokens where the interval jumps by 20-30ms, possibly from KV cache extension operations. One request showed a 206ms gap followed by a compensating 54ms interval, which looks like a garbage collection pause or memory operation.

Sustained throughput is stable

A 2-minute sustained load test at concurrency 5: 56 requests, 990 tokens, 7.6 tok/s, stable the entire time. No memory leaks, no thermal throttling. The per-window latency (10s buckets) varied by only 0.55s standard deviation across the full run. The aggregate token rate was 96.5% of the isolated single-stream rate.

The bottleneck isn’t generation speed. It’s sequential processing. The model generates tokens fast enough; it just can’t serve multiple users at once.

Prompt length matters more than expected

Prompt Avg Latency
Short (5 tokens) 2.0s
Long (~50 tokens) 4.9s
5-turn conversation 5.9s
10-turn conversation 7.8s

A 10-turn conversation takes nearly 4x longer than a short prompt, even with the same max_tokens=30 output limit. The extra time is almost entirely prompt processing (the prefill phase). The model needs to process all input tokens through 28 layers of attention before generating the first output.

The KV cache explains everything

During prefill, the model computes attention keys and values for every input token and stores them in the KV cache. For subsequent output tokens during decode, it only computes attention for the new token against the cached keys and values. This is why prefill is compute-bound (matrix-matrix multiplication across all input tokens) while decode is memory-bandwidth-bound (matrix-vector for one token, but must read all cached KV entries).

Prefill attention complexity is O(n^2) where n is the prompt length. A 10-turn conversation with ~200 tokens of context requires 4x the prefill computation of a 5-turn conversation with ~100 tokens. Once the prefill is done, decode speed is nearly identical regardless of prompt length.

For chat applications, this means every request gets slower as conversations grow. Production systems deal with this through KV cache reuse: storing the cached attention state between turns so only the new user message needs prefill processing. Ollama doesn’t expose this across requests, so every request pays the full prefill cost from scratch.

What the data hid

Beyond the headline findings, the raw experiment data revealed patterns I didn’t expect:

Cold-start tax is 1.5-4.3x. The first request to each experiment was consistently slower. For short prompts: 2.88s first vs 1.6s warm (1.8x). For 10-turn prompts: 15.3s first vs 3.5s warm (4.3x). The penalty scales with prompt complexity because the initial request pays both model loading overhead and the full prefill cost without any cached state.

Zero completions in the 100-request flood. Despite 50 queue slots, not a single request completed. The queue accepted 50 requests, but the serial backend couldn’t process any of them within the 60s timeout. The queue protects the system from crashing, but it accepted work that was mathematically impossible to finish.

Only 2 out of 50 succeeded in the degradation test. Request #0 (8.98s) and request #40 (31.88s). The 22.9s gap between them aligns almost exactly with 2 batch processing rounds. The remaining 48 requests all timed out at ~60.7s.

Token generation rate is identical across all modes. Solo streaming: 7.8 tok/s. Concurrent non-streaming: 7.6 tok/s aggregate. Solo sequential: ~7.7 tok/s. The Ollama backend generates tokens at a fixed rate regardless of how many requests are queued. All latency differences come from queuing and batching, not token generation.

What I’d do differently

Queue size: Set it to 20-25, not 50. With a sequential backend and 60s timeout, a queue of 50 means accepting requests you’ll never finish. The formula: (timeout / request_duration) * concurrency = (60 / 2.5) * 1 = 24.

Batching: Skip it entirely for a sequential backend. The 100ms collection window adds latency with no benefit. Only enable it when the backend supports parallel processing.

TTFT alerting: Set a Grafana alert on p95 TTFT > 5s. That metric tells you users are having a bad experience earlier than total latency does.

503 latency: Investigate why rejection takes 870ms. For a load shedder, the rejection path should be sub-10ms. The current latency is dominated by network overhead, but with connection pooling and HTTP keep-alive, it could be much faster.

The backend: The most impactful improvement would be swapping Ollama for llama-cpp-python with continuous batching. That allows multiple requests to share the model simultaneously, keeping TTFT low under load. The InferenceBackend Protocol abstraction makes this a clean swap: implement generate(), stream(), and health(), and the serving logic stays unchanged.

Key takeaways

  • Backpressure protects the system, but queue size must match (timeout / request_duration) * concurrency

  • Latency averages lie at the boundary: when requests start timing out, the survivors look artificially fast

  • Batching is not universally good: with a sequential backend, it’s pure overhead that adds 100ms to every request

  • TTFT is the metric that matters most for UX, and it degrades linearly with queue position

  • Token generation on ARM64 is memory-bandwidth-bound at ~7.5 tok/s, consistent across all load conditions

  • Prompt length affects latency as much as output length: prefill is O(n^2) and grows with every conversation turn

  • A 3B model fits comfortably on an 8GB server via 4-bit quantization (6.4 GB compressed to 2 GB)

  • The most impactful improvement isn’t in the serving layer: it’s swapping a sequential backend for one with continuous batching


Resources: