Why Your Binary Protocol Should Care About CPU Cache Lines If you've ever designed a custom binary protocol for a hot path β a game server, a market-data feed, an internal RPC β you've probably obsessed over byte layout, alignment, and zero-copy parsing. There's one detail most tutorials skip that quietly costs you 2-5x throughput: cache line alignment . The 64-byte secret Modern CPUs don't read memory one byte at a time. They read in chunks called cache lines β typically 64 bytes on x86_64 and ARM. Every load that misses L1 pulls in a full cache line. Every store that has to be visible to other cores invalidates a cache line on those cores. If your protocol's "hot fields" β the bits the receiver reads first and most often β sit on the boundary between two cache lines, you just doubled your memory traffic for free. A worked example Picture a naive market-data tick struct: a uint8_t type tag, a uint64_t timestamp, a uint32_t symbol id, an 8-byte price, an 8-byte sequence number, and an 8-bit flags field.β¦