Menu

Post image 1
Post image 2
Post image 3
1 / 3
0

We rotated our JWKS without overlap. Here is the 4-minute window that broke prod.

DEV Community·Blue Hills·29 days ago
#aZxTOMBW
#jwt#jwks#cache#rotation#issuer#overlap
Reading 0:00
15s threshold

The on-call alert at 02:14 said auth_5xx_rate spiked from 0.01 to 31.4 . Not a deploy window. Not a traffic spike. Just thirty-one percent of authenticated requests failing for ~four minutes, then back to baseline. The cause was a JWKS rotation on the issuer side. New keys came in. Old keys went out. Caches in our service didn't refresh fast enough. Tokens signed with the new key were rejected because the verifier still held the old JWKS. Tokens signed with the old key were rejected because the issuer had stopped publishing them. We had a key-overlap gap of roughly four minutes between when the issuer stopped issuing tokens with the old key and when our verifier's cache picked up the new one. This is a class of bug that does not show up in any of the tests we run. Unit tests use a fixture JWKS that never rotates. Integration tests use a mocked issuer. Synthetic monitoring hits the live issuer but uses tokens minted within the same minute, so cache freshness is irrelevant.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More