Let's Encrypt 把 Rate Limit 機制實作的演進整理出來:「Scaling Our Rate Limits to Prepare for a Billion Active Certificates」。
可以看到一開始他們用資料庫 (MariaDB) 做 (很明顯的是直覺 & 方便),然後就遇到效能上的問題了:
In 2015, we introduced our first rate limiting system, built on MariaDB. It evolved alongside our rapidly growing service but eventually revealed its limits: straining database servers, forcing long reset times on subscribers, and slowing down every request.
放 RDBMS (這邊的 MariaDB) 會很難無腦 scale out,通常第一招都是先看看 replica 可不可以解,後面就先看到先把 replica 生出來撐,但 rate limit 算是有 access 就會有寫入動作的服務,效果應該有限:
In late 2021, we updated our control plane and Boulder—our in-house CA software—to route most API reads, including rate limit checks, to database replicas. This reduced the load on the primary database and improved its overall health. At the same time, however, latency of rate limit checks during peak hours continued to rise, highlighting the limitations of scaling reads alone.
可以馬上想到兩個問題,一個是用不到 RDBMS 的 relational 性質,另外一個是可以掉資料的系統拿 RDBMS 存會浪費不少資源 (ACID 裡的 D)。
後面就看到從 MariaDB 拆到 Redis 的資訊了,這邊有提到資料的短暫有效性以及 Redis 的 TTL 設計:
By moving this data from MariaDB to Redis, we could eliminate the need for ever-expanding, purpose-built tables and indexes, significantly reducing read and write pressure. Redis’s feature set made it a perfect fit for the task. Most rate limit data is ephemeral—after a few days, or sometimes just minutes, it becomes irrelevant unless the subscriber calls us again. Redis’s per-key Time-To-Live would allow us to expire this data the moment it was no longer needed.
log 可以透過其他方式 stream 出去 (會有稽核上的需求),而 rate limit 的資料可以放在 in-memory storage,掉了就當作大放送,這樣的設計可以把 Redis 效能拉到極限。
不過以 Let's Encrypt 的量,遲早會遇到要拿 sharding 出來用,rate limit 算是很好拆開的服務,過幾年再回來看看有沒有新的文章... 大家都是這樣過來的?