On Hacker News
Cell-based architecture for resilient payment systems
Read the full article on americanexpress.io ↗123
points
48
comments
1
notable voices
The 5-second version
- Cell-based architecture groups independent instances of microservices, databases, and components into self-contained cells that function without cross-cell dependencies to contain failure blast radius.
- Each cell is a single failure domain with local data and processing, never spanning multiple regions, and can be removed from rotation without impacting other cells or requiring coordination.
- Static and semi-static reference data is replicated to every cell ahead of time to avoid synchronous lookups during transaction processing and preserve critical-path isolation.
- Dynamic transaction data uses deterministic routing through the Global Transaction Router to direct transactions to the cell where the required data already resides.
- The trade-off of increased management overhead and architectural complexity is outweighed by reduced blast radius, improved resiliency, lower latency from fewer network hops, and easier horizontal scaling via independent cells.
Top voices
Verbatim comments from the thread's most notable / highest-karma participants.
Some of it sounds like it reinvented Erlang supervision trees https://learnyousomeerlang.com/supervisors. As a joke there we’re calling gen_severs “nanoservices”. Granted, that was mostly when microservices were the hot new thing.Read on HN ↗
GLBs aren’t SPOFs. They are typically deployed around the world redundantly, often using Anycast IPs or using DNS geographic and failover records, and are stateless. Think AWS Global Accelerator and Route 53 as an example. The architecture diagram is a high level simplification.Read on HN ↗
mcintyre19946k karma
A few years ago someone kept signing up for loads of bank accounts/credit cards in my name, with my address. I’m not sure what the point of it was. But while everyone else happily sent cards and stacks of welcome paperwork to me, Amex were the only one that contacted me and told me they’d detected something weird in the signup. They gave me some helpful advice to resolve that situation too.Read on HN ↗
Backups in such a system are quite pointless; if losing 10 seconds of data means you lost 4000 transactions then periodic backups are invalid if not instantly than close to instantly. The system I work on has such a property and the only real infra style approach is sync replication before responding to a caller and a delayed replica for delete/drop protections (say with a 2hr or more window). Should also defend for this in your code (be able to reply from your initiation systems also etc)Read on HN ↗