The purpose of this post is to explore the design options for a ticket booking service, and see how it aligns with what TicketMaster (TM) does. The hypothesis this post argues for is that 1) allocating 2–4M tickets among 20M-odd users is not a particularly hard problem, and 2) TM’s architecture is entirely reasonable, and should not buckle under load.
Three caveats — 1) This is not meant to be a “defence” of TM (esp wrt monopolies and scalping and secondary markets etc.), 2) I do not mean to suggest that tractable = trivial, or that implementing these architectures and making sure they run properly is not challenging, and 3) I have never designed or built a web service that runs at such scales, and would be happy to incorporate comments/feedback from folks with more experience.
With that, let’s dive into the design space for something like this, before we try analyzing what TM did, and what failed there.
Designing A Ticket Purchase Platform
2M tickets, 20M users vying for those, and all this action unfolding over a couple of hours. (The 20M number is unverified, but I’m going with it for simplicity). We’ll 1) shard the problem, 2) handle concurrency within the shards, and 3) make the concurrency work. One fundamental assumption here is that there is a seat map, and people get to hand pick the seats they want (vs getting a random allocation, which makes it a much easier problem).
If we can break the problem into independent shards, it suddenly becomes much more tractable. You can have one duplicate of your entire service stack for each shard, and let the load-balancer route. Or more realistically, some of your services understand shards (such as a distributed database), and self-optimize accordingly, while some others could be fast enough to handle multiple shards per instance. In any case, sharding = wins.
The key (pun intended) here is realizing that all shows (let’s call one show a seat map) are independent. At any point a user is booking tickets within one seat map. Assuming 50 shows, our “concurrency domain” is suddenly reduced to 40k tickets, and 400k users. We could make it even smaller by asking users to input a price range, and have them wander around within that price range — but it’s more work for diminishing returns.
Now let’s zoom into one seat map. We have N (= 400k) users, trying to book seats. The seat map is a shared resource, so we need to use some form of concurrency control. Let’s explore 3 options.
- We let one user enter, explore the seat map, book and pay, and then let the next user come in. This works, but at 2 minutes per user, it’s extremely slow. It’s a coarse-grained lock at the level of a seat map.
- To speed things up, we make our locks finer-grained, and allow more concurrency. We introduce a per-seat lock. All users look at the seat map and make selections, if you click on a seat it turns yellow, a lock is requested from the backend, and if you acquire the lock it turns green. Once you’ve selected your seats, acquired all locks, you pay, get your tickets, and brag on Mastodon. Something like this could plausibly work.
- Optimistic concurrency control (OCC)— everyone sees the seat map, and makes selections. Your selection is locked when you proceed to pay. If your selection conflicts with someone else’s, you’re made to go back and choose another set of seats. This could also work.
Both options 2 and 3 from the previous section are viable, but only for small values of concurrency. Coordinating locks across 40K users is complicated and wasteful. OCC among 40K users will just lead to too many conflicts.
So we limit the concurrency window. We still have N (=40K) users, but only k of them enter the seat map selection page and do their thing. Both 2) and 3) are valid design decisions, but the optimal k for each of them may be different.
We can do this by arranging our N users into a fully ordered queue. We let k users into the seat map section at a time, and move people from the head of the queue as users in the seat map section complete their booking. As long as this queue service is robust and functional, we can manipulate this k to be whatever concurrency the rest of our system can support.
We need one queue per seat map. This queue is first-come-first-serve, and is mostly fair. As long as this queue works, the rest of the system is tractable.
What TicketMaster Does
This is exactly what TicketMaster does. They serialize users in a queue, and for some “concurrency window”, they let them choose seats and do optimistic concurrency control while locking seats before payment. If we assume that a user spends 2 minutes picking seats, and my estimate of the queue progress throughput from 10am to 10.30am EST is about 100/minute, that implies k = 200 concurrent users wandering around the seat map. (This k should go down as seats fill up and conflict probability increases).
TicketMaster Devs Mostly Got It Right
- Their queue service worked. That’s the lynchpin for handling this scale, and it worked fine. I do not get why they chose to show “2000+ people” when they had the exact number in the API response — it led folks to feel like they weren’t actually moving along, but they were!
- At 10.50am EST-ish, their queue was deliberately paused. An explicit message from the servers indicating a pause was sent out to all clients. Something somewhere went wrong (I read reports of presale codes not working), which to me maybe suggests some consistency issues in some database somewhere. But it seems that they got the hard bits right and messed the easier bits up somewhere.
- A minor source of unfairness was joining the queue at 9.30am EST. My “join queue” clicks just spun and did nothing the first 3 times, and only worked the 4th time. I assume that the initial spike in handshakes was overwhelming and led to a lot of requests being arbitrarily dropped. A minor failure case in the queuing service.
- Between OCC and per-seat locking, OCC is much simpler to implement, although you could argue that a per-seat lock results in a marginally better experience.
Things That They Did Wrong (IMO)
Swifties are obviously MAD. Ultimately, what matters is whether there was a clear, predictable set of rules for tickets, and were folks who followed those rules able to get tickets.
- Sending presale invites to 70% of those who signed up. Why!? This situation was entirely predictable once that had happened.
- Allowing folks to join queues without entering presale codes. Why not filter out ineligible folks there?
- Not being transparent with pricing beforehand. There is no reason that seat maps, prices etc. could not have been announced beforehand, if only earlier that morning. That means folks spend less time in the “critical section” exploring their options, and are able to budget and plan before queuing up. If dynamic pricing is a factor, the only non-evil move is to explain the algorithm, or at least how that will affect the user.
- Some technical failures did occur. It doesn’t matter that you were able to get the hard bits right if the end-to-end experience is still compromised.