Tuesday, February 24, 2015

Async IO: Latency and Consistency

The degree of control I have over the threads in Playform is lacking. It seems that when the server should be flushing messages to the client, it's loading terrain instead. That's not always a bad call, it's just not great if the client is waiting on player updates (low latency is kind of key in realtime games). Potentially, we could just say "process all incoming events before doing anything else", by using a condition variable, but that doesn't really give any kind of enforcement between "anything else"s. World updates need to be considered too. Trying to organize the relationships between these threads using condition variables is going to become a headache quickly, even with only a few of them.

I could put a whole lot of things back into a single thread, but there's a reason we separated them in the first place.. If they can be running concurrently, they should be (e.g. generating more terrain in the background). And putting things all into a single loop means you tend to have a strict sequence, e.g. incoming events -> world updates -> generate terrain, although of course you could write more code to let them run with a more dynamic ordering.

Which brings me to the new approach I'm considering: implement threading via a priority queue of closures. A fixed-size ThreadPool constantly executes the highest priority closures on the queue. Currently, our threads are all initialization blocks for shared state, followed by a message processing loop. The loop can become queued closures by scheduling a single iteration, and having it schedule itself again when done. The shared state becomes a parameter to the closure (which it then passes to itself as part of its recursive scheduling). What about the priority? Next iteration deadline works as a base priority for the loops; it can become more complicated as needed.

Now this is a switch from interrupt-based to polling-based IO, so instead of blocking on message queue reads, each loop body executes at a regular interval and checks whether messages are available. If these loops are dealing with incoming events, the average latency for dealing with an incoming event is directly proportional to how often the closure is scheduled. If we have events that require low-latency, low-throughput processing, we'll be doing more spinning than we should be. I think this is a fair trade to make for a game though: most events are high throughput, and a check for "are there messages available" is pretty damn cheap (even including the context switch) compared to some of the other loop bodies.

If anybody has thoughts on this, let me know. Async IO is hard and I would love for someone to have already dealt with my use case.

No comments:

Post a Comment