Playform Dev Blog: February 2015

Saturday, February 28, 2015

High-Level Client/Server Design

The server has to be able to send things like player updates at a relatively high rate, with a relatively low latency. Ideally, the "reaction time" between a keypress and movement is <10ms (a nice round number for human-noticeable latency), not counting network latency. That would mean we could just run a local one-player server for singleplayer.

The idea of tiered priorities to organize async IO is becoming more and more appealing. Each tier holds a priority queue of closures, ordered by completion deadline. We do not process lower-tier actions before a higher tier is complete. (Note that actions may schedule other actions).

Warnings should be added for when the queue is growing too quickly, or when too many things are being processed too late.

We should also audit these actions for things that can be processed greedily (i.e. before their actual deadline). Interval-based actions (e.g. rendering) can be responsible for scheduling their next iterations, so that the overall rate of an action is unaffected by greedy processing.

Client

Priority 0

Receives blocks from the server.

Queues blocks for loading.

Receives world updates.

Moves the client.

Removes obsolete surroundings from render state.
Requests blocks from the server.
Updates the render state.

Receives entity updates.

Updates the render state.

Receives user input events.

Sends the server player updates.
Updates the render state.

Priority 1

Renders at a regular interval.

Priority 2

Loads queued blocks.

Server

Priority 0

Acks new clients.

Register client listen addresses.
Open a new connection.
Sends clients initial world state.

Queues terrain block requests.
Applies player updates from clients.

Priority 1

Updates the world at a regular interval.

Sends world updates to clients.
Unload blocks from the octree.
Requests terrain blocks.

Priority 2

Loads/generates queued blocks.

Sends blocks to client.
Loads blocks into octree.

Tuesday, February 24, 2015

Async IO: Latency and Consistency

The degree of control I have over the threads in Playform is lacking. It seems that when the server should be flushing messages to the client, it's loading terrain instead. That's not always a bad call, it's just not great if the client is waiting on player updates (low latency is kind of key in realtime games). Potentially, we could just say "process all incoming events before doing anything else", by using a condition variable, but that doesn't really give any kind of enforcement between "anything else"s. World updates need to be considered too. Trying to organize the relationships between these threads using condition variables is going to become a headache quickly, even with only a few of them.

I could put a whole lot of things back into a single thread, but there's a reason we separated them in the first place.. If they can be running concurrently, they should be (e.g. generating more terrain in the background). And putting things all into a single loop means you tend to have a strict sequence, e.g. incoming events -> world updates -> generate terrain, although of course you could write more code to let them run with a more dynamic ordering.

Which brings me to the new approach I'm considering: implement threading via a priority queue of closures. A fixed-size ThreadPool constantly executes the highest priority closures on the queue. Currently, our threads are all initialization blocks for shared state, followed by a message processing loop. The loop can become queued closures by scheduling a single iteration, and having it schedule itself again when done. The shared state becomes a parameter to the closure (which it then passes to itself as part of its recursive scheduling). What about the priority? Next iteration deadline works as a base priority for the loops; it can become more complicated as needed.

Now this is a switch from interrupt-based to polling-based IO, so instead of blocking on message queue reads, each loop body executes at a regular interval and checks whether messages are available. If these loops are dealing with incoming events, the average latency for dealing with an incoming event is directly proportional to how often the closure is scheduled. If we have events that require low-latency, low-throughput processing, we'll be doing more spinning than we should be. I think this is a fair trade to make for a game though: most events are high throughput, and a check for "are there messages available" is pretty damn cheap (even including the context switch) compared to some of the other loop bodies.

If anybody has thoughts on this, let me know. Async IO is hard and I would love for someone to have already dealt with my use case.

Monday, February 23, 2015

Async IO

Today, I added SendSocket<T> and ReceiveSocket<T> structs to Playform. These replace and subsume the spark_socket_sender and spark_socket_receiver functions that were there before, which would spawn a new thread to pump messages between a socket and async queue. I briefly debated adding support for a kill signal, to be sent from the parent, so the threads could cleanly kill themselves. But since the socket threads can be blocked indefinitely (e.g. waiting on a disconnected client), some mechanism of external kill by the parent needs to exist anyway, so we might as well use that all the time. So far, I've been returning the socket endpoints to the caller of spark_socket_{sender,receiver}, and the caller can shut them down, which causes the sparked thread to die. But that's far from clean, and the resources that were moved into the sparked threads would be kind of messily dealt with.

Now we have SendSocket::spawn and ReceiveSocket::spawn. They do the same thing, but return a SendSocket/ReceiveSocket, which hold all the state created. This requires that the threads spawned have a more constrained lifetime (so they can be put inside a data structure), which turns out to actually be a good thing: it means that the compiler now lets those threads make use of values that aren't around for as long, e.g. sockets created at runtime.

Adding these types meant changing a lot of signatures, though. A lot of Playform's threads look something like:

/// Keep the terrain around the client loaded.
pub fn surroundings_thread(
  client: &Client,
  ups_to_view: &Mutex<Sender<ClientToView>>,
  ups_to_server: &Mutex<Sender<ClientToServer>>,
) {
  ...
}

Because the high-level design is still evolving, these types change a lot. Sometimes references get added or removed, sometimes Mutexes get added or removed, and here, I'm changing from a Sender to SendSocket - a different type entirely. This is annoying.

My solution was to let the threads just say "I want to fire off an event of type X", without caring much about exactly how. I replaced the offending signatures with things like:

/// Keep the terrain around the client loaded.
pub fn surroundings_thread<UpdateView, UpdateServer>(
  client: &Client,
  update_view: &mut UpdateView,
  update_server: &mut UpdateServer,
) where
  UpdateView: FnMut(ClientToView),
  UpdateServer: FnMut(ClientToServer),
{
  ...
}

It's longer and clunkier, but it means I don't have to change these signatures nearly as much. What this does is say that the surroundings_thread can be run with any two functions that accept ClientToView and ClientToServer respectively. The exact nature of these functions isn't important (e.g. whether it's a plain-ol' rust function, a closure on the stack, or something else that just kinda works like a function), which is why the types of functions themselves are parameters (<UpdateView, UpdateServer>).

At use time, we specify what exactly it means to apply our updates:

thread::spawn(move || {
  surroundings_thread::surroundings_thread(
    client,
    &mut |msg| { client_to_view_send.lock().unwrap().send(msg).unwrap() },
    &mut |msg| { ups_to_server.lock().unwrap().send(msg).unwrap() },
  );
})

Saturday, February 21, 2015

Terrain loading design

We can't keep an entire world loaded, but we need to make sure that the useful parts of the world (i.e. the parts near players) are loaded. How do we do this?

First, the world is divided into a regular cubic grid. Each cell on the grid is loaded and unloaded contiguously. Each cell is generated at several levels of detail (LODs), and the appropriate version is selected depending on distance from the viewer. This is called a geometry clipmap.

A SurroundingsIter struct iterates through all the cells in an NxNxN block; it iterates from the center outward in cubic "layers", so that terrain loads from near to far around a player.

This SurroundingsIter is wrapped up in a SurroundingsLoader struct, which is responsible for maintaining the cubic area around some (moving) point loaded. Its update function takes in a new central point, and produces Load and Unload events, which designate a cell to be unloaded or loaded at a specific LOD. In different places, these events are handled differently: the client responds by either requesting blocks from the server (for Loads) or by removing the block data from VRAM (for Unloads); the server responds to Loads by inserting the physical mesh data into the world for collision detection (note also that because the server doesn't care about visuals, it doesn't need to keep more blocks loaded than just the ones that the player touches).

Now, what if generating terrain takes too long? Should the server block? Probably not. Other updates and clients should be kept alive and well, regardless of terrain gen. Just load it when it's ready? What if the terrain is under the player? They'll fall through the world if it takes too long to load.

My solution so far has been to make unloaded areas solid near players; you simply can't move into a cell until it's loaded. This has some interesting side effects, though - what if one block isn't loaded, but the empty block above it is? Then I can use the solid unloaded block as a step to reach that chest I couldn't reach before (hypothetically).

The solidfying logic is surprisingly code-costly too: in the original client, blocks were either loaded both physically AND graphically (at some LOD), or not at all. The concept of "solid as a placeholder until the terrain gen is done" had to be threaded through all the code and data structures that dealt with loading, unloading, and levels of detail. In the end, the "cleanest" thing ended up being to add Placeholder as a somewhat special minimal LOD. This still exists in the server code, but thankfully the client doesn't worry about physics, and doesn't have to deal with this particular quirk.

I'd like to remove this logic, although there's no ideal replacement - if you stop anything from moving until the terrain around it is loaded, that can be exploited to, say, fire at an opponent from above. Minecraft just kind of pretends that unloaded areas are air until proven otherwise, and resolves inconsistencies by suffocating players that end up inside something. Although the idea of directly punishing players for slow servers tickles me, I feel like there's a better solution. Maybe the best solution is just to work hard at keeping players from encountering unloaded areas, and then it becomes less important what happens when they do.

The server code for keeping track of loaded blocks is overcomplicated too; since it doesn't care about graphics, it really only loads blocks at one LOD (max) for collision logic. The data structure(s) it uses are left over from the old client, though, and they can reason about many different owners requesting blocks at different LODs. Although right now Placeholder is still a LOD, so this functionality is kindasorta used, it really can get a lot simpler (which almost definitely means faster, too).