Sunday, March 15, 2015

Cap'n Proto Rust Macros

I've found Rust is pretty good at designing powerfully for the common use case while still smoothly supporting uncommon ones. unsafe is a great example: Rust provides tons of tools to keep yourself from shooting yourself in the foot in ways that are common in C++, like null pointers and pointers to expired data. But if what Rust provides doesn't work for you, for whatever reason (e.g. your data ownership is complicated enough that you can't explain it to the compiler), then you can opt out of these things using unsafe. You, as the programmer, accept the responsibility for the things you've given up.

The macro system is another good example. Macros are pretty effective for a lot of the macro-y use cases: you can use them as inline functions, you can use them to parse varargs, and you can use them to create tree-style DSLs, like this example taken from the Rust guide:

  let mut out = String::new();

  write_html!(&mut out,
      html[
          head[title["Macros guide"]]
          body[h1["Macros are the best!"]]
      ]);

  assert_eq!(out,
      "<html><head><title>Macros guide</title></head>\
       <body><h1>Macros are the best!</h1></body></html>");

I used macros to make a DSL like that for Cap'n Proto message construction code, and I'm pretty happy with the results. This is the example from capnpc-rust:

let mut message = MallocMessageBuilder::new_default();
{
    let address_book = message.init_root::<address_book::Builder>();

    let mut people = address_book.init_people(2);

    {
        let mut alice = people.borrow().get(0);
        alice.set_id(123);
        alice.set_name("Alice");
        alice.set_email("alice@example.com");
        {
            let mut alice_phones = alice.borrow().init_phones(1);
            alice_phones.borrow().get(0).set_number("555-1212");
            alice_phones.borrow().get(0).set_type(person::phone_number::Type::Mobile);
        }
        alice.get_employment().set_school("MIT");
    }

    {
        let mut bob = people.get(1);
        bob.set_id(456);
        bob.set_name("Bob");
        bob.set_email("bob@example.com");
        {
            let mut bob_phones = bob.borrow().init_phones(2);
            bob_phones.borrow().get(0).set_number("555-4567");
            bob_phones.borrow().get(0).set_type(person::phone_number::Type::Home);
            bob_phones.borrow().get(1).set_number("555-7654");
            bob_phones.borrow().get(1).set_type(person::phone_number::Type::Work);
        }
        bob.get_employment().set_unemployed(());
    }
}

And here's how it looks using capnpc-macros:

let mut message =
    capnpc_new!(
        address_book::Builder =>
        [array init_people 2 =>
            [
                [set_id 123]
                [set_name "Alice"]
                [set_email "alice@example.com"]
                [array init_phones 1 =>
                  [
                      [set_number "555-1212"]
                      [set_type {person::phone_number::Type::Mobile}]
                  ]
                ]
                [init_employment => [set_school "MIT"]]
            ]

            [
                [set_id 456]
                [set_name "Bob"]
                [set_email "bob@example.com"]
                [array init_phones 2 =>
                    [
                        [set_number "555-4567"]
                        [set_type {person::phone_number::Type::Home}]
                    ]
                    [
                        [set_number "555-7654"]
                        [set_type {person::phone_number::Type::Work}]
                    ]
                ]
                [init_employment => [set_unemployed ()]]
            ]
        ]
    );
I think that's pretty cool.

Update: Client/Server Refactoring

The client/server refactoring is done for now! It took about a month, but Playform has been split into separate server and client binaries that communicate over sockets. Hypothetically, this should work over LAN too, but I haven't tested it.

Async IO Design
A huge portion of the work went into creating a predictable and performant async IO design, that lets you explicitly prioritize different kinds of events (e.g. prioritizing world updates and client inputs over terrain loading).

The Journey: For a while, I tried using an "intuitive" approach to threading, where each conceptually independent line of work (e.g. terrain loading, client update processing, world updating) ran in a separate thread. But it was hard to control which thread ran at which time, so terrain loading would interrupt world updating at unhelpful times. It's not really reasonable or helpful for all those independent lines of work to be treated equally. It's not at all impossible to coordinate threads with one another (e.g. by using condition variables to have threads signal each other), but it just seemed like too much work. I'd be trying to describe the relationships between threads by defining individual actions between them; I imagine that's like trying to draw a picture in Conway's game of life: doable, but unnecessarily hard. If you want to describe a picture, just describe the picture.

The Result: There's that old rule of thumb about multithreading that says if any nontrivial portion of the work you do is serial, then the speedup you get from multithreading will be incremental. That basic idea lingered in my head throughout the whole redesign. I felt like although I could try to jump straight to an inherently-multithreaded design, it might work better to try and get a design that works pretty well in a single thread, that can have more threads added to it, i.e. a design where the work being done is independent of the threads doing it. I went back to having a single (main) thread for the client and server, plus message-handoff threads that look like:
  let mut listen_socket =
    ReceiveSocket::new(
      listen_url.clone().as_slice(),
      Some(Duration::seconds(30)),
    );
  loop {
    let msg = listen_socket.read();
    server_recv_thread_send.send(msg).unwrap();
  }

These threads are basically always blocked, since they're either waiting for incoming message (so blocked on interrupts) or sending a message (so probably blocked on a memcpy). They don't spend a whole lot of time actually doing anything, so I pretty much ignore them (let me know if that'll get me into trouble).

The server's main thread looks like:
  in_series!(
    {
      let updates =
        server.update_timer.lock().unwrap().update(
          time::precise_time_ns()
        );
      if updates > 0 {
        update_world(
          ...
        );
        true
      } else {
        false
      }
    },
    {
      listen_thread_recv.lock().unwrap().try_recv_opt()
        .map_to_bool(|up| {
          let up = binary::decode(up.as_slice()).unwrap();
          apply_client_update(
            ...
          );
        })
    },
    {
      gaia_thread_recv.lock().unwrap().try_recv_opt()
        .map_to_bool(|up| {
          update_gaia(
            ...
          )
        })
    },
  );

The in_series macro runs its parameter "actions" in top-down order until one of them succeeds, then returns to the top of the list. If none of the functions succeeds, it explicitly yields the thread. The result is that it runs the top action until it fails, then for as long as it fails, it runs the next action until it fails, and so on. So the server's main thread updates the world at a regular interval (e.g. update players, mobs, move the sun, etc.). If there's time available, it'll process events from the listen socket. With any remaining time, we process terrain load requests. Terrain requests are separate from world updates because they're less critical and cost more CPU time. That said, the world updates (top priority) decide what terrain to load, and load solid placeholder blocks until the terrain is loaded, so, for instance, the player won't fall through the world if the terrain gen is too slow.

The client has a similar main thread. If you wanted to process two of the actions above at the same priority, you could just merge the message queues into a single queue. So far, that hasn't been particularly necessary: realistically, we should be "running out" of world updates and client updates on a really frequent basis (i.e. several times a second), so none of the actions get starved. If we assume that's happening, then the exact ordering doesn't matter anyway.

This design is also easy to scale to multiple threads; I can just add a new thread with the same in_series-based code; the message queues and server data are Mutex-wrapped to be thread-safe, so the threads can run pretty independently (I still want to put in more work on the locking side of things, so threads are less likely to subtly/unintentionally block one another). The one major wrinkle is when some things need to be single-threaded, like OpenGL calls in the client (I've been operating under the assumption that the OpenCL calls for terrain generation work the same way). In that case, you can just remove those actions from the in_series invocations of all but one thread.

The last 20%
The async IO redesign was a big chunk of the client/server refactoring, but a few more things needed to happen to get performance back to where it was before - I wanted to be able to run a local server and local client and have the same smoothness I had when they were bundled together in the same process.

Message-passing: A big bottleneck that came up was the message-passing between client and server: they communicate over sockets (since, ideally, I want this to also work over at least LAN), so before, where we could just pass data by passing a pointer, now we actually need to send data, byte-by-byte, in a serialized format (no pointers for things like dynamic arrays!) I was using JSON originally, just because it's easy and built in to the standard library, but it was too verbose and too slow. I tried to use Cap'n Proto, but the refactoring to make it work was getting to be too much (I did get a cool new repo out of it, though). I ended up just writing my own "Flatten" trait to turn data types into "flat" memory blocks, and using that.

Rate limiting: The client asks the server to send it all the blocks of terrain it wants to load. The end goal for the client is "load all blocks within some radius R". We could just send a "gimme all blocks within radius R" packet to the server, but if we move a little to the right, we don't want to get all those blocks again. Ideally, we could just ask for the new "slice" of blocks, but we also don't have any guarantee that all the previous ones finished loading, or what order they loaded in (since the server could really send them back in any order, regardless of the requested order - we want to load blocks as soon as they're available, don't we?) So although it might be possible to avoid individually requesting blocks all the time, the amount of work we'd have to do in order to request in bigger "chunks" doesn't seem worth it. Instead, let's just handle that huge stream of requests better. More compact, efficient serialization was a big part of it. Handling terrain loading as a lower-priority action was another big step, since handling client terrain requests as part of the client update action would lag all the event processing from the client (e.g. you'd press W and the player wouldn't move because the server was still busy dealing with your client's terrain requests). But there were still too many actions; the client updates were still being slowed down by all the terrain requests. The obvious, simple, functional solution was just to rate-limit the client. Ideally, this would be done on the server's end (since we have to do that eventually for security reasons anyway), but for now, rate limiting on the client's end is good enough - just don't be as eager about sending requests.

Don't be an idiot: The one last optimization that made a big difference was some of the more low-hanging fruit I could've found: at some point during the switch (or maybe it's been this way the whole time), I accidentally removed the code that would filter out requests for blocks that were already loaded. When the client moved a few steps in any direction, we'd re-request roughly render_distance^3 blocks, instead of roughly render_distance^2. That made a pretty noticeable difference, and gave the performance that last push back to smoothness!

Monday, March 2, 2015

Stripping Down the Scheduler

In my last post, I laid out a high-level client/server design based around a custom scheduler enqueuing and prioritizing closures. The basic idea is still sound, but somewhat surprisingly (at least to me) the "closure" part of this design is the least relevant one.

All the queued closures would originate from somewhere: either incoming messages, or by timed signals. If we allocate a closure for every single event, it means a whole lot of small, repetitive heap allocations. We can take down the churn a little by batching up messages, but we sacrifice some flexibility and latency.

But more importantly, why use closures at all? The set of things causing a closure is small and well-defined - we can keep the priority- and deadline-based approach, but we can implement it on run-of-the-mill message queues too. We have to add a few enums to describe the messages, but that's okay. Then a fixed-size task pool can pop and process the prioritized events just like before.

Once again, I find myself bitten by trying to think too generically.