RPC for Twisted Python servers

Lesson learned: if you need an RPC protocol for Python/Twisted, don’t use AMP, use HTTP (with urllib if necessary for synchronous calls). AMP’s 65k message limit is really annoying to work within, and if you strace it you’ll see it doing tons of 1-byte reads/writes. HTTP overhead sucks, but I’d rather have that than the limitations of AMP.

edit: after a few more months of experience, this turns out to be wrong. AMP is actually not too bad in terms of the socket calls it makes, and you can work around the 65k message limit (at the cost of extra round-trips) by using a chunked result protocol. urllib, on the other hand, is pretty awful for synchronous HTTP. It is the source of those 1-byte reads/writes, because it doesn’t have a buffering mechanism between the socket and the Python file interface it is trying to implement.

John Carmack on data storage for graphics

John Carmack’s annual QuakeCon addresses are gold mines of insight on 3D graphics and game development (2011 link).

One interesting point in this year’s talk was about how to structure files and I/O for data-intensive graphics. John came out very strongly in favor of what you might call an “mmap-in-place” approach. Today his code still uses heavily-compressed “pack” files that are read serially, but that’s mostly due to storage space limitations on game consoles. In the future he is favoring binary formats that are very close to the final in-memory representation of the data, so that you just mmap() the file and use it in place rather than running it through some kind of parser.

This surprised me because most of my experiments with mmap have not shown big wins relative to conventional serial file I/O. (e.g. I once hacked an MP3 player to use mmap() rather than read(), and was surprised to find that it performed poorly and trashed the buffer cache). Modern operating systems are very good at loading and caching big linear files on disk, but not as good at handling lots of random paging on memory-mapped files. I couldn’t figure out why John thought mmap-in-place was the ideal solution, until it occurred to me that his use cases are qualitatively different from mine.

Let’s contrast the two I/O approaches in a few ways. First, does a given access actually need to hit the disk? If all of one’s data fits into RAM, then neither I/O system will require disk access. mmap() will be more efficient because there is only one copy of the data in memory rather than two, and program code can access it directly. This is actually a very important consideration for future-proofing. Any code that does complicated things to get around memory limitations should have a good “fast path” that kicks in once Moore’s Law makes those limitations obsolete. For example, a Linux kernel developer once remarked that any database that uses unbuffered disk I/O should include an option to fall back to regular buffered I/O, or else it will perform very poorly in cases where RAM is actually big enough to hold the entire working set. Note that in John’s game-engine world, games levels are specifically designed to always fit into the available memory on target platforms, so it’s always going to use this “fast path.” Whereas, in the offline rendering world, I can’t guarantee that my datasets will always fit in RAM, so mmap-in-place may end up causing more I/O than reading everything serially.

Second, consider the issues of disk access locality and latency. At first glance, it seems that serial I/O on a well-designed, compressed file format is ideal, because disk reads are large and linear, whereas mmap() I/O is inefficient because the access pattern is random. However, I believe John makes an unstated assumption that most of the bulky data consists of graphical details that can be loaded asynchronously, like high-resolution textures and models, and not “core” data structures that must be present in order for the engine to run. In this case, I/O latency and locality are less important. Also, I think John assumes the use of a clever virtual-memory scheme as in his MegaTexture system, which improves locality of access.

So, in a game engine where the working set usually fits into available RAM, and where data can be paged in asynchronously, mmap-in-place does make a lot of sense as a data storage architecture. But for offline applications where you don’t have enough RAM for everything, and where reads have to be synchronous, mmap may not be the ideal approach.

All of this has got me thinking in more detail about what the true disk/memory storage needs are for high-end offline rendering. We spend a lot of time developing clever tricks to minimize memory needs, like on-demand geometry tessellation (REYES/procedurals), mip-mapping, and brickmaps. Most of my rendering optimizations boil down to trying very hard to minimize how much geometry needs to be kept in RAM. It’s interesting to take a step back and think about how much of this work is really necessary. After all, RAM is getting ridiculously cheap. Optimizations to squeeze a scene into 4GB might be useless or even counterproductive when you’ve got 16GB. Is there some point at which we can just dump everything into a naive ray-tracer and forget about all of this annoying optimization work?

Mip-mapping and brickmaps have more or less completely solved the problem of texture memory access. By selecting mip-map tiles using screen-space metrics, we’ve gotten pretty close to optimal in terms of I/O and memory needs for 2D and 3D textures. The remaining problem is just geometry. Smart culling and REYES do a fantastic job on camera-visible geometry; it’s more about ray-visible geometry. You can only fit so many million tessellated micropolygons in RAM, and given the poor locality and wide scope of ray paths, there isn’t a clear upper bound on what might need to be tessellated as there is with straight camera-visible geometry.

You’ve also got the problem of modifications to geometry – clever ray-tracing data structures usually aren’t designed for cases where major pieces of the scene are transforming or deforming every frame. This is why ray tracing hasn’t completely taken over from REYES in production. Ray tracing is theoretically O(log N), but that’s only after you have built an acceleration data structure. In practice it’s more like O(N) because you still need to stream all that source geometry into the system to get it ready to be traced. As of today, this means storing your models on disk, then serially reading those files and translating them into a ray-tracer-friendly data structure in memory. For my current project, which isn’t all that geometry-heavy, this involves going through 100-200MB of data every frame. If we are ever going to do high-quality rendering at interactive frame rates, this will need to change. John’s talk suggests the interesting approach of encoding models into some kind of virtually-paged ray acceleration structure. Perhaps we could run a pre-pass on baked models and animations, converting them into some kind of special binary format that the renderer can mmap on demand.

Animating Mars Rovers

The Mars Rover’s six-wheeled “rocker-bogie” suspension system is an ingenious mechanical design that allows the vehicle to crawl nimbly over rocky terrain. Visualizing how this system moves has been one of my biggest challenges.

Let’s start with how the rover works in reality. Each of the six wheels has its own electric motor. When the rover wants to go somewhere, it uses those motors to turn the wheels forward. The many joints in the suspension and the rover body are pulled along by friction forces on the wheels, transmitted through the suspension structure. Disregarding side-to-side movement, turning, and slippage, there seem to be five degrees of freedom: two bogie angles, the rocker angle, plus pitch and bank on the rover body. (there is only one degree of freedom in the rocker because the left and right rockers are linked through a differential that ensures they move equally far in opposite directions). All dynamics of these five joints are determined from the geometry of the wheels contacting the ground.

Unfortunately, current off-the-shelf animation software is pretty bad at modeling this system. I have considered two basic approaches: first, what I call a “static” approach, where you specify the XZ position and heading of the rover body, and then attempt to determine the rest of the degrees of freedom based on this. The static approach is not “path-dependent” – it solves for the joint positions on each frame independently, without considering forces or inertia. Second, the “dynamic” approach would actually simulate the full physical system evolving over time.

I tried the “dynamic” approach first, but ran into serious problems. Maya’s rigid body dynamics solver appears to use forward integration and therefore has trouble dealing with the very stiff forces necessary to hold the suspension system together (e.g. the differential bar). Furthermore, its collision model does not come close to generating realistic interaction with the ground. In order to model wheels, it is important to be able to specify friction independently on the lateral and longitudinal axes, but Maya only offers a single isotropic friction control. The new nCloth solver seemed like it would give better results, but nCloth does not handle rigid bodies yet (and using very stiff soft bodies makes the joints seem like they are made of Jell-O).

The “static” approach is somewhat more animator-friendly, because you can easily scrub through time and “puppeteer” the rover motion rather than having to run a full simulation first. (and Maya’s simulation baking feature is horribly broken, but that’s a rant for another day). Disregarding forces and inertia, and just solving for the constrained joints seems like a fairly simple geometry problem. Given the rover body position, you compute the XZ coordinates of the wheels, sample the terrain there to determine the wheel heights, then derive all of the suspension joint angles and rover body pitch/bank from the mechanical constraints. Unfortunately, this seems to be a nonlinear problem, because changing the joint angles also changes the wheel positions, which affects the height of the terrain underneath them. I think a fully “correct” solver would have to use an iterative approach: sample the wheels at zero joint deflection, compute the joint angles, then re-sample the terrain at the updated wheel positions, and continue iterating until the system settles down in a stable configuration.

At this point I have built a rig based on Maya expressions that performs just one iteration. Essentially it makes the approximation that the terrain height underneath each wheel does not change as the joints rotate, which is true in the limit of small joint deflections. This is actually good enough to use for animation shots, with the addition of additional manual keyframe controls to fudge things where the rig fails.

Thinking about this some more, I bet I could develop a “second-order” rig that uses the basic rig as input, but then re-samples the terrain height at the wheel positions determined by the first-order rig. It’s like a poor-man’s iterative solver, done manually with Maya expressions…

Edit: Of course, I did go and read a bunch of papers from the robotics literature about how engineers actually simulate rocker-bogie systems. However, it seems like in all of these cases, they assume you already know the joint angles and wheel contact geometry, and are just solving for forces, or trying to figure out how to rotate the wheels to drive in a given direction. I could not find a paper that gives any guidance about how to simulate driving “from scratch” over arbitrarily complex terrain.

One one hand

I remember back when I needed more than one hand to count render time in minutes. Now if an HD frame takes more than 2 minutes I get antsy.

How do I keep render times short? 1) Use an absolute minimum amount of ray tracing. I cache ambient occlusion in point clouds for non-deforming objects (and also for many deforming objects – it really doesn’t look bad). 2) Stream all input to the renderer in a single pass. No writing RIB or other big intermediate files to disk.

Maya 2012 bugs

I do admire Autodesk’s work in porting Maya from its old (XForms-based?) interface to Qt. The transition is almost seamless, except for a few annoying UI bugs. These are minor bugs requiring a keystroke or two to work around, but many are in the “critical path” for working animators, so they seriously hurt usability.

  • the script editor button does nothing
  • focus does not return to original window after pressing “Enter” on numeric keyframe time/value fields
  • same issue for channel box fields
  • A/F “focus selected”/”focus all” keys randomly fail to work
  • import “resolve only clashing nodes” option does not work
  • play/stop toggle key occasionally fails to work, or waits several frames before taking effect
  • pop-up dialogs cannot be dismissed with “Enter” key unless first clicked to set focus (despite looking like they are focused)

Organizing scenes 2

I implemented the new scene layout convention. It feels awkward, but it works. As a benefit, I can now in theory load more than once scene into interp at once, without any interference. This is now the third time I’ve implemented something akin to a module/package system in interp. It’s definitely crying out for one. I’m pretty sure I have all the code I need to implement it. I just need to make sure whatever solution I come up with works smoothly in the presence of parameter state, and the toplevel syntax has to be logical and clean.

Organizing Scenes

With terrain mostly under control now, I spent some time trying to re-organize how scenes are put together and fed to interp.

Historically, I have built each scene as single, large .mcp file that pulls in a few common definitions from a project-specific .mcp library. The common part is rather small, including only a handful of things like file locations and shader definitions. The rest of the scene consists of a large blob of interp code in the scene-specific .mcp file. This approach has the following features:

  • Advantage: scenes are largely independent of each other. Assets can be tweaked from scene to scene easily without interference.
  • Disadvantage: a large amount of interp code is duplicated across scenes. (e.g., instructions for how to render the main element, shadow passes, and atmosphere, and how to comp it all together). Updates to common assets are not propagated across the duplicates.
  • Disadvantage: scene files are hard to read. The ratio of “important” code to boilerplate is low.

For my current animation project, I’m trying a different approach, putting as much of the boilerplate code as possible into common files. Scenes are much smaller now, down from ~500 lines to ~200 lines or so. Common elements reference the same sources. However, accomplishing scene-specific tweaks while keeping everything well-organized is a challenge.

The problem has to do with interp’s stateless design. I have always believed very strongly that a practical scene processor must be stateless, in order to allow parallelism and caching. Symbol definitions can only reference previously-declared symbols (or themselves), so everything is built in a bottom-up manner. This works fine with abstract math and algorithms, but production scenes are another matter. They contain lots of mutually-referencing definitions, some of which stay constant for an entire project, and some of which must vary on a scene-by-scene basis (ideally with a default value that can be picked up without writing any code). For example, the outer structure of the comp tree and main scene graph are usually constant for a project. However, these reference “inner” symbols, like the list of active objects, and some lighting parameters, which can vary by scene. This prevents a straightforward bottom-up structure for the .mcp files, because the common “headers” at the top of scene need to reference varying things declared in the main body of the file. (old-fashioned monolithic scene files are built bottom-up, but they must duplicate a ton of boilerplate because the scene-specific tweaks start very close to the “bottom”).

I’m also a strong believer in referential transparency and early binding. This means I won’t use preprocessor tricks that operate at the textual level (like C #defines). Definitions should not be updated “behind the back” of the interpreter. Interp does provide a way for specific symbols to “break” statelessness. These are called “parameters” and are used for changing the evaluation environment in a stack-like manner. The classic cases are the symbols for “time”, “width,” and “height” of the current rendered image, since those can change throughout a comp tree evaluation, and it’s very ugly to pass them as explicit parameters to every single function. However, I insist on keeping the use of parameters to an absolute minimum, since they interfere with caching and slow down the interpreter (due to the symbols being bound late). In fact, I consider any use of parameters beyond time and region-of-interest information to be highly questionable. There is no way I’d make every single scene-varying data item a parameter.

Another complication is my long-term plan that involves interp running as an interactive server process. Most scene description systems have serious trouble with interactive usage because they are designed to operate start-to-end as a batch process, like a C compiler. My long-term vision is for interp to read the library files and scenes for an entire project, keeping them all “live” in memory — or in some sort of object-oriented, version-controlled database, where today’s .mcp files are just a snapshot of the database contents — while artists manipulate symbol values interactively. Text-level preprocessor tricks will never work in this case.

I’m currently thinking that a solution might involve managing a dictionary of scene-specific “properties” (which are given default values in a project-wide parent dictionary, then optionally overridden one by one in a scene-specific child dictionary). The dictionary would, unfortunately, have to be a parameter in order to allow library functions to reference things defined below. Something like this:

library.mcp says:
       (def default-values (dict "prop1" 123.0 "prop2" 456.0))
       (defparam scene-values)
       (def scene-values default-values)
then library functions say:
       (def myfunc (+ (get-scene-value "prop1") (get-scene-value "prop2")))
and the scene looks like this:
(execfile "/project/library.mcp")
(set scene-values (dict default-values "prop2" 789.0))
(def foo (myfunc))

Tiled image input

I went ahead and added tiled image input to interp, in order to make the ROAM terrain generator more efficient. It was a little easier than I thought – about a full day’s work, although a substantial portion of that was me adding tiling support to my custom EXR loader, which I ended up not using in favor of good old LibTIFF. (mainly because EXR does not support 16-bit or 8-bit integer channels, which I need for storing terrain heights and colors efficiently). Luckily I didn’t need to write a tiling utility, because PRMan ships with one called “tiffcopy.”

Combined with parallel processing, terrain generation times have gone down significantly, from about 60 seconds to 15 seconds for a decent-quality full-res shot now (shadingrate = 5). The tiled image input did not speed things up much by itself, but it greatly reduced memory usage, which allowed me to run more concurrent threads. My laptop has only 4GB of RAM and could only run 6 threads before, now it’s up to 8. That’s where most of the speed-up came from.

I realized that my ROAM generator has very bad locality in its texture lookups. This is hindering further texture-related efficiency gains. Unlike a REYES renderer, which marches across finely-tessellated surface, the ROAM generator jumps all over the place. It starts with planet-scale triangles and alternates between them as it subdivides the world more and more finely. Delaying vertex evaluation and sorting batches in lat/lon space has helped, and it’s probably worth looking more in this direction in the future.

I also haven’t gotten as far as estimating filter sizes and performing actual mip-mapping yet. Reducing the total number of texels touched is likely to give more speed-ups.

Terrain Rendering Optimization

Lately terrain generation has been taking up a majority of my render times, so I spent a day working on some optimizations to the terrain generator. This is an interpreter function that takes as input a terrain function (mapping latitude/longitude to height/color) and camera parameters, then uses the ROAM algorithm to generate a high-resolution view-dependent mesh of the planetary terrain (roughly 1 polygon/pixel), which it outputs as RIB.

The output and rendering side of the terrain generator is already well-optimized; it sorts the mesh into screen-space-coherent chunks and feeds them to RenderMan in a way that renders very efficiently. The main bottleneck is in evaluating the terrain function. Simple math operations run quickly because the interpreter already uses a PRMan-like SIMD scheme that processes large batches of points in a cache-friendly manner. However, about 90% of terrain generation time is spent sampling large image maps.

A typical surface-level shot requires taking hundreds of thousands of bicubically-interpolated samples from ten or so image maps (some storing heights and some storing colors) totaling 1GB or so of LZW-compressed TIFF files (~5GB if uncompressed). The sampler is not very efficient: it does keep an LRU cache of uncompressed scanlines in memory to avoid going to disk for each sample, but that’s about the only optimization. It is a long way from a PRMan-style texture() function. The main “pain point” is that the sampler loads in a lot more texture data than is really necessary to render a frame. Textures on disk are just scanline-oriented TIFF files, so the sampler needs to read and decompress many scanlines in order to satisfy a single sample query.

I saw basically two approaches to optimizing this part of the system: first, reduce the average amount of texture data loaded for each sample, and second, make the loading process go faster. The first approach is more powerful, and can follow the design of tile-caching texture() functions used by most renderers. However, it involves writing a LOT of new code: breaking planar textures into tiles and pyramids, changing the sampler to use an LRU tile cache, and most importantly, figuring out how deep into the pyramid a given sample access must go. This is a non-trivial problem for a terrain generator. Unlike an ordinary texture lookup, the results of the image sampling will affect the geometry of the terrain and also  the ROAM tessellation process. In theory it’s not possible to know what detail level is appropriate for a given sample, because the results affect where the geometry lands in 3D space.

However, in practical situations, I can imagine dividing sample lookups into two classes: a high-level “lay of the land” class that only samples the highest available map resolution (this would be used for planet-scale maps), and a low-level “detail” class that is allowed to “cheat” by stopping higher in the image pyramid for samples far from the camera. (note: interp re-tessellates the terrain each frame, so frame-to-frame coherence is not a problem; this would be an issue if the terrain height for a given lat/lon position could change depending on where the camera is). Alternatively, just using a tile-based texture format rather than pyramids would probably make the system run faster than it currently does, although this wouldn’t be as efficient as a detail-based lookup. I will keep both of these options in mind for future optimization.

The second approach to optimizing the sampler is simply to make the sampling process go faster. At some point this is bottlenecked on disk speed. However, most of the compressed texture data fits into memory, so the real bottleneck is more in copying the compressed data from the kernel’s buffer cache into the sampler’s memory, and then performing LZW decompression there. I got an easy “quick win” by re-writing the TIFF file textures to store each scanline as a separate compression unit. (by default most TIFF writers group about 16 scanlines into a single unit; this interacts poorly with the scanline cache inside the sampler). I checked the speed of TIFF’s other compression method, Deflate, but it was much slower than LZW. (and I’m not going to store textures uncompressed on disk). I also got a small speed increase from sorting each SIMD batch of vertex evaluations by latitude – this improves locality of texture accesses. The big speed gain, however, came from distributing vertex evaluation across many processor cores.

Again, it seems counter-intuitive that a “dumb” sampler would benefit from multicore processing, since it is more I/O bound than CPU bound. However, note as above that the compressed texture data fits into RAM, so really we are memcpy-and-decompress bound rather than read-from-disk bound.

The interp system is based on a stateless functional language specifically so that its computations can be parallelized easily. In many cases it’s obvious where and how to split a large function up so that each CPU computes a portion of the result independently (sort of a mini map-reduce). I had already experimented earlier with breaking PRMan renders into screen-space chunks and computing each one in parallel. This actually works great, provided that the edges can be kept seamless. However, since ROAM outputs a unified mesh for the whole planet, this approach entails a lot of redundant computation, and it also interacts poorly with PRMan’s own multithreading (PRMan really wants the entire machine to itself while rendering)*. So, I decided to try splitting on the vertex evaluations. The original ROAM system evaluates vertices one at a time, but as part of my SIMD optimization I modified the algorithm to batch up position evaluations as much as possible. The ROAM algorithm doesn’t actually need to know the position of a new vertex immediately; it can keep working on other splittable triangles and then come back later to see if the new triangle it made still needs to be split further. In this way I’m able to batch up terrain evaluations into groups of hundreds or thousands of vertices, each of which takes a non-negligible amount of time to compute, and all of which are entirely independent.

In order to reduce coding time and complexity, my usual technique to distributed processing is to fork() the main process a few times, open pipes to each child process, then sit in a poll() loop sending batches of work to each child and sleeping while they are busy. I find this far easier than traditional multithreaded programming, because there is no shared state and inter-task communication is 100% explicit through the poll() loop and pipes. For RIB handling and external rendering I pass the results over pipes or sockets, but here for efficiency I use shared memory: the parent process mmap()s a buffer shared to all child processes, where the parent writes input lat/lon positions and the children write position/color outputs in designated places.

Here I discovered an unanticipated complication: the fork()ed children also share file descriptor state with each other. This caused libtiff to step all over itself because file seeks from different child processes would interfere with each other. In general interp is not designed for this situation (and it is definitely not thread-safe). However, I was able to get around it by plugging in custom I/O functions to libtiff that use pread()/pwrite() instead of read()/write(), which provide an explicit file offset to each call (tracked manually by following libtiff’s read, write, and seek calls). Yes, I know that Linux has some magic clone() flags that can turn off file descriptor sharing, but I don’t want interp to become too Linux-specific. (though of course fork() is a lost cause on Windows – my general philosophy is that “core” non-GUI/non-rendering features must work on all of Linux, OSX, and Windows, and everything must work across Linux and OSX. It’d probably work on any BSD too, but I haven’t tried).

Back to the main story: distributing the vertex computations. I got a massive speed-up! Together with the sorting optimization and scanline-oriented TIFF files, terrain computation got about 2x to 3x faster on an 8-core machine (using about 6 worker processes – using more reduced performance due to batches becoming too small). I bet I could get another 2x or so from making the sampler smarter as described above, but that would likely require many days of coding, and so isn’t worth it for my current project. Also, terrain generation has gone from a majority to a minority of render time. Now the dominant factor is shading of secondary rays within PRMan. I’ve already added a few tweaks to my main shader that reduce the computation done on non-camera rays (using larger texture filters, not doing ambient occlusion lookups, etc). Still, I am down to about 2 minutes per frame with the quality knobs turned all the way up, which is certainly comfortable for this production.

* splitting a frame into screen-space chunks is still worth pursuing as an optimization for high-speed interactive rendering, where seamlessness isn’t a huge worry. I can’t wait to try running a whole bunch of AWS instances in parallel on one frame.