Terrain Rendering Optimization

Lately terrain generation has been taking up a majority of my render times, so I spent a day working on some optimizations to the terrain generator. This is an interpreter function that takes as input a terrain function (mapping latitude/longitude to height/color) and camera parameters, then uses the ROAM algorithm to generate a high-resolution view-dependent mesh of the planetary terrain (roughly 1 polygon/pixel), which it outputs as RIB.

The output and rendering side of the terrain generator is already well-optimized; it sorts the mesh into screen-space-coherent chunks and feeds them to RenderMan in a way that renders very efficiently. The main bottleneck is in evaluating the terrain function. Simple math operations run quickly because the interpreter already uses a PRMan-like SIMD scheme that processes large batches of points in a cache-friendly manner. However, about 90% of terrain generation time is spent sampling large image maps.

A typical surface-level shot requires taking hundreds of thousands of bicubically-interpolated samples from ten or so image maps (some storing heights and some storing colors) totaling 1GB or so of LZW-compressed TIFF files (~5GB if uncompressed). The sampler is not very efficient: it does keep an LRU cache of uncompressed scanlines in memory to avoid going to disk for each sample, but that’s about the only optimization. It is a long way from a PRMan-style texture() function. The main “pain point” is that the sampler loads in a lot more texture data than is really necessary to render a frame. Textures on disk are just scanline-oriented TIFF files, so the sampler needs to read and decompress many scanlines in order to satisfy a single sample query.

I saw basically two approaches to optimizing this part of the system: first, reduce the average amount of texture data loaded for each sample, and second, make the loading process go faster. The first approach is more powerful, and can follow the design of tile-caching texture() functions used by most renderers. However, it involves writing a LOT of new code: breaking planar textures into tiles and pyramids, changing the sampler to use an LRU tile cache, and most importantly, figuring out how deep into the pyramid a given sample access must go. This is a non-trivial problem for a terrain generator. Unlike an ordinary texture lookup, the results of the image sampling will affect the geometry of the terrain and also  the ROAM tessellation process. In theory it’s not possible to know what detail level is appropriate for a given sample, because the results affect where the geometry lands in 3D space.

However, in practical situations, I can imagine dividing sample lookups into two classes: a high-level “lay of the land” class that only samples the highest available map resolution (this would be used for planet-scale maps), and a low-level “detail” class that is allowed to “cheat” by stopping higher in the image pyramid for samples far from the camera. (note: interp re-tessellates the terrain each frame, so frame-to-frame coherence is not a problem; this would be an issue if the terrain height for a given lat/lon position could change depending on where the camera is). Alternatively, just using a tile-based texture format rather than pyramids would probably make the system run faster than it currently does, although this wouldn’t be as efficient as a detail-based lookup. I will keep both of these options in mind for future optimization.

The second approach to optimizing the sampler is simply to make the sampling process go faster. At some point this is bottlenecked on disk speed. However, most of the compressed texture data fits into memory, so the real bottleneck is more in copying the compressed data from the kernel’s buffer cache into the sampler’s memory, and then performing LZW decompression there. I got an easy “quick win” by re-writing the TIFF file textures to store each scanline as a separate compression unit. (by default most TIFF writers group about 16 scanlines into a single unit; this interacts poorly with the scanline cache inside the sampler). I checked the speed of TIFF’s other compression method, Deflate, but it was much slower than LZW. (and I’m not going to store textures uncompressed on disk). I also got a small speed increase from sorting each SIMD batch of vertex evaluations by latitude – this improves locality of texture accesses. The big speed gain, however, came from distributing vertex evaluation across many processor cores.

Again, it seems counter-intuitive that a “dumb” sampler would benefit from multicore processing, since it is more I/O bound than CPU bound. However, note as above that the compressed texture data fits into RAM, so really we are memcpy-and-decompress bound rather than read-from-disk bound.

The interp system is based on a stateless functional language specifically so that its computations can be parallelized easily. In many cases it’s obvious where and how to split a large function up so that each CPU computes a portion of the result independently (sort of a mini map-reduce). I had already experimented earlier with breaking PRMan renders into screen-space chunks and computing each one in parallel. This actually works great, provided that the edges can be kept seamless. However, since ROAM outputs a unified mesh for the whole planet, this approach entails a lot of redundant computation, and it also interacts poorly with PRMan’s own multithreading (PRMan really wants the entire machine to itself while rendering)*. So, I decided to try splitting on the vertex evaluations. The original ROAM system evaluates vertices one at a time, but as part of my SIMD optimization I modified the algorithm to batch up position evaluations as much as possible. The ROAM algorithm doesn’t actually need to know the position of a new vertex immediately; it can keep working on other splittable triangles and then come back later to see if the new triangle it made still needs to be split further. In this way I’m able to batch up terrain evaluations into groups of hundreds or thousands of vertices, each of which takes a non-negligible amount of time to compute, and all of which are entirely independent.

In order to reduce coding time and complexity, my usual technique to distributed processing is to fork() the main process a few times, open pipes to each child process, then sit in a poll() loop sending batches of work to each child and sleeping while they are busy. I find this far easier than traditional multithreaded programming, because there is no shared state and inter-task communication is 100% explicit through the poll() loop and pipes. For RIB handling and external rendering I pass the results over pipes or sockets, but here for efficiency I use shared memory: the parent process mmap()s a buffer shared to all child processes, where the parent writes input lat/lon positions and the children write position/color outputs in designated places.

Here I discovered an unanticipated complication: the fork()ed children also share file descriptor state with each other. This caused libtiff to step all over itself because file seeks from different child processes would interfere with each other. In general interp is not designed for this situation (and it is definitely not thread-safe). However, I was able to get around it by plugging in custom I/O functions to libtiff that use pread()/pwrite() instead of read()/write(), which provide an explicit file offset to each call (tracked manually by following libtiff’s read, write, and seek calls). Yes, I know that Linux has some magic clone() flags that can turn off file descriptor sharing, but I don’t want interp to become too Linux-specific. (though of course fork() is a lost cause on Windows – my general philosophy is that “core” non-GUI/non-rendering features must work on all of Linux, OSX, and Windows, and everything must work across Linux and OSX. It’d probably work on any BSD too, but I haven’t tried).

Back to the main story: distributing the vertex computations. I got a massive speed-up! Together with the sorting optimization and scanline-oriented TIFF files, terrain computation got about 2x to 3x faster on an 8-core machine (using about 6 worker processes – using more reduced performance due to batches becoming too small). I bet I could get another 2x or so from making the sampler smarter as described above, but that would likely require many days of coding, and so isn’t worth it for my current project. Also, terrain generation has gone from a majority to a minority of render time. Now the dominant factor is shading of secondary rays within PRMan. I’ve already added a few tweaks to my main shader that reduce the computation done on non-camera rays (using larger texture filters, not doing ambient occlusion lookups, etc). Still, I am down to about 2 minutes per frame with the quality knobs turned all the way up, which is certainly comfortable for this production.

* splitting a frame into screen-space chunks is still worth pursuing as an optimization for high-speed interactive rendering, where seamlessness isn’t a huge worry. I can’t wait to try running a whole bunch of AWS instances in parallel on one frame.

Leave a Reply

Your email address will not be published. Required fields are marked *