July 2011 Archives

Organizing scenes 2

I implemented the new scene layout convention. It feels awkward, but it works. As a benefit, I can now in theory load more than once scene into interp at once, without any interference. This is now the third time I’ve implemented something akin to a module/package system in interp. It’s definitely crying out for one. I’m pretty sure I have all the code I need to implement it. I just need to make sure whatever solution I come up with works smoothly in the presence of parameter state, and the toplevel syntax has to be logical and clean.

Organizing Scenes

With terrain mostly under control now, I spent some time trying to re-organize how scenes are put together and fed to interp.

Historically, I have built each scene as single, large .mcp file that pulls in a few common definitions from a project-specific .mcp library. The common part is rather small, including only a handful of things like file locations and shader definitions. The rest of the scene consists of a large blob of interp code in the scene-specific .mcp file. This approach has the following features:

Advantage: scenes are largely independent of each other. Assets can be tweaked from scene to scene easily without interference.
Disadvantage: a large amount of interp code is duplicated across scenes. (e.g., instructions for how to render the main element, shadow passes, and atmosphere, and how to comp it all together). Updates to common assets are not propagated across the duplicates.
Disadvantage: scene files are hard to read. The ratio of “important” code to boilerplate is low.

For my current animation project, I’m trying a different approach, putting as much of the boilerplate code as possible into common files. Scenes are much smaller now, down from ~500 lines to ~200 lines or so. Common elements reference the same sources. However, accomplishing scene-specific tweaks while keeping everything well-organized is a challenge.

The problem has to do with interp’s stateless design. I have always believed very strongly that a practical scene processor must be stateless, in order to allow parallelism and caching. Symbol definitions can only reference previously-declared symbols (or themselves), so everything is built in a bottom-up manner. This works fine with abstract math and algorithms, but production scenes are another matter. They contain lots of mutually-referencing definitions, some of which stay constant for an entire project, and some of which must vary on a scene-by-scene basis (ideally with a default value that can be picked up without writing any code). For example, the outer structure of the comp tree and main scene graph are usually constant for a project. However, these reference “inner” symbols, like the list of active objects, and some lighting parameters, which can vary by scene. This prevents a straightforward bottom-up structure for the .mcp files, because the common “headers” at the top of scene need to reference varying things declared in the main body of the file. (old-fashioned monolithic scene files are built bottom-up, but they must duplicate a ton of boilerplate because the scene-specific tweaks start very close to the “bottom”).

I’m also a strong believer in referential transparency and early binding. This means I won’t use preprocessor tricks that operate at the textual level (like C #defines). Definitions should not be updated “behind the back” of the interpreter. Interp does provide a way for specific symbols to “break” statelessness. These are called “parameters” and are used for changing the evaluation environment in a stack-like manner. The classic cases are the symbols for “time”, “width,” and “height” of the current rendered image, since those can change throughout a comp tree evaluation, and it’s very ugly to pass them as explicit parameters to every single function. However, I insist on keeping the use of parameters to an absolute minimum, since they interfere with caching and slow down the interpreter (due to the symbols being bound late). In fact, I consider any use of parameters beyond time and region-of-interest information to be highly questionable. There is no way I’d make every single scene-varying data item a parameter.

Another complication is my long-term plan that involves interp running as an interactive server process. Most scene description systems have serious trouble with interactive usage because they are designed to operate start-to-end as a batch process, like a C compiler. My long-term vision is for interp to read the library files and scenes for an entire project, keeping them all “live” in memory — or in some sort of object-oriented, version-controlled database, where today’s .mcp files are just a snapshot of the database contents — while artists manipulate symbol values interactively. Text-level preprocessor tricks will never work in this case.

I’m currently thinking that a solution might involve managing a dictionary of scene-specific “properties” (which are given default values in a project-wide parent dictionary, then optionally overridden one by one in a scene-specific child dictionary). The dictionary would, unfortunately, have to be a parameter in order to allow library functions to reference things defined below. Something like this:

library.mcp says:
       (def default-values (dict "prop1" 123.0 "prop2" 456.0))
       (defparam scene-values)
       (def scene-values default-values)
then library functions say:
       (def myfunc (+ (get-scene-value "prop1") (get-scene-value "prop2")))
and the scene looks like this:
(execfile "/project/library.mcp")
(set scene-values (dict default-values "prop2" 789.0))
(def foo (myfunc))

Tiled image input

I went ahead and added tiled image input to interp, in order to make the ROAM terrain generator more efficient. It was a little easier than I thought – about a full day’s work, although a substantial portion of that was me adding tiling support to my custom EXR loader, which I ended up not using in favor of good old LibTIFF. (mainly because EXR does not support 16-bit or 8-bit integer channels, which I need for storing terrain heights and colors efficiently). Luckily I didn’t need to write a tiling utility, because PRMan ships with one called “tiffcopy.”

Combined with parallel processing, terrain generation times have gone down significantly, from about 60 seconds to 15 seconds for a decent-quality full-res shot now (shadingrate = 5). The tiled image input did not speed things up much by itself, but it greatly reduced memory usage, which allowed me to run more concurrent threads. My laptop has only 4GB of RAM and could only run 6 threads before, now it’s up to 8. That’s where most of the speed-up came from.

I realized that my ROAM generator has very bad locality in its texture lookups. This is hindering further texture-related efficiency gains. Unlike a REYES renderer, which marches across finely-tessellated surface, the ROAM generator jumps all over the place. It starts with planet-scale triangles and alternates between them as it subdivides the world more and more finely. Delaying vertex evaluation and sorting batches in lat/lon space has helped, and it’s probably worth looking more in this direction in the future.

I also haven’t gotten as far as estimating filter sizes and performing actual mip-mapping yet. Reducing the total number of texels touched is likely to give more speed-ups.

My friend gives me the J. J. Abrams treatment

I think this is my favorite photo of myself, like, ever. Look at the special effects!

Terrain Rendering Optimization

Lately terrain generation has been taking up a majority of my render times, so I spent a day working on some optimizations to the terrain generator. This is an interpreter function that takes as input a terrain function (mapping latitude/longitude to height/color) and camera parameters, then uses the ROAM algorithm to generate a high-resolution view-dependent mesh of the planetary terrain (roughly 1 polygon/pixel), which it outputs as RIB.

The output and rendering side of the terrain generator is already well-optimized; it sorts the mesh into screen-space-coherent chunks and feeds them to RenderMan in a way that renders very efficiently. The main bottleneck is in evaluating the terrain function. Simple math operations run quickly because the interpreter already uses a PRMan-like SIMD scheme that processes large batches of points in a cache-friendly manner. However, about 90% of terrain generation time is spent sampling large image maps.

A typical surface-level shot requires taking hundreds of thousands of bicubically-interpolated samples from ten or so image maps (some storing heights and some storing colors) totaling 1GB or so of LZW-compressed TIFF files (~5GB if uncompressed). The sampler is not very efficient: it does keep an LRU cache of uncompressed scanlines in memory to avoid going to disk for each sample, but that’s about the only optimization. It is a long way from a PRMan-style texture() function. The main “pain point” is that the sampler loads in a lot more texture data than is really necessary to render a frame. Textures on disk are just scanline-oriented TIFF files, so the sampler needs to read and decompress many scanlines in order to satisfy a single sample query.

I saw basically two approaches to optimizing this part of the system: first, reduce the average amount of texture data loaded for each sample, and second, make the loading process go faster. The first approach is more powerful, and can follow the design of tile-caching texture() functions used by most renderers. However, it involves writing a LOT of new code: breaking planar textures into tiles and pyramids, changing the sampler to use an LRU tile cache, and most importantly, figuring out how deep into the pyramid a given sample access must go. This is a non-trivial problem for a terrain generator. Unlike an ordinary texture lookup, the results of the image sampling will affect the geometry of the terrain and also the ROAM tessellation process. In theory it’s not possible to know what detail level is appropriate for a given sample, because the results affect where the geometry lands in 3D space.

However, in practical situations, I can imagine dividing sample lookups into two classes: a high-level “lay of the land” class that only samples the highest available map resolution (this would be used for planet-scale maps), and a low-level “detail” class that is allowed to “cheat” by stopping higher in the image pyramid for samples far from the camera. (note: interp re-tessellates the terrain each frame, so frame-to-frame coherence is not a problem; this would be an issue if the terrain height for a given lat/lon position could change depending on where the camera is). Alternatively, just using a tile-based texture format rather than pyramids would probably make the system run faster than it currently does, although this wouldn’t be as efficient as a detail-based lookup. I will keep both of these options in mind for future optimization.

The second approach to optimizing the sampler is simply to make the sampling process go faster. At some point this is bottlenecked on disk speed. However, most of the compressed texture data fits into memory, so the real bottleneck is more in copying the compressed data from the kernel’s buffer cache into the sampler’s memory, and then performing LZW decompression there. I got an easy “quick win” by re-writing the TIFF file textures to store each scanline as a separate compression unit. (by default most TIFF writers group about 16 scanlines into a single unit; this interacts poorly with the scanline cache inside the sampler). I checked the speed of TIFF’s other compression method, Deflate, but it was much slower than LZW. (and I’m not going to store textures uncompressed on disk). I also got a small speed increase from sorting each SIMD batch of vertex evaluations by latitude – this improves locality of texture accesses. The big speed gain, however, came from distributing vertex evaluation across many processor cores.

Again, it seems counter-intuitive that a “dumb” sampler would benefit from multicore processing, since it is more I/O bound than CPU bound. However, note as above that the compressed texture data fits into RAM, so really we are memcpy-and-decompress bound rather than read-from-disk bound.

The interp system is based on a stateless functional language specifically so that its computations can be parallelized easily. In many cases it’s obvious where and how to split a large function up so that each CPU computes a portion of the result independently (sort of a mini map-reduce). I had already experimented earlier with breaking PRMan renders into screen-space chunks and computing each one in parallel. This actually works great, provided that the edges can be kept seamless. However, since ROAM outputs a unified mesh for the whole planet, this approach entails a lot of redundant computation, and it also interacts poorly with PRMan’s own multithreading (PRMan really wants the entire machine to itself while rendering)*. So, I decided to try splitting on the vertex evaluations. The original ROAM system evaluates vertices one at a time, but as part of my SIMD optimization I modified the algorithm to batch up position evaluations as much as possible. The ROAM algorithm doesn’t actually need to know the position of a new vertex immediately; it can keep working on other splittable triangles and then come back later to see if the new triangle it made still needs to be split further. In this way I’m able to batch up terrain evaluations into groups of hundreds or thousands of vertices, each of which takes a non-negligible amount of time to compute, and all of which are entirely independent.

In order to reduce coding time and complexity, my usual technique to distributed processing is to fork() the main process a few times, open pipes to each child process, then sit in a poll() loop sending batches of work to each child and sleeping while they are busy. I find this far easier than traditional multithreaded programming, because there is no shared state and inter-task communication is 100% explicit through the poll() loop and pipes. For RIB handling and external rendering I pass the results over pipes or sockets, but here for efficiency I use shared memory: the parent process mmap()s a buffer shared to all child processes, where the parent writes input lat/lon positions and the children write position/color outputs in designated places.

Here I discovered an unanticipated complication: the fork()ed children also share file descriptor state with each other. This caused libtiff to step all over itself because file seeks from different child processes would interfere with each other. In general interp is not designed for this situation (and it is definitely not thread-safe). However, I was able to get around it by plugging in custom I/O functions to libtiff that use pread()/pwrite() instead of read()/write(), which provide an explicit file offset to each call (tracked manually by following libtiff’s read, write, and seek calls). Yes, I know that Linux has some magic clone() flags that can turn off file descriptor sharing, but I don’t want interp to become too Linux-specific. (though of course fork() is a lost cause on Windows – my general philosophy is that “core” non-GUI/non-rendering features must work on all of Linux, OSX, and Windows, and everything must work across Linux and OSX. It’d probably work on any BSD too, but I haven’t tried).

Back to the main story: distributing the vertex computations. I got a massive speed-up! Together with the sorting optimization and scanline-oriented TIFF files, terrain computation got about 2x to 3x faster on an 8-core machine (using about 6 worker processes – using more reduced performance due to batches becoming too small). I bet I could get another 2x or so from making the sampler smarter as described above, but that would likely require many days of coding, and so isn’t worth it for my current project. Also, terrain generation has gone from a majority to a minority of render time. Now the dominant factor is shading of secondary rays within PRMan. I’ve already added a few tweaks to my main shader that reduce the computation done on non-camera rays (using larger texture filters, not doing ambient occlusion lookups, etc). Still, I am down to about 2 minutes per frame with the quality knobs turned all the way up, which is certainly comfortable for this production.

* splitting a frame into screen-space chunks is still worth pursuing as an optimization for high-speed interactive rendering, where seamlessness isn’t a huge worry. I can’t wait to try running a whole bunch of AWS instances in parallel on one frame.

Bug fix

I spent about two hours finding and fixing a nasty bug in my ROAM terrain tessellator. The symptom was the RIB generator complaining about negative polygon vertex indices. This was very strange because the ROAM system should never produce invalid geometry. The problem only occurred when the bucket-sorting stage was enabled. This is an optional post-processing step that runs after terrain generation; it breaks the monolithic terrain mesh into a number of smaller spatially-coherent meshes, making it easier for PRMan to “digest.” The sorter creates new vertex and index arrays for each output mesh, including only the polygons and vertices referenced by each mesh. In order to maintain the same sharing of indices with the big input mesh, the sorter uses a temporary array to track the new index it gives to each vertex that is pulled from the input mesh. The temporary array is looked up by index, so it must be large enough to contain the greatest-valued index as a valid element. This is where I found the bug. The sorter chooses the size of the array to be the same as the size of the index array on the input mesh. This makes a hidden assumption that the input mesh is not sparse, i.e., it does not contain any index values greater than the size of the index array itself. This assumption no longer holds because I recently added an optimization to the ROAM tessellator that culls polygons outside the camera frustum. It deletes the polygons but leaves their vertices in the vertex array. The resulting meshes are “sparse,” so this was fouling up the bucket sorter. The fix was simply to size the temporary index mapping array as big as the greatest index value seen in the input mesh.

Some ideas

Graphics-related ideas I’m hoping to pursue in the future (recorded here in public to establish priority):

A standard interface for shader reflectance models. This will operate at a higher level than shading languages. Rather than being a full programming language, it will be a simple “API” containing a handful of parameters (diffuse/specular intensities and colors). The purpose of this is to allow a single library of surfaces to operate across very different renderers, especially between RenderMan-style offline renderers and GPU-based real-time renderers. The idea is to cleanly separate surface reflectance functions from the overall light transport framework. This way artists only need to design surfaces once, and they will look roughly the same regardless of the particular light transport method used (e.g., OpenGL local lighting vs. ray casting vs. full global illumination).
A “quick turnaround” 3D animation pipeline. The idea is to intentionally sacrifice a small amount of visual quality in return for massive gains in production speed (and thus significantly lower costs). This could be accomplished by using motion capture for character animation, coupled to a GPU-based rendering system, likely based on an existing game engine. (e.g., the Unreal engine, CryTek engine, or id’s Rage engine).
A unified lighting and compositing system that uses micropolygon grids as the medium for exchanging data, rather than deep framebuffers. Pushing final shading and rasterization of micropolygon grids out of the renderer and into the compositor will allow some interesting tricks, like greater shader flexibility and quicker re-rendering at the compositing stage.
A distributed digital media production system based on a commodity cloud-computing platform like Amazon’s EC2. This would include a set of tools to synchronize files efficiently over low-bandwidth links between computers in the cloud and artists’ workstations. Ideally this would tie in to a production asset management/scheduling system. It would also include tools to control render nodes in the cloud, and collect and display computed work products.

The Magic of EVE Online

“In Eve you are allowed to make a difference, even if you do it the wrong way, even if it isn’t in the script. One turncoat was able to disband one of the oldest and most powerful alliance in the game, putting them into disarray long enough for their enemies to come and take everything they had built — and the developers let it happen. One guy can pull off an amazing heist worth thousands of real life dollars — and the developers will let it happen. One guy can forget to put money in the right wallet before he goes on his honeymoon leading to one of the most powerful alliances losing their key systems and throwing them into disarray long enough for their enemies to come and take everything they had built — and the developers will let it happen.

As long as you don’t hack other players’ accounts, almost anything goes. You probably won’t ever do anything particularly noteworthy, but you might, and if you do it will be allowed to happen. You will be allowed to change the world. You can’t get that anywhere else. People get all mad at CCP and go looking for other games, not even space games or science fiction games, just a persistent world to play in, and nothing else comes close. Eve is a terrible game. Every other game is worse.”

— Post by “Angela Christine” on SomethingAwful.com

Transformers: Dark of the Moon

Reactions:

The writing was significantly better than the last Transformers film, especially throughout the first half – good balance and pacing between action and set-up of the various characters.
However, the transition into the Chicago finale was very awkward. I think it was just too hard to align all the different subplots that needed to be tied up. Maybe one could have been jettisoned, like the unlikable Dylan character?
Fantastic VFX, especially considering the gigantic scale of the film. Highlights:

Beautiful, readable lighting – it was much easier to follow what was going on in most action shots as compared to the earlier films, thanks to well-choreographed timing and composition. And it looked great too, especially the Cybertron scenes.
Slomo-actors-flying-through-the-air shots seemed to be the main “innovation,” and were done well.
Very little “animaticitis” – big things moved at the right speed for the most part, like the big dropships over Chicago.
Some comps seemed a little off, probably due to time constraints. e.g. the “worm” Decepticon plowing through the long factory building in the intro – not enough interaction with the outer walls. Some lens flares and lens dirt elements went over the top.
Volumetric fire/smoke elements didn’t always comp over the backgrounds realistically. I wonder if this is a consequence of stereo 3D making comp tricks harder to pull off?

Fun cameos by Buzz Aldrin, Kennedy Space Center, and tons of other NASA and military sites. (shut down your logical brain and just enjoy the spectacle)
“This is the episode where Spock goes insane. ” — wow, I should have caught that!

I saw the film on celluloid in 2D. Will catch a digital 3D screening when I get a chance to compare. I have to say though I don’t think I was missing much in 2D.

Month: July 2011