Maas Digital

Terrain Rendering Optimization

Lately terrain generation has been taking up a majority of my render times, so I spent a day working on some optimizations to the terrain generator. This is an interpreter function that takes as input a terrain function (mapping latitude/longitude to height/color) and camera parameters, then uses the ROAM algorithm to generate a high-resolution view-dependent mesh of the planetary terrain (roughly 1 polygon/pixel), which it outputs as RIB.

The output and rendering side of the terrain generator is already well-optimized; it sorts the mesh into screen-space-coherent chunks and feeds them to RenderMan in a way that renders very efficiently. The main bottleneck is in evaluating the terrain function. Simple math operations run quickly because the interpreter already uses a PRMan-like SIMD scheme that processes large batches of points in a cache-friendly manner. However, about 90% of terrain generation time is spent sampling large image maps.

A typical surface-level shot requires taking hundreds of thousands of bicubically-interpolated samples from ten or so image maps (some storing heights and some storing colors) totaling 1GB or so of LZW-compressed TIFF files (~5GB if uncompressed). The sampler is not very efficient: it does keep an LRU cache of uncompressed scanlines in memory to avoid going to disk for each sample, but that’s about the only optimization. It is a long way from a PRMan-style texture() function. The main “pain point” is that the sampler loads in a lot more texture data than is really necessary to render a frame. Textures on disk are just scanline-oriented TIFF files, so the sampler needs to read and decompress many scanlines in order to satisfy a single sample query.

I saw basically two approaches to optimizing this part of the system: first, reduce the average amount of texture data loaded for each sample, and second, make the loading process go faster. The first approach is more powerful, and can follow the design of tile-caching texture() functions used by most renderers. However, it involves writing a LOT of new code: breaking planar textures into tiles and pyramids, changing the sampler to use an LRU tile cache, and most importantly, figuring out how deep into the pyramid a given sample access must go. This is a non-trivial problem for a terrain generator. Unlike an ordinary texture lookup, the results of the image sampling will affect the geometry of the terrain and also the ROAM tessellation process. In theory it’s not possible to know what detail level is appropriate for a given sample, because the results affect where the geometry lands in 3D space.

However, in practical situations, I can imagine dividing sample lookups into two classes: a high-level “lay of the land” class that only samples the highest available map resolution (this would be used for planet-scale maps), and a low-level “detail” class that is allowed to “cheat” by stopping higher in the image pyramid for samples far from the camera. (note: interp re-tessellates the terrain each frame, so frame-to-frame coherence is not a problem; this would be an issue if the terrain height for a given lat/lon position could change depending on where the camera is). Alternatively, just using a tile-based texture format rather than pyramids would probably make the system run faster than it currently does, although this wouldn’t be as efficient as a detail-based lookup. I will keep both of these options in mind for future optimization.

The second approach to optimizing the sampler is simply to make the sampling process go faster. At some point this is bottlenecked on disk speed. However, most of the compressed texture data fits into memory, so the real bottleneck is more in copying the compressed data from the kernel’s buffer cache into the sampler’s memory, and then performing LZW decompression there. I got an easy “quick win” by re-writing the TIFF file textures to store each scanline as a separate compression unit. (by default most TIFF writers group about 16 scanlines into a single unit; this interacts poorly with the scanline cache inside the sampler). I checked the speed of TIFF’s other compression method, Deflate, but it was much slower than LZW. (and I’m not going to store textures uncompressed on disk). I also got a small speed increase from sorting each SIMD batch of vertex evaluations by latitude – this improves locality of texture accesses. The big speed gain, however, came from distributing vertex evaluation across many processor cores.

Again, it seems counter-intuitive that a “dumb” sampler would benefit from multicore processing, since it is more I/O bound than CPU bound. However, note as above that the compressed texture data fits into RAM, so really we are memcpy-and-decompress bound rather than read-from-disk bound.

The interp system is based on a stateless functional language specifically so that its computations can be parallelized easily. In many cases it’s obvious where and how to split a large function up so that each CPU computes a portion of the result independently (sort of a mini map-reduce). I had already experimented earlier with breaking PRMan renders into screen-space chunks and computing each one in parallel. This actually works great, provided that the edges can be kept seamless. However, since ROAM outputs a unified mesh for the whole planet, this approach entails a lot of redundant computation, and it also interacts poorly with PRMan’s own multithreading (PRMan really wants the entire machine to itself while rendering)*. So, I decided to try splitting on the vertex evaluations. The original ROAM system evaluates vertices one at a time, but as part of my SIMD optimization I modified the algorithm to batch up position evaluations as much as possible. The ROAM algorithm doesn’t actually need to know the position of a new vertex immediately; it can keep working on other splittable triangles and then come back later to see if the new triangle it made still needs to be split further. In this way I’m able to batch up terrain evaluations into groups of hundreds or thousands of vertices, each of which takes a non-negligible amount of time to compute, and all of which are entirely independent.

In order to reduce coding time and complexity, my usual technique to distributed processing is to fork() the main process a few times, open pipes to each child process, then sit in a poll() loop sending batches of work to each child and sleeping while they are busy. I find this far easier than traditional multithreaded programming, because there is no shared state and inter-task communication is 100% explicit through the poll() loop and pipes. For RIB handling and external rendering I pass the results over pipes or sockets, but here for efficiency I use shared memory: the parent process mmap()s a buffer shared to all child processes, where the parent writes input lat/lon positions and the children write position/color outputs in designated places.

Here I discovered an unanticipated complication: the fork()ed children also share file descriptor state with each other. This caused libtiff to step all over itself because file seeks from different child processes would interfere with each other. In general interp is not designed for this situation (and it is definitely not thread-safe). However, I was able to get around it by plugging in custom I/O functions to libtiff that use pread()/pwrite() instead of read()/write(), which provide an explicit file offset to each call (tracked manually by following libtiff’s read, write, and seek calls). Yes, I know that Linux has some magic clone() flags that can turn off file descriptor sharing, but I don’t want interp to become too Linux-specific. (though of course fork() is a lost cause on Windows – my general philosophy is that “core” non-GUI/non-rendering features must work on all of Linux, OSX, and Windows, and everything must work across Linux and OSX. It’d probably work on any BSD too, but I haven’t tried).

Back to the main story: distributing the vertex computations. I got a massive speed-up! Together with the sorting optimization and scanline-oriented TIFF files, terrain computation got about 2x to 3x faster on an 8-core machine (using about 6 worker processes – using more reduced performance due to batches becoming too small). I bet I could get another 2x or so from making the sampler smarter as described above, but that would likely require many days of coding, and so isn’t worth it for my current project. Also, terrain generation has gone from a majority to a minority of render time. Now the dominant factor is shading of secondary rays within PRMan. I’ve already added a few tweaks to my main shader that reduce the computation done on non-camera rays (using larger texture filters, not doing ambient occlusion lookups, etc). Still, I am down to about 2 minutes per frame with the quality knobs turned all the way up, which is certainly comfortable for this production.

* splitting a frame into screen-space chunks is still worth pursuing as an optimization for high-speed interactive rendering, where seamlessness isn’t a huge worry. I can’t wait to try running a whole bunch of AWS instances in parallel on one frame.

Bug fix

I spent about two hours finding and fixing a nasty bug in my ROAM terrain tessellator. The symptom was the RIB generator complaining about negative polygon vertex indices. This was very strange because the ROAM system should never produce invalid geometry. The problem only occurred when the bucket-sorting stage was enabled. This is an optional post-processing step that runs after terrain generation; it breaks the monolithic terrain mesh into a number of smaller spatially-coherent meshes, making it easier for PRMan to “digest.” The sorter creates new vertex and index arrays for each output mesh, including only the polygons and vertices referenced by each mesh. In order to maintain the same sharing of indices with the big input mesh, the sorter uses a temporary array to track the new index it gives to each vertex that is pulled from the input mesh. The temporary array is looked up by index, so it must be large enough to contain the greatest-valued index as a valid element. This is where I found the bug. The sorter chooses the size of the array to be the same as the size of the index array on the input mesh. This makes a hidden assumption that the input mesh is not sparse, i.e., it does not contain any index values greater than the size of the index array itself. This assumption no longer holds because I recently added an optimization to the ROAM tessellator that culls polygons outside the camera frustum. It deletes the polygons but leaves their vertices in the vertex array. The resulting meshes are “sparse,” so this was fouling up the bucket sorter. The fix was simply to size the temporary index mapping array as big as the greatest index value seen in the input mesh.

Some ideas

Graphics-related ideas I’m hoping to pursue in the future (recorded here in public to establish priority):

A standard interface for shader reflectance models. This will operate at a higher level than shading languages. Rather than being a full programming language, it will be a simple “API” containing a handful of parameters (diffuse/specular intensities and colors). The purpose of this is to allow a single library of surfaces to operate across very different renderers, especially between RenderMan-style offline renderers and GPU-based real-time renderers. The idea is to cleanly separate surface reflectance functions from the overall light transport framework. This way artists only need to design surfaces once, and they will look roughly the same regardless of the particular light transport method used (e.g., OpenGL local lighting vs. ray casting vs. full global illumination).
A “quick turnaround” 3D animation pipeline. The idea is to intentionally sacrifice a small amount of visual quality in return for massive gains in production speed (and thus significantly lower costs). This could be accomplished by using motion capture for character animation, coupled to a GPU-based rendering system, likely based on an existing game engine. (e.g., the Unreal engine, CryTek engine, or id’s Rage engine).
A unified lighting and compositing system that uses micropolygon grids as the medium for exchanging data, rather than deep framebuffers. Pushing final shading and rasterization of micropolygon grids out of the renderer and into the compositor will allow some interesting tricks, like greater shader flexibility and quicker re-rendering at the compositing stage.
A distributed digital media production system based on a commodity cloud-computing platform like Amazon’s EC2. This would include a set of tools to synchronize files efficiently over low-bandwidth links between computers in the cloud and artists’ workstations. Ideally this would tie in to a production asset management/scheduling system. It would also include tools to control render nodes in the cloud, and collect and display computed work products.

The Magic of EVE Online

“In Eve you are allowed to make a difference, even if you do it the wrong way, even if it isn’t in the script. One turncoat was able to disband one of the oldest and most powerful alliance in the game, putting them into disarray long enough for their enemies to come and take everything they had built — and the developers let it happen. One guy can pull off an amazing heist worth thousands of real life dollars — and the developers will let it happen. One guy can forget to put money in the right wallet before he goes on his honeymoon leading to one of the most powerful alliances losing their key systems and throwing them into disarray long enough for their enemies to come and take everything they had built — and the developers will let it happen.

As long as you don’t hack other players’ accounts, almost anything goes. You probably won’t ever do anything particularly noteworthy, but you might, and if you do it will be allowed to happen. You will be allowed to change the world. You can’t get that anywhere else. People get all mad at CCP and go looking for other games, not even space games or science fiction games, just a persistent world to play in, and nothing else comes close. Eve is a terrible game. Every other game is worse.”

— Post by “Angela Christine” on SomethingAwful.com

Transformers: Dark of the Moon

Reactions:

The writing was significantly better than the last Transformers film, especially throughout the first half – good balance and pacing between action and set-up of the various characters.
However, the transition into the Chicago finale was very awkward. I think it was just too hard to align all the different subplots that needed to be tied up. Maybe one could have been jettisoned, like the unlikable Dylan character?
Fantastic VFX, especially considering the gigantic scale of the film. Highlights:

Beautiful, readable lighting – it was much easier to follow what was going on in most action shots as compared to the earlier films, thanks to well-choreographed timing and composition. And it looked great too, especially the Cybertron scenes.
Slomo-actors-flying-through-the-air shots seemed to be the main “innovation,” and were done well.
Very little “animaticitis” – big things moved at the right speed for the most part, like the big dropships over Chicago.
Some comps seemed a little off, probably due to time constraints. e.g. the “worm” Decepticon plowing through the long factory building in the intro – not enough interaction with the outer walls. Some lens flares and lens dirt elements went over the top.
Volumetric fire/smoke elements didn’t always comp over the backgrounds realistically. I wonder if this is a consequence of stereo 3D making comp tricks harder to pull off?

Fun cameos by Buzz Aldrin, Kennedy Space Center, and tons of other NASA and military sites. (shut down your logical brain and just enjoy the spectacle)
“This is the episode where Spock goes insane. ” — wow, I should have caught that!

I saw the film on celluloid in 2D. Will catch a digital 3D screening when I get a chance to compare. I have to say though I don’t think I was missing much in 2D.

Sony formats

Why did I get an MBA? Read this article

An interesting article in New York magazine discusses the “bamboo ceiling” Asian-Americans face in U.S. workplaces. I think the same discussion applies to academic hotshots of any color. This is a pretty accurate explanation of why I chose to get an MBA instead of furthering my technical skills with a science degree.

“In order to succeed, you have to understand which rules you’re supposed to break. If you break the wrong rules, you’re finished. And so the easiest thing to do is follow all the rules. But then you consign yourself to a lower status. The real trick is understanding what rules are not meant for you.”

An important point is that while lower-level jobs tend to be fairly meritocratic, many other factors start entering the picture as you rise into management – IQ becomes less relevant and EQ becomes essential. But, paying attention in school and doing what you’re told won’t prepare you for this, especially if you aren’t raised in an environment that emphasizes leadership and assertiveness.

Having the go-home option

In maintaining my digital systems, one principle that has served me well is making sure there’s a quick way to work around any potential point of failure. For example, by hosting your domain’s DNS records at a separate company from your web host, you have the option to cut over to a different provider in the event that something goes wrong (and do it immediately, rather than waiting hours or days for your failing web host to give up control of your DNS settings).

Here are some ways to avoid single points of failure:

DNS hosting is really useful. I use EasyDNS. It costs slightly more than letting your web hosting company handle DNS, but having the option to cut them out instantly if something goes wrong is essential.
Be careful of cloud-based services that don’t provide a true “export” option that dumps ALL of your data in a non-proprietary format. For example, while Google Calendar allows you to dump all of your appointments to an iCal file, there is no way to extract all of your emails from GMail in a common format. Nor is there a way to batch-download comments or photos from Facebook. Clearly this is to the benefit of the providers because it causes lock-in. But as a user, be aware that your data can be held hostage at any time. (this is why I’m starting to use this blog more instead of posting updates on Facebook – I’m getting more worried about the lack of backup and search options over there).
When performing maintenance on any computer system, always ALWAYS make sure you’ve got a working, up-to-date backup that you can cut over to in case something goes wrong. Fortunately, virtualization and cloud hosting are making this much easier to accomplish that it used to be.
Anyone who’s worked with Maya extensively has horror stories about corrupt .mb files. Despite the waste of space that saving in ASCII format entails, it’s much safer to have your production data in a form that you can hack or fix with a text editor.
On that subject, ALWAYS keep backup copies of Final Cut Pro files. I’ve had a few bad experiences where FCP decided to “eat” one of my .fcp files, leaving it in an unreadable state. (this is why FCP gets an asterisk in my list of trusted tools)

Moving data to the cloud

I’ve already moved my public internet servers over to Amazon’s EC2 cloud, and am now planning how to use it for backing up my important data. Later I might even move my primary file server into AWS.

Let’s consider how I might set up a data backup system. I have about 100GB of “core” files I need to protect. This includes things like email, documents, 3D production data, source files for websites, custom software, and Linux system images. The biggest uses of space are texture maps and some large datasets for my 3D work. The 100GB figure does not include some massive things I won’t move to the cloud yet, like finished animations (~200GB in compressed video; ~2TB of raw frames) or my personal media collection (a few tens of GB).

Amazon gives you two ways to store data in their cloud: first, there’s S3, the classic heavy-duty data repository. S3 is supposed to be extremely reliable, but you can’t use it like a filesystem. It’s designed for uploading and downloading large “chunks” of data. Second, you can store files on an EBS volume, which is somewhat like a physical hard drive in one of Amazon’s data centers. You can put a regular filesystem on EBS, but it’s not designed to be as reliable as S3, and it can only be accessed or served out through a running EC2 system.

I think S3 makes most sense for backup purposes, although I don’t want to use any of the hacks that make it appear like a filesystem. I also don’t want to upload all 100GB as a monolithic glob. I think I will divide my core data into smaller chunks, say around 10GB each, of related data (e.g., system software, email, per-project 3D source files, etc). Then I’ll upload each chunk to S3 as a single archive. This makes partial backups/restores easier, and seems to fit best with how S3 is designed to operate.

Long-term, I will probably end up using an EC2 instance as my main file server, so I’ll have to store things in a regular filesystem on EBS. In this case I’ll still use S3 for “offline” storage, while keeping a smaller set of “online” data in EBS. This is just like an old-fashioned system of on-line/near-line storage areas.

One drawback to this arrangement is that Amazon will end up double- or triple-charging you for data storage: once for the S3 backup, once for the EBS copy, and again for any EBS snapshot images. So I’ll probably end up paying more like $0.30-$0.50/GB per month for this setup. Still, the cost is quite reasonable compared to the depreciation on local hardware, plus the headaches of maintaining the system myself.

Sony: Not the Greatest Brand

Campaign magazine just published a survey that indicated Sony was the top brand in Asia (http://www.bbc.co.uk/news/business-14009880). My perspective on them is quite different. It’s a negative brand. When I hear “Sony” I think:

Too many proprietary interfaces and file formats,in everything from consumer to pro video equipment
Flimsy VAIO laptops and PCs loaded with shovelware
The PlayStation 3, a monument to complacency and hubris
Poorly-managed online games (Star Wars NGE, anybody?)
edit: how could I forget about the DRM rootkit on Sony audio CDs?

I will never buy Sony video hardware. There are just too many instances of “Super XDCAM PRO II EX” formats that are “really just motion JPEG, but changed just enough so that none of your existing software tools work with it.”

Engineer: “Listen, I know this sounds crazy, but wouldn’t it be cool if our awesome new HD camera actually made, you know, Quicktime movies?”

Boss: “A camera that records in a non-proprietary format? Come on now, you’ve got to be kidding. Go back to your desk and design us a proprietary variant of MPEG-4, then hire some fly-by-night sweatshop off elance.com to develop us a crummy Windows driver for it. That’s The Sony Way!”

Thank goodness somebody at Canon had the INSANE idea of making a camera that actually records in a directly editable format. They rightfully deserve to own the video hardware market.

Now, how about some great Taiwanese brands? ASUS and Giant Bicycles are known world-wide, 85 Degrees is getting big in China now, and domestically Uni-President and Eslite go a long way…