Moving data to the cloud

I’ve already moved my public internet servers over to Amazon’s EC2 cloud, and am now planning how to use it for backing up my important data. Later I might even move my primary file server into AWS.

Let’s consider how I might set up a data backup system. I have about 100GB of “core” files I need to protect. This includes things like email, documents, 3D production data, source files for websites, custom software, and Linux system images. The biggest uses of space are texture maps and some large datasets for my 3D work. The 100GB figure does not include some massive things I won’t move to the cloud yet, like finished animations (~200GB in compressed video; ~2TB of raw frames) or my personal media collection (a few tens of GB).

Amazon gives you two ways to store data in their cloud: first, there’s S3, the classic heavy-duty data repository. S3 is supposed to be extremely reliable, but you can’t use it like a filesystem. It’s designed for uploading and downloading large “chunks” of data. Second, you can store files on an EBS volume, which is somewhat like a physical hard drive in one of Amazon’s data centers. You can put a regular filesystem on EBS, but it’s not designed to be as reliable as S3, and it can only be accessed or served out through a running EC2 system.

I think S3 makes most sense for backup purposes, although I don’t want to use any of the hacks that make it appear like a filesystem. I also don’t want to upload all 100GB as a monolithic glob. I think I will divide my core data into smaller chunks, say around 10GB each, of related data (e.g., system software, email, per-project 3D source files, etc). Then I’ll upload each chunk to S3 as a single archive. This makes partial backups/restores easier, and seems to fit best with how S3 is designed to operate.

Long-term, I will probably end up using an EC2 instance as my main file server, so I’ll have to store things in a regular filesystem on EBS. In this case I’ll still use S3 for “offline” storage, while keeping a smaller set of “online” data in EBS. This is just like an old-fashioned system of on-line/near-line storage areas.

One drawback to this arrangement is that Amazon will end up double- or triple-charging you for data storage: once for the S3 backup, once for the EBS copy, and again for any EBS snapshot images. So I’ll probably end up paying more like $0.30-$0.50/GB per month for this setup. Still, the cost is quite reasonable compared to the depreciation on local hardware, plus the headaches of maintaining the system myself.

Leave a Reply Cancel reply