22 May, 2008
Derek Gottfrid and his colleagues at the New York Times have obviously been having a lot of fun with Amazon EC2.
Their latest offering is the TimesMachine. Print subscribers can access any issue of the New York Times, dating back to Volume 1, Number 1 in 1851. Non-subscribers can take a peek at 6 different (and historically significant) issues, including the inaugural edition, the end of World War I, and the sinking of the Titanic.
As they explained in their blog post, they used EC2, Hadoop, and some of their own code to convert 405,000 large TIFF images, 3.3 million SGML files, and 405,000 XML files to 810,000 PNG images and 405,000 JavaScript files. This didn't take all that long:
"By leveraging the power of AWS and Hadoop, we were able to utilize hundreds of machines concurrently and process all the data in less than 36 hours."
The content itself is really interesting, but I also enjoyed the fact that it was possible to see the articles in the context of the other issues of the day. The advertising is also interesting.
Robert Scoble has more coverage, including a video interview with Derek.
-- Jeff;
21 May, 2008
I will be participating in a webinar on Thursday, May 22nd at 11 AM PST. Hosted by AWS user Vertica, the webinar will cover Vertica's cloud-based approach to analytic data management. The webinar is free, but you will have to register in advance if you would like to attend.
The Vertica Analytic Database runs on Amazon EC2 and S3 and is hosted completely within the Amazon cloud. Using this approach, they are able to smoothly scale to meet large and complex workloads, while also supporting automatic replication, failover, and recovery.
Unlike a traditional database install where you would have to pay for a data center, hardware, software, and administrators before you can store a single row, the Vertica solution is priced on a per-month, per-node basis. New nodes are available 30 minutes after receipt of order! In true cloud-based fashion, payment is handled through the Amazon Flexible Payments Service.
A traditional relational database takes a row-oriented approach to data storage. A fixed or variable block of contiguous space is allocated to each row. Vertica, by contrast, takes a column-oriented approach (hence the company name). The data is grouped by column instead of by row. This opens the door to many types of optimizations. Processing a single column of a database which has a large number of rows becomes very efficient, as does compression. Benchmarks indicate that approach can be 30 to 200 times faster.
I will be speaking about EC2 and S3 and about some general cloud computing concepts. I hope that you can join in.
-- Jeff;