• On BNET: Vote: How will Apple blow it?
April 4, 2009 1:53 PM PDT

Internal cloud's big test: Amazon vs. Cloudera

by James Urquhart

The debate about the validity of internal cloud implementations has raged on for some time now, with some claiming that cloud computing and wholly owned infrastructure don't mix, and others pointing out that applying "on demand," "at scale," and "multitennant" to enterprise IT data centers offers unique advantages to those who have already made that investment. It has been difficult, however, to do an objective comparison of the two approaches--until now.

The announcement on Thursday of Amazon's new Hadoop-based Elastic MapReduce service, combined with the introduction of a commercial Hadoop distribution from start-up Cloudera, means that we finally have a reasonable means of watching which directions enterprise IT prefers. Let me explain.

Amazon's service is a simplified, prepackaged Hadoop implementation that can be leveraged by anyone with an Amazon account. The Amazon Web Services blog describes it as follows:

Today we are rolling out Amazon Elastic MapReduce. Using Elastic MapReduce, you can create, run, monitor, and control Hadoop jobs with point-and-click ease.

You don't have to go out and buys scads of hardware. You don't have to rack it, network it, or administer it. You don't have to worry about running out of resources or sharing them with other members of your organization. You don't have to monitor it, tune it, or spend time upgrading the system or application software on it.

You can run world-scale jobs anytime you would like, while remaining focused on your results. Note that I said jobs (plural), not job. Subject to the number of EC2 (Elastic Compute Cloud) instances you are allowed to run, you can start up any number of MapReduce jobs in parallel. You can always request an additional allocation of EC2 instances here.

Processing in Elastic MapReduce is centered around the concept of a Job Flow. Each Job Flow can contain one or more steps. Each step inhales a bunch of data from Amazon S3, distributes it to a specified number of EC2 instances running Hadoop (spinning up the instances if necessary), does all of the work, and then writes the results back to S3.

Each step must reference application-specific "mapper" and/or "reducer" code (Java JARs or scripting code for use via the Streaming model). We've also included the Aggregate Package with built-in support for a number of common operations such as Sum, Min, Max, Histogram, and Count. You can get a lot done before you even start to write code!

Cloudera, on the other hand, provides a Hadoop build that you can deploy wherever you wish:

Cloudera's Distribution for Hadoop is based on the most recent stable version of Apache Hadoop. It includes some useful patches back-ported from future releases, as well as improvements we have developed for our support customers.

Cloudera's Distribution includes everything you need to configure and deploy Hadoop using standard Linux system administration tools.

Here's what I'm thinking: enterprise IT is looking at an entirely new class of applications that take advantage of MapReduce to process very large sets of both structured and unstructured data for things like predictive analysis, sorting/sequencing, and data mining. Both commercial Hadoop offerings meet the demand for a platform to simplify the development and operation of these applications. The primary difference is the where, not so much the what.

That is exactly what will make the competition between the two offerings so compelling to watch. Let me break it down for you:

  1. Will the requirement to own and operate hardware work against Cloudera? What makes the Amazon offering so groundbreaking (and it will prove to be historic, in my opinion) is that it is now possible for anyone with a need to analyze large data sets to do so simply for the cost of data storage plus processing time. (Note that the use of Elastic MapReduce adds a nominal cost to the server instances that host the instances.)

    Where "grid computing" was once the playground of large enterprises and academic institutions that could afford the hardware to justify the cost of building them out, Amazon makes it possible for even individuals to run such jobs for a few tens or hundreds of dollars.

    Cloudera, on the other hand, requires that the hardware be available to install it on. That either means existing server capacity, new hardware (which greatly adds to the cost, and can only be justified for continuous Hadoop use), or leased capacity. The latter starts to look a lot like Amazon's service.

  2. Will Amazon's requirements to use S3 work against it? There are three reasons why I see it might:

    • The commonly cited concern about data security outside of corporate firewalls. (Even if the perception is wrong, the perception exists.)
    • The cost of data transfer to and from the S3 service--currently as high as 17 cents per gigabyte a month.
    • The cost of storage of both the raw data and the aggregate results--currently as high as 15 cents per gigabyte a month.

    It should be rightly noted that if you already rely on S3 to store your data sets to be processed, this is a great deal. However, if you have to upload terabytes or even petabytes of data to be combed through by MapReduce, then this could get quite pricey on its own, and existing infrastructure might serve the purpose well. If you are going to leave the data up there permanently--and update it regularly--the cost of Amazon's service should be weighed against the cost of owning and operating that storage yourself in your existing facilities.

  3. Will the so-called "barrier of exit" stand up? I'm not even arguing that the choice will be based solely on the comparative costs to the business. In fact, what I am interested in is the extent to which business units and departments will simply bypass IT altogether to build and run their own jobs in Amazon Elastic MapReduce.

    If IT maintains a valuable service using existing facilities and computing investments, then Cloudera will likely do fine. If not, then Amazon stands to be the overwhelmingly dominant commercial Hadoop implementation.

I should also note that running a Hadoop instance is not the same thing as cloud computing in and of itself. An internal Cloudera implementation is not necessarily an internal cloud, though if operated "on demand," "at scale," and with multitenancy, it certainly qualifies as a cloud.

I will be watching this space closely for the next year or two. I have a feeling that Amazon will do fine, regardless, as there are many possible implementations that would benefit from a completely public cloud implementation. The real test is probably how much opportunity Cloudera finds within enterprise data centers.

Cloudera also has much more competition from the free downloads of Hadoop than Amazon has, in my opinion, as it faces a more traditional open-source competitive landscape.

Is your company looking at MapReduce for a new generation of data-mining applications? If so, what will you choose: the public, external cloud implementation of Hadoop from Amazon Web Services, or the wholly owned, internal implementation of the same from Cloudera?

James Urquhart is a seasoned field technologist with almost 20 years of experience in distributed systems development and deployment, focusing on service-oriented architectures, cloud computing, and virtualization. James is currently market manager for the Data Center 3.0 strategy at Cisco Systems, though the opinions expressed here are strictly his own. He is a member of the CNET Blog Network and is not an employee of CNET.
Recent posts from The Wisdom of Clouds
IBM launches development and test cloud
Does cloud computing need malpractice safeguards?
Mitosis in action: Cloud computing and 'The Cloud'
Cloud computing and the big rethink: Part 5
Cloud computing and the big rethink: Part 4
Cloud computing and the big rethink: Part 3
Cloud computing and the big rethink: Part 2
Cloud computing and the big rethink: Part 1
Add a Comment (Log in or register) (6 Comments)
  • prev
  • 1
  • next
by BlitzBoy1120 April 4, 2009 3:19 PM PDT
Is it me, or is Amazon starting to do everything?
Reply to this comment
by marvin25 April 4, 2009 6:37 PM PDT
I have one question is how can we do cloud computing when we don't have enough capacity on the Internet right now. We are short of capacity and this why the economy is going downhill. This added requirements can't be handled at all with the current capacity on the Internet. So we are talking something in the future when we have capacity in the Internet. The data streams that is required can't be handled with the current capacity at all or is it your goal to take down the Internet completely. All the major nodes are working at maximum capacity and all it is absorbed as fast as additional bandwidth by one ISP. Come back in a year for update on the area of capacity. This is really an idea that can't take place under current conditions.
Reply to this comment
by jhammerb April 4, 2009 7:11 PM PDT
Hey James,

Thanks for the analysis. One addition: you don't need to own any hardware to run Cloudera's Distribution for Hadoop. We have an AMI for our distribution and a set of supporting scripts to get moving with our distribution on EC2, documented at http://www.cloudera.com/hadoop-ec2.

In addition, one of Cloudera's engineers, Tom White, wrote the code that enabled Hadoop clusters to be deployed on EC2 and read their data from S3 (http://issues.apache.org/jira/browse/HADOOP-930). This code underlies Amazon's current implementation of Elastic MapReduce.

We're pretty excited to see Amazon lowering the barrier for adoption of Hadoop even further!

Later,
Jeff
Reply to this comment
by jamesurquhart April 4, 2009 8:00 PM PDT
Fair enough. I should have added that Cloudera will do business on EC2 as well as in internal clouds. However, that would in many ways be an extension of the public cloud approach, not an alternative to Amazon's implementation. That being said, its good to see Cloudera leverage as many opportunities as possible, and hedging their bets.

Congrats to Tom White for his possibly historically important contribution.
by cloudment April 5, 2009 3:08 AM PDT
Just want to add that Amazon slashed S3 transfer prices to 3 cents per GB/months. This makes it even more attractive to the vast audience and I hope it will also help to grow CloudBerry Explorer for S3 freeware user base. http://cloudberrylab.com/
Reply to this comment
by asmann25 April 5, 2009 9:14 AM PDT
This is a great discussion. I would like to add that Aster Data Systems provides an enterprise-class implementation of MapReduce. It runs inside Aster's massively parallel relational database for large-scale analytics. Read here for more details:

http://www.asterdata.com/blog/index.php/2009/04/02/enterprise-class-mapreduce/
Reply to this comment
(6 Comments)
  • prev
  • 1
  • next
advertisement

FAQ: Buying the right Windows 7 upgrade

Readers still have lots of questions on just which version of the software they need to buy in order to upgrade their PC. CNET News tries to offer some answers.

N.Y. lawsuit details Intel's 'largesse' toward Dell

Attorney General Andrew Cuomo's federal antitrust case filed Wednesday alleges a longstanding symbiotic relationship between Intel and Dell.

About The Wisdom of Clouds

The Wisdom of Clouds, a CNET Tech blog by James Urquhart, covers cloud computing, virtualization, SaaS, data centers, and much more.

Add this feed to your online news reader

The Wisdom of Clouds topics

advertisement

Inside CNET News

Scroll Left Scroll Right