The Wisdom of Clouds

Read all 'internal cloud' posts in The Wisdom of Clouds
April 4, 2009 1:53 PM PDT

Internal cloud's big test: Amazon vs. Cloudera

by James Urquhart
  • 6 comments

The debate about the validity of internal cloud implementations has raged on for some time now, with some claiming that cloud computing and wholly owned infrastructure don't mix, and others pointing out that applying "on demand," "at scale," and "multitennant" to enterprise IT data centers offers unique advantages to those who have already made that investment. It has been difficult, however, to do an objective comparison of the two approaches--until now.

The announcement on Thursday of Amazon's new Hadoop-based Elastic MapReduce service, combined with the introduction of a commercial Hadoop distribution from start-up Cloudera, means that we finally have a reasonable means of watching which directions enterprise IT prefers. Let me explain.

Amazon's service is a simplified, prepackaged Hadoop implementation that can be leveraged by anyone with an Amazon account. The Amazon Web Services blog describes it as follows:

Today we are rolling out Amazon Elastic MapReduce. Using Elastic MapReduce, you can create, run, monitor, and control Hadoop jobs with point-and-click ease.

You don't have to go out and buys scads of hardware. You don't have to rack it, network it, or administer it. You don't have to worry about running out of resources or sharing them with other members of your organization. You don't have to monitor it, tune it, or spend time upgrading the system or application software on it.

You can run world-scale jobs anytime you would like, while remaining focused on your results. Note that I said jobs (plural), not job. Subject to the number of EC2 (Elastic Compute Cloud) instances you are allowed to run, you can start up any number of MapReduce jobs in parallel. You can always request an additional allocation of EC2 instances here.

Processing in Elastic MapReduce is centered around the concept of a Job Flow. Each Job Flow can contain one or more steps. Each step inhales a bunch of data from Amazon S3, distributes it to a specified number of EC2 instances running Hadoop (spinning up the instances if necessary), does all of the work, and then writes the results back to S3.

Each step must reference application-specific "mapper" and/or "reducer" code (Java JARs or scripting code for use via the Streaming model). We've also included the Aggregate Package with built-in support for a number of common operations such as Sum, Min, Max, Histogram, and Count. You can get a lot done before you even start to write code!

Cloudera, on the other hand, provides a Hadoop build that you can deploy wherever you wish:

Cloudera's Distribution for Hadoop is based on the most recent stable version of Apache Hadoop. It includes some useful patches back-ported from future releases, as well as improvements we have developed for our support customers.

Cloudera's Distribution includes everything you need to configure and deploy Hadoop using standard Linux system administration tools.

Here's what I'm thinking: enterprise IT is looking at an entirely new class of applications that take advantage of MapReduce to process very large sets of both structured and unstructured data for things like predictive analysis, sorting/sequencing, and data mining. Both commercial Hadoop offerings meet the demand for a platform to simplify the development and operation of these applications. The primary difference is the where, not so much the what.

That is exactly what will make the competition between the two offerings so compelling to watch. Let me break it down for you:

  1. Will the requirement to own and operate hardware work against Cloudera? What makes the Amazon offering so groundbreaking (and it will prove to be historic, in my opinion) is that it is now possible for anyone with a need to analyze large data sets to do so simply for the cost of data storage plus processing time. (Note that the use of Elastic MapReduce adds a nominal cost to the server instances that host the instances.)

    Where "grid computing" was once the playground of large enterprises and academic institutions that could afford the hardware to justify the cost of building them out, Amazon makes it possible for even individuals to run such jobs for a few tens or hundreds of dollars.

    Cloudera, on the other hand, requires that the hardware be available to install it on. That either means existing server capacity, new hardware (which greatly adds to the cost, and can only be justified for continuous Hadoop use), or leased capacity. The latter starts to look a lot like Amazon's service.

  2. Will Amazon's requirements to use S3 work against it? There are three reasons why I see it might:

    • The commonly cited concern about data security outside of corporate firewalls. (Even if the perception is wrong, the perception exists.)
    • The cost of data transfer to and from the S3 service--currently as high as 17 cents per gigabyte a month.
    • The cost of storage of both the raw data and the aggregate results--currently as high as 15 cents per gigabyte a month.

    It should be rightly noted that if you already rely on S3 to store your data sets to be processed, this is a great deal. However, if you have to upload terabytes or even petabytes of data to be combed through by MapReduce, then this could get quite pricey on its own, and existing infrastructure might serve the purpose well. If you are going to leave the data up there permanently--and update it regularly--the cost of Amazon's service should be weighed against the cost of owning and operating that storage yourself in your existing facilities.

  3. Will the so-called "barrier of exit" stand up? I'm not even arguing that the choice will be based solely on the comparative costs to the business. In fact, what I am interested in is the extent to which business units and departments will simply bypass IT altogether to build and run their own jobs in Amazon Elastic MapReduce.

    If IT maintains a valuable service using existing facilities and computing investments, then Cloudera will likely do fine. If not, then Amazon stands to be the overwhelmingly dominant commercial Hadoop implementation.

I should also note that running a Hadoop instance is not the same thing as cloud computing in and of itself. An internal Cloudera implementation is not necessarily an internal cloud, though if operated "on demand," "at scale," and with multitenancy, it certainly qualifies as a cloud.

I will be watching this space closely for the next year or two. I have a feeling that Amazon will do fine, regardless, as there are many possible implementations that would benefit from a completely public cloud implementation. The real test is probably how much opportunity Cloudera finds within enterprise data centers.

Cloudera also has much more competition from the free downloads of Hadoop than Amazon has, in my opinion, as it faces a more traditional open-source competitive landscape.

Is your company looking at MapReduce for a new generation of data-mining applications? If so, what will you choose: the public, external cloud implementation of Hadoop from Amazon Web Services, or the wholly owned, internal implementation of the same from Cloudera?

March 16, 2009 4:00 AM PDT

The three routes to cloud computing's future

by James Urquhart
  • 5 comments

Ten years after the creation of Salesforce.com, the future of cloud computing is not in doubt; it is just being heavily debated. Two opposing views of how cloud computing will play out--especially enterprise cloud computing--are making the rounds among thought leaders and customer decision makers alike. Interestingly, there is enough to question about both approaches that a third option may, in fact, gain importance.

What all sides agree on, however, is that some form of cloud computing is coming your way. As always, the devil is in the details.

Marc Benioff, Salesforce.com's "pull no punches" supreme leader, represents one of the debate's extremes. At "Whose Cloud is it Anyway?"--a cloud-computing roundtable put on by TechCrunch recently--Benioff stated (the emphasis is mine):

(Microsoft was) a company that...had a lock on the entire industry in terms of innovation, and was able to hold it through a monopoly. So, that is really broken down through a new, next generation paradigm, which is cloud computing; which is no software, no hardware, don't hire anyone, just sign up to these various cloud platforms and pick the flavor that is appropriate for your application.

In other words, it's not cloud computing to Benioff unless the IT department doesn't have to directly handle any form of technology beyond a browser or perhaps an SSH terminal application. This is the very definition one would expect from the leader of possibly the world's biggest software-as-a-service provider.

It is a call to jettison traditional IT altogether, and focus efforts on leveraging the work of professional providers of IT applications, platforms, infrastructure, and services. By this definition, it is indeed a complete change in IT paradigm.

This view is echoed by the current Wikipedia page for cloud computing, as originally authored by Sam Johnston:

Cloud computing is Internet ("cloud") based development and use of computer technology ("computing"). It is a style of computing in which dynamically scalable and often virtualised resources are provided as a service over the Internet. Users need not have knowledge of, expertise in, or control over the technology infrastructure "in the cloud" that supports them.

The 'internal cloud'
At the other end of the spectrum are those who believe the road to cloud computing begins at home. The starting point for any enterprise with existing IT infrastructure investment, according to this camp, is an "internal cloud." An internal cloud applies the concepts of cloud computing (on-demand resources, pay-as-you-go pricing, and the appearance of infinite scalability) to resources wholly owned by the enterprise consuming the service.

There is no doubt that it is a view expressed by much of the traditional IT industry, but there are other voices out there as well pointing out the value of providing multitenant, on-demand, at-scale architectures to internal customers. Internal clouds are appealing to IT departments at many levels, though obviously they are not going to provide the economies of scale that public clouds will offer over time. (For a really good explanation of why large public clouds will dominate the next generation of IT, see the University of California at Berkeley paper titled "Above the Clouds: A Berkeley View of Cloud Computing".)

The strength of the "own nothing" argument is difficult to miss. Benioff put it very well. Don't spend money up front on things that aren't core to your business. Get them as "on-demand" services, instead, and pay for them only as you consume them.

The benefits of internal clouds, however, are a little more subtle. Most proponents will point to the inability of most public clouds to support legacy applications, while internal clouds can be built to handle old and new applications alike. Perhaps the most pervasive argument, however, is that internal clouds allow you to maintain control over security, service levels and regulatory compliance in a way that public clouds are not yet able to offer.

So, what is an enterprise to do? Choosing an "own nothing" approach, like any other paradigm shift, is extremely disruptive and requires a major overhaul or outright replacement of existing IT software assets.

On the other hand, choosing an "internal cloud" approach really doesn't gain the full benefits of public cloud computing offerings. With much smaller scale, the economics are not in internal cloud's favor. As this year and the next progress, I would expect to see it less and less justifiable to rely solely on an internal cloud.

The 'private cloud'
The term "private cloud" is becoming associated with a third option--an option that has fundamental implications to the way in which enterprise customers will approach cloud computing:

A private cloud consists of IT resources under the control of the enterprise consuming it. Those resources may be owned by the enterprise, consumed from a public cloud provider, or some combination of the two. The only requirement is that the resources be under the direct control of the customer under a unified management system, as opposed to each separately consumed offering being individually managed through the interfaces provided by their respective owners.

Many of you may be thinking "hey, that's just the definition of a hybrid cloud", but there is an important, though subtle distinction to understand.

  • A hybrid cloud is the use of both public and internal cloud capabilities to meet the needs of an application system.
  • A private cloud meets the needs of an application system by any combination of public and internal cloud resources--and that combination can change moment by moment.

Private clouds, by this definition, overcome the "rewrite everything" effect of "own nothing" cloud computing. On the other hand, they provide the degree of trust that enterprises were seeking from internal clouds, including the ability to change the mix of cloud services consumed completely at their own discretion.

In the end, I think the debate will evolve away from "own nothing" vs. "internal clouds", with the latter being replaced by "private clouds." Then, over time, supporters of the "own nothing" vision will come to realize that private clouds give them a direct route to migrating all application workloads from wholly owned infrastructure to public clouds, achieving their vision.

Meanwhile, the enterprise continues to operate with the perception that everything is running in their own data centers, under their complete control. In the end, I think that is the factor that will make private clouds the winning enterprise cloud computing model in the years to come.

So, which is it for you? Will you be taking Benioff's advice and cease to directly purchase software and hardware? Will you play it conservative and insist on turning your own resources into a cloud before venturing out in force to the public cloud?

Will you leverage both approaches as makes sense, a la David Linthicum's frequent advice? Will you pushing the boundaries of what you call your IT resources to include third party services, yet tie it all together within one "trust boundary"? Where do you fall in the great cloud computing debate 10 years after the creation of one of its bellwethers, Salesforce.com?

See also:
Salesforce.com: Pondering the next 10 years
Cloud computing: How we got here

You can follow James Urquhart on Twitter.

  • prev
  • 1
  • next
advertisement

15 sites that went kaput in 2009

Web sites launch all the time, but they also shut their doors. We highlight 15 that bit the dust this year.

Top 10 news stories of the decade

Let the debate begin: Was the iPhone more important than iTunes? Was anything bigger than Google finding a great business model? CNET offers its list of the 10 most important stories of the '00s.

About The Wisdom of Clouds

The Wisdom of Clouds, a CNET Tech blog by James Urquhart, covers cloud computing, virtualization, SaaS, data centers, and much more.

Add this feed to your online news reader

The Wisdom of Clouds topics

Most Discussed

Inside CNET News

Scroll Left Scroll Right