• On UrbanBaby: I won't vaccinate my daughter!

Software, Interrupted

Read all 'Hadoop' posts in Software, Interrupted
December 2, 2009 4:01 AM PST

Survey: IT's key role in global economic recovery

by Dave Rosenberg
  • 1 comment
Share

information technology is expected to play an important part in the global economic recovery, according to a new survey released Wednesday.

Some 72 percent of business and information technology executives say their "organizations place greater value on the IT function today than they did before the economic crisis" and that they "view IT as an important part of their economic recovery efforts," according to Accenture's Global Survey on IT Investments.

This is not an unfamiliar sentiment and is one we've heard from United States CIO Vivek Kundra as he's attempted to use IT to kick start a variety of programs on the federal level that will set the pace for innovative new uses of technology across the globe.

The results of the Accenture survey are similar to last week's Goldman Sachs cautiously optimistic survey results that suggested IT spending would trend upward in 2010 and normalize to pre-recession levels with the majority of countries represented planning to increase investment selectively next year.

2010 IT spending

2010 IT spending

(Credit: Accenture)

... Read more
November 9, 2009 8:42 AM PST

Open-source Hadoop powers Tennessee smart grid

by Dave Rosenberg
  • 3 comments
Share

The Tennessee Valley Authority is the nation's largest public power provider serving approximately 9 million consumers in seven southeastern states. The organization also happens to be a big supporter of open-source projects, including Hadoop, a tool designed for deep analysis and transformation of very large data sets.

Earlier this year, the Tennessee Valley Authority (TVA) announced that it open sourced its data system used to collect data from smart grid devices called Phasor measurement units (PMUs). The data collection system is known in the industry as a Super Phasor Data Concentrator (SuperPDC), which can be used to determine the health of a power grid.

The open-source version of the SuperPDC is now called the "OpenPDC." I spoke to both Ritchie Carroll (RC), the project's creator, and Josh Patterson (JP), the person responsible for introducing Hadoop to the project, to discuss what the OpenPDC is and why TVA turned to Hadoop in building the system.

What sort of data volumes are you working with?
RC: Currently there is around 20 TB of archived data, we expect this to grow quickly as a result of the SmartGrid stimulus funding which includes the addition of 850 phasor measurement devices. This may well grow the archive to half a Petabyte within the next few years.

How is this data currently captured and managed? Is any data discarded?
JP: Data is collected directly from field devices at 30 times per second. This data is then time-aligned and processed in real-time--all data gets captured into a binary data file as time-series data for mass processing by Hadoop.

RC: No data is currently discarded, if we get to the point of needing to discard data because of cost--this will be a decision based on weighed importance of collected data. It is likely the data around major events will never be deleted because it will always be valuable for future student researchers. There is also value in being able to go back in time and look for newly discovered event signatures to see how long they might have been occurring.

... Read more
September 16, 2009 10:35 PM PDT

Want to analyze big data? Check your log files

by Dave Rosenberg
  • 3 comments
Share

More than a few technology sectors seem to be turning up the volume on "big data" and the enormous challenges and opportunities that enterprises face in managing and analyzing their data and system resources.

There are a number of hip technologies and frameworks like Apache Hadoop, which is used to store, process, and analyze massive data sets, enabling applications to work with thousands of nodes and petabytes of data.

Log management

Log management

(Credit: LogLogic)

One area that provides never-ending data analysis fodder are log files. For those not aware, log files are usually automatically created and updated whenever a machine or machine user does something. Logs are often been put under the "dark matter" umbrella, signifying the challenge of mining useful information from raw data.

But, operating system and application logs are a goldmine of vital information about the health and well-being of an organization's computer infrastructure. Plus, they can record the day-to-day activity of system users as well as capture evidence of malicious activity.

I spoke with Dimitri McKay, security architect for LogLogic, and asked him to provide a few examples of real-world use cases demonstrating how logs are used for business analytics and intelligence purposes in the enterprise.

Example 1--A global retail company uses log analysis to meet regulations established by the PCI DSS compliance standards. Comprehensive reporting capabilities and secure long-term storage capacity are critical elements that must be met, and in order to support forensic analysis, all data must not only be stored but also encrypted.

Example 2--A customer in the telecom industry was overwhelmed with the sheer amount of log data they were forced to consume for both forensics and operations. They were doubling their amount of log data storage every nine months, and the home-grown solution they had been using was just not keeping up.

This company wanted the ability to track a session from start to finish across their entire infrastructure for forensics and operations. With that it could see where sessions were failing, it could reduce downtime and increase the value of the user experience.

Example 3--A global financial firm analyzes its log data to increase and improve network performance as well as for intrusion detection of the infrastructure, scanning for vulnerabilities and vulnerability assessment.

Keep an eye on the buzz meter to see how vendors address the impending data explosion by providing solutions that help enterprises take advantage of these massive data sets.

Follow me on Twitter @daveofdoom.

September 4, 2009 3:33 PM PDT

Hadoop buzz continues to excite the cloud

by Dave Rosenberg
  • Post a comment
Share

Hadoop is the popular open-source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets. It enables you to explore complex data, using custom analyses tailored to your information and questions. It's also one of the most buzz-worthy, talked about open-source projects around.

Hadoop World

Hadoop World

(Credit: Hadoop World)
I spoke with Christophe Bisciglia, Hadoop World organizer and founder of Cloudera, to ask some questions about this inaugural event. And by the way, if you're interested in attending, click on the link in the answer to question No. 5. (My readers get a 25 percent discount if you register before September 15.)

Q: How can you explain the buzz around Hadoop? It's deafening.

... Read more
June 1, 2009 5:03 PM PDT

Big data and Cloudera: Follow the money

by Dave Rosenberg
  • Post a comment
Share

I recently asked Cloudera CEO Mike Olson how a commercial open-source company balances community and commerce.

When it comes to open source, this isn't Olson's first rodeo; in his past life he served as CEO of the open-source database company Sleepycat, which was acquired by Oracle in 2006. Olson understands the fragile balance that exists in open source; he's a firm believer that good community relations are critical for open-source companies. Case in point--since we last spoke, Cloudera launched the industry's first certification program for Hadoop and MapReduce, open source projects that support data intensive distributed applications.

Cloudera on Tuesday is expected to formally announce the closing of a $6 million series B funding round led by Greylock (whose past investments successes include Red Hat among many others).

Olson reports that fast growth in the business and rapid adoption of Hadoop/MapReduce drove heavy interest from investors. For Cloudera, apparently it's a buyer's market, so it decided to secure funding now to allow it to expand the business rapidly on all fronts.

So, with $11 million in the bank from top-tier VCs (Accel led the A round and participated in the B) along with individual investments from Diane Greene (former CEO of VMware), Marten Mickos (former CEO of MySQL), and Jeff Weiner (president of LinkedIn), Cloudera has successfully raised the smart money to compliment the big data all-star founding team from Google, Facebook, and Yahoo.

For a brief overview of Hadoop and Cloudera check out the video below.

... Read more
May 23, 2009 4:28 AM PDT

Balancing open-source community and commerce

by Dave Rosenberg
  • Post a comment
Share

The tech media recently started taking serious notice of Hadoop, an open-source project developed to processing huge amounts of data, and the coverage is growing every day. According to ITDatabase, 161 stories have been written about Hadoop in the last three months alone, including a veritable "coming out party" in The New York Times.

Hadoop is interesting because it's proven in use at large Web shops, cloud-oriented, open-source, and it solves two major computing problems: handling large amounts of data, and writing parallel programs for large numbers of computers. Hadoop clusters can scale up to tens or hundreds of terabytes, or even petabytes.

But adoption doesn't always equal commercial success. I've written in the past about Cloudera, a company formed to support Hadoop, and recently sat down with CEO Mike Olson to get his thoughts on the burgeoning Hadoop ecosystem and how the company intends to balance community and commerce.

My initial question for Olson was how does the company succeed when users are happy with the open-source project?

Olson answered with several key points. Cloudera sees "big data" -- terabytes at least -- becoming a common problem for all kinds of companies. The early adopters of Hadoop were all Web 2.0 companies generating logs and mining them for user behavior data. But data processing at this scale is also an enterprise problem and enterprises aren't always early adopters and often require software to be supported by a vendor, not just a community.

Most enterprise buyers are very different from Facebook and Yahoo. They employ much smaller development and IT staff. They need strong SLAs and a quick response to problems from a vendor with deep expertise. Cloudera aims to solve those problems in ways that community support, mailing lists, and online forums can't.

This is typical of open-source projects that become more like products, and the challenge is ensuring that the project lives on and the commercialization efforts are balanced with good citizenship to non-customers.

The open-source community around Hadoop thus far appears to be pretty happy with Cloudera. The company has made its Cloudera Distribution for Hadoop available for free download, put a large amount of free training material on its Web site, and contributes to the open-source project with new features.

Good community relations are critical for open-source companies; getting this right is important for Cloudera.

Olson tells me that customers are running Hadoop in-house and, increasingly, in the cloud. A few weeks ago, Amazon even announced a hosted Hadoop offering called "Elastic MapReduce" -- more evidence that Hadoop has gone mainstream. From Olson's perspective, more Hadoop in the world means more demand for enterprise-grade services and support, and that creates a great opportunity for Cloudera to make life better for commercial users of the open-source project.

This is the key to maintaining the balance of commercial and community and others will certainly pay attention to how Cloudera interacts with the Hadoop community to learn what works and what doesn't.

Follow me on Twitter @daveofdoom

May 15, 2009 8:56 PM PDT

Hadoop breaks data-sorting world records

by Dave Rosenberg
  • Post a comment
Share

Hadoop

Hadoop

(Credit: Hadoop)

Yahoo's grid-computing team announced that Apache Hadoop broke world records in the annual GraySort contest in the Gray and Minute sorts in the general-purpose (Daytona) category.

Hadoop is the only open-source software to ever win the GraySort competition, adding another notch to last year's win at the Terasort competition, where Hadoop sorted 1 terabyte of data in 209 seconds. That beat the previous record of 297 seconds in the terabyte sort benchmark.

Within the rules for the 2009 Gray sort, our 500 GB sort set a new record for the minute sort and the 100 TB sort set a new record of 0.578 TB/minute. The 1 PB sort ran after the 2009 deadline, but improves the speed to 1.03 TB/minute. The 62 second terabyte sort would have set a new record, but the terabyte benchmark that we won last year has been retired.

If you want to learn more about Hadoop, the Cloudera blog has a great post titled 5 Common Questions About Hadoop that explains things pretty well.

Follow me on Twitter @daveofdoom

March 11, 2009 7:59 AM PDT

Understanding MapReduce and Hadoop (Video)

by Dave Rosenberg
  • Post a comment
Share

For those of you interested in just how cloud computing (and I do mean, computing) works, check out this video from a recent AWSome Atlanta Cloud Computing User's Group. Twitpay's Don Brown explains how open source applications MapReduce and Hadoop are used to process enormous amounts of data at Google and other large websites.

For more on MapReduce, check out these articles by Eugene Ciurana. For more on Hadoop (including support) check out Cloudera.

Via John M. Willis

You can follow me on Twitter @daveofdoom

December 24, 2008 9:36 AM PST

Cloud platforms of the future: Hadoop and Eucalyptus

by Dave Rosenberg
  • 2 comments
Share

Without a doubt, the cloud and all its forms and meanings were big news in 2008. Besides the huge growth of Amazon EC2 and Google App Engine, we saw Salesforce launch Force.com, a true platform-as-a-service.

My picks for the most interesting software of 2008 are Hadoop and Eucalyptus.

Hadoop is an Apache project, the "open source implementation of MapReduce, a powerful tool designed for the detailed analysis and transformation of very large data sets," which basically means you can process a ton of data on commodity hardware.

Hadoop is going commercial through Cloudera and while details are not publicly available, let's just say there are some very important and interesting foundations being laid for the way that people deal with computing and processing power.

... Read more
  • prev
  • 1
  • next
advertisement

The yogurt makers of tech: Gadgets to avoid

Don't buy these one-trick ponies--unless you like gizmos that gather dust.

Google wants to unclog Net's DNS plumbing

The Net giant, ever eager for a faster Internet, debuts its Google Public DNS service. With it, Google could become even more central to the Net.

advertisement

About Software, Interrupted

In "Software, Interrupted," Dave Rosenberg discusses disruption in the software market, as well as the products and services that keep business technology norms in perpetual flux.

With nearly 15 years of technology and marketing experience spanning from Bell Labs to multiple start-up IPOs, Dave co-founded open-source software company MuleSource and now serves as general manager of Hardy Way. He also happens to be a U.S. patent holder and a workaholic. Technology is his best friend and mortal enemy.

Add this feed to your online news reader

Software, Interrupted topics

Most Discussed

advertisement

Inside CNET News

Scroll Left Scroll Right