ie8 fix

Hadoop

How GE uses Hadoop to analyze big data

One of the most talked about open-source projects is having its second annual Hadoop World Conference next month in New York. On the heels of a successful inaugural event , 2010 promises more than 25 presentations from the likes of Bank of America, eBay, HP, Orbitz, Twitter, Facebook, and Yahoo (full agenda here). Also, for the second year running, here is a code for my readers to get a 20 percent registration discount: CNETHW2010.

To provide a small taste of what the event will offer, I corresponded with Hadoop World speaker Linden Hillenbrand, product manager of Hadoop Technologies at General Electric, … Read more

Open-source 'R' gets Hadoop integration

Lately, you can't talk about business without talking about "big data," which, incidentally, is the focus of the latest package from Revolution Analytics. Revolution Analytics, which commercialized the open-source R statistics language, emphasizes expanding the use of R beyond its academic roots to business.

On Tuesday, Revolution is expected to release a new addition of big data analysis to its Revolution R Enterprise software. This is an add-on package called RevoScaleR that provides a framework for fast and efficient multicore processing of large data sets.

According to the company, the new package will allow users to process, … Read more

Big data in context

A few weeks back I attended venture firm Accel Partners' New Data Workshop event and learned quite a bit about the state of what we are now commonly referring to as "big data" and the challenges that await the vendors trying to target this new way of slicing and dicing vast amounts of information.

One of the big takeaways for me was the realization that even with all of the processing power available nowadays, the amount of data is growing at such a rapid pace that people are simply looking to cope with the problem, rather than facing it head on.

The issue of processing large amounts of data is not necessarily new--most developers and IT staff can tell you about having too much information to deal with--but, the big difference is that there are new approaches, tools and technologies that can help alleviate the difficult in processing.

Over the course of the last 30 years or so the way that machines process transactions has changed, but so too has the vast amount of data that is being processed and collected, now with an eye toward real-time analysis of information.

This has led to the advent of a number of technologies that allow for data processing to be offloaded and managed in both structured and unstructured ways--examples include open-source projects like Memcached and Hadoop as well as NoSQL data storage mechanisms like Cassandra.… Read more

Adobe releasing Puppet code for managing Hadoop

Puppet Labs announced on Thursday that Adobe Systems is publishing code for managing Hadoop on the Puppet Forge community development site. (Disclosure: I am an adviser to Puppet Labs.)

Puppet is an open-source data center automation and configuration management framework aiming to provide system administrators a platform for consistent, transparent, and flexible systems management.

The necessity of data center automation and management tools (often grouped into the DevOps category) is becoming ever more apparent, as cloud principles and large-scale systems that process data in a parallel manner continue to emerge.

Case in point: Hadoop is an open-source platform powering hugely … Read more

Cloudera teams up to connect Oracle and Hadoop

This week Cloudera, a provider of software and services for the Apache Hadoop project, is set to announce a partnership with Quest Software to develop, support, and distribute an Oracle connector for Hadoop.

Hadoop is the popular open-source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets. It enables its users to explore complex data, using custom analyses tailored to users' information and questions.

Code-named "Ora-Oop," the connector will provide connectivity between Cloudera's Hadoop distribution and Oracle through an interface that allows for bidirectional, scalable, and functional data transfer … Read more

IBM chooses Hadoop to analyze big data

IBM on Wednesday is set to announce a new portfolio of solutions and services to help enterprises analyze large volumes of data. IBM InfoSphere BigInsights is based on Apache Hadoop, an open-source technology designed for analysis of big volumes of data.

IBM InfoSphere BigInsights is made up of a package of Hadoop software and services, BigSheets, a beta product designed to help business professionals extract, annotate, and visually uncover insights from vast amounts of information quickly and easily through a Web browser, and industry-specific frameworks to help clients get started.

IBM has been aggressive in consuming and repackaging open-source projects … Read more

IBM BigSheets to preserve fleeting Web data

IBM announced Thursday that it is working with the British Library on a project that will preserve and analyze terabytes of information on the Web before it is lost forever.

Recent research estimates the average life expectancy of a Web site is 44 to 75 days. Every six months, for example, roughly 10 percent of Web pages on the U.K. domain are lost.

In most cases of personal sites, this is no big loss. But in the case of organizations attempting to archive and chronicle elections, news, media, and video, this data leakage presents massive challenges. And even if … Read more

Survey: IT's key role in global economic recovery

information technology is expected to play an important part in the global economic recovery, according to a new survey released Wednesday.

Some 72 percent of business and information technology executives say their "organizations place greater value on the IT function today than they did before the economic crisis" and that they "view IT as an important part of their economic recovery efforts," according to Accenture's Global Survey on IT Investments.

This is not an unfamiliar sentiment and is one we've heard from United States CIO Vivek Kundra as he's attempted to use IT to kick start a variety of programs on the federal level that will set the pace for innovative new uses of technology across the globe.

The results of the Accenture survey are similar to last week's Goldman Sachs cautiously optimistic survey results that suggested IT spending would trend upward in 2010 and normalize to pre-recession levels with the majority of countries represented planning to increase investment selectively next year.

Read more

Open-source Hadoop powers Tennessee smart grid

The Tennessee Valley Authority is the nation's largest public power provider serving approximately 9 million consumers in seven southeastern states. The organization also happens to be a big supporter of open-source projects, including Hadoop, a tool designed for deep analysis and transformation of very large data sets.

Earlier this year, the Tennessee Valley Authority (TVA) announced that it open sourced its data system used to collect data from smart grid devices called Phasor measurement units (PMUs). The data collection system is known in the industry as a Super Phasor Data Concentrator (SuperPDC), which can be used to determine the health of a power grid.

The open-source version of the SuperPDC is now called the "OpenPDC." I spoke to both Ritchie Carroll (RC), the project's creator, and Josh Patterson (JP), the person responsible for introducing Hadoop to the project, to discuss what the OpenPDC is and why TVA turned to Hadoop in building the system.

What sort of data volumes are you working with? RC: Currently there is around 20 TB of archived data, we expect this to grow quickly as a result of the SmartGrid stimulus funding which includes the addition of 850 phasor measurement devices. This may well grow the archive to half a Petabyte within the next few years.

How is this data currently captured and managed? Is any data discarded? JP: Data is collected directly from field devices at 30 times per second. This data is then time-aligned and processed in real-time--all data gets captured into a binary data file as time-series data for mass processing by Hadoop.

RC: No data is currently discarded, if we get to the point of needing to discard data because of cost--this will be a decision based on weighed importance of collected data. It is likely the data around major events will never be deleted because it will always be valuable for future student researchers. There is also value in being able to go back in time and look for newly discovered event signatures to see how long they might have been occurring. … Read more

How Yahoo is betting its cloud will pay off

There was a day when information technology personnel toiled behind the scenes to make their corporate computing infrastructure work.

But in the Internet era, those experts increasingly are getting starring roles in corporate computing leadership rather than being supporting cast members. Such is the case for Shelton Shugar, Yahoo's senior vice president of cloud computing.

"It becomes more a topic at cocktail parties," he said of his present job, which he took shortly after Yahoo formed the group in June 2008. "I was at a wine tasting, and an acquaintance said, 'I did a search on … Read more