Earlier this year, EMC surprised the storage community with its acquisition of Greenplum, a small producer of sophisticated software that can be used to both scale-out and accelerate data warehousing and business analytics applications. Its core technology is based on a convergence of Google's MapReduce process, and SQL. The result is a business analytics engine that is now being used to process very large data sets from a variety of online and traditional database sources. EMC created a new Data Computing Division around Greenplum and has recently released a Data Computing Appliance to compete with a number of accelerated business analytics platforms already available from Oracle and Teradata.
Shortly thereafter, IBM acquired Netezza, another start-up with technology similar to Greenplum's and a similar appliance. For IBM however, this acquisition was much less of a surprise. IBM has a well-established customer base and product portfolio in both traditional data warehousing and emerging business analytics opportunities. Indeed, IBM sees business analytics as a $208 billion opportunity between 2011 and 2015 ("cloud" was rated at $181 billion). Or, as an IBM executive quipped during a recent analyst event, "It is impossible to do Smarter Planet without analytics."
There is no question in my mind about whether traditional data warehousing will be transformed into "big data" analysis applications. Some vendors are already responding to the burgeoning need for IT processes that converge traditional database data with data from multiple disparate sources that include online, wireless, and sensory resources (hence the term "big data"). This need is being seen across industry segments, but is currently most prominent in health care, government, and retail. What storage professionals need to understand now is how the big-data wave will affect data storage.
Parsing the impact on storage of the new business analytics applications will first require differentiating the multiple processes now available from traditional data warehousing processes. Standard extract/transform/load processes common to traditional data warehousing applications are no longer scalable or fast enough. A purpose-built appliance approach that integrates servers, storage, and networking is increasingly the answer.
For example, Greenplum uses Scatter Gather Streaming to simultaneously ingest multiple data streams from multiple sources. The Scatter Gather approach also says that the sources for that data do not need to be on premise. These sources could reside on other Web sites or within other organizations connected to the user. Customers can build processing infrastructure around the Greenplum engine or they can buy it preformed as an appliance. In the case of the appliance, the storage decisions are mostly already made. Customers wanting to build the infrastructure however will need to familiarize themselves with the storage requirements.
Performance will top the list of requirements. For this reason, SSD (solid-state drive) will figure prominently in external storage arrays that support the new business analytics applications. Scalability and parallelization will also be required. For these reasons, IBM positions its Scale-out NAS (SONAS) in business analytics and EMC's new Isilon acquisition lands in the big-data space.
When real-time analysis is required, storage as we know it may still be too slow. In this case, database-in-memory technologies may be preferred. Perhaps even more extreme from a storage perspective are applications like StreamBase that completely avoid using storage by processing data streams as they flow though the StreamBase system.
While it's a bit early for next-year prognostications, I expect to see the number of storage devices aimed at analytics applications blossom in 2011 with more storage vendors pursuing the opportunity.