• On The Insider: Britney's Bikini-Clad Top 10
August 5, 2009 9:37 AM PDT

The new databases

by Gordon Haff
  • Font size
  • Print
  • 4 comments

"Database" has come to be largely synonymous with a relational database management system (RDBMS) or, more specifically, a relational database that is accessed using the SQL query language. Some simpler products run on desktops, but if you are talking about products used for serious business computing on a server, SQL it is. The widespread adoption of open-source products such as MySQL and PostgreSQL only cemented SQL's dominance by making it available to a broad audience that couldn't afford licensing fees for products from Oracle and other large database vendors.

An RDBMS stores data in the form of multiple tables that are related to each other by keys that are unique among all occurrences in a given table. The "relational database" term was originally defined and coined by IBM's Edgar Codd in a 1970 paper. Products based on this database model came to largely replace a variety of hierarchical and other technology approaches. While it could be lower performance than alternatives, it tended to offer more flexibility in how data could be laid out, added, and accessed.

As computer systems got faster (and SQL RDBMSs were enhanced in many ways), concerns about the performance of the basic approach largely receded into the background. In general, efforts to displace RDBMSs--such as object databases--have ended up possibly generating a lot of hype for a time but have stayed very much in the niches.

However, with the advent of truly massive scale distributed computing infrastructures, we're starting to see the significant adoption of technologies that don't necessarily replace RDBMSs, but certainly complement them.

The basic issue is that RDBMSs are architected to process and store all transactions with absolute reliability. (ACID--atomicity, consistency, isolation, and durability--is a set of properties commonly used to describe the requirements.) This is a good thing when we're talking about, say, financial transactions. A bank balance has to immediately reflect a withdrawal; the system has to prevent multiple withdrawals of the same balance from happening simultaneously.

RDBMSs and their associated infrastructure also tend to reflect the assumption that data will be retained for a significant period. Again, this makes a lot of sense in the context of the traditional role of databases. A business not only wants to keep transaction records for at least several years--in many cases, it's legally required to do so.

However, we're seeing the increased use of alternative approaches in large distributed systems that don't have as stringent consistency requirements or that generate lots of intermediate results that don't need to be stored permanently. In exchange, they can use replication for maximum performance and availability.

One form this takes is "eventual consistency," which Amazon CTO Verner Vogels describes as tolerating inconsistency for "improving read and write performance under highly concurrent conditions and handling partition cases where a majority model would render part of the system unavailable even though the nodes are up and running." You can read a paper Vogels wrote on the topic here. 

Amazon SimpleDB implements such a model. It "keeps multiple copies of each domain. When data is written or updated (using PutAttributes, DeleteAttributes, CreateDomain or DeleteDomain) and Success is returned, all copies of the data are updated. However, it takes time for the update to propagate to all storage locations. The data will eventually be consistent, but an immediate read might not show the change."

We're also seeing products that essentially augment RDBMSs by reducing the volume of data that they need to store. Terracotta is a commercial product that provides distributed caching for Java applications. An example could be a travel reservation application where the actual "books" need to go into an RDBMS but many of the transactions associated with "looks" can be handled in a distributed way without touching the database every time. Terracotta says that they can frequently offload 40 percent to 60 percent of transactions.

Memcached, an open-source distributed memory caching system, is conceptually similar. It distributes data (together with an associated structure to lookup that data) across multiple systems to reduce accesses to external data stores. It is widely used at large Web sites such as Twitter, YouTube, and Wikimedia.

These techniques and technologies don't replace RDBMSs in the way that RDBMSs replaced older technologies such as hierarchical databases. Rather, they trade off characteristics that have been considered non-negotiable must-haves in the realm of database design such as full consistency. As a result, they can't be used instead of RDBMSs for the situations where those characteristics truly are requirements.

However, a lot of software that is more asynchronous and read-intensive than traditional business applications doesn't have the same constraints on the one hand and needs to massively scale performance across many systems on the other. And for the organizations implementing that software, pairing RDBMSs with distributed data stores of various forms isn't just the right architectural approach; it may be the only way they can get to the scale levels they need at a price point that makes business sense.

Gordon Haff is a principal IT adviser at Illuminata and has more than 20 years of IT industry experience. He writes about what's happening with enterprise servers and data centers, "Yotta-scale" computing, and related software and device trends as part of the CNET Blog Network. Disclosure.
Recent posts from The Pervasive Data Center
The new optimizations for capability computing
Observations from an EMC analyst day
VMware elevates its desktop virtualization view
Intel's James Reinders on parallelism - Part 2
Intel's James Reinders on parallelism: Part 1
Red Hat debuts virtualization management
3Leaf's modern take on NUMA
Cloud computing's dual identity
Add a Comment (Log in or register) (4 Comments)
  • prev
  • 1
  • next
by fazalmajid August 5, 2009 4:06 PM PDT
I am a software architect working on a new app that requires a scalable OLTP database with stringent latency requirements (queries served under 30 milliseconds at the 95th percentile). We considered some of the trendy in-memory databases or clustered key-value stores and found them lacking in one way or the other. The best solution turned out to be a standard RDBMS (PostgreSQL, although Oracle could also do the job) running on SSDs.
Reply to this comment
by ghaff August 5, 2009 6:47 PM PDT
Generally speaking, if you have stringent requirements in the traditional database mold (consistency, latency, etc.), you're going to want a more-or-less traditional RDBMS. I don't see that as making other approaches "trendy" just suited for a different set of needs.
by yahooBS August 6, 2009 5:54 AM PDT
This was a rather underwhelming article; didn't learn much at all (I already know the SimpleDB wasn't RDBMS).
1-what other systems are you seeing (other than SimpleDB)?
2-how do they work?
3-how would I take advantage of them and whey?
Reply to this comment
by ghaff August 6, 2009 8:25 AM PDT
Getting into the details of caching systems and other technologies is rather beyond the scope of a general blog post. Beyond what Amazon is doing (for which I recommend Werner's linked article), the distributed caching systems that I mention are probably the other hottest area. There's been a lot written on memcached; Gear6 has a fair bit of info on their site.
(4 Comments)
  • prev
  • 1
  • next
advertisement

The browser battles go on and on

roundup From Firefox to IE and from Chrome to Opera and Safari, there's no sitting still for browser makers looking to keep their products fresh and competitive.

3G wireless still holds promise

The next generation of 4G wireless may get all the headlines, but advanced 3G technology will likely dominate services for the next few years.

advertisement

About The Pervasive Data Center

This blog takes a deep (and often skeptical) look at trends big and small in the world of enterprise servers, data centers, and "Yotta-scale" computing. This means also taking into account the myriad of software, networks, and devices that are driving change in (or being driven by) these back-end systems. Stories posted to this blog may also appear on Illuminata's site.

Gordon Haff is a principal IT adviser for Illuminata of Nashua, N.H. Before becoming an IT industry analyst, Gordon held a variety of product-marketing positions at Data General, spanning more than a decade. He's programmed for DOS, Windows, and Linux; builds his own PCs; and holds engineering degrees from MIT and Dartmouth, with an MBA from Cornell. He is a member of the CNET Blog Network and is not an employee of CNET. Disclosure.

Add this feed to your online news reader

The Pervasive Data Center topics

advertisement
advertisement

Inside CNET News

Scroll Left Scroll Right