A few weeks ago I talked with Jonathan Heiliger, vice president of technical operations at Facebook, about the challenge of innovating quickly and building stable infrastructure while 250,000 new members are added to the social network every day. Check out the video on ZDNet.
Jonathan Heiliger
(Credit: CNET News)Q: You've been at Facebook, I think, for about a year and it's been quite a ride I guess, scaling up from zero in 2004 to over 80 million today. How do you keep up with that hyper growth?
Heiliger: You're absolutely right--we've had a lot of growth. We add over 250,000 users every day, and that means a lot of infrastructure, a lot of servers, and constantly looking at new processes and looking at how we're doing things and ensuring that we're doing things the most efficient way possible, not just for delivering all the content to our users but to stay on top of what it costs to run the site.
How do you stay on top of the cost in terms of the kind of equipment you buy and how you work with the vendors? How do you prioritize those things?
Heiliger: One of the things we recently did was we ran an RFP process for the servers we buy from vendors and essentially did a bake-off with a number of different people looking at building servers on our own. What we concluded from that process was to continue to buy servers from a couple of major OEMs (original equipment manufacturers), but through that process we were able to lock in prices today and carry those prices forward as all the commodity components costs drop.
When you're buying those servers, and I assume you're doing just a huge scale out of commodity servers, what do they look like? How are they configured?
Heiliger: We're pretty lucky in that we run a wide variety of applications, literally tens of applications on our own and hundreds of applications for our platform developers that use Facebook as a distribution mechanism, as a way of interacting with their users. But one of the reasons we're very lucky is our engineering team has selected to use PHP as the primary development language. That allows us to use a fairly generic server type. So we, with a couple of exceptions, have three main server types and run a fairly homogeneous environment, which allows us to then consolidate our buying power.
You're different from Google in the kinds of applications that you run. They are mostly running search queries, and you're running all kinds of queries and bringing back all kinds of data from the social graph. How is it different in terms of the way you build out your data center from the inside?
Heiliger: Google has a tremendous amount of information that they index and archive and present to users, but fundamentally if you go to Google and type in a search for a "tiger" and I go to Google and type in a search for a "tiger" we're going to see generally the same results, so they're presenting that same information to both of us. Facebook is a little different in that the context for our data is all social. When you look at your friends and their status updates and their photos and the notes they may have written, you're going to see one set of data versus if I look at my friends and their photos and their notes and status updates, and those tend to be non-intersecting sets of data.
So it's much more dynamic?
Heiliger: Much more dynamic data set--and what that means is it's caused us to do a bunch of different things relative to caching and relative to federating all of that data up amongst thousands of different databases so that as a user requests all of that information we're not using one particular server every time for different data.
You recently introduced a chat application on Facebook, and it seems like it took a lot of time to test it to make sure it could scale having all those simultaneous conversations going on. Could you give us a little background and color on how that came to be?
Heiliger: Chat is actually one of our most recent launches. It started as a hack-a-thon project, which is one of the things we do about every other month. People get together and work all night and pick a project they don't have time to do necessarily during the day. From the time it really germinated as an idea to the time it launched and was available for our entire user base, it became a more formal development project. One of the things we did as part of that was actually built a new back-end service to be able to deal with all of the millions of simultaneous connections that we persist for users.
One other thing I was reading up on some of the work you've been doing--you say that clouds don't solve single points of failure in your stack. What are those single points of failure?
Heiliger: Interesting question, and the notion you are referring to there was part of the talk I give in regards to cloud computing is just a panacea, and for a start-up or even a more mature start-up like Facebook, isn't the answer to solving failure points in an application. By that I mean the underlining infrastructure that powers an application is typically the result of, or the outcome of, how the application is originally designed and how users interact with that application. If an application is poorly designed or designed to constantly reference a single set of data, the underlining infrastructure is going to be the victim of that. Guys like myself in the infrastructure world have to figure out how to best make that work.
As someone who is in operations how much impact do you have on the application development to make sure that once it gets into the data center that it can work properly and scale and not have the kind of failures we're seeing with some of the new applications?
Heiliger: I think it's a constant challenge in any organization, particularly a fast-moving one like Facebook, where we want to iterate quickly and get product out in our customers' hands so we can get feedback on that product and continue to tweak and enhance it over time. We have one force that's moving in that direction, and we have another force that says we want to keep the site up, we want the site to be reliable, and we want the site to be fast.
So there's a fine balancing act, where everyone in management and everyone in both the engineering and operations department constantly just sort of works, interacts, and goes back and forth, figures out just how to make those trade-offs. Sometimes we err too aggressively on the side of innovation and iteration, and put things out on the site in perhaps a small quantity that may break the site or cause the site to slow temporarily. Other times we air on the side of conservatism, of not releasing new functionality or new features, and that then delays the sort of user gratification of having that feature or fixing that bug.
What are the challenges that you see--let's say you're at 80 million unique users per month, 250,000 being added per day and 50,000 transactions per second. What happens when you get to 500 million or a billion if you ever get there?
Heiliger: Hopefully, tremendous things. I think we can only look forward to those days.
But what are some of the bottlenecks or barriers you have to overcome to get to that kind of scale?
Heiliger: Some of the bottlenecks we're facing are how we scale this extremely distributed set of data. One of the challenges we have is figuring out how to make that replicated such that it can exist in multiple places around the world and we don't also have to bring users back to the U.S. or back to one of our data centers. I think it's a challenge that most Web sites tend to face as they scale, which is you start in one location with a single database and then you have to figure out how to grow from there, primarily driven by the amount of latency or the amount of time it takes to reach the site and interact with the site. Being able to replicate the data across multiple data centers and across multiple geographies allows users to not just read their data from a local version but write that data as well. That is one of our key challenges over the next 12 months.
As you learn more about building up this very large scale infrastructure do you ever see the possibility that a Facebook could be a service provider?
Heiliger: What do you mean by service provider?
In the sense that right now you're just running the Facebook application but what if a developer or user wanted to do something similar to what Amazon is doing, using your infrastructure to run their applications in the cloud?
Heiliger: Gotcha. So one of the values of Facebook is the Facebook platform. We have over 100,000 developers and several hundred applications that have over a million users using them. We've talked about perhaps opening up or further opening up the platform by offering compute power for those application developers. One of the steps we've already taken to improve that development environment and improve the experience for our developers is just to open-source our platform, which we announced just a couple of weeks ago as well.
After attending GigaOM's Structure 08, I came away with a cloud-computing hangover. Just trying to define cloud computing is daunting given all the hype and companies thunderclapping.
Today the research firm Gartner has jumped on the cloud computing bandwagon, proclaiming that it "heralds an evolution of business that is no less influential than e-business," and defining it as massively scalable IT-related capabilities provided as a service using Internet technologies to multiple external customers.
Yahoo just announced a Cloud Computing & Data Infrastructure Group, which will develop computing infrastructure that balances scalability with cost effectiveness. What was Yahoo doing before it created this group?
I prefer the way Sun Chairman Scott McNealy talks about cloud computing. Ten years ago he was calling it the "big freakin' Webtone switch." Following is how he described it in December 2001:
That is the server, the storage, the operating system, the monitoring software, the clustering, the alternate pathing, multiple domaining, dynamic reconfiguration--and then it has a mail tone, a calendar tone, a news tone, an app server tone, and a directory tone. It has all of the different features of a big freaking WebTone switch and allows you to create this big jukebox. You can buy that all complete. Or you have one throat to choke and you can buy it all through a service provider that is SunTone certified. Or you can do what many IT directors do and they go out and buy the telephone switch by buying the chip from Intel, the operating system from Microsoft, the disk drive from EMC, the Compaq power supply, the Oracle database, the Novell directory, the BEA app server, the SAP, ERP, and CRM from here, blah-blah-blah, this, that, and the other thing, a SoundBlaster card from somebody else, the anti-virus uninstaller from Norton, and then go bring in IBM Global Services to try to make the whole thing work. Buy the big freaking WebTone switch.
At that time McNealy was talking about how enterprises provision their data centers and user services. Now we are seeing Amazon, Google and others take their data center expertise and make it available to developers and companies. Enterprises will be slower to move to the cloud, but they will eventually get there. Software-as-a-service providers are flourishing, and increasingly enterprises are considering off-premises, hosted solutions.
In essence, we are at the beginning of the age of planetary computing. Billions of people will be wirelessly interconnected, and the only way to achieve that kind of massive scale usage is by massive scale, brutally efficient cloud-based infrastructure.
SAN FRANCISCO--Speaking at the Structure 08 conference here, Sun Microsystems CTO Greg Papadopoulos predicted that by the beginning of 2010 the majority of systems sold would be for Web, high performance computing and software-as-a-service applications. "We are going through this phase change in computing in a big way," he said. He made a similar prediction last year.
Papadopoulos also advocated a free market in which all interfaces and formats are based on open standards; customers own their data, relationships, and metadata; and customers can extract, synchronize or purge their data unilaterally. This echoes recent efforts to promote openness and data portability.
Papadopoulos acknowledged that the nirvana of every customer or user in charge of their own data that lives in the cloud has challenges. Today, users cede control of their data to service providers like Google, Facebook, Microsoft, Yahoo, and others. It's not as easy for users to manage and move their data as it should be, which means users are generally stuck with the user experience and monetization schemes of the host sites. "It's proprietary systems all over again," Papadopoulos said. Over the last several years Sun has differentiated itself proprietary vendors, focusing on free open-source software and open standards.
Sun CTO Greg Papadopoulos
(Credit: Dan Farber)Further out into the future, Papadopoulos expects that the technology infrastructure industry will be similar to the energy industry. In past presentations, he has called this transition the Red Shift.
Papadopoulos has predicted a "neutron star collapse of data centers," meaning at some juncture it won't make sense for businesses to build their own data centers. Instead they will contract for computing resources from hosting providers who bring "brutal efficiency" for utilization, power, security, service levels, and idea-to-deploy time.
There will be a grid of a half dozen very large cloud infrastructure providers and a hundred or so regional providers, Papadopoulos said. It will also look more like the banking world, he continued, with customers willing to trust the service providers with their private data as they do banks with their money. It's a question of when, not if, this scenario will occur.
Papadopoulos also laid out a map (see below) of the current universe of cloud computing in terms of increasing virtualization and consolidation across various categories: processor, operating system, language, and application services. Over time, the categories will fill out more especially as more languages and applications services or platforms rise up. Papadopoulos pointed to two Sun projects, Dark Star and Project Caroline. Dark Star is about software infrastructure designed to simplify the creation massively scalable online games, virtual worlds and social networking applications. Project Caroline is a hosting platform for developing and delivering Internet-based services. It's not clear why the Sun research projects are positioned at the far right on the chart, and players such as Google, Joyent, and Rackable are missing.
Higher up in stack developers have more targets and more freedom to innovate below it, Papadopoulos said.
(Credit: Sun)Click here to see more stories from the Structure 08 conference and on cloud computing generally.
Speaking at Structure 08, Debra Chrapaty, corporate vice president of Global Foundation Services at Microsoft, shed some light on the cloud-based infrastructure supporting Microsoft's online services.
Despite characterizations that Microsoft is stuck in the client/server world, the company is spending billions to apply the cloud, or server/client, model, where most of the computing happens in the cloud and some small amount on the client (offline support for applications). But until Microsoft Office and other applications are built for the cloud, the laggard characterization will continue to stick to the company's forehead.
Debra Chrapaty, corporate vice president of Global Foundation Services at Microsoft.
(Credit: Dan Farber)Microsoft has one of the biggest collections of Web sites, with 550 million users, 2 billion search queries, and 10 billion page views per month, as well as 8 billion messages on Microsoft Messenger per day. The company deploys 10,000 new servers per month on average to keep up with demand, Chrapaty said. She broke down Microsoft's model for building infrastructure into a three-letter acronym.
The cloud is all about GET--Growth, Efficiency, and Trust, Chrapaty said. In terms of growth, data centers are a $300 million to $500 million investment. "You have to make every kilowatt count," she said, noting that Microsoft has 35 criteria, such as network egress, power, and available staff, to determine locations for data centers.
Efficiency involves tools for manageability, operability, and sustainability, which translate into cost savings. "It's nice to go to Steve (Ballmer) and say you can save millions of billions of dollars," she said. Trust is having the security, reliability, availability, performance, and familiarity with the local languages and markets, Chrapaty explained.
Trust is also the user community feeling that privacy will be respected as people live their lives on line. That is a challenge that every large site will have to grapple with long after technology issues are resolved.
Click here to see more stories from the Structure 08 conference and on cloud computing generally.
SAN FRANCISCO--During a panel discussion at the Structure conference here Wednesday, various representatives from the cloud-computing world offered their views. Panelists included:
- Christophe Bisciglia, senior software engineer, Google
- Jason Hoffman, founder and chief technology officer, Joyent
- Tony Lucas, CEO, XCalibre Communications
- Lew Moorman, senior vice president of strategy and corporate development, Rackspace
- Geva Perry, chief marketing officer, GigaSpaces
- Joe Weinman, VP of Strategic Solutions at AT&T
The panelists agreed that there will be open and proprietary, as well as specialized, cloud platforms. The discussion got a little heated between Google's Bisciglia and Joyent's Hoffman on the subject of open platforms and Google's BigTable software for distributed data storage.
"The question is, is it about selling your soul? You can't leave," Hoffman said during the panel, referring to Google's App Engine and cloud-computing platform. "There's been a lot published on what an open, loving cloud should do. We should give people real assurances that the cloud is a good place to be."
During the panel, Bisciglia said people can build a better mouse trap and compete with what Google offers. "When we publish something on BigTable, it is not to say that it is a lock-in, but it's our attempt to say that this is something that worked for us," he said.
"If your data is in Google's BigTable, you can't pull it out. You can't install it on your own hardware or leave. You have big brother telling you everything will be OK," Hoffman told me after the panel concluded. "One solution is that Google should provide nice export tools, but that doesn't solve the problem of where you run it. If I were a big enterprise company, I might want to run BigTable on my own hardware. If Oracle had the equivalent of a Google App Engine, a customer could run it on their own or someone else's hardware. What if Facebook started on Google App Engine? They would be stuck on Google."
Joyent is a David facing at least one Goliath, and its livelihood depends on an open-infrastructure approach. It doesn't have the market power to create its own standards. The company is doing 5 billion page views on month, which includes about 25 percent of third-party Facebook application pages, according to CEO David Young.
Joyent is working on a cloud-computing standards initiative called Cloud 9.
"We want to make it easy for people to leave," Hoffman said, adding that application programming interfaces should not hard-code server provider names into APIs.
"We need to interoperate just like the electrical grid," Young said. Google's BigTable and Amazon's SimpleDB are not pushing standards, which are needed to move things forward."
Click here to see more of CNET's stories from the Structure 08 conference and on cloud computing generally.In the early morning at Structure 08, AMR Research's Jonathan Yarmis described various tech trends around cloud computing. Mendel Rosenblum, a founder and technical lead behind VMware, outlined the role of virtualization in data centers.
Amazon CTO Werner Vogels
(Credit: Dan Farber)Now Werner Vogels, vice president and CTO at Amazon.com, is talking about why Amazon is in the cloud computing business, how it got there, and why customers should want it. Instead of every company or developer doing the heavy lifting, dealing with the "muck" as Amazon CEO Jeff Bezos likes to say, Amazon opened up its software-as-a-service stack (Amazon Web Services) and infrastructure (Elastic Compute Cloud, S3, and SimpleDB) to external parties.
I've heard the Amazon story many times, but Vogels offered a few new tidbits, such as S3 is storing 18 billion objects and how Amazon thinks about building to its 1,000 services.
"Amazon built these services internally as tools, not as a framework. Each team can use whatever development tools they need. Infrastructure services need to be very generic and people can switch to competing services internally," Vogels said. For example, users could work with Amazon EC2 and a different storage service than S3.
Vogels outlined the core objectives and principles that cloud computing must meet to be successful:
Vogels noted that cloud computing is in its infancy, but it's not difficult to see the broad outline of how it will evolve. Nick Carr's book The Big Switch tells the story.
Click here to see more of CNET's stories from the Structure 08 conference and on cloud computing generally.Forget about flashy Web 2.0 applications. The real, geeky coolness of the Web is the growing acreage of data centers that deliver bits to billions of devices. At GigaOM's Structure 08 conference in San Francisco on Wednesday, infrastructure--"clouds" of servers, storage and networks--was the headliner.
Conference host Om Malik kicked off the event, which is centered on the massive build out of infrastructure to power the wired planet.
(Credit: Dan Farber)Jonathan Yarmis, vice president of advanced, emerging and disruptive technologies at AMR Research, said changes in the next five years will make the past Internet revolution feel like child's play. He didn't explain exactly how the next five years will be more revolutionary than evolutionary, but outlined the convergence of several technology trends.
Jonathan Yarmis
(Credit: Dan Farber)The combination of social networking, mobility, alternative business models (advertising and different license and revenue models) and cloud and stream computing are mutually reinforcing trends that are driving innovations. The average life of a cell phone is 21 months, which allows users to take advantage of improvements in infrastructure.
"Cloud computing is not just for software as a service, but EaaS--Everything as a Service. Many things as discrete products become cloud-based offerings. It offers us an independence of device and location that is profoundly important," Yarmis said. Spoken like a true analyst--come up with another way to market a concept that is also known as on demand, cloud, SaaS, or utility computing.
One of the infrastructure challenges is not just storing and analyzing the growing body of data but reading, reacting, and responding in real time to disposable streams of data, Yarmis explained. The network and software needs to get much smarter and faster to enable real-time filtering and streaming for every user.
"We've reached a tipping point. All of the waves of disruptive tech are coming together at the same time," Yarmis said. He predicted that the economic downturn will help spur the adoption of cloud computing. Given the lower cost model and technological advances pioneered by companies like Amazon, Google, and Salesforce.com in cloud computing, that's a sure bet.
Click here to see more stories from the Structure 08 conference and on cloud computing generally.In this video interview, Jonathan Heiliger, vice president of technical operations at Facebook, talks with about managing Facebook's hypergrowth. Heiliger is a rock star infrastructure geek. He was the CTO of Global Crossing at age 23, worked at Marc Andreessen's Loudcloud and spent time as the head of Web engineering at Walmart.com.
During the interview, Heiliger said that Facebook has more than 10,000 servers and leverages mostly open-source software across a distributed architecture, with thousands of MySQL instances. "It's almost a new challenge every day," Heiliger said regarding the challenges of keeping up with the growth in users--about 250,000 new users per day. He said that Facebook is considering building its own data centers, but for now is renting.
In the blogosphere of early and ardent technology adopters, sites like Twitter and Seesmic have justifiably gained the attention and buzz. Twitter has had a series of well documented outages, and this weekend Seesmic seized up when videos of movie celebrities, such Steven Spielberg and Harrison Ford, were posted to the video sharing site.
It also caused problems at partner sites, like TechCrunch, that embed Seesmic video comments (vomments) on their pages.
These recurring problems once again demonstrate that the much loved Web 2.0, consisting of many start-ups lacking adequate infrastructure and stable code, is unreliable. The larger start-ups and established sites have the funding to deal with traffic spikes, but they are not invulnerable to outages. Google, Yahoo, Microsoft, Salesforce.com, and others have delivered blank pages. I grappled with some brief outages caused by server overloads as we were testing new pages on my own site recently.
All the unreliability and hiccups simply proves that Web 2.0 is like Swiss cheese, full of holes that lead to 404s. It's growing pains that these fledgling companies will survive if they can continue to innovate, attract more users, and increase uptime.
As the user base grows for these start-ups, there will be proportionally increased outrage associated with downtime, even if they are free services. That's why Facebook borrowed $100 million recently to provide funding to expand its server farms and associated infrastructure.
In his blog, Seesmic founder Loic Le Meur said the company has had 99.99 uptime until the recent problems. The downtime was exacerbated by a lack of communication with users by Seesmic, which the company plans to address.
Following is a Seesmic video I did on the issue:
See also: A business model for Twitter: Pay up
- prev
- 1
- next




