• On TV.com: New TV sex symbol: Vintage black PORSCHE

Business Tech

Read all 'outage' posts in Business Tech
February 24, 2009 6:35 AM PST

Google apologizes for Gmail outage

by Stephen Shankland
  • 16 comments
Share

Updated 6:44 a.m. PST to reflect that Gmail service was restored.

Business and personal users of Gmail suffered an outage starting about 1:30 a.m. PST Tuesday, but Google said it's fixed the problem.

"If you've tried to access your Gmail account today, you are probably aware by now that we're having some problems. Shortly after 10 9:30am GMT our monitoring systems alerted us that Gmail consumer and businesses accounts worldwide could not get access to their email," said Acacio Cruz, Google's Gmail site reliability manager, in a blog posting Tuesday. "We're working very hard to solve the problem and we're really sorry for the inconvenience."

"The problem is now resolved and users have had access restored," Google said on its Gmail status page. "Many" users were affected, Google said.

Google promises that customers paying for the Google Apps service will have access to Gmail at least 99.9 percent of the time each month or Google has to pay a penalty. So far Google hasn't dipped below that, the company said last year.

The company took advantage of the problem to tout the new Gmail Labs feature that permits offline access to Gmail for customers in the U.S. and U.K. With it, people can read, search, label, and archive their e-mail and compose new messages, but of course messages aren't sent or received until network access is restored.

Outages pose problems for Google as it tries to persuade companies to buy into its cloud-computing vision, in which applications are hosted on the Internet rather than on corporate computers. But Google argues its service availability is competitive with most organizations' abilities to run their own e-mail servers.

Originally posted at Webware
January 8, 2009 7:45 AM PST

Salesforce.com outage hits thousands of businesses

by Tim Ferguson
  • 1 comment
Share

Thousands of businesses were left without access to their applications Tuesday after Salesforce.com's servers suffered a service disruption.

The problem affected all of the software-as-a-service vendor's data centers for at least 40 minutes.

According to a Salesforce.com status page, the problem occurred at 12:40 p.m. PST Tuesday when a core network device failed, stopping all data from being processed in Japan, Europe, and North America.

When the system failed to trigger a failover to redundant systems, Salesforce.com staff had to carry out a manual recovery.

Most of the services were restored in about 40 minutes, according to Salesforce.com, and all services were back online about two and a half hours later.

"While we are confident the root cause has been addressed by the work-around," the company said, "the Salesforce.com technology team will continue to work with hardware vendors to fully detail the root cause and identify if further patching or fixes will be needed."

Freeform Dynamics senior analyst Tony Lock said that "having a service interruption like this one is certainly noticeable when you have a vendor like Salesforce.com that has been delivering pretty good service over the course of the last five or six years."

Lock added that as long as software-as-a-service vendors continue to deliver good service levels and availability, the occasional interruption is acceptable since "nobody expects IT to be perfect."

"It will not have a major impact on organizations' plans for the adoption of software as a service. I think that software as a service will continue to grow as it has been doing over the course of the last few years," he said.

Tim Ferguson of Silicon.com reported from London.

advertisement
 
Business supplies and services can get expensive. Get smart spending tips and learn about new cost-saving opportunities for your business
December 4, 2008 3:12 PM PST

Google weasels out of uptime promise? Not so fast

by Stephen Shankland
  • 5 comments
Share

Correction, 4:05 p.m. PST: The name of the senior product manager for Google Apps was misspelled. It is Rajen Sheth. Also, Pingdom had an incorrect number for total downtime in its "more likely" scenario. It is 55 minutes.

Google's SLA loophole?

Pingdom argues Google can get away with more outages because smaller ones fall between the service level agreement gaps.

(Credit: Pingdom)

Pingdom, a company that monitors Web site availability, has concluded that Google gives itself a lot of wiggle room in its service level agreement for its Google Apps service.

The service level agreement (SLA) gives credit to paying customers if the service falls short of promised availability--99.9 percent measured monthly for Google Apps. Pingdom points out that because Google only counts downtime periods that last at least 10 minutes, the company could get away with intermittent problems that are shorter.

"What if Google Apps was down for 9 minutes, up for 1 minute, down 9 minutes, etc.? That would mean 54 minutes of downtime each hour, but Google still wouldn't count it because none of the individual downtimes lasted 10 minutes (or) more," according to a blog entry Thursday. In a "more likely" scenario with outages lasting 3, 8, 12, 5, 9, 14, and 4 minutes, the total of 55 minutes of actual downtime would only be counted as 26 minutes for purposes of the SLA.

Google, while concerned about uptime, isn't as concerned about the SLA terms or what it called Pingdom's "hypothetical scenario," though.

"If you look at our SLA and compare to others' in the industry, it's identical," said Rajen Sheth, senior product manager for Google Apps, pointing as an example to Microsoft's hosted Exchange service. Service providers need to set a threshold somewhere "to distinguish between a real outage and intermittent errors," he said, and Google is trying to be transparent about where it sets its.

That may sound like dodging the question about an accumulation of small outages, but the company does have a point that a blip probably shouldn't count as much as a catastrophe. Realistically, shortening the interval would probably squeeze Google on the other end to lower its 99.9 percent uptime commitment or perhaps raise its $50 per user per year price. There's no free lunch here for customers.

And after all, although SLAs are important, customers will rapidly abandon ship if a service breaks, credit or no credit.

Notably, Google monitors not only each customer account's uptime, but also each user of that account. It also gives credits even if only part of the service goes down while other parts are available, Sheth said. And though only some customers were affected by a significant Gmail outage in August, Google offered SLA credits to all Google Apps customers.

Google has promised a better dashboard to inform customers about outages. "During the times when we've seen outages, the No. 1 thing we need to do is communicate with our customers," Sheth said.

October 23, 2008 7:10 AM PDT

Amazon's Linux cloud computing out of beta, joined by Windows

by Stephen Shankland
  • 1 comment
Share

A central part of Amazon's online computing foundation is growing up.

The Elastic Compute Cloud, a service that gives customers on-demand access to Linux servers, is now out of beta testing, said Jeff Barr, evangelist for the collection of online options collectively called Amazon Web Services.

"Amazon EC2 is now in full production," Barr said in a blog post Thursday. And as promised, EC2 now offers Windows in a beta test, joining Sun Microsystems' OpenSolaris and Solaris Express Community Edition.

Along with those moves, EC2 now comes with a service level agreement, a formal commitment that the service will be available at least 99.95 percent of the time. This type of agreement makes it easier for businesses to place faith in the service. Previously, only the only AWS component with a service level agreement was the Simple Storage Service (S3), which provides online data storage.

Customers pay for AWS according to how much they need: more servers, more storage space, and more network capacity means more charges. But unlike with computing infrastructure built in-house, when customers don't need it anymore, they can stop paying for it. AWS has had outages, but it continues to gain in popularity, and Amazon has been lowering some AWS prices.

Amazon collects multiple gigabits of monitoring data each second for its Elastic Compute Cloud servce.

Amazon collects multiple gigabits of monitoring data each second for its Elastic Compute Cloud servce.

(Credit: Amazon.com)

Barr also described features that signal growing sophistication for AWS overall in 2009 that should make it easier to administer AWS--either manually or by letting it run itself better. Barr listed four areas:

• Management Console: The management console will simplify the process of configuring and operating your applications in the AWS cloud. You'll be able to get a global picture of your cloud computing environment using a point-and-click web interface.

• Load Balancing: The load-balancing service will allow you to balance incoming requests and traffic across multiple EC2 instances.

• Automatic Scaling: The auto-scaling service will allow you to grow and shrink your usage of EC2 capacity on demand based on application requirements.

• Cloud Monitoring: The cloud-monitoring service will provide real time, multidimensional monitoring of host resources across any number of EC2 instances, with the ability to aggregate operational metrics across instances, Availability Zones, and time slots.

In a separate blog post, Amazon Chief Technology Officer Werner Vogel described some of Amazon's work in ensuring reliability and efficiency.

"We relentlessly measure every possible resource usage parameter, every application counter, and every customer's experience. Many gigabits per second of monitoring data flows continuously through the Amazon networks to make sure that our customers are getting serviced at the levels they can expect and at an efficiency level the business desires," Vogel said.

Among the customers using the Windows version of EC2 are Autodesk, RenderRocket, and Eli Lilly, Amazon said.

"This is a huge step forward in maximizing our results relative to IT spend, and now that Amazon EC2 runs Windows and SQL Server, we have even greater flexibility in the kinds of applications we can build in the AWS cloud," Dave Powers, an Eli Lilly associate information consultant who uses the service to process research data, gushed in a statement.

Autodesk uses EC2 for back-end data processing tasks, said Mike Haley, a senior architect of search engineering, and RenderRocket uses the service for 3D film and TV graphics work for TV and movies, Amazon said.

Originally posted at Webware
July 21, 2008 10:47 PM PDT

Amazon offers automatic credit for S3 outage

by Stephen Shankland
  • Post a comment
Share

Customers affected by Sunday's outage of Amazon's Simple Storage Service, an online data storage plan, won't have to do anything to get credit for the hours-long glitch.

Some Amazon Web Services were down for hours on July 20.

Some Amazon Web Services were down for hours on July 20.

(Credit: Amazon)

"We'll be announcing on the developer forum momentarily that we'll be waiving our standard SLA (service-level agreement) process and applying the appropriate service credit to all affected customers for the July billing period," the company said Monday evening in a statement about the S3 outage. "Customers will not need to send us an e-mail to request their credits, as these will be automatically applied. This transaction will be reflected in our customers' August billing statements."

S3 provides an online mechanism where customers can pay to store data according to the amount they need stored. It's one of a host of Amazon Web Services, but it's the only one so far covered by a service-level agreement that promises high reliability.

Amazon's S3 and the Elastic Compute Cloud (EC2) are two of prominent examples of the concept of cloud computing, in which specialists offer online services on which others can base their own applications. Another variety of cloud computing offers more specific services such as online e-mail or office suites from Zoho, Google, Adobe, and Yahoo.

advertisement
 
Business supplies and services can get expensive. Get smart spending tips and learn about new cost-saving opportunities for your business
July 21, 2008 4:55 PM PDT

Amazon S3: For now at least, sometimes you have to reboot the cloud

by Stephen Shankland
  • 2 comments
Share

Amazon.com's Simple Storage Service, S3, spent a few hours Sunday in a big pothole on the road to the glorious cloud computing future, with an outage taking the storage system offline for several hours Sunday. Should we be surprised?

No. In short, the computing industry is making up what's called cloud computing as it goes along, often with a server and networking architecture that's one part improvisation to two parts proven best practice. Frankly, it's notable to me that some services are as reliable as they are.

Some Amazon Web Services were down for hours on July 20.

Some Amazon Web Services were down for hours on July 20.

(Credit: Amazon)

Computing practices tend to gravitate toward one of two poles. One is tight control, higher prices, and high reliability. The other is openness, lower cost, but some degree of flakiness. High-end mainframes and Unix servers can handle transaction loads that would crush most machines using Intel or AMD x86 processors, but they cost more and are less adaptable. Most of the cutting-edge, large-scale action in the Internet--including various cloud computing efforts--is happening with the more free-wheeling technology.

One company operating at colossal scale, Google, has concluded it's better to buy cheap x86 servers and write software that automatically paves over hardware failures. The bigger problem comes when a large system composed of many interacting components loses track of its self-conception, and rebooting a single system or swapping out a hard drive isn't sufficient.

Essentially, Amazon had to reboot S3. Here's how the company described its S3 problem in a statement:

"As a distributed system, the different components of S3 need to be aware of the state of each other. For example, this awareness makes it possible for the system to decide which redundant physical storage server to route a request to. We experienced a problem with those internal system communications, leaving the components unable to interact properly, and customers unable to successfully process requests. After exploring several alternatives, the team determined it had to take the service offline to restore proper communication and then bring service online again. These are sophisticated systems and it generally takes a while to get to root cause in such a situation," Amazon said. "We will be providing our customers with more information when we've fully investigated the incident."

Afterward, Om Malik called cloud computing frail: "The S3 outage points to a bigger (and a larger) issue: the cloud has many points of failure--routers crashing, cable getting accidentally cut, load balancers getting misconfigured, or simply bad code. And he's right, to a degree, but there are three things that shouldn't be overlooked before writing cloud computing off as a failure.

• First, you should compare the problems of cloud computing to the alternatives, including running computing services in-house. Last I checked, corporate data centers also have crashing routers, bad code, and misconfigured load balancers.

• Second, you can expect reliability to increase as the companies providing cloud infrastructure and services figure out explore the terra igcognita.

• Third, don't confuse Web 2.0 with the foundational elements of cloud computing. A Web site that uses an online application at another site to mash up data from some other sites then present it using a service from yet another site is indeed susceptible to numerous points of failure. But a single-purpose infrastructure such as Amazon S3 is at least in theory a more tightly controlled, single-purpose utility that can offer higher reliability.

That's not to excuse Amazon's outage or gloss over the effect it had on business partners reliant on it. After all, S3 is the sole part of Amazon Web Services that comes with a service level agreement to promise customers reliability.

But a little silver lining to this particular cloud problem is that Amazon is setting expectations at the right level: They said in a statement, "Any downtime is unacceptable, and we won't be satisfied until it is perfect."

  • prev
  • 1
  • next
advertisement

The yogurt makers of tech: Gadgets to avoid

Don't buy these one-trick ponies--unless you like gizmos that gather dust.

Google wants to unclog Net's DNS plumbing

The Net giant, ever eager for a faster Internet, debuts its Google Public DNS service. With it, Google could become even more central to the Net.

advertisement

About Business Tech

Your destination for the latest news on enterprise-level information technology, from chip research and server design to software issues including programming, open source and patents.

Add this feed to your online news reader

Business Tech topics

Most Discussed



advertisement

Inside CNET News

Scroll Left Scroll Right