July 9, 2003 1:28 PM PDT

Google cache raises copyright concerns

Read more about search engines' reach
Like other online publishers, The New York Times charges readers to access articles on its Web site. But why pay when you can use Google instead?

Through a caching feature on the popular Google search site, people can sometimes call up snapshots of archived stories at NYTimes.com and other registration-only sites. The practice has proved a boon for readers hoping to track down Web pages that are no longer accessible at the original source, for whatever reason. But the feature has recently been putting Google at odds with some unhappy publishers.

"We are working with Google to fix that problem--we're going to close it so when you click on a link it will take you to a registration page," said Christine Mohan, a spokeswoman at New York Times Digital, the publisher of NYTimes.com. "We have established these archived links and want to maintain consistency across all these access points."

Google offers publishers a simple way to opt out of its temporary archive, and scuffles have yet to erupt into open warfare or lawsuits. Still, Google's cache links illustrate a slippery side of innovation on the Web, where cool new features that seem benign on the surface often carry unintended consequences.

The issue is particularly relevant at Google, a company that prides itself on creativity and routinely floats trial balloons for new features and services. Its culture of innovation may become increasingly risky as Google, which draws millions of visitors to its site daily and redirects them to others through secretive search formulas, cements its position as one of the most popular and powerful companies on the Web.

At the heart of Google's caching dilemma lies a thorny legal problem involving a core Web technology: When is it acceptable to copy someone else's Web page, even temporarily?

A phantom life for dead pages
Google's cache, a feature introduced in 1997, is unique among commercial search engines, but it's not unlike other archival sites on the Web that keep digital copies of Web pages. Google's relatively little-known feature lets people access a copy of almost any Web page, within Google's own site, in the form it was in whenever last indexed by the search giant. That could mean the page accessed is either minutes or months old, depending on when Google last crawled it.

Unlike formal Web archive projects, Google says its cache feature does not attempt to create a permanent historical record of the Web. Rather, the company actively seeks to delete dead links; once a Web page disappears, the search engine seeks to purge that record and any related cached page as quickly as possible.

Still, Google's cached pages have proven to be a treasure trove for investigators seeking to recover data pulled from public Web sites. In one high-profile example, security and privacy expert Richard Smith copied Web pages detailing the backgrounds of Dr. John Poindexter, head of the Pentagon's Information Awareness Office (IAO), and other officials, from the Google cache days after they were removed from the IAO Web site. The pages were deleted after public reports surfaced on the office's development of a massive computer system to spy on Americans and potential terrorists.

"When something's been yanked, Google cache is a good place to grab it and save for posterity, because you don't know how long Google will have it," said Smith.

Google claims its caching feature benefits Web surfers by letting them access a site that may be malfunctioning or offline. Also, its cached pages highlight terms that match a search query "to make it easier for users to find relevant information," according to a spokesman at the Mountain View, Calif.-based company.

Lawyers, start your search engines
As seemingly benign and beneficial as it is, some Web site operators take issue with the feature and digitally prevent Google from recording their pages in full by adding special code to their sites. Among other arguments, they say that cached pages at Google have the potential to detour traffic from their own site, or, at worst, constitute trademark or copyright violations. In the case of an out-of-date news page in Google's cache, a Web publisher could even face legal troubles because of false data remaining on the Web but corrected at its own site.

For this reason, search experts and copyright lawyers expect the issue to come up in a court of law, joining the leagues of copyright disputes that have surfaced because of technology innovation.

"It's very much an issue that has yet to be tested, and I fully expect that it will be," said Danny Sullivan, industry pundit and editor of Search Engine Watch.

Admittedly, Google's cache is like any number of backdoors to information on the Web. For example, proxy servers can be the keys to a site that is banned by a visitor's hosting Web server. And technically, any time a Web surfer visits a site, that visit could be interpreted as a copyright violation, because the page is temporarily cached in the user's computer memory.

The digital universe is constantly changing, but its content can be either fleeting or permanent. Several Web sites, including the Internet Archive Wayback Machine and the Sept. 11 Digital Archive, have surfaced to preserve information on the Web and to keep permanent historical accounts of events and Web pages. Yet, many more pages, and even those in Google's cache, are eventually lost in the digital ether. The average lifespan of a Web site is 100 days, according to estimates by the Internet Archive.

Still, copyright lawyers and industry experts say that there are legally uncharted waters around a commercial caching service.

"Many of us copyright lawyers have been waiting for this issue to come up: Google is making copies of all the Web sites they index and they're not asking permission," said Fred von Lohman, an attorney at the Electronic Frontier Foundation. "From a strict copyright standpoint, it violates copyright."

Most search engines make a statistical record of a Web page when they "spider" it, or use "robots" to scan the page for meaning or context to related queries. For example, the engine can point to specific information contained on a page that's related to a search term, but it often doesn't have the complete picture of the page. Google goes one step beyond, however, by taking a digital picture of pages and making it available to visitors in cached links. Those pictures exist temporarily on its site until the next time Google crawls that particular page, which can happen in a few days or in six weeks or more.

Legally, what could differentiate Google from other archival sites that record pages is that it is a commercial site and that it has enormous scope and influence on the Web.


Special Report
The Google gods
Does the search engine's power
threaten the Web's independence?


But what's kept the feature off most Web sites' radar is that, anecdotally, most people don't click on the cache. Even Google says people only "occasionally" click its cached links. If more people did, Web publishers might lose visitors--and potentially advertising dollars, which no one can afford to lose as Web publishing gets back on its feet.

Practically speaking, Web sites can "opt out," or include code in their pages that bars Google from caching the page. A tag to exclude "robots" such as "www.nytimes.com/robots.txt" or "NOARCHIVE" typically does the job. And that's largely what's kept the cache feature from being controversial.

Search Engine Watch's Sullivan said that, even though some publishers are wary of the caching feature, many don't block Google's robots for fear of losing favor in the company's powerful search rankings. He said some Webmasters believe there's a stigma associated the "no cache" tag, because many sites that use it have been accused of attempting to use banned methods to manipulate Google's rankings. Google said the "no cache" tag does not affect rankings.

Cache now, pay later?
Some legal experts say Google may be on shaky ground by caching first and asking questions later.

A provision in the Digital Millennium Copyright Act (DMCA) includes a safe harbor for Web caching. The safe harbor is narrowly defined to protect Internet service providers that cache Web pages to make them more readily accessible to subscribers. For example, AOL could keep a local copy of high-trafficked Web pages on its servers so that its members could access them with greater speed and less cost to the network. Various copyright lawyers argue that safe harbor may or may not protect Google if it was tested.

"Most people agree that the caching exception in the DMCA is obsolete," von Lohman said. "I don't think it would cover Google's cache. Google is not waiting for users to request the page. It spiders the page before anyone asks for it."

Still, other lawyers argue that Google's practice would be protected by fair-use laws. A judge might look at the market impact of Google's caching and find that it's valuable, given that it could ultimately drive traffic to the cached site. Or the reverse could be true, depending on the nature of the page.

For its part, Google is confident that the service is within the law. "We've evaluated this from a legal perspective, including copyright law, and have determined that Google's cached page service complies with the law," a Google spokesman said.

A similar issue has played out in the courts in an image-searching case, Kelly v. Arriba Soft, filed in April 1999. Leslie Kelly, a photographer, sued the company for copyright infringement when its visual search finder cataloged thumbnails and full-sizes of his digital photos and made them accessible via its own search engine.

The court initially ruled against Kelly based on the "established importance of search engines," but Kelly appealed and won. In Feb. 2002, the 9th U.S. Circuit Court of Appeals held that Arriba's use of thumbnail images of Kelly's photos was fair use, but its display of full-size images was not fair use, because it was likely to harm the market for Kelly's work by reducing visits to his Web site and by allowing free downloads. But the opinion on full-size images was remanded by the 9th Circuit Court this week and is set to go to trial in the lower court of central California.

Judith Jennison, defense lawyer for Arriba Soft, said that one of the issues in the case is that Arriba Soft, in its process of indexing the Web, made copies of Kelly's photos and saved them for 24 hours in its servers. The 9th Circuit Court agreed that creating that copy is fair use under copyright law, she said, adding that there would be a slightly different analysis in a case related to Google. Also, the fact that the search site has an opt-out program would likely illustrate that the market for original copyrighted works can be protected, which is a significant factor in fair-use analysis.

"In Google's case, the result would likely be the same, because the temporary caching for indexing purposes would be fair use per Kelly v. Arriba Soft," Jennison said.

While it seems that many Net publishers haven't formed an official policy on Google caching, they say they are examining how it affects their business.

Randy Stearns, executive producer for ABCNews.com, said he's somewhat concerned about his company's news pages being archived temporarily on Google, because readers might access information that is not up-to-date or, in the worst case for a daily news outlet, is inaccurate. Theoretically, if a news report was issued with errors and was subsequently fixed on the publisher's site, but the erroneous report still existed in a cached version, it could raise legal issues for the publisher, he said.


Special report
Search and destroy
Microsoft's path to expanding Windows
empire leads to search king Google.


Other publishers dismiss any threat, saying that not enough people actually click on those links to be a detriment to traffic. "People who find objection to what Google does likely spend enormous amounts (of time) on their content and refresh it regularly," said Harry Lin, head of ABC.com.

In contrast with the priorities of some news publishers, Web archivists say preserving pages as they first appeared can offer important documentary records for historians and others.

Brewster Kahle, head of the Wayback Machine, said many people use its archive for patent research, or "prior art" searches. Designers and students have used the archive to see the evolution of Web site design and display, he added, and the Smithsonian has used subsets of the collection in the Presidential Election memorabilia room.

News publishers agree that Google's cache is also valuable if, for example, their site was inaccessible because of technical difficulties.

"It's a great, wonderful feature, and I don't know that copyright laws would protect them," said Search Engine Watch's Sullivan. "But most people are concerned about getting into Google, not getting out of it."

 

Join the conversation

Add your comment

The posting of advertisements, profanity, or personal attacks is prohibited. Click here to review our Terms of Use.

What's Hot

Discussions

Shared

RSS Feeds

Add headlines from CNET News to your homepage or feedreader.