April 23, 2004 4:00 AM PDT
Google's chastity belt too tight
Despite claims of "advanced proprietary technology," Google's opt-in porn filter proves no better than the tools of the last decade, blocking many harmless sites, a CNET News.com investigation shows.
The indiscriminate nature of the tool is bad news for affected businesses. Google is the most widely used search engine, and failure to appear in its listings can have a direct impact on sales for some companies, particularly smaller enterprises with limited marketing budgets.
By an accident of spelling, the domain name of the Ohio electronics retailer includes an unfortunate string of letters, "sex," which is enough to block the Web site from Google's filtered results.
PartsExpress.com is not alone. A CNET News.com investigation shows that Google's SafeSearch filter technology incorrectly blocks many innocuous Web sites based solely on strings of letters such as "sex," "girls" or "porn" embedded in their domain names.
Google's SafeSearch flaws are more than academic--they can have serious consequences for innocent Web site operators blocked out by them. Google is the most widely used search engine on the Web, and failure to appear in its listings can have a direct impact on sales for some companies, particularly smaller enterprises with limited marketing budgets.
Research company WebSideStory reported last month that Google claimed an all-time high in search referrals, 41 percent of the United States total, and the search giant's market share is steadily expanding.
"Traffic from Google can make or break a business," said Maria Medina, whose family-run clothing business at ALittleGirlsBoutique.com doesn't pass the SafeSearch censor. "Here I am, a mom of four children, creating an at-home business that sells little girl dresses and accessories, in order to spend more time with my children, and I have been filtered out as not being family friendly. Ridiculous."
Matt Cutts, the Google engineer who designed SafeSearch four years ago, said his algorithm looks for a "relatively small" number of trigger words in a Web page's address. If one of those words appears, the SafeSearch algorithm puts the address on a block list and does not take the next step of evaluating the content of the site. "We try to find the best trade-off of precision, recall and safety," Cutts said. "People who opt in to SafeSearch are mostly OK with us being on the conservative side."
Cutts would not disclose how many Web searches are done with SafeSearch enabled, saying only that it's a small percentage of the millions of queries handled by Google each day. But the sloppy filter stands out as a rare black eye for a company that prides itself on superior search technology and boasts on its payroll one of the world's highest concentrations of computer science doctoral degrees. Google claims SafeSearch "uses advanced proprietary technology that checks keywords and phrases" and filters out only Web pages "containing pornography and explicit sexual content."
"That's not very bright," said Karen Schneider, a librarian who runs the Librarians' Index to the Internet and has made a study of filtering software. SafeSearch is "certainly evocative of the very primitive CyberSitter-type tools of the mid-1990s--not a tool of fairly sophisticated development."
The Scunthorpe problem
For years, Web content filters have drawn criticism for inaccuracies. In a famously embarrassing incident in 1996, America Online's errant dirty-word filter prevented residents of the British town Scunthorpe from signing up as new customers. Google's SafeSearch makes the same mistake, blocking local news sites like ThisIsScunthorpe.co.uk and ScunthorpeDistrictCatsProtection.co.uk, a housecat-adoption site.
SafeSearch is "evocative of the very primitive CyberSitter-type tools of the mid-1990s--not a tool of fairly sophisticated development."
SafeSearch also marked as unsafe for children JewishSussex.com, a religious Web site; EssexCountyBeeKeepers.org of Topsfield, Mass.; BluesExcuse.SouthBurnett.com.au, an Australian blues band's site; BassExpert.com; and the Anglo-Saxon history site RomansInSussex.co.uk.
Gareth Roelofse, the Web designer of RomansInSussex.co.uk, said his filtering complaints are broader than just Google. "We also found many library Net stations, school networks and Internet cafes block sites with the word 'sex' in" the domain name, Roelofse said. "This was a challenge for RomansInSussex.co.uk because its target audience is school children."
"I think it would be nice if Google would have a 'white list' for sites like ours, but this would involve human man-hours, I guess," said Roelofse, who designed the site on behalf of the Sussex Archaeological Society and local museums.
Cutts, the Google software engineer, noted that the SafeSearch Web page permits visitors to contact the company with complaints. "In most cases it's a pretty unambiguous usage," Cutts said about the word "sex" in domain names and Web addresses. "No filter can be 100 percent accurate. We're always willing to take a fresh look at our filter and see how we can improve it."
Google is not alone in seeking to lure searchers worried about encountering online raunch and ribaldry: Yahoo offers a "mature Web content" search filter, and Ask Jeeves has set up a separate Web site for kid-friendly searches. But Yahoo's filter isn't as hypersensitive as Google's, and lists domains mentioning Sussex, Essex and Scunthorpe as acceptable.The flaws in Google's filter have persisted despite research published about a year ago that highlighted overblocking in SafeSearch.
An April 2003 report from Harvard University's Berkman Center described similar but less extensive problems with SafeSearch. That report said some news articles and political Web sites were filtered.
David Drummond, Google's vice president for business development, said that at the time of its development, SafeSearch was designed to be overly cautious. "The thinking was that SafeSearch was an opt-in feature," Drummond said. "People who turn it on care a lot more about something sneaking through than they do about something getting filtered out."
"Plainly silly" blocking
CNET News.com evaluated SafeSearch by testing tens of thousands of random Web pages and identifying which ones were incorrectly listed as pornographic. The results showed that Google encountered many of the same problems that have plagued Internet filters for almost a decade. One 1996 analysis, for instance, showed that CyberPatrol blocked National Rifle Association and gay and lesbian Web sites, and CyberSitter cordoned off Usenet newsgroups such as alt.feminism and soc.support.fat-acceptance.
"People who opt in to SafeSearch are mostly OK with us being on the conservative side."
The ACLU, which has warned against buggy filters
"In the end, the lists are proprietary," Steinhardt said. "Without access to the lists, you don't know precisely what's being blocked. You have to rely on the authors of the lists to have the right judgment."
The word "girls" also tends to lead SafeSearch astray. It incorrectly blocks the Web sites of the private school GirlsSchoolOfAustin.org; the bridesmaid dress shop DressyGirls.com; TatuGirls.com, a Russian band's site; and TheCalicoGirls.com, a Web site devoted to cat poetry.
"Porn" in a domain name can confuse SafeSearch just as thoroughly. It won't display Pornichet.org, devoted to improving tourism for the French seaside town of Pornichet; SpornGroup.com, a New York-based business consultancy; Sporn.com, which sells dog leashes; PornkRocks.com, a site devoted to the band Pornk; and Anti-Kinderporno.de, a German effort to oppose child pornography.
Aaron Wolfe, information systems director for SafeSearch-banned PartsExpress.com, said the company is planning to excise that unfortunate string of letters from its domain name. "We are going to modify our domain name to Parts-Express.com," Wolfe said, adding that the renaming will also help "get around spam filters on e-mail servers."
18 commentsJoin the conversation! Add your comment