Web site owners might be amazed to learn that one of the biggest sources for duplicate content isn't externally, but rather internally.
Certainly, popular sites and blogs that syndicate a lot of content have to deal with external duplication, but as I already touched on external duplicate content, we know that there are steps to minimize those challenges and to establish your site as the canonical source.
Internal, or on-site, content duplication tends to come in a few key ways, the first of which is within the key page elements. The second is from the content itself; similar to e-commerce sites using stock product copy, you may be using your own copy over and over again on your site. Third, it simply may come from too little differentiated copy.
... Read moreAre you being outranked by you? Is "your" content showing up in searches, but on sites that aren't yours? Do you have multiple websites that compete against each other? Well this discussion on duplicate content from external sources should be right up your alley.
Earlier in the week, I started our discussion on duplicate content by trying to lay to rest the idea of a duplicate content penalty. Now we pick up that discussion with one aspect of duplicate content . . . content duplication from other sites.
While I'd love to start out our discussion with the idea that external duplicate content is the hardest to deal with, that may not always be the case as you'll see when we talk about duplication on our own websites. For now though, we are just going to focus on content duplication from other sites.
At this point, you are probably in one of two camps--the "Yes, help me with this please," camp or the "What in the world are you talking about?" camp. So let's start by getting everyone in the same camp at least. External content duplication can come about, generally, in three ways.
Content Theft
In every aspect of life, there are those who want to get ahead through the hard work of others, even illegally or unethically. The Web is certainly no exception to this, especially given the fact that, of all the ways to take advantage of the hard efforts of others, copy-paste must certainly be the laziest--I mean easiest.
Don't feel that this is an issue that only affects big name brands and sites, because anyone who publishes online is susceptible to this kind of attack. Keep in mind that what we are talking about here is essentially copyright infringement, not phishing sites and things like that, which is a whole other level of criminal activity.
Realistically, this is probably the hardest to combat, but in many cases, probably doesn't cause as much damage as you might think. In many ways, we might thank the search engines for this. They're out to deliver the best results they can to searchers and are certainly aware of these issues. Because of this, I truly believe they work really hard to identify authoritative and original sources of content. They can compare content they find based on when they found it, as well as links leading back to that content, and while purely speculation, I would have to imagine that it would be pretty easy for the engines to assign a score to any site based on the proportion of content on the site that appears elsewhere and determine natural and unnatural patterns.
So what can you do about content theft? While you can file reports with the search engines based on the Digital Millennium Copyright Act (just search on "Google copyright infringement" or the respective search engine for specific details), the ISP that hosts the infringing domain, or seek even greater legal action, it may be better to first weigh the impact you feel it really has as well as the resources it may take to fight it and determine whether it is worth your attention to begin with. And sometimes, just an email or letter to the infringer might be enough
Content Syndication
Ironically, you are probably the most responsible for your own duplicate content on other sites. Writing content and syndicating through article directories or other content syndication services, RSS feeds of blog posts, and press release syndication will probably make up far more of your duplication woes than pirated content.
Each of these instances can be addressed though. Article writing and similar content is best kept unique and different from any content you have on your own site. When it comes to this kind of content, it is often best to develop content for the sites where it is going to be placed anyway, rather than a mass distribution. Of course, you'll also want to include a byline with a link back to your site.
Blog syndication can be handled a little differently. You may decide to include only a summary of your post, or the full post. The pros and cons here must be weighed, since a partial feed may discourage some sites from even syndicating your blog. In many cases, there may be enough differentiation between your blog and the sites where your post is syndicated anyway. However the best solution is to also include an absolute link back to the blog post on your own site. This helps signal to the search engines that your post is the source.
Press releases can be handled the same way as these other content pieces. Whether you are distributing through wire services or using RSS to syndicate from your site, including links back to your site helps signal the source. Press releases also tend to be more temporary on external sites, though you should certainly keep an archive on your own site.
Micro-Sites
The final source of external content also falls under your control. Micro-site strategy consists of creating additional websites, often around niche topical areas. This strategy evolved out of the idea that if one website was good, then many websites must be better, and would increase the chances of ranking in search engines and the number of listings for a particular search. Some view micro-sites as a good thing, while others view them as bad, however neither view is particularly accurate. Rather, it is the implementation that makes them good or bad.
Micro-site strategy is a much bigger topic, but bad implementation is directly related to our discussion of duplicate content. Most micro-site implementations result in identical or nearly identical duplication of the main website's pages on the various micro-sites. This isn't surprising since creating unique content for one site, especially for an ecommerce site, is often challenging enough without having to create unique content for multiple sites. But rather than improving or increasing rankings, the micro-sites tend to directly compete with the main site and greater resources are needed to maintain multiple sites. Needless to say, this is why most micro-site implementations are bad.
Like many things, there are a few tools that can be used in the fight against duplicate content. One tool to help you keep on top of potential content theft issues is Copyscape, that allows you to enter in your page and it comes back with a list of potential duplication.
Several weeks ago at SMX West I had the pleasure of meeting and having lunch with Brian White from Google. White works on Matt Cutts' Web spam team, tirelessly working to make Google's search results the best they can be, ensuring the best user experience. Quite a hefty task indeed.
You'd think that someone who spends his days fighting the never-ending battle that is Web spam might be a bit negative or jaded. If that is the case, he does an amazing job hiding it. Instead, he was upbeat and you could feel the excitement in his voice as he spoke. Here's a guy who loves what he's doing and truly wants to not only improve the searchers' experience on Google, but wants to make the Web a better place. You can't help but like a guy who's fighting the good fight.
... Read moreGoogle's new teleportation, its search-within-search function, is getting mixed responses, at least from some site owners, who may be remembering occasions when teleportation in the Star Trek transporter went wrong. Earlier in the month, Google introduced the teleportation functionality as a way to better help searchers find information within a site by providing a search box below the snippet of the top listing, which performs a "site:" search on the domain of that listing using the additional search terms the searcher added in.
The "site:" advanced query is quite familiar to those within the search industry, but much less so to the average searcher. So bringing this functionality front and center for the searcher should be a well-received addition.
When I first saw this, I thought it was interesting--once I was able to get it to show up. It doesn't come up for every site, mainly big-name sites, nor does it come up for every search. One that it did come up for was searching for Amazon.com. After playing around with the teleportation search, I also began wondering how these big-name retailers would react and thought that some might not care for this new functionality. Why would they object?
Let me show you--except I can't use Amazon to do it anymore. According to the New York Times, Amazon is one such retailer that has already objected and asked Google to turn off this functionality for its site. It seems that most of the talk so far, like that happening at Search Engine Land (here and here), has been more about acknowledgment than anything else, but Rishi Lakhani's post at SEO Smarty shows that others have had similar thoughts as I.
Now, before we go much further, understand that I'm not suggesting ulterior motives here on Google's part or that this is even a good or a bad thing. For regular users, I think this will be well received, and Google pays a lot of attention to delivering the best user experience it can--but that isn't to say that there isn't going to be a potential upside for the PPC program as well.
So let's take a look at some examples of how this may impact results and get a feel for why some site owners may be less than thrilled with this functionality. Let's use national retailer Target as an example while we still can since its site is powered by Amazon. We'll try this on searches for plasma TVs.
Below we see the results that someone might see doing a search in Google just for "plasma tv" which includes eight paid search ads.
Google search for "plasma tv."
Below we see the results that someone might see doing a search in Google for "target plasma tv." Notice how there are no paid search results showing up, and not surprising, Target shows up in the top organic listing.
Google search results for "target plasma tv."
Then let's see what happens if someone searches just on "target." No surprise that Target.com shows up No. 1 again in organic results and still no paid search ads. What is different is the appearance of the teleportation, search-within-search, box showing up below the sitelinks in the Target result, labeled as "Search target.com."
Google search results for "target."
Then when we do a teleportation search for "plasma tv," we get the following search results. Notice that this creates the advanced search query "plasma tv site:target.com." Now the searcher gets Target.com specific search results in the organic area, hopefully relevant to the search, but also eight paid listings that Target is now competing with.
Google teleportation search results for "plasma tv" within Target.com.
This isn't all as cut-and-dried as this example may seem. The appearance of ads can vary widely from none to many. But for now it does serve as an example of at least one scenario that site owners need to be aware of.
So what does teleportation mean for the various players? Well hopefully, for the searchers, it does get them to what they are looking for faster and easier, but this can really vary as well and may or may not be more helpful than getting directly to the site.
For Google, it means that searchers will have performed at least one more search on Google, instead of clicking through to Target.com immediately. And it may mean that it has gained an opportunity to serve up more targeted (no pun intended) search ads that otherwise may not have been served up (as we can see from the other Target focused searches which yielded no ads). Even more subtle here is the fact that many advertisers may not have bid against a big brand name to begin with. Currently, advertisers can use a trademarked brand as a trigger word as long as they don't use it in the ad itself. As much of the legislation in this area continues to be formed and reformed, who knows whether this will always be the case--but it would seem that teleportation search may provide an additional means to serve up ads around another brand without even needing the advertiser to use that brand as a trigger word.
But how might Target feel about this? Well, if it does help get searchers to their destination, then it might be happy with this. But it also might mean that its natural results are competing against paid-listings that it may not have been competing against under the other Target related searches. It also means that it may not be able to cull additional search information from its own site-search. While the quality of on-site search may vary from excellent to completely worthless, some sites invest heavily in their on-site search to not only deliver good results, but also to serve as insight into what their visitors are looking for. Being able to follow the search path, which they may be losing because of teleportation, may help improve the site experience.
Needless to say, Target might prefer to get people directly to its site and have people search on-site, which at least in this example allows it to serve up a richer experience.
Target.com on-site search for "plasma tv."
Good, bad or otherwise, what this means to site owners is that SEO may be more important than ever. Now, getting to the top listing may not be enough. Defending your brand may not be enough. Securing multiple listings through blended search may not be enough. What happens to the site that has excellent search, but terrible indexation in Google? Now more than ever, site owners need to focus on creating the most search-friendly site as they can to make sure that Google and other search engines can spider and index the site as completely as possible. For some sites, this is a huge challenge, trying to overcome legacy CMS and e-commerce systems. Fortunately, there are solutions like Netconcepts' own GravityStream proxy optimization that can help many sites overcome these obstacles, but GravityStream isn't for everyone.
One thing this clearly means is that site optimization is more important than ever. Optimization will help to make sure that the teleportation results for your site are highly relevant and speak to the searcher, hopefully gaining the click-through from the searcher. If you are like Target and experience millions of searches a year just on your brand name, then you don't want to leave your optimization to chance when it comes to teleportation.
At Netconcepts, we often work with clients who have portfolios of domains. Some of these may be domains from other businesses or sites that have been acquired that are no longer active, while others are typo and brand protection names, and still others may be used for marketing purposes. These portfolios can range from a handful to hundreds or even thousands.
When kicking off work with a new client, one of the things we look at is their portfolio to see which domains are in use, what other sites they have, and which domains are parked or have redirects in place. We want to establish whether any domains are being used inefficiently. If a domain is returning a 404 Not Found and isn't currently in use, then we'd like to redirect it to a more appropriate destination to capture any traffic or link juice that may be going to the old domain.
What is more likely to be the case though is that the domains are just redirected to the main site. So what we are really interested in is how they are redirected. Many times, these domains are set up with 302 Temporary Redirects. While these redirects will still get the traffic and search engine spiders to the right destination, unfortunately these redirects will not pass along any of the PageRank or link popularity.
Once this has been identified, it is a pretty easy thing for the client's IT group to make sure their domain portfolio is working optimally. As you can imagine though, when working with a portfolio with hundreds or thousands of domains, this can be quite a task. There are individual header checkers like Rex Swain's HTTP Viewer (which is great and there is rarely a day that goes by that I don't find myself there) and Firefox add-ons, but that can still be a task with several URLs. There are some bulk checkers, but even those tend to have limits on how many URLs can be checked at a time.
But here is a quick and easy solution, demonstrated with some of CNET's own domains, that anyone can use to check a ton--maybe even two tons--of URLs using Excel and a simple formula and one of my favorite Firefox add-ons, Link Counter (see that link for an earlier post on using Link Counter and download).
Step 1 - copy and paste the URLs to be checked into Excel.
List of URLs in Excel spreadsheet.
Step 2 - if "http://" wasn't already present for the URLs, place it in a cell by itself.
Step 3 - write out this simple formula (adjusting your cell references if need be):
=HYPERLINK(CONCATENATE($A$1,A1),A1)
*if the URL list already includes the "http://" protocol, then the formula is even simpler:
=HYPERLINK(A1,A1)
Hyperlink formula to create live links.
Step 4 - copy that down for your entire list.
Step 5 - go to the "File" menu and select "Web Page Preview"
Web page preview with live links.
Step 6 - when this opens in Firefox, right-click on Link Counter in the browser status bar and select "Check link status."
Server status overlay using Link Counter.
Step 7 - now would be a good time to do some spot checking on some of the URLs, but otherwise, rejoice in the time that has been saved.
This can also be a way to double check whole lists of domains for canonicalization being in place, similar to the examples used here.
Over the past year, there has been a lot of talk about the best way to handle Flash on your site. I previously covered quite a few aspects about this heavily-debated topic in Flash Alternatives Blessed by Google and in Progressive Enhancement is Good for SEO. In my previous interview with Maile Ohye, Google's support engineer I had asked her about Google's view on Flash. Maile confirmed that Google looks at the content within "noscript" tags, but she advised to be careful to mirror accurately the Flash-based content you include within the noscript tags or it will look like cloaking to Googlebot.
In my recent interview with Matt Cutts, Google engineer and head of their webspam team, I questioned as to the status of Google reading textual content within Flash .swf files. Here's what Matt had to say:
"It is a good question. I think that we do a pretty good job of reading textual content. Now, stuff within Flash is binary and you can define it in terms of characters and strokes - so you can have things that look like normal text - but that are completely weird and are not really normal text. So it can be difficult to pull the text out a Flash file. I think we do pretty well. It used to be the case that we had our own, home-brew code to pull the text out of Flash, but I think that we have moved to the Search Engine SDK tool that Adobe/Macromedia offers. So, my hunch is that most of the search engines will standardize on using that Search Engine SDK tool to pull out the text. The easiest way to know whether you have textual content that can be read in a Flash file, is that you could always use that tool yourself and verify as well."
Not only did Matt suggest that Flash users take advantage of the search engine SDK tool, he also confirmed that Google is hoping to standardize it and work with Adobe to continue updating it.
So there you have it. If you use Flash on your website, you owe it to yourself to use the Search Engine SDK tool to gain insight into how Google "sees" your Flash content. If the Search Engine SDK tool is used by Google, why shouldn't you?
For more great advice courtesy of Matt Cutts, I invite you to either read the transcript of my interview with Google's Matt Cutts at Pubcon or you can listen to the Matt Cutts at Pubcon interview podcast (31 minutes, 3.8 MB).
MSN's Live Search team announced back on August 22 that they would be launching a set of tools for Webmasters. At that time, this was strictly a private, by invitation beta. Even then, Webmasters and SEO practitioners alike were excited and hopeful as one of the much awaited features was the ability to pull up backlink information. MSN had previously turned off the special "link" and "linkdomain" query operators that provided a count of links pointing to a page or entire site, respectively.
The Live Search team is really trying to give everyone something to be thankful for. Karen Blakeman reported in October that Microsoft had restored the link and linkdomain queries, though with the slight modification of leading them off with a "+" sign, like:
+linkdomain:www.cnet.com
With apparently no official announcement from Microsoft, news of this seems to have just now picked up notice after Barry Schwartz reported it on Search Engine Land.
And now the Live Search team has ... Read more
Search engine optimization is one of those ongoing tasks. SEO only has two directions...forward or backward, and the day you stop paying attention to SEO is the day you start moving backwards.
If you are one of the so-called little guys, you may feel overwhelmed with how you can ever compete against the big guys. Well good news, as you'll see, even some of the big guys miss the mark on some of the most basic concepts.
Canonicalization
Simply put, in regards to SEO, we might describe "canonicalization" as identifying and consolidating to one, definitive source. The most basic and simplest example of this is www.domain.com versus domain.com. In most cases, both of these lead to the same "page," that is, there is no discernible difference in content between the two. This doesn't have to be the case, but let's not worry about that for now.
This is so often overlooked because it generally doesn't present any noticeable issues. After all, your visitors and search engines get to your site with either version. It's important to understand that search engines see pages based on the URL, which means, to them, these are two different URLs, and therefore two different pages--even if the content is 100 percent identical.
On a basic level, this means duplicate content. Search engines have gotten much better about handling duplicate content and will eventually choose one page or the other to serve up. On a more critical level, what this means is that you may be dividing up your link popularity in all the engines and PageRank specifically in Google.
PageRank dilution
When it comes to link popularity and PageRank, you always want to consolidate your efforts. If you don't force all the PageRank you've earned through to one canonical version, you may split that ever important "link juice" between two different URLs. That's because some folks will link to your URL without the www, just out of convenience or laziness. This SEO issue is one of those simple basics that every site should take care of, especially if it ends up being the difference between one of your pages showing up above or below your biggest competitor.
So which do you choose? Whether you go with the "www" version or the "non-www" version isn't really an issue. What is important is that you use a 301 permanent redirect in order to redirect traffic to the version of your choosing. Then you'll consolidate and flow all of the link juice to the canonical version regardless of how others link to your site--rather than diluting or splitting it.
As you can see by the list below, there are still a lot of big-name sites that haven't addressed canonicalization. Each of these sites can be reached by their www and non-www URLs. Is this negatively impacting their rankings? Maybe, maybe not, but why pass up on the simple stuff.
- levenger.com
- dwr.com
- dresses.com
- footlocker.com
- harborfreight.com
- hickoryfarms.com
- jeffersequine.com
- petco.com
- pfaltzgraff.com
- pier1.com
- qvc.com
- systemax.com
Contrast the above with the following big-name sites that DO properly redirect:
- zappos.com
- llbean.com
- wards.com
- potterybarn.com
- cabelas.com
- and of course... cnet.com
The dreaded "404 File Not Found" error page is one of the great frustrations of the online world. No doubt that it must be a close relative of the "no longer in service" recording for phone numbers.
While they are important for visitors already on your site, they are even more important to visitors coming in from other sites or search engines. These visitors may be entirely new to your site and haven't really "invested" in your site yet or realize what a great site you have. Without a good 404 page, the back button appears awfully inviting.
404 pages may not seem like a typical SEO topic, but they are very important and even the best website will end up with a few 404's in its life. What many site owners don't realize is that there is some significance behind the "404." The 404 is actually a specific server code message, and the reason it is important is that a proper 404 instructs search engines that the page is no longer available and should be removed from the index.
Some content management systems and e-commerce systems allow you to designate a page to use for 404 errors, but the page actually returns a 200 (Ok) code instead of the proper 404 error code. The only way to know though is to look at the header status message that is sent along with the page, which you can do by using "View Response Headers" from the "Information" menu if you've installed the Web Developer extension, or use an online header checker like the one from Rex Swain (just enter the URL of your error page).
One of the benefits to the web is that the 404 doesn't have to be a cold, harsh message. The best 404 pages will try to help the visitor by recommending related pages, providing a search form, or at least a link to the home page or sitemap. But along with that, it can also offer a chance to lighten the mood a little.
Below are some you might find amusing, and you can also find a collection of 404 pages at Smashing Magazine.
- Disney
- Homestar Runner 404
- Homestar Runner System Down
- Craigslist
- Happy Cog
- Pen-and-Paper
- Jeremy Fuksa
- Galiacho
- Martin Korner
- NextWave Performance
- Porcupine Colors
- Jason Kottke
I bet you never imagined that you could take something as dry as a 404 page, and not only dress it up and make it fun, but turn it into an opportunity for user generated content (UGC). Which also clearly makes great link bait! Well the folks at Dailymotion did. You can see how they've implemented their 404 and invited others to participate, and spend some time in their 404 gallery. Who knew 404's could be fun and entertaining!
This is an area though that has really been overlooked. Many companies that could really have fun with this have missed the boat, and there are even some that don't even have a basic, custom 404 error page in place.
Oftentimes, SEO is much easier to accomplish within a small company. It's hard to be nimble when working at a behemoth.
When I talk about items like title tags, URL structures, meta descriptions and canonical URLs, it sounds all quite logical and seems prudent to implement, doesn't it? Well imagine, what it must take for a company like IBM--whose myriad divisions and business units span 90 countries and over 30 languages--to make even the slightest SEO enhancement.
It's a big deal.
To find out more about what it's like to chip away at SEO (search engine optimization) and SEM (search engine marketing) within a mega-corporation like IBM, I went to the source. Meet Mike Moran, a "distinguished engineer" who started working on search marketing for IBM back in the early 2000s when less than 1 percent of Big Blue's traffic was coming from search engines. (Now it's over 25 percent!)
Moran is also co-author of Search Engine Marketing, Inc: Driving Search Traffic to Your Company's Website, and author of the new book called Do it Wrong Quickly: How the Web Changes the Old Marketing Rules.
Both Moran and I will be speaking at the American Marketing Association's Hot Topic: Search Engine Marketing conference on Friday in Boston and again November 2 in Chicago.
In a 45-minute podcast interview with Moran, we covered a lot of ground related to SEO for big companies, and also how to work within the internal constraints and corporate politics to get what you need done.
Here are some of the more interesting points that came out of the talk...
- Be an SEO evangelist (not an SEO dictator): If we think about SEO, a lot of things like keyword research and technical fixes such as 301 redirects are very granular things to do. In a larger company, Moran suggests: "There's no way to really make someone the czar of search marketing; it doesn't really work. What you need to do instead is to figure out how to speak with every specialist that you have in the company, and be able to teach them all what the new parts of their job are, and that's a really difficult thing." By teaching your employees good SEO techniques, not only are you reducing the amount of strain on one person, but you're also boosting your return on investment because once people learn what a great title tag looks like, they can continue to write great title tags until SEO "best practices" change.
- Avoid intramural "bidding wars": Moran offers a really powerful example of something that happened at IBM. Different departments were choosing keywords to bid on that were important to what they were doing. So instead of having one department bid on "Linux", for example, several of them were bidding on the same term, effectively competing against themselves. Moran's point is that a company can significantly reduce its costs "just by analyzing the keyword competition that was intramural, and how much money being wasted" versus having one, consolidated page that showw all of the Linux-related products and services.
- "Management by embarrassment": To communicate changes that need to be made, Moran and his team designed a color-coded "scorecard" that highlighted each division within IBM. Specific colors were used to show how well certain areas like title tags, body copy, keywords, etc. were up to benchmarked SEO standards. Predictably, the color red signaled a problem area. Despite executives not necessarily understanding the more granular aspects of what needed to be done, they started to instruct their divisions to "do whatever Moran wants" because they were tired of having their division show up each month with a bunch of red marks. Pretty powerful what can be done by capturing some benchmarks and communicating what needs to be done using the KISS (keep it simple, stupid) philosophy.
For more on how to implement SEO at a mega-corporation, check out the 45-minute podcast and the text synopsis.





