Page Hijack Exploit: 302, redirects and Google

An explanation of the page hijack exploit using 302 server redirects.

302 Exploit: How somebody else's page can appear instead of your page in the search engines.

Abstract:
An explanation of the page hijack exploit using 302 server redirects. This exploit allows any webmaster to have his own "virtual pages" rank for terms that pages belonging to another webmaster used to rank for. Successfully employed, this technique will allow the offending webmaster ("the hijacker") to displace the pages of the "target" in the Search Engine Results Pages ("SERPS"), and hence (a) cause search engine traffic to the target website to vanish, and/or (b) further redirect traffic to any other page of choice.
Published here on March 14, 2005.
Copyright: © 2005 Claus Schmidt, clsc.net
Citations (quotes, not full-text copy) are considered fair use if accompanied by author name and a link to this web site or page.
Bug Track:

2008-02-01: Status: Every now and then this issue and related problems pops up again. Even here, 3 years after I wrote this paper. Please understand that there is nothing I personally can do about it. So, even if it sounds a bit hard -- and even if you wouldn't mind paying my horrific hourly rates -- just don't bother asking. I am genuinely sorry but I am simply not able to solve all the server redirect related problems of the world.
2006-09-18: Status: It does not seem like this is a widespread problem with Google any more. Yahoo has no problems with this. MSN status is unclear.
2006-01-04: Status: Google is attempting a new fix, being tested Q1 2006. Status for this: Unknown.
2005-08-26: Status: STILL NOT FIXED. Although the engineers at Google have recently made new attempts to fix it, it can still not be considered a solved problem.
2005-05-08: Status: STILL NOT FIXED. Google only hides the wrong URLs artificially when you search using a special syntax ("site:www.example.com"). The wrong URLs are still in the database and they still come up for regular searches.
2005-04-19: Some information from Google here from message #108 and onwards
2005-04-18: Good news: It seems Google is fixing this issue right now
2005-03-24: Added "A short description" and a new example search
2005-03-17: Added a brief section about the "meta refresh" variant of the exploit.
2005-03-16: Edited some paragraphs and added extra information for clarity, as requested by a few nice Slashdot readers.
2005-03-15: Some minor quick edits, mostly typos.
2005-03-14: I apologize in advance for typos and such - I did not have much time to write all this.

The Google view:
These three pieces are good if you want an opinion from a Google engineer (Matt Cutts). He does not write all that I write below, but it's always nice to hear another perspective:
Url canonicalization,
The inurl operator, and
302 redirects.

Disclaimer

This exploit is published here for one reason only: To make the problem understandable and visible to as many people as possible in order to force action to be taken to prevent further abuse of this exploit. As will be shown below, this action can only be taken by the search engines themselves. Neither clsc.net nor Claus Schmidt will encourage, endorse or justify any use of this exploit. On the contrary, I (as well as the firm) oppose strongly to any kind of hijacking.

What is it?

A page hijack is a technique exploiting the way search engines interpret certain commands that a web server can send to a visitor. In essence, it allows a hijacking website to replace pages belonging to target websites in the Search Engine Results Pages ("SERPs").

When a visitor searches for a term (say, foo) a hijacking webmaster can replace the pages that appear for this search with pages that (s)he controls. The new pages that the hijacking webmaster inserts into the search engine are "virtual pages", meaning that they don't exist as real pages. Technically speaking they are "server side scripts" and not pages, so the searcher is taken directly from the search engine listings to a script that the hijacker controls. The hijacked pages appear to the searcher as copies of the target pages, but with another web address ("URL") than the target pages.

Once a hijack has taken place, a malicious hijacker can redirect any visitor that clicks on the target page listing to any other page the hijacker chooses to redirect to. If this redirect is hidden from the search engine spiders, the hijack can be sustained for an indefinite period of time.

Possible abuses include: Make "adult" pages appear as e.g. CNN pages in the search engines, set up false bank frontends, false storefronts, etc. All the "usual suspects" that is.

A short description

Regarding the Search Engine Result Pages ("the SERPs"), it's not that the listed text (the "snippets") are wrong. The snippets are the right ones, and so is the page size, the headline, the SERP position, and the Google cache. The only thing that can be observed and identified as wrong in the SERPs is the URL used for the individual result.

This is what happens, in basic terms (see "The technical part: How it is done" for the full story). It's a multi-step process with several possible outcomes, sort of like this:

  1. Hijacker manages to get his script listed as the official URL for another webmaster's page.
  2. To Googlebot the script points to the other webmaster's page from now on.
  3. Searchers will see the right results in the SERPs, but the wrong URL will be on the hijacked listing.
  4. Depending on number of successful hijacks (or some other measure of "severity" only known to Google) the search engine traffic to the other webmaster dries up and disappears, because all his pages (not just the hijacked one(s)) are now "contaminated" and no longer show up for relevant searches.
  5. Optional: The hijacker can choose to redirect the traffic from SERPs to other places for any other visitor than Googlebot.
  6. Offended webmaster can do nothing about this as long as the redirect script(s) points Googlebot to the page(s) of the offended webmaster (and Google has the script URL(s) indexed).

While step five is optional, the other steps are not. Although it is optional it does indeed happen, and this is the worst case as it can direct searchers in good faith to misleading, or even dangerous pages.

Step five is not the only case, as hijacking (as defined by "hijacking the URL of another web page in the SERPS") is damaging in the other cases as well. Not all of them will be damaging to the searcher, and not all of them will be damaging to all webmasters, but all are part of this hijacking issue. The hijack is established in step one above, regardless of later outcome.

This whole chain of events can be executed either by using a 302 redirect, a meta refresh with a zero second redirect time, or by using both in combination.

Which engines are vulnerable?

Search engines vulnerable to this exploit have been reported to include Google and MSN Search, probably others as well. The Yahoo! search engine is at the time of writing the only major one which has managed to close the hole.

Below, the emphasis will be on Google as that one is by far the greatest search engine today in terms of usage - and allegedly also in terms of number of pages indexed

Is it deliberate wrong-doing?

I am not a lawyer, I should stress this. Further, the search engines affected by this operate on a worldwide scale, and laws tend to differ a lot among countries especially regarding the Internet.

That said, the answer is: Most likely not. This is a flaw on the technical side of the search engines. Some webmasters do of course exploit this flaw, but almost all cases I've seen are not a deliberate attempt at hijacking. The hijacker and the target are equally innocent as this is something that happens "internally" in the search engines, and in almost all cases the hijacker does not even know that (s)he is hijacking another page.

It is important to stress that this is a search engine flaw. It affects innocent and un-knowing webmasters as these webmasters go about doing their normal routines, and maintaining their pages and links as usual. It is not so that you have to take steps that are in any way outside of the "normal" or "default" in order to either become hijacked or hijack others. On the contrary, page hijacks are accomplished using everyday standard procedures and techniques used by most webmasters.

What does it look like?

The Search Engine Results Pages ("SERPs") will look just like normal results to the searcher when a page hijack has occurred. On the other hand, to a webmaster that knows where one of his pages used to be listed, it will look a little different. The webmaster will be able to identify it because (s)he will see his/her page listed with an URL that does not belong to the site. The URL is the part in green text under listings in Google.

Example (anonymous)

This example is only provided as an example. I am not implying anything whatsoever about intent, as I specifically state that in most cases this is 100% un-intentional and totally unknown to the hijacker, which becomes so only by accident. It is an error that resides within the search engines, and it is the sole fault of the search engines - not any other entity, be it webmasters, individuals, or companies of any kind. So, I have no reason to believe that what you see here is intentional, and I am in fact suggesting that the implied parties are both totally innocent.

Google search: "BBC News"

Anonymous example from Google SERPs:

BBC NEWS | UK | 'Siesta syndrome' costs UK firms
Healthier food and regular breaks are urged in an effort to stop Britain's
workplace "siesta syndrome".
r.example.tld/foo/rAndoMLettERS - 31k - Cached - Similar pages

Real URL for above page: news.bbc.co.uk/1/hi/uk/4240513.stm

By comparing the green URL with the real URL for the page you will see that they are not the same. The listing, the position in the SERPs, the excerpt from the page ("the snippet"), the headline, the cached result, as well as the document size are those of the real page. The only thing that does not belong to the real page is the URL, which is written in green text, and also linked from the headline.

NEW: This search will reveal more examples when you know what to look for:

Google search: "BBC News | UK |"
Do this: Scroll down and look for listings that look exactly like the real BBC listings, i.e. listings with a headline like this:
BBC News | subject | headline

Check that these listings do not have a BBC URL. Usually the redirect URL will have a questionmark in it as well.

It is important to note that the green URL that is listed (as well as the headline link) does not go to a real page. In stead, the link goes straight to a script not controlled by the target page. So, the searcher (thinking (s)he has found relevant information) is sent directly from the search results to a script that is already in place. This script just needs a slight modification to send the searcher (any User-Agent that is not "Googlebot") in any direction the hijacker chooses. Including, but not limited to, all kinds of spoofed or malicious pages.

(In the example above - if you manage to identify the real page in spite of attempts to keep it anonymous - the searcher will end up at the right page with the BBC, exactly as expected (and on the right URL as well). So, in that case there is clearly no malicious intent whatsoever, and nothing suspicious going on).

Who can control your pages in the search engines?

This is the essence of it all. In the example above, clearly the BBC controls whatever is displayed on the domain "news.bbc.co.uk", but BBC normally does not control what is displayed on domains that BBC does not own. So, a mischievous webmaster controlling the "wrong URL" is free to redirect visitors to any URL of his liking once the hijack has taken place. The searcher clicking on the hijacked result (thinking that (s)he will obtain a news story on UK firms) might in fact end up obtaining all kinds of completely unrelated kinds of "information" and/or offers in stead.

As a side-effect, target domains can have so many pages hijacked that the whole domain starts to be flagged as "less valuable" in the search engine. This leads to domain poisoning, whereby all pages on the target domain slips into Google's "supplemental listings" and search engine traffic to the whole domain dries up and vanishes.

And here's the intriguing part: The target (the "hijacked webmaster") has absolutely no methods available to stop this once it has taken place. That's right. Once hijacked, you can not get your pages back. There are no known methods that will work.

The only certain way to get back your pages at this moment seems to be if the hijacker is kind enough to edit his/her script so that it returns a "404 Not Found" status code, and then proceeds to request removal of the script URL from Google. Note that this has to be done for each and every hijack script that point to the target page, and there can be many of them. Even locating these can be very difficult for an experienced searcher, so it's close to impossible for the average webmaster.

The technical part: How it is done

Here is the full recipe with every step outlined. It's extremely simplified to benefit non-tech readers, and hence not 100% accurate in the finer details, but even though I really have tried to keep it simple you may want to read it twice:
  1. Googlebot (the "web spider" that Google uses to harvest pages) visits a page with a redirect script. In this example it is a link that redirects to another page using a click tracker script, but it need not be so. That page is the "hijacking" page, or "offending" page.
  2. This click tracker script issues a server response code "302 Found" when the link is clicked. This response code is the important part; it does not need to be caused by a click tracker script. Most webmaster tools use this response code per default, as it is standard in both ASP and PHP.
  3. Googlebot indexes the content and makes a list of the links on the hijacker page (including one or more links that are really a redirect script)
  4. All the links on the hijacker page are sent to a database for storage until another Googlebot is ready to spider them. At this point the connection breaks between your site and the hijacker page, so you (as webmaster) can do nothing about the following:
  5. Some other Googlebot tries one of these links - this one happens to be the redirect script (Google has thousands of spiders, all are called "Googlebot")
  6. It receives a "302 Found" status code and goes "yummy, here's a nice new page for me"
  7. It then receives a "Location: www.your-domain.tld" header and hurries to your page to get the content.
  8. It heads straight to your page without telling your server on what page it found the link it used to get there (as, obviously, it doesn't know - another Googlebot fetched it)
  9. It has the URL of the redirect script (which is the link it was given, not the page that link was on), so now it indexes your content as belonging to that URL.
  10. It deliberately chooses to keep the redirect URL, as the redirect script has just told it that the new location (That is: The target URL, or your web page) is just a temporary location for the content. That's what 302 means: Temporary location for content.
  11. Bingo, a brand new page is created (never mind that it does not exist IRL, to Googlebot it does)
  12. Some other Googlebot finds your page at your right URL and indexes it.
  13. When both pages arrive at the reception of the "index" they are spotted by the "duplicate filter" as it is discovered that they are identical.
  14. The "duplicate filter" doesn't know that one of these pages is not a page but just a link (to a script). It has two URLs and identical content, so this is a piece of cake: Let the best page win. The other disappears.
  15. Optional: For mischievous webmasters only: For any other visitor than "Googlebot", make the redirect script point to any other page free of choice.

Added: There are many theories about how the last two steps (13-14) might work. One is the duplicate theory - another would be that the mass of redirects pointing to the page as being "temporary" passes the level of links declaring the page as "permanent". This one does not explain which URL will win, however. There are other theories, even quite obscure ones - all seem to have problems the duplicate theory does not have. The duplicate theory is the most consistent, rational, and straight-forward one I've seen so far, but only the Google engineers know the exact way this works.

Here, "best page" is key. Sometimes the target page will win; sometimes the redirect script will win. Specifically, if the PageRank (an internal Google "page popularity measure") of the target page is lower that the PageRank of the hijacking page, it's most likely that the target page will drop out of the SERPs.

However, examples of high PR pages being hijacked by script links from low PR pages have been observed as well. So, sometimes PR is not critical in order to make a hijack. One might even argue that -- as the way Google works is fully automatic -- if it is so "sometimes" then it has to be so "all the time". This implies that the examples we see of high PR pages hijacking low PR pages is just a co-occurrence, PR is not the reason the hijack link wins. This, in turn, means that any page is able to hijack any other page, if the target page is not sufficiently protected (see below).

So, essentially, by doing the right thing (interpreting a 302 as per the RFC), the search engine (in the example, Google) allows another webmaster to convince it's web page spider that your website is nothing but a temporary holding place for content.

Further, this leads to creation of pages in the search engine index that are not real pages. And, if you are the target, you can do nothing about it.

302 and meta refresh - both methods can be used

The method involving a 302 redirect is not the only one that can be used to perform a malicious hijack. Another just as common webmaster tool is also able to hijack a page in the search engine results: The "meta refresh". This is done by inserting the following piece of code in a standard static HTML page:
<meta http-equiv="refresh" content="0;url=http://www.target-website.com/folder/file.html">

The effect of this is exactly as with the 302. To be sure, some hijackers have been observed to employ both a 302 redirect and a meta redirect in the 302 response generated by the Apache server. This is not the default Apache setting, as normally the 302 response will include a standard hyperlink in the HTML part of the response (as specified in the RFC).

The casual reader might think "a standard HTML page can't be that dangerous", but that's a false assumption. A server can be configured to treat any kind of file as a script, even if it has a ".html" extension. So, this method has the exact same possibilities for abuse, it's only a little bit more sophisticated.

What you can - and can not - do about it

Note the bolded part of item #4 in the list above. At a very early stage the connection between your page and the hijacking page simply breaks. This means that you can not put a script on your page that identifies if this is taking place. You can not "tell Googlebot" that your URL is the right URL for your page either.

Here are some common misconceptions. The first thoughts of technically skilled webmasters will be along the lines of "banning" something, i.e. detecting the hijack by means of some kind of script and then performing some kind of action. Lets' clear up the misunderstandings first:

You can't ban 302 referrers as such
Why? Because your server will never know that a 302 is used for reaching it. This information is never passed to your server, so you can't instruct your server to react to it.

You can't ban a "go.php?someURL" redirect script
Why? Because your server will never know that a "go.php?someURL" redirect script is used for reaching it. This information is never passed to your server, so you can't instruct your server to react to it.

Even if you could, it would have no effect with Google
Why? Because Googlebot does not carry a referrer with it when it spiders, so you don't know where it's been before it visited you. As already mentioned, Googlebot could have seen a link to your page a lot of places, so it can't "just pick one". Visits by Googlebot have no referrers, so you can't tell Googlebot that one link that points to your site is good while another is bad.

You CAN ban click through from the page holding the 302 script - but it's no good
Yes you can - but this will only hit legitimate traffic, meaning that surfers clicking from the redirect URL will not be able to view your page. It also means that you will have to maintain an ever-increasing list of individual pages linking to your site. For Googlebot (and any other SE spider) those links will still work, as they pass on no referrer. So, if you do this Googlebot will never know it.

You CAN request removal of URLs from Google's index in some cases
This is definitely not for the faint at heart. I will not recommend this, only note that some webmasters seem to have had success with it. If you feel it's not for you, then don't do it. The point here is that you as webmaster could try to get the redirect script deleted from Google.

Google does accept requests for removal, as long as the page you wish to remove has one of these three properties:

Only the first can be influenced by webmasters that do not control the redirect script, and the way to do it will not be appealing to all. Simply, you have to make sure that the target page returns a 404, which means that the target page must be unavailable (with sufficient technical skills you can do this so that it only returns a 404 if there is no referrer). Then you have to request removal of the redirect script URL, i.e. not the URL of the target page. Use extreme caution: If you request that the target page should be removed while it returns a 404 error, then it will be removed from Google's index. You don't want to remove your own page, only the redirect script.

After the request is submitted, Google will spider the URL to examine if the requirements are met. When Googlebot has seen your pages via the redirect script and it has gotten a 404 error you can put your page back up.

Precautions against being hijacked

I have tracked this and related problems with the search engines literally for years. If there was something that you could easily do to fix it as a webmaster, I would have published it a long time ago. That said; the points listed below will most likely make your pages harder to hijack. I will and can not promise immunity, though, and I specifically don't want to spread false hopes by promising that these will help you once a hijack has already taken place. On the other hand, once hijacked you will lose nothing by trying them.

Precautions against becoming a hijacker

Of course you don't want to become a page hijacker by accident. The precautions you can take are:

Recommended fix

This can not and should not be fixed by webmasters. It is an error that is generated by the search engines, it is only found within the search engines, and hence it must be fixed by the search engines.

The fix I personally recommend is simple: treat cross-domain 302 redirects differently than same-domain 302 redirects. Specifically, treat same-domain 302 redirects exactly as per the RFC, but treat cross-domain 302 redirects just like a normal link.

Meta redirects and other types of redirects should of course be treated the same way: Only according to RFC when it's within one domain - when it's across domains it must be treated like a simple link.

Added: A Slashdot reader made me aware of this:

RFC 2119 (Key words for use in RFCs to Indicate Requirement Levels) defines "SHOULD" as follows:

3. SHOULD This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.
So, if a search engine has a valid reason not to do as the RFC says it SHOULD, it will actually be conforming to the same RFC by not doing it.

You can help

For this to happen, we need to put some pressure on the search engines. What I did not tell you above is that this problem has been around for years. Literally (see, e.g. bottom of page here). The search engines have failed to take it seriously and hence their results pages are now filled with these wrong listings. It is not hard to find examples like the one I mentioned above.

You can help in this process by putting pressure on the search engines, e.g. by writing about the issue on your web page, in forums, or in your blog. Feel free to link to this page for the full story, but it's not required in any way unless you quote from it.

 


A small part of this article was originally posted by the author at 4:30 pm on Mar 9, 2005 (UTC +1) here
See specifically posts #54, #218, and #279.

 

 


Document URL: http://clsc.net/articles/google-302-page-hijack.php