An explanation of the page hijack exploit using 302 server redirects.
302 Exploit: How somebody else's page can appear instead of your page in the search engines.
2008-02-01: Status: Every now and then this issue and related
problems pops up again. Even here, 3 years after I wrote this paper.
Please understand that there is nothing I personally can do about it.
So, even if it sounds a bit hard -- and even if you wouldn't mind paying
my horrific hourly rates -- just don't bother asking. I am genuinely
sorry but I am simply not able to solve all the server redirect related
problems of the world.
2006-09-18: Status: It does not seem like this is a widespread
problem with Google any more. Yahoo has no problems with this. MSN
status is unclear.
2006-01-04: Status: Google is attempting a new fix, being tested Q1 2006. Status for this: Unknown.
2005-08-26: Status: STILL NOT FIXED. Although the engineers at
Google have recently made new attempts to fix it, it can still not be
considered a solved problem.
2005-05-08: Status: STILL NOT FIXED. Google only hides the wrong
URLs artificially when you search using a special syntax
("site:www.example.com"). The wrong URLs are still in the database and
they still come up for regular searches.
2005-04-19: Some information from Google here from message #108 and onwards
2005-04-18: Good news: It seems Google is fixing this issue right now
2005-03-24: Added "A short description" and a new example search
2005-03-17: Added a brief section about the "meta refresh" variant of the exploit.
2005-03-16: Edited some paragraphs and added extra information for clarity, as requested by a few nice Slashdot readers.
2005-03-15: Some minor quick edits, mostly typos.
2005-03-14: I apologize in advance for typos and such - I did not have much time to write all this.
When a visitor searches for a term (say, foo) a hijacking webmaster can replace the pages that appear for this search with pages that (s)he controls. The new pages that the hijacking webmaster inserts into the search engine are "virtual pages", meaning that they don't exist as real pages. Technically speaking they are "server side scripts" and not pages, so the searcher is taken directly from the search engine listings to a script that the hijacker controls. The hijacked pages appear to the searcher as copies of the target pages, but with another web address ("URL") than the target pages.
Once a hijack has taken place, a malicious hijacker can redirect any visitor that clicks on the target page listing to any other page the hijacker chooses to redirect to. If this redirect is hidden from the search engine spiders, the hijack can be sustained for an indefinite period of time.
Possible abuses include: Make "adult" pages appear as e.g. CNN pages in the search engines, set up false bank frontends, false storefronts, etc. All the "usual suspects" that is.
This is what happens, in basic terms (see "The technical part: How it is done" for the full story). It's a multi-step process with several possible outcomes, sort of like this:
While step five is optional, the other steps are not. Although it is optional it does indeed happen, and this is the worst case as it can direct searchers in good faith to misleading, or even dangerous pages.
Step five is not the only case, as hijacking (as defined by "hijacking the URL of another web page in the SERPS") is damaging in the other cases as well. Not all of them will be damaging to the searcher, and not all of them will be damaging to all webmasters, but all are part of this hijacking issue. The hijack is established in step one above, regardless of later outcome.
This whole chain of events can be executed either by using a 302 redirect, a meta refresh with a zero second redirect time, or by using both in combination.
Below, the emphasis will be on Google as that one is by far the greatest search engine today in terms of usage - and allegedly also in terms of number of pages indexed
That said, the answer is: Most likely not. This is a flaw on the technical side of the search engines. Some webmasters do of course exploit this flaw, but almost all cases I've seen are not a deliberate attempt at hijacking. The hijacker and the target are equally innocent as this is something that happens "internally" in the search engines, and in almost all cases the hijacker does not even know that (s)he is hijacking another page.
It is important to stress that this is a search engine flaw. It affects innocent and un-knowing webmasters as these webmasters go about doing their normal routines, and maintaining their pages and links as usual. It is not so that you have to take steps that are in any way outside of the "normal" or "default" in order to either become hijacked or hijack others. On the contrary, page hijacks are accomplished using everyday standard procedures and techniques used by most webmasters.
Google search: "BBC News"
Anonymous example from Google SERPs:
BBC NEWS | UK | 'Siesta syndrome' costs UK firms
Healthier food and regular breaks are urged in an effort to stop Britain's
workplace "siesta syndrome".
r.example.tld/foo/rAndoMLettERS - 31k - Cached - Similar pagesReal URL for above page: news.bbc.co.uk/1/hi/uk/4240513.stm
By comparing the green URL with the real URL for the page you will see that they are not the same. The listing, the position in the SERPs, the excerpt from the page ("the snippet"), the headline, the cached result, as well as the document size are those of the real page. The only thing that does not belong to the real page is the URL, which is written in green text, and also linked from the headline.
NEW: This search will reveal more examples when you know what to look for:
Google search: "BBC News | UK |"
Do this: Scroll down and look for listings that look exactly like the real BBC listings, i.e. listings with a headline like this:BBC News | subject | headline
Check that these listings do not have a BBC URL. Usually the redirect URL will have a questionmark in it as well.
It is important to note that the green URL that is listed (as well as the headline link) does not go to a real page. In stead, the link goes straight to a script not controlled by the target page. So, the searcher (thinking (s)he has found relevant information) is sent directly from the search results to a script that is already in place. This script just needs a slight modification to send the searcher (any User-Agent that is not "Googlebot") in any direction the hijacker chooses. Including, but not limited to, all kinds of spoofed or malicious pages.
(In the example above - if you manage to identify the real page in spite of attempts to keep it anonymous - the searcher will end up at the right page with the BBC, exactly as expected (and on the right URL as well). So, in that case there is clearly no malicious intent whatsoever, and nothing suspicious going on).
As a side-effect, target domains can have so many pages hijacked that the whole domain starts to be flagged as "less valuable" in the search engine. This leads to domain poisoning, whereby all pages on the target domain slips into Google's "supplemental listings" and search engine traffic to the whole domain dries up and vanishes.
And here's the intriguing part: The target (the "hijacked webmaster") has absolutely no methods available to stop this once it has taken place. That's right. Once hijacked, you can not get your pages back. There are no known methods that will work.
The only certain way to get back your pages at this moment seems to be if the hijacker is kind enough to edit his/her script so that it returns a "404 Not Found" status code, and then proceeds to request removal of the script URL from Google. Note that this has to be done for each and every hijack script that point to the target page, and there can be many of them. Even locating these can be very difficult for an experienced searcher, so it's close to impossible for the average webmaster.
Added: There are many theories about how the last two steps (13-14) might work. One is the duplicate theory - another would be that the mass of redirects pointing to the page as being "temporary" passes the level of links declaring the page as "permanent". This one does not explain which URL will win, however. There are other theories, even quite obscure ones - all seem to have problems the duplicate theory does not have. The duplicate theory is the most consistent, rational, and straight-forward one I've seen so far, but only the Google engineers know the exact way this works.
Here, "best page" is key. Sometimes the target page will win; sometimes the redirect script will win. Specifically, if the PageRank (an internal Google "page popularity measure") of the target page is lower that the PageRank of the hijacking page, it's most likely that the target page will drop out of the SERPs.
However, examples of high PR pages being hijacked by script links from low PR pages have been observed as well. So, sometimes PR is not critical in order to make a hijack. One might even argue that -- as the way Google works is fully automatic -- if it is so "sometimes" then it has to be so "all the time". This implies that the examples we see of high PR pages hijacking low PR pages is just a co-occurrence, PR is not the reason the hijack link wins. This, in turn, means that any page is able to hijack any other page, if the target page is not sufficiently protected (see below).
So, essentially, by doing the right thing (interpreting a 302 as per the RFC), the search engine (in the example, Google) allows another webmaster to convince it's web page spider that your website is nothing but a temporary holding place for content.
Further, this leads to creation of pages in the search engine index that are not real pages. And, if you are the target, you can do nothing about it.
<meta http-equiv="refresh" content="0;url=http://www.target-website.com/folder/file.html">
The effect of this is exactly as with the 302. To be sure, some hijackers have been observed to employ both a 302 redirect and a meta redirect in the 302 response generated by the Apache server. This is not the default Apache setting, as normally the 302 response will include a standard hyperlink in the HTML part of the response (as specified in the RFC).
The casual reader might think "a standard HTML page can't be that dangerous", but that's a false assumption. A server can be configured to treat any kind of file as a script, even if it has a ".html" extension. So, this method has the exact same possibilities for abuse, it's only a little bit more sophisticated.
Here are some common misconceptions. The first thoughts of technically skilled webmasters will be along the lines of "banning" something, i.e. detecting the hijack by means of some kind of script and then performing some kind of action. Lets' clear up the misunderstandings first:
You can't ban 302 referrers as such
Why? Because your server will never know that a 302 is used for
reaching it. This information is never passed to your server, so you
can't instruct your server to react to it.
You can't ban a "go.php?someURL" redirect script
Why? Because your server will never know that a "go.php?someURL"
redirect script is used for reaching it. This information is never
passed to your server, so you can't instruct your server to react to it.
Even if you could, it would have no effect with Google
Why? Because Googlebot does not carry a referrer with it when it
spiders, so you don't know where it's been before it visited you. As
already mentioned, Googlebot could have seen a link to your page a lot
of places, so it can't "just pick one". Visits by Googlebot have no referrers, so you can't tell Googlebot that one link that points to your site is good while another is bad.
You CAN ban click through from the page holding the 302 script - but it's no good
Yes you can - but this will only hit legitimate traffic, meaning
that surfers clicking from the redirect URL will not be able to view
your page. It also means that you will have to maintain an
ever-increasing list of individual pages linking to your site. For
Googlebot (and any other SE spider) those links will still work, as they
pass on no referrer. So, if you do this Googlebot will never know it.
You CAN request removal of URLs from Google's index in some cases
This is definitely not for the faint at heart. I will not recommend this, only note that some webmasters seem to have had success with it. If you feel it's not for you, then don't do it. The point here is that you as webmaster could try to get the redirect script deleted from Google.
Google does accept requests for removal, as long as the page you wish to remove has one of these three properties:
Only the first can be influenced by webmasters that do not control the redirect script, and the way to do it will not be appealing to all. Simply, you have to make sure that the target page returns a 404, which means that the target page must be unavailable (with sufficient technical skills you can do this so that it only returns a 404 if there is no referrer). Then you have to request removal of the redirect script URL, i.e. not the URL of the target page. Use extreme caution: If you request that the target page should be removed while it returns a 404 error, then it will be removed from Google's index. You don't want to remove your own page, only the redirect script.
After the request is submitted, Google will spider the URL to examine if the requirements are met. When Googlebot has seen your pages via the redirect script and it has gotten a 404 error you can put your page back up.
Or, for www-to-non-www redirection, use this syntax:RewriteCond %{HTTP_HOST} !^www\.example\.com RewriteRule (.*) http://www.example.com/$1 [R=301,L]
RewriteCond %{HTTP_HOST} !^example\.com RewriteRule (.*) http://example.com/$1 [R=301,L]
The fix I personally recommend is simple: treat cross-domain 302 redirects differently that same-domain 302 redirects. Specifically, treat same-domain 302 redirects exactly as per the RFC, but treat cross-domain 302 redirects just like a normal link.
Meta redirects and other types of redirects should of course be treated the same way: Only according to RFC when it's within one domain - when it's across domains it must be treated like a simple link.
Added: A Slashdot reader made me aware of this:
RFC 2119 (Key words for use in RFCs to Indicate Requirement Levels) defines "SHOULD" as follows:So, if a search engine has a valid reason not to do as the RFC says it SHOULD, it will actually be conforming to the same RFC by not doing it.
3. SHOULD This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.
For this to happen, we need to put some pressure on the search engines. What I did not tell you above is that this problem has been around for years. Literally (see, e.g. bottom of page here). The search engines have failed to take it seriously and hence their results pages are now filled with these wrong listings. It is not hard to find examples like the one I mentioned above.
You can help in this process by putting pressure on the search engines, e.g. by writing about the issue on your web page, in forums, or in your blog. Feel free to link to this page for the full story, but it's not required in any way unless you quote from it.