I read a bit about how spammers use web page search functions to reflect their content and get indexed, using your site's position in search engines to push their agenda. The tl;dr is a jerk might link to your site with your search query, using their content, and have that link indexed somehow, then searches for a term might get their own site pushed higher in results because it appears on yours.

So I, as the jerk here who wants to push shady pharmaceuticals, might find a bunch of wikis with open registration. I write a bot to spam those wikis with new users that will create pages linking to some other site's search function, with some text and the link to my site. Search engines come along, find those wiki links, follow them to the search-reflection site and see that the victim site, with a high reputation, is linking to my shady site and now I effectively have free advertising when people use that search engine to look for terms related to my pharmaceuticals site. This works because lots of search functions will tell you what your search term was when they return the results, especially if those results are empty.

As the owner of a legitimate site, I might want to stop this. There's a couple of ways. The first is don't tell a searcher what it was that they searched for. So if they searched for arglebargle, the results page shouldn't say 'you searched for arglebargle and here are the results', it should only return something like 'nothing found'.

The CMS and theme I'm currently using, Pelican with Elegant, supports a search engine, which I have turned on, and this search engine does the thing where it tells you what you searched for. Instead of modifying the search engine (a bunch of JS which I hate) or disabling it (I use it all the time to find my own posts, even though I could grep my markdown text) I'm going to do the more annoying but less difficult to me thing, and tell my webserver to tell search engines not to index my search result pages. And yes, it's kind of hilarious that I switched to a static HTML generator to avoid issues with dynamic pages and comments and etc etc and here I am working around what is effectively a dynamic page generator open to anybody with an Internet connection that will put their content on my site, which I enabled on purpose.

I (currently) use Apache, so I'm going to use mod_rewrite to tell my web server to return a X-Robots-Tag header if the query is for my search page. The search page is at search.html, with a query like so: https://snowcrash.ca/blawg/search.html?q=asdf

I can see what headers are returned initially, by pointing curl at that URL:

% curl -X GET -I 'https://snowcrash.ca/blawg/search.html?q=asdf'
HTTP/1.1 200 OK
Date: Fri, 26 Feb 2021 12:50:01 GMT
Server: Apache
Strict-Transport-Security: max-age=15768000
Last-Modified: Fri, 26 Feb 2021 12:35:26 GMT
ETag: "13bd-5bc3c80a78a89"
Accept-Ranges: bytes
Content-Length: 5053
Content-Type: text/html

Then, I drop a .htaccess file in at my document root (I could/should put it in /blawg but I don't want my regeneration script to need to leave files behind), like this:

Header set X-Robots-Tag "noindex" "expr=%{QUERY_STRING} =~ m#.*(\&)*q=.*#"

And I test again:

% curl -X GET -I 'https://snowcrash.ca/blawg/search.html?q=asdf'
HTTP/1.1 200 OK
Date: Fri, 26 Feb 2021 12:53:16 GMT
Server: Apache
Strict-Transport-Security: max-age=15768000
Last-Modified: Fri, 26 Feb 2021 12:35:26 GMT
ETag: "13bd-5bc3c80a78a89"
Accept-Ranges: bytes
Content-Length: 5053
X-Robots-Tag: noindex
Content-Type: text/html

Just to make sure I'm not adding that header to every single response, I tested against a URL that is not my search page. I could refine this a bit, and ensure that the URL includes blawg/search.html but I don't care very much about my CPU usage and the only thing on snowcrash is my blawg anyway.

This won't prevent the reflection attack, but it will prevent it from being very useful to people who want to abuse the massive clout this domain has.

I cribbed the techniques and htaccess file contents from a posting by a member to a closed security community of which I am a member. No credit given because they explicitly did not want credit.


Published

Category

Technology

Contact