Scraping the scrapers for fun and (to maintain) profit: part 1

It barely even needs to be mentioned that scrapers (sites that lift content from other sources and republish it without permission) are a huge problem on the web. Even Google (indirectly) admits – through a tweet asking for data points – that the scraper problem remains unsolved. What’s more, practically everyone in the content business is faced with issues relating to content being scraped, which not only results in lost ad revenue (due to clicks misdirected to the content thieves) but even duplicate content penalties in search when Google is unable to correctly determine the original source. The problem is my business needs traffic (and the corresponding ad impressions) to keep on paying bloggers to post new content, and a situation where others profit from this content at the expense of harming my site is unsustainable.

Up until a few weeks ago, I had had pretty good success with managing scrapers by limiting the amount of content in RSS feeds (which has the unfortunate side-effect of inconveniencing some loyal readers). For the really stubborn scrapers, reaching out to web their hosts and Google with DMCA complaints usually worked (I’ve got a ChillingEffects.org posting to prove it). The problem with filing DMCA complaints, especially with Google, is that you have to provide the corresponding original URL for each URL you are submitting from a scraper site. This works OK when dealing with ten or twenty posts, but quickly becomes untenable once you’ve got scrapers that have scraped hundreds or even thousands of pages of content.

My initial strategy when dealing with the latter case was to collect all the links to the pages by loading each page and using jQuery to extract the permalinks of each stolen post on a scraper site. This was fairly using Chrome’s console and a bit of code like:

jQuery(".title a").each(function(){
    console.log(jQuery(this).attr("href"));
});

Clearly, the exact selector varies by scraper site, but since almost all of them use WordPress or some other similar package that uses templates, it’s just a matter of going through each page and running a variant of the above code to collect the URLs once you figure out the selector. Not so fast! Google’s DMCA forms (except the Web Search form, which was recently changed to be more friendly) look like this:

Not only do you have to input each source and infringing URL into its own field, but you actually have to click “Add another source” each time you want to add another pair of URLs. Doing this for hundreds of URLs would literally take all day. Luckily, the Javascript console comes to the rescue. First, looking at the page source, I found that the “Add another source” button simply calls a function to add a field row: addFieldRow('request_table_package_adsense_copyright_table',true);
I also found that the input fields have id values in the format “extra.adsense_copyright_source_#” and “extra.adsense_copyright_url_#” respectively for the source and infrinding URL fields (where # represents a number 1, 2, 3 etc.).

Using this knowledge, it is easy to automate the otherwise tedious process of adding the correct number of rows and filling in a field for each infringing URL and putting “http://www.ifans.com/” as the source (since I was dealing with hundreds of URLs that all pointed to blog posts with “- iFans” in the title). I used some string replacements in BBEdit to transform my newline-separated list of URLs to Javascript’s literal array notation, and then added a bit of code to loop over them and fill in the fields as needed:

bad_urls = ["http://evil.scraper.com/blog-post-1","http://evil.scraper.com/copied-post-2/"]; // etc.
for(i = 0; i < bad_urls.length; i++) {
 	if(i > 0) addFieldRow('request_table_package_adsense_copyright_table',true);
	l = i+1;
	document.getElementById('extra.adsense_copyright_source_'+l).value = "http://www.ifans.com/";
	document.getElementById('extra.adsense_copyright_url_'+l).value = bad_urls[i];
}

Running this code in the Javascript console populated the form with the appropriate URLs, so after filling in the rest of the form, I submitted the form and went off to work on something else.

The next day, I got this email back from Google:

Hello Mircea,

Thanks for reaching out to us.

We have received your DMCA complaint regarding [scraper]. If you are the copyright holder for this content, please provide us with specific links to the original source of the copyrighted work that you claim has been republished without your authorization. This will help us further investigate the issue and take the appropriate actions.

Thank you for your cooperation in this regard.

Oops! It turns out that having “- iFans” in the title of every single post I linked is not enough proof for Google to turn off AdSense on these scraper sites. At this point, sort of frustrated, I wrote back:

Hello,

It is very difficult for us to list the hundreds of instances, but here are few examples (corresponding to the first few instances below):

http://www.ifans.com/blog/27997/

http://www.ifans.com/blog/28507/

http://www.ifans.com/blog/27991/

http://www.ifans.com/blog/28177/

Please also look at the fact that all of the articles on the infringing website have the ” – iFans” at the end of their titles, clearly a reference to the fact that they were scraped from our site. It is very hard for us to keep up with manually matching every infringing instance when these guys are automatically scraping our site.

Thank you for your help on this matter.

Best regards,
Mircea Georgescu

Given the fact that the scraped articles looked like they had been run through some sort of machine translation round-trip (i.e. translated from English to Spanish/German/French/etc. and back with Google Translate), matching them to their original articles would require some pretty serious effort for each of hundreds of infringing posts. My initial hope was that giving a few examples would help whoever was on the other end get the idea, but as I thought back to my days at Google as an APM intern, I recalled the cultural emphasis on making decisions based on very strong supporting data, and realized that what I had given would probably not be enough.

That’s when it hit me that although it would be hard to manually collect hundreds of scraper URLs and match them with the original posts they were based on, automating this process could allow it to scale.

In part 2 of this series, I will describe how I automated this process and share the code used to do it. There’ll be a lot less talk and a lot more code, I promise.

Update: Part 2 is now live.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>