Scraping the scrapers for fun and (to maintain) profit: part 2

In the first installment of this series, I described the approach I had been taking to deal with harvesting links to scraped content and submitting them to Google’s DMCA forms. I ended that post promising to share the code that I used to more fully automate the process of collecting these links and matching them with the copied content from my site.

In this post, I will briefly describe the approach I took to automatically scrape the scrapers, and share the code to achieve this. Before I do, though, I want to share the tangible results of using this quick bit of code:

All told, I had over 1200 different pieces of stolen content removed from Google’s index using the output from this tool. On top of that, submitting the same lists of URLs to the separate Adsense DMCA form also spooked the people running these scrapers and caused them to remove iFans from their list of sites to scrape (I’ll talk more about my dealings with Adsense DMCA support in the next post in this series).

I’ve posted all the code on GitHub under a BSD license (although the underlying concepts are fairly simple to replicate without copying the code). I wrote it in PHP since I was already familiar with the MySQLi library, I found a decent DOM implementation, and I wanted it to run it directly on the web server on which I run WordPress. You could achieve the same results in any language that has a decent library for parsing imperfect HTML documents and running CSS or XPath queries against them. The way this works is pretty simple and consists of the following steps:

  1. Programmer provides a CSS selector that identifies links to stolen content and a URL for the page (or page range, in the case of a paginated blog listing) containing these links.
  2. Fetch the page(s), extract the URLs and anchor text.
  3. Since the anchor text corresponds to blog entry titles, use MySQL full-text search to match the scraped URLs to the original content in the database.
  4. Output JavaScript code that can automatically fill in the Adsense DMCA form with the URL pairs.
  5. Output plaintext lists of the scraped and original URLs suitable for copy-and-pasting into other DMCA forms and emails.

All of the functionality is contained in a single file called functions.php, which I documented fairly thoroughly. Note that the goal here was not to come up with the most efficient or robust solution, but to get something that could allow me to get my business task done. Because of this, there are parts of the code that you will need to modify to match your needs. I also included a slightly-modified example of the code I actually used to scrape the specific scraper sites I was targeting in example.php. Here’s an abridged version to give you an idea of how easy it was to use this to scrape hundreds of instances of infringing content at a time and match them up with the original posts:

< ?php
// Declare output array
$final_output = array();
// Scrape a bunch of scrapers
processPaginatedUrlPattern('{page}', '.art-PostHeader a', 1, 7, $final_output);
// Match up the scraped articles with the originals
$mysqli = new mysqli("localhost", "root", "", "wordpress");
matchOutputWithArticles($final_output, $mysqli);
// Print the output in formats suitable for C&P and Google's complex DMCA form

Note that here I took advantage of the fact that the scrapers were using WordPress, which allows searching via GET requests. If you actually have to go through every single page of a scraper blog that doesn’t have its search enabled, you will need to include a step that filters out posts on the scraper blog that are not likely to be scraped from your site. In my case, this would’ve been easy because they all included the text “iFans” in their titles, but it may be more involved if this is not the case for you. I would suggest playing around with MySQL full-text search relevance values and once you find a good threshold, discarding scraper URLs that do not have a match above that threshold.

Finally, a word of warning. This technique WILL result in a few false positives. The implications of this can be serious if you submit a DMCA claim containing them. Since Google explains it better than me, I will just quote them:

IMPORTANT: Misrepresentations made in your notice regarding whether material or activity is infringing may expose you to liability for damages (including costs and attorneys’ fees). Courts have found that you must consider copyright defenses, limitations or exceptions before sending a notice. In one case involving online content, a company paid more than $100,000 in costs and attorneys fees after targeting content protected by the U.S. fair use doctrine. Accordingly, if you are not sure whether material available online infringes your copyright, we suggest that you first contact an attorney.

In my case, I was willing to accept the risk of false positives, since I had found that all of the scrapers that I was targeting were run by people in China that seemed to be doing this to make a few pennies here and there from misdirected search traffic. If I were submitting these same claims against a US company with deep pockets, I would manually check every single pair to be absolutely sure that no misrepresentations are made. The most tedious part of the dealing with this, which is matching the scraped content to the original content, would still be automated, and it would be fairly trivial to create a web interface that loads each successive matched pair in side-by-side iframes to double-check the match.

While the scrapers still have a sizable advantage because they can crank out stolen content faster than it can be taken down, this technique leveled the playing field significantly for me. I hope that this will be helpful to others out there.

In the next (and probably final) installment of this series, I will delve a bit deeper into the somewhat disappointing differences between the responses of the Adsense and Web Search support teams to my DMCA notice.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>