Scraping the scrapers for fun and (to maintain) profit: part 2

In the first installment of this series, I described the approach I had been taking to deal with harvesting links to scraped content and submitting them to Google’s DMCA forms. I ended that post promising to share the code that I used to more fully automate the process of collecting these links and matching them with the copied content from my site.

In this post, I will briefly describe the approach I took to automatically scrape the scrapers, and share the code to achieve this. Before I do, though, I want to share the tangible results of using this quick bit of code:

All told, I had over 1200 different pieces of stolen content removed from Google’s index using the output from this tool. On top of that, submitting the same lists of URLs to the separate Adsense DMCA form also spooked the people running these scrapers and caused them to remove iFans from their list of sites to scrape (I’ll talk more about my dealings with Adsense DMCA support in the next post in this series).

I’ve posted all the code on GitHub under a BSD license (although the underlying concepts are fairly simple to replicate without copying the code). I wrote it in PHP since I was already familiar with the MySQLi library, I found a decent DOM implementation, and I wanted it to run it directly on the web server on which I run WordPress. You could achieve the same results in any language that has a decent library for parsing imperfect HTML documents and running CSS or XPath queries against them. The way this works is pretty simple and consists of the following steps:

  1. Programmer provides a CSS selector that identifies links to stolen content and a URL for the page (or page range, in the case of a paginated blog listing) containing these links.
  2. Fetch the page(s), extract the URLs and anchor text.
  3. Since the anchor text corresponds to blog entry titles, use MySQL full-text search to match the scraped URLs to the original content in the database.
  4. Output JavaScript code that can automatically fill in the Adsense DMCA form with the URL pairs.
  5. Output plaintext lists of the scraped and original URLs suitable for copy-and-pasting into other DMCA forms and emails.

All of the functionality is contained in a single file called functions.php, which I documented fairly thoroughly. Note that the goal here was not to come up with the most efficient or robust solution, but to get something that could allow me to get my business task done. Because of this, there are parts of the code that you will need to modify to match your needs. I also included a slightly-modified example of the code I actually used to scrape the specific scraper sites I was targeting in example.php. Here’s an abridged version to give you an idea of how easy it was to use this to scrape hundreds of instances of infringing content at a time and match them up with the original posts:

< ?php
require('functions.php');
 
// Declare output array
$final_output = array();
 
// Scrape a bunch of scrapers
processPaginatedUrlPattern('http://fictional-scraper-blog.com/search/ifans/page/{page}', '.art-PostHeader a', 1, 7, $final_output);
 
// Match up the scraped articles with the originals
$mysqli = new mysqli("localhost", "root", "", "wordpress");
matchOutputWithArticles($final_output, $mysqli);
 
// Print the output in formats suitable for C&P and Google's complex DMCA form
printOutputCopyAndPaste($final_output);
printOutputJavascriptForm($final_output);
?>

Note that here I took advantage of the fact that the scrapers were using WordPress, which allows searching via GET requests. If you actually have to go through every single page of a scraper blog that doesn’t have its search enabled, you will need to include a step that filters out posts on the scraper blog that are not likely to be scraped from your site. In my case, this would’ve been easy because they all included the text “iFans” in their titles, but it may be more involved if this is not the case for you. I would suggest playing around with MySQL full-text search relevance values and once you find a good threshold, discarding scraper URLs that do not have a match above that threshold.

Finally, a word of warning. This technique WILL result in a few false positives. The implications of this can be serious if you submit a DMCA claim containing them. Since Google explains it better than me, I will just quote them:

IMPORTANT: Misrepresentations made in your notice regarding whether material or activity is infringing may expose you to liability for damages (including costs and attorneys’ fees). Courts have found that you must consider copyright defenses, limitations or exceptions before sending a notice. In one case involving online content, a company paid more than $100,000 in costs and attorneys fees after targeting content protected by the U.S. fair use doctrine. Accordingly, if you are not sure whether material available online infringes your copyright, we suggest that you first contact an attorney.

In my case, I was willing to accept the risk of false positives, since I had found that all of the scrapers that I was targeting were run by people in China that seemed to be doing this to make a few pennies here and there from misdirected search traffic. If I were submitting these same claims against a US company with deep pockets, I would manually check every single pair to be absolutely sure that no misrepresentations are made. The most tedious part of the dealing with this, which is matching the scraped content to the original content, would still be automated, and it would be fairly trivial to create a web interface that loads each successive matched pair in side-by-side iframes to double-check the match.

While the scrapers still have a sizable advantage because they can crank out stolen content faster than it can be taken down, this technique leveled the playing field significantly for me. I hope that this will be helpful to others out there.

In the next (and probably final) installment of this series, I will delve a bit deeper into the somewhat disappointing differences between the responses of the Adsense and Web Search support teams to my DMCA notice.

Scraping the scrapers for fun and (to maintain) profit: part 1

It barely even needs to be mentioned that scrapers (sites that lift content from other sources and republish it without permission) are a huge problem on the web. Even Google (indirectly) admits – through a tweet asking for data points – that the scraper problem remains unsolved. What’s more, practically everyone in the content business is faced with issues relating to content being scraped, which not only results in lost ad revenue (due to clicks misdirected to the content thieves) but even duplicate content penalties in search when Google is unable to correctly determine the original source. The problem is my business needs traffic (and the corresponding ad impressions) to keep on paying bloggers to post new content, and a situation where others profit from this content at the expense of harming my site is unsustainable.

Up until a few weeks ago, I had had pretty good success with managing scrapers by limiting the amount of content in RSS feeds (which has the unfortunate side-effect of inconveniencing some loyal readers). For the really stubborn scrapers, reaching out to web their hosts and Google with DMCA complaints usually worked (I’ve got a ChillingEffects.org posting to prove it). The problem with filing DMCA complaints, especially with Google, is that you have to provide the corresponding original URL for each URL you are submitting from a scraper site. This works OK when dealing with ten or twenty posts, but quickly becomes untenable once you’ve got scrapers that have scraped hundreds or even thousands of pages of content.

My initial strategy when dealing with the latter case was to collect all the links to the pages by loading each page and using jQuery to extract the permalinks of each stolen post on a scraper site. This was fairly using Chrome’s console and a bit of code like:

jQuery(".title a").each(function(){
    console.log(jQuery(this).attr("href"));
});

Clearly, the exact selector varies by scraper site, but since almost all of them use WordPress or some other similar package that uses templates, it’s just a matter of going through each page and running a variant of the above code to collect the URLs once you figure out the selector. Not so fast! Google’s DMCA forms (except the Web Search form, which was recently changed to be more friendly) look like this:

Not only do you have to input each source and infringing URL into its own field, but you actually have to click “Add another source” each time you want to add another pair of URLs. Doing this for hundreds of URLs would literally take all day. Luckily, the Javascript console comes to the rescue. First, looking at the page source, I found that the “Add another source” button simply calls a function to add a field row: addFieldRow('request_table_package_adsense_copyright_table',true);
I also found that the input fields have id values in the format “extra.adsense_copyright_source_#” and “extra.adsense_copyright_url_#” respectively for the source and infrinding URL fields (where # represents a number 1, 2, 3 etc.).

Using this knowledge, it is easy to automate the otherwise tedious process of adding the correct number of rows and filling in a field for each infringing URL and putting “http://www.ifans.com/” as the source (since I was dealing with hundreds of URLs that all pointed to blog posts with “- iFans” in the title). I used some string replacements in BBEdit to transform my newline-separated list of URLs to Javascript’s literal array notation, and then added a bit of code to loop over them and fill in the fields as needed:

bad_urls = ["http://evil.scraper.com/blog-post-1","http://evil.scraper.com/copied-post-2/"]; // etc.
for(i = 0; i < bad_urls.length; i++) {
 	if(i > 0) addFieldRow('request_table_package_adsense_copyright_table',true);
	l = i+1;
	document.getElementById('extra.adsense_copyright_source_'+l).value = "http://www.ifans.com/";
	document.getElementById('extra.adsense_copyright_url_'+l).value = bad_urls[i];
}

Running this code in the Javascript console populated the form with the appropriate URLs, so after filling in the rest of the form, I submitted the form and went off to work on something else.

The next day, I got this email back from Google:

Hello Mircea,

Thanks for reaching out to us.

We have received your DMCA complaint regarding [scraper]. If you are the copyright holder for this content, please provide us with specific links to the original source of the copyrighted work that you claim has been republished without your authorization. This will help us further investigate the issue and take the appropriate actions.

Thank you for your cooperation in this regard.

Oops! It turns out that having “- iFans” in the title of every single post I linked is not enough proof for Google to turn off AdSense on these scraper sites. At this point, sort of frustrated, I wrote back:

Hello,

It is very difficult for us to list the hundreds of instances, but here are few examples (corresponding to the first few instances below):

http://www.ifans.com/blog/27997/

http://www.ifans.com/blog/28507/

http://www.ifans.com/blog/27991/

http://www.ifans.com/blog/28177/

Please also look at the fact that all of the articles on the infringing website have the ” – iFans” at the end of their titles, clearly a reference to the fact that they were scraped from our site. It is very hard for us to keep up with manually matching every infringing instance when these guys are automatically scraping our site.

Thank you for your help on this matter.

Best regards,
Mircea Georgescu

Given the fact that the scraped articles looked like they had been run through some sort of machine translation round-trip (i.e. translated from English to Spanish/German/French/etc. and back with Google Translate), matching them to their original articles would require some pretty serious effort for each of hundreds of infringing posts. My initial hope was that giving a few examples would help whoever was on the other end get the idea, but as I thought back to my days at Google as an APM intern, I recalled the cultural emphasis on making decisions based on very strong supporting data, and realized that what I had given would probably not be enough.

That’s when it hit me that although it would be hard to manually collect hundreds of scraper URLs and match them with the original posts they were based on, automating this process could allow it to scale.

In part 2 of this series, I will describe how I automated this process and share the code used to do it. There’ll be a lot less talk and a lot more code, I promise.

Update: Part 2 is now live.

Efficiently finding the average color of a UIImage

Update 07/05/2012: As commenter Andy Matuschak kindly pointed out, this approach does not use GPU acceleration. Nonetheless, this approach is very likely to be faster than anything you write yourself, because the place where all the time is spent (resample_byte_h_4cpp_armv7) is written in tight assembly and makes use of the CPU’s vector unit.

For my final project for Stanford’s iPhone app class, I built World of Spectra, which challenges users to interact with the world around them by capturing and collecting its colors. In brief, users would take a picture of a real world object, and this picture would be reduced to a single color and added to a collection of colors (my talk includes a demo). To make this happen, I needed a way to find the “average” color of an image. Although there are a lot of clever approaches to extracting the most “important” color of an image, I just wanted something that would find a simple average of the colors in each pixel of the image.

The header file is pretty straightforward. Note that you need to remove the space in “< UIKit/UIKit.h>” (it is included to get around a WordPress parsing issue):

/*
 UIImage+AverageColor.h
*/
 
#import &lt; UIKit/UIKit.h&gt;
 
@interface UIImage (AverageColor)
- (UIColor *)averageColor;
@end

Now on to the actual code. The naive approach to finding the average color would be to loop over each pixel of the source image, separately accumulating the RGB values and then dividing by the total number of pixels. Of course, not only are things not so simple if we want to support images with transparency, but looping over measuring a few megapixels in this way is not very efficient.

A better approach is to use Apple’s existing graphics libraries to achieve this in much fewer lines of code and with the benefit of hardware acceleration. I create a Core Graphics bitmap context backed by a 4 element array (representing the RGBA values of a 1×1 image) and have Core Graphics draw the image to this context. Then, it is simple to return a UIColor correctly initialized with the “average” color of the image based on the aforementioned RGBA values. By using Core Graphics, which leverages the device’s GPU via OpenGL (see above), we avoid having the CPU loop through the image’s pixels to find an average, instead passing off this work to the GPU, which is designed for tasks like scaling images quickly. The result is much cleaner code and better performance:

/*
 UIImage+AverageColor.m
 
 Copyright (c) 2010, Mircea "Bobby" Georgescu
 All rights reserved.
 
 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions are met:
 * Redistributions of source code must retain the above copyright
 notice, this list of conditions and the following disclaimer.
 * Redistributions in binary form must reproduce the above copyright
 notice, this list of conditions and the following disclaimer in the
 documentation and/or other materials provided with the distribution.
 * Neither the name of the Mircea "Bobby" Georgescu nor the
 names of its contributors may be used to endorse or promote products
 derived from this software without specific prior written permission.
 
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
 ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 DISCLAIMED. IN NO EVENT SHALL Mircea "Bobby" Georgescu BE LIABLE FOR ANY
 DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
 
#import "UIImage+AverageColor.h"
 
@implementation UIImage (AverageColor)
 
- (UIColor *)averageColor {
 
    CGColorSpaceRef colorSpace = CGColorSpaceCreateDeviceRGB();
    unsigned char rgba[4];
    CGContextRef context = CGBitmapContextCreate(rgba, 1, 1, 8, 4, colorSpace, kCGImageAlphaPremultipliedLast | kCGBitmapByteOrder32Big);
 
    CGContextDrawImage(context, CGRectMake(0, 0, 1, 1), self.CGImage);
    CGColorSpaceRelease(colorSpace);
    CGContextRelease(context);  
 
    if(rgba[3] &gt; 0) {
        CGFloat alpha = ((CGFloat)rgba[3])/255.0;
        CGFloat multiplier = alpha/255.0;
        return [UIColor colorWithRed:((CGFloat)rgba[0])*multiplier
                               green:((CGFloat)rgba[1])*multiplier
                                blue:((CGFloat)rgba[2])*multiplier
                               alpha:alpha];
    }
    else {
        return [UIColor colorWithRed:((CGFloat)rgba[0])/255.0
                               green:((CGFloat)rgba[1])/255.0
                                blue:((CGFloat)rgba[2])/255.0
                               alpha:((CGFloat)rgba[3])/255.0];
    }
}
 
@end

Feel free to use the code however you want (it is BSD licensed), and please drop me a line or leave a comment if it is useful to you.