Extracting Phone Numbers from the common crawl with EMR/hadoop

Extracting phone numbers from the November 2015 CC

Motivation: A client wanted to find a large set of phone numbers from the public web; all types of numbers, not just US phone numbers.

The common crawl provided a convenient to way to efficiently process a large portion of the web and extra phone numbers. A big thanks to Ben C. at yelp for writing this article Analyzing the Web For the Price of a Sandwich, it provided a great starting place. While Ben’s goal was to search for only US phone numbers, I needed to extract phone numbers from all countries and so I opted to use the phonenumbers python library, despite it being “slower”.

I found that using the common crawl’s WET files was sufficient to extract phone numbers (text content only, as apposed to full HTML). Processing the WET files meant the total dataset was roughly 8TB, down from the 151TB of the full crawl, despite this it still took ~10 hours to fully process; the phonenumbers library is indeed quite a bit slower/more CPU intensive than a simple regex, but in this case I think it was definitely worth it, because it provided a clean, tested method to extract all forms phone numbers from text.

Initially I thought that I’d have to try and detect the pages’s associated country, in order to extract phone numbers, because if a number isn’t fully qualified eg. ‘555-845-3567’, then it’s interpretation depends on the country it’s being called from, but this turned out to not be necessary, as the client’s requirement was to extract at minimum 1,000 for numbers for every country/zone, and the common crawl contains more than 1,000 fully qualified numbers for every relevant zone.

Optimizing Map Reduce (a little)

As I was using spot instances, in order to avoid losing work, if a job failed, and to avoid reducing over the entirety of the common crawl, I split the crawl into a handful of map reduce jobs.

Initially I tried a batch size of 900 WET files, which took in 20 minutes, not too bad, but I also noticed that the cluster often sat idle, due to a combination of a few straggling map operations and the overhead of starting a job with mrjob.

Upping the batch size to 3648 improved things a bit, it only took 60 minutes to complete that job, 25% faster; mostly due to eliminating the overhead associated with scheduling jobs.

I had tried to optimize the mappers associated with the job to fit onto the available cores in stages (19 machines * 32 cores per machine * 6 =3648), but unfortunately, a few of the jobs failed and were rerun, which added an extra final “stage” where only a few mappers we running at a time. Since each mapper processes an entire WET file this final stage added quite a bit of processing time to the total job run time.

I think the moral of the story here is that it’s not easy to exactly tune a map reduce job to fit onto a cluster, there will be intermittent failures, and that’s just something you have to deal with, but larger batch sizes and smaller individual map operations will reduce the overhead associated with each failure.

Actually if mrjob allowed scheduling multiple jobs onto the same EMR cluster, then It wouldn’t have mattered as much, if one of the jobs was stuck waiting on a few stragglers to complete, the rest of the cluster would be busy processing the next job.



Leave a Reply

Your email address will not be published. Required fields are marked *