Hacking for Sales, Part 2

In the last post, we introduced a lot of programming concepts and pulled one page of results from Crunchbase. In this next part, we will open the links to the individual company pages and scrape the juicy bits we’ve been wanting all along.

Ready? Set. Scrape!

In Part 1, we left with a nested list of company names and their respective Crunchbase pages:

[[‘http://www.crunchbase.com/company/a-s-professional-security-services’, u’A & S Professional Security Services’], [‘http://www.crunchbase.com/company/a-4-bandas-media’, u’A 4 Bandas Media’], [‘http://www.crunchbase.com/company/a-basket-for-every-occasion’, u’A Basket For Every Occasion’], [‘http://www.crunchbase.com/company/a-beautiful-site’, u’A Beautiful’], [‘http://www.crunchbase.com/company/a-better-opinion’, u’A Better Opinion’]]

Sorry for the letdown. I know that’s not what you paid me all this money for.

Let’s fire up python and start by saving this list into a variable.

list = [['http://www.crunchbase.com/company/a-s-professional-security-services', u'A & S Professional Security Services'], ['http://www.crunchbase.com/company/a-4-bandas-media', u'A 4 Bandas Media'], ['http://www.crunchbase.com/company/a-basket-for-every-occasion', u'A Basket For Every Occasion'], ['http://www.crunchbase.com/company/a-beautiful-site', u'A Beautiful'], ['http://www.crunchbase.com/company/a-better-opinion', u'A Better Opinion']]

We’ll come back to this list at the end and iterate through all of the elements. For now, let’s just take the first one.

first = list[0]

You should recall that lists have indices, and to get the first “thing” (i.e. element) in the list, put the number in brackets (indices begin at zero). Type “first” and press enter to make sure that it prints [‘http://www.crunchbase.com/company/a-s-professional-security-services’, u’A & S Professional Security Services’].

Now, let’s isolate the company URL and get to work.

first_url = first[0]

Note: We could have also simply done first_url = list[0][0]. It should be intuitive why that works.

Step 1. Isolate the goods. 

View your link in a browser that has Firebug or an equivalent tool to inspect the HTML on a page. I use Chrome because the element inspector is built in. The image to the right shows what we’re looking for. We want to grab only the info presented here, and do so in a way that minimizes the likelihood of rogue data breaking our script.

If you inspect the HTML circled here (in Chrome, right click to “Inspect Element”), you’ll see that this is structured as a table, with the descriptions on the left and the data on the right. While tables are great for copying and pasting, they’re usually not great for scraping. This means our scraping code will be ugly. Future tutorials will have nice clean CSS IDs and tags to work with. For now, we’re getting down and dirty with Crunchbase.

Here’s the HTML we’re working with:

<div class="col1_content">
			<tr><td class="td_left">Website</td><td class="td_right"><a href="http://www.as-profsecurity.com" target="_self" title="as-profsecurity.com">as-profsecurity.com</a></td></tr>
<tr><td class="td_left">Blog</td><td class="td_right"><a href="http://www.abercrombie-and-fitch.biz" target="_self" title="abercrombie-and-fitch.biz">abercrombie-and-f...</a></td></tr>
			<tr><td class="td_left">Category</td><td class="td_right"><a href="/companies?q=security" title="Security">Security</a></td></tr>
			<tr><td class="td_left">Phone</td><td class="td_right">(800)427-5471</td></tr>
			<tr><td class="td_left">Email</td><td class="td_right"><a href="mailto:%67%6c%79%6e%64%61@%61%73-%70%72%6f%66%73%65%63%75%72%69%74%79.%63%6f%6d" rel="nofollow" title="glynda@as-profsecurity.com">glynda@as-profsec...</a></td></tr>
			<tr><td class="td_left">Employees</td><td class="td_right"><span id="num_employees">500</span><span id="linkedin" class=" company-insider-pop-up"><img src="http://www.linkedin.com/img/icon/icon_company_insider_in_12x12.gif" width="12" height="12"></span><script language="javascript">new LinkedIn.CompanyInsiderPopup("linkedin", "A &amp; S Professional Security Services");</script></td></tr>
			<tr><td class="td_left">Founded</td><td class="td_right">4/86</td></tr>

Step 2. Decide the Approach.

I checked and on this page, there are five divs with the “col1_content” class. So even the div containing our goods isn’t unique. This is going to be a brute force approach. We’re going to run a find_all on all the <td> tags with class “td_left”. For each of those, we’ll move to the next <td> over and capture the data. This time we’re building a dictionary, because I’m not going to know what’s in each of the <td> tags, or how many of them there will be per company. Therefore, the list isn’t a good approach, because if there’s a website but no email address, or vice versa, the data positions within each index won’t line up. Does that make sense? It should come clear later.

Step 3. Sure boss, let’s try it. 

I’m actually making this up as I go. Let’s get the basic libraries loaded and load the soup.  This stuff was all covered in Part 1 of our saga.

import urllib, re
from bs4 import BeautifulSoup
url = 'http://www.crunchbase.com/company/a-s-professional-security-services'
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)

From here, let’s try to get all of those td_left’s.

results = soup.find_all('td', { "class" : "td_left" })

I’m not going to explain this syntax, other than to say it’s in the BeautifulSoup documentation, and you can replace “class” with any HTML attribute you want to target, and “td_left” with whatever the value of the attribute is. So if this tag had “id = first_name” then I would use { “id” : “first_name” } in the find_all function here.

You should see this when you print the results:

[<td>Website</td>, <td>Blog</td>, <td>Category</td>, <td>Phone</td>, <td>Email</td>, <td>Employees</td>, <td>Founded</td>]

Kinda perfect, huh? That’s only part accident. Now let’s parse and save only the data we want.

goodie = {}
for r in results:
	if r.nextSibling.a:
		goodie[r.string] = r.nextSibling.a['href']
		goodie[r.string] = r.nextSibling.string

I’m actually kinda proud of this code. This builds a Python dictionary. A dictionary is different than a list in that you can call an element by its name. So within a dictionary, I don’t need to know which index the website is on, I can just do goodie[“website”] and it’ll show the website (instead of goodie[2], as you’d do with a list).

Since I know this data table is not necessarily the same for every company, I can’t hard code where the Website or Email address is. The email is not conveniently given an ID or class attribute for fast finding, and it’s not always the second row down. Instead, I’m iterating through the result of our search for <td class=”td_left”> tags, and saving them into my dictionary. Two notes on this code:

1. The first line defines goodie as a dictionary. You can tell because of the curly brackets.

2. Python is awesome about if statements.  An “if” will be negative if it evaluates to zero (number not word), None, False, or empty quotes like “” (and maybe a couple others, but this is all you need to know). I know that if there is no <a> in the next sibling, Python returns None. If I assume that there’s an <a> tag (like for website and email), when there’s not (like for Phone number), my script will break and you’ll see an error like “TypeError: ‘NoneType’ object is not subscriptable”. Instead, I test to see if there’s an <a>, and if so, I capture the link. If not, I capture the string.

If you print goodies, you should see this:

[{u'Website': 'http://www.as-profsecurity.com', u'Category': '/companies?q=security', u'Employees': None, u'Founded': u'4/86', u'Blog': 'http://www.abercrombie-and-fitch.biz', u'Phone': u'(800)427-5471', u'Email': 'mailto:%67%6c%79%6e%64%61@%61%73-%70%72%6f%66%73%65%63%75%72%69%74%79.%63%6f%6d'}]

By the way, where it says Employees: None, that doesn’t mean there are no employees. None is a nil object, (None would have to be in quotes to mean literally “None”), and it happened because the Employee HTML has some nested data that I don’t care enough about to grab. Really, I’m probably only going to use Website, Blog, and Email addresses. Maybe Phone if the data looks good.

Step 4. Check data for weird stuff. Fix.

Speaking of email, yick! What’s that weird stuff after the mailto:? Turns out that’s easy to decode and Crunchbase is just trying to do these folks a favor by obscuring their emails so auto-scrapers don’t harvest them. But we’re not auto-scraping. We’re sales scraping.

See what I mean?  Let’s fix this.


First of all, we don’t want the mailto: part. That’s just for HTML so you when you click on the link, it opens an email application. To get rid of it, you need a regular expression. Hooray, more regex!

result = re.search(r'((mailto:)([%@-.0-9A-Za-z]*))', goodie["Email"], re.I)
encoded_email = result.group(3)

Now encoded_email has just the encoded email, not the mailto: part.  I’m not going to explain the regex syntax or what group(3) means. You’ll have to google it. Try typing result.group(2) and result.group(1). You’ll start to figure out what’s going on, but it’s beyond the scope of this post because it doesn’t really matter.

Great, we’ve isolated the email address. Now let’s decode. Fortunately, Python makes that really easy!

decoded_email = urllib.unquote(encoded_email)

Poof, decoded_email now equals glynda@as-profsecurity.com. Let’s overwrite the original email in the dictionary simply with:

goodie["Email"] = decoded_email

Step 6. Tie it all together. 

You’ll find that in the process of coding, you’ll hack in snippets. Problems come up, you solve them in a few lines of code, and move on. Soon the problem becomes keeping all of this code organized. To clearly tie this together, I’ll need to show you functions. These blocks of code will make your code much easier to read (which is super helpful later on, when you adapt your scraping code for other websites).

Let’s outline how this thing works:

  • Use www.crunchbase.com/companies?c=a to build a nested list with company names and their detail URLs with the data we really want
  • Open each detail URL and isolate only the data we want
  • Decode the email address
  • Save a nested list with dictionaries of the data we wanted

It’s no coincidence that my functions will follow this outline. This part should look familiar from Part 1.

def get_crunchbase_companies(url):
	page = urllib.urlopen(url).read()
	soup = BeautifulSoup(page)
	links = soup.find_all('a', href=True)
	all_links = []
	for link in links:
		all_links.append([link['href'], link.string])
	clean_links = []
	for a in all_links:
		if a[0]:
			match = re.search(r'(/company/[a-zA-Z0-9-]*[^"])', a[0], re.I)
			if match:
				full_match = "http://www.crunchbase.com"+match.group(1)
				clean_links.append([full_match, a[1]])
				print 'Caught:', a
	return clean_links

A couple of notes.

1. Functions start with def and end with a colon. They need to be indented. You can pass data into functions, but you need to tell the function to expect some data by putting a variable name inside parentheses. If no parameters are used in the function, then you pass empty parentheses, like ‘def my_empty_function():’.

2. You also need to tell the function to return something, if you wanted anything back. That’s what ‘return clean_links’ does. Otherwise, clean_links will not be available outside the function.

Likewise, I’ll take the other parts of our code and function-ize them.

def get_crunchbase_companies_data(url):
	page = urllib.urlopen(url).read()
	soup = BeautifulSoup(page)
	results = soup.find_all('td', { "class" : "td_left" })
	goodie = {}
	for r in results:
		if r.nextSibling.a:
			if r.string == "Email":
				goodie[r.string] = decode_email(r.nextSibling.a['href'])
				goodie[r.string] = r.nextSibling.a['href']
			goodie[r.string] = r.nextSibling.string
	return goodie

I added a new if statement here to check if we need to use the email decoding function (shown below). First I check for the <a> tag, then I check to see if the string says “Email”. Important note here: when you check the value of something, you need to use two equals: ‘==’. The reason is Python always thinks that when you have a single equals sign, you’re assigning a value. When you use two equals signs, you’re checking for value. This will become intuitive!

def decode_email(raw_email):
	result = re.search(r'((mailto:)([%@-.0-9A-Za-z]*))', raw_email, re.I)
	encoded_email = result.group(3)
	decoded_email = urllib.unquote(encoded_email)
	return decoded_email

This could have nested in the function above, but some day on some other site I might run into this again, and it’s nice to have it in a tight little function of its own.

And finally, we reference these functions in a clean bit of “master” code.

clean_links = get_crunchbase_companies('http://www.crunchbase.com/companies?c=a')
goodies = []
for link in clean_links:
	goodie = get_crunchbase_companies_data(link[0])

If you run this, it’ll take a while and you’ll be staring at a blank Terminal. Instead, it’s much more rewarding (and fun!) to make your console tell you what’s happening. It makes you feel like a real hacker, and it’s easy to do with some print commands.

clean_links = get_crunchbase_companies('http://www.crunchbase.com/companies?c=a')
goodies = []
for link in clean_links:
	print "Scraping", link[0]
	goodie = get_crunchbase_companies_data(link[0])

You’ll see something like the image featured on top. For Part 3, we’ll go over how to avoid all of this and use the Crunchbase API instead!