I've been a Python sales hacker for about two years. I taught myself Python so I could become better at sales. And now I want to help other salespeople learn, too.
There has been a lot of talk about unicorns, those elusive coders who are just as comfortable iterating in Ruby as they are in Photoshop. I argue that there is an equally sublime startup creature: the salesperson who can code.
In this brief project-based tutorial I'll get you started on the #1 reason someone in sales might learn to program - scraping. In these lessons we'll cover a lot of stuff. I'm going to teach you the practical points and gloss over the minutiae that only your engineers will care about.
Let me be very clear. The goal is not to make you brilliant coder. The goal is to help you make more sales. My code is sloppy. It's probably not the best way to do things. But it works, and for me, it's about the end not the means. I'll even coin a term for this quick and dirty coding style: sales hacking.
Some of you may be curious why this tutorial is in Python instead of Ruby. I have no good answer for you. It's what I picked up first and I never looked back. The concepts here can be translated to any other language, but the code examples and libraries I use are specific to Python. Someone else can write the same tutorial for Ruby. If you do, send me the link!
I use a Mac. You should too. If you're on Windows, then pay $10/mo for a Rackspace Linux server you can ssh into and follow all of my examples. It's really silly that the Windows Terminal is so bad. Because of that, I'm not even going to bother trying to explain how to break through it. Follow this tutorial on a Mac or get a Linux server.
Now let's begin! Our first victim is Crunchbase.
Let's say you want the website and email address of every business on this page. There's no single table with this data. Instead, it's organized alphabetically, with a dedicated page for each letter. You have to click the company name before the data you want is available.
In Part 1, we'll create a list of all of these company names and the links that go to their detail pages. The list will look like this:
Don't worry if that's tough to decipher. We'll get to it.
The nested directories on this Crunchbase page are very common and unfriendly to the untechnical salesperson. If you want to build a target list on Crunchbase without sales programming, your only option is to find someone overseas to copy and paste, click by click, line by line. Sounds horrible, doesn't it? (Note: Yes there is a Crunchbase API you could use, but I don't get to that until Part 3. For now, assume the API is not an option.)
Anything that requires clicking and copying in simple patterns is a coding problem. Python to the rescue!
To scrape the A's in Crunchbase, we'll have to cover these programming concepts:
importing and installing libraries
storing data in variables
making lists and arrays
iterating through lists and arrays
using Regular expressions
opening and writing to files
As warned, that's a lot of stuff. But learning it will be fun! Get excited! And if you make it through this, you'll be well on your way to even more powerful tools for your sales programming arsenal. Note: if you crank through this, it should take you less than 30 mins. If you really try to learn it, you should spend at least an hour.
Step 0. Get easy_install.
To get the BeautifulSoup library, you should have easy_install. To get easy_install, you need setuptools.
To get setuptools, go here and follow the instructions: http://pypi.python.org/pypi/setuptools. You'll need to download a file, go into your Download directory (where the file probably is) using Terminal (try typing "cd ~/Downloads/" and press enter in a Terminal window) , then type "sh setuptools-0.6c9-py2.4.egg". Replace the egg file name with whatever the downloaded file name is. Don't include those quotes or the trailing period. Note: If you see a permissions error, you probably need to use "sudo sh setuptools..". Putting that sudo in front of the last command will prompt you for a password, and then the script should run successfully.
You'll see some log messages and if your laptop doesn't blow up, you've probably installed it successfully.
Step 1. Get BeautifulSoup.
This is why you got easy_install. Open a Terminal window (or use your Linux prompt) and type [py]easy_install beautifulsoup4 [/py]
BeautifulSoup is a parsing library. This means you can load structured data (HTML, XML, etc) into a variable, pass it into BeautifulSoup, and step through the structured elements. Not making sense? Go on.. you'll see what I mean.
Step 2. Launch Python and import BeautifulSoup
For now, we're going to do all of our coding in the Python console. You'll get immediate feedback this way.
However, as you find code snippets that work, store them in a text editor and save the file with a ".py" extension. Don't use MS Word. My favorite editor is TextMate. You can also use the native Mac TextEdit (but it won't come with the fancy coloring schemas and hotkeys). You can also use a Linux native editor like vim directly within the Terminal console. Google the merits of each, try them out, and decide what works best for you. That's all I'll say about text editors.
Open the Terminal window and type "python". You'll see a prompt appear with three prepended carats like ">>> ". Now you're ready to rock. We'll do all of this programming from the Python prompt in Terminal. I'll go over how to save scripts as .py files in the next part.
To make sure that BeautifulSoup was installed properly, type "from bs4 import BeautifulSoup". If Python doesn't spit anything nasty back at you, you're all set.
Step 3. Let's start coding!
I'm going to paste a block of code here. We can discuss each piece line by line.
This is the standard way you'll import native Python libraries. When Python loads, it doesn't bring all native functions onboard with it. You need to tell Python when you want to use some of the "extra" functions.
Similarly to urllib, we have to tell Python we want to use BeautifulSoup. The "from bs4" part is required for some libraries that have multiple functions. To be honest, I don't really know why you need the "from". All I know is BeautifulSoup won't work unless you import it this way.
Woohoo, it's your first variable! You didn't need to tell Python that you intended for "url" to store some text data before you just went and stored it. In Python, you can create a home and move stuff into it with one line of code. There's nothing special about "url". We could have just as easily stored the web address in a variable called "url_from_crunchbase_that_I_want_to_scrape." But why type more than necessary?
Let's also note here that when you store strings (text) into variables, you need quotes. Double quotes would have worked fine, too.
Hey, it's your first scrape! Yes, just like that, you took HTML off the web and stored it into a variable. Go ahead and make sure. At the python prompt, type
Python will tell you the value of any locally stored variable if you simply type the variable name and press enter. Now try typing something new, like "blah" and hit Enter. You got an error, right? Python looked for a variable called "blah" and didn't find one, so it told you 'blah' is not defined.
Ok, we're close, but "page" now contains unstructured plain text data. We can't yet step through all this HTML in a structured way. That's what this line does, using the BeautifulSoup library. You're taking the page variable with all this raw HTML and creating a new variable called soup, which was parsed and interpreted by BeautifulSoup. At the python prompt, type:
And press Enter. See the difference?
Step 4. Visualize the Goal
So far, so good, right? Here's where it gets difficult. The BeautifulSoup documentation kinda sucks. You have to know programming to be able to read it, but give it a skim anyway.
We're going to cover a lot of ground here. Get ready!
I'm going to start by describing in a little more detail what we want to end up with. Let's make a list.
[py]list = ['a', 123, 'xyz', url] #I assume url is still defined. If not, go back a step.[/py]
Lists are bounded with square brackets and have comma-separated elements. In this list, 'a' is a string, 123 is a number (integer), 'xyz' is a string, and url is a variable containing a string.
Lists have indices starting with 0, so 123 is at index 1. To pull the first element in a string, do list. It will return 'a'. Lists are a great way to store data because they are really easy to iterate over. Check this out.
for l in list:
You should be able to read that for loop and know what it does. The only confusing thing is where "l" came from. I made it up.
for whatever in list:
This does the same thing! Python simply equates each element in list to the variable "whenever" and then prints it. At each iteration, "whatever" is overwritten. I can prove it.
That should return the same thing as "print url" because Python set whatever equal to url in the last iteration.
And guess what else? You can have a list of lists, too.
And so on.. Each element in the base list is a list with two elements, the name of the company, and the Crunchbase URL to that company's profile page.
Step 5. Parse the Soup
This just might blow your mind.
[py]links = soup.find_all('a', href=True)[/py]
Just like that, we have in one BeautifulSoup object all the links in this page. To print them, let's iterate through it.
for link in links:
Cool, ya? But we want the href part and the string inside of it, not all that HTML junk. Fortunately, if you recall from the BeautifulSoup documentation (I'm joking), that's pretty easy to get.
for link in links:
print link['href'], link.string
Yep, the print command can take multiple variables on the same line if you separate them by a comma.
Now let's build our nested list.
all_links = 
for link in links:
The for loop here should make sense. In each iteration, we're appending a list to the empty list all_links. We have to create an empty list first (all_links = ) because Python won't let you append into a list that doesn't exist yet. Append takes whatever you put in it. We're creating a list on the fly by putting these two strings inside the square brackets.
Check it out:
That's what we wanted, right? Well, sorta. There's still some junk we don't want, like links to licensing policies. Sigh, still more steps to go.
Step 6. Oops.. gotta filter too.
Hello, regular expressions!
You know the for loop structure. "b" is made up by me. Use whatever variable name you want.
This is the structure for a regular expression match. I just want to point out that the thing I'm feeding into the pattern for testing is "b". I want to test the second element in each list, which is the URL. I've tested this pattern and found that it matches the structure of all the Crunchbase URLs.
If there's a match, I'm going to concatenate the full URL to the relative URL portion. In Python, you concatenate two strings with a plus sign. I store it to a variable I'm calling full_match and then I print it to the screen.
If there's no match (this what the else: means), I print "Caught:" and then the list that it caught. Remember I only tested the second element in the list. I could have also printed b to show only the second element.
We're actually only halfway there (maybe a little bit less) but this is an important milestone. Let's recap.
We're using BeautifulSoup to capture all the links on a Crunchbase page and build a nested list of company names and URLs that have been filtered to only have the types of links we're looking for.
Here's the consolidated code.
import urllib, re
from bs4 import BeautifulSoup
url = 'http://www.crunchbase.com/companies?c=a'
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
links = soup.find_all('a', href=True)
all_links = 
for link in links:
clean_links = 
for a in all_links:
match = re.search(r'(/company/[a-zA-Z0-9-]*[^"])', a, re.I)
full_match = "http://www.crunchbase.com"+match.group(1)
print 'Caught:', a
Type 'clean_links' to see what you got. The first few lines should look like this: