I’ve recently been experimenting on a new project to scrape data from webpages located on the Tor network. For simplicity’s sake, I decided to write this bit of code in Python and use the handy urllib2 library to handle the HTTP requests.

For those that don’t know, Tor runs a SOCKS5 proxy, which, by default, runs on 127.0.0.1:9050. I thought things would be as simple as telling urllib2 to use a proxy located at IP 127.0.0.1 and port 9050, but I quickly found that this doesn’t work.

Luckily, after a bit of digging, I found a solution. It turns out that urllib2 uses Python’s socket module, which contains the method create_connection(). If we take a look at the code for this method we can see where our problem lies:

 1def create_connection(address, timeout=_GLOBAL_DEFAULT_TIMEOUT,
3    """Connect to *address* and return the socket object.
4
5   Convenience function.  Connect to *address* (a 2-tuple (host,
6   port)) and return the socket object.  Passing the optional
7   *timeout* parameter will set the timeout on the socket instance
8   before attempting to connect.  If no *timeout* is supplied, the
9   global default timeout setting returned by :func:getdefaulttimeout
10   is used.  If *source_address* is set it must be a tuple of (host, port)
11   for the socket to bind as a source address before making the connection.
12   An host of '' or port 0 tells the OS to use the default.
13   """
14
15    msg = "getaddrinfo returns an empty list"
17    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
18        af, socktype, proto, canonname, sa = res
19        sock = None
20        try:
21            sock = socket(af, socktype, proto)
22            if timeout is not _GLOBAL_DEFAULT_TIMEOUT:
23                sock.settimeout(timeout)
26            sock.connect(sa)
27            return sock
28
29        except error, msg:
30            if sock is not None:
31                sock.close()
32
33    raise error, msg


In looking at this we can see that, even though we specified that Tor should be used as our proxy, the create_connection() function will still perform the DNS request using the default settings, hence bypassing the Tor network. Luckily, we can create our own create_connection() method and jerry-rig it into the socket class before we load urllib2. In doing this we can force the DNS request to go through Tor, thus allowing us to route our urllib2 traffic through the Tor network. This can be achieved with the following bit of code:

 1import socket
2import socks
3
4# urllib2 uses the socket module's create_connection() function.
5# The way the DNS request is done won't work for our Tor connection,
6# so we need to jerry-rig our own create_connection() for urllib2
8  sock = socks.socksocket()