A Seamless Authenticated Proxy Solution For Selenium

A Seamless Authenticated Proxy Solution For Selenium

Forwarding Requests To An Authenticated Proxy With Squid

In my web scraping adventures, I’ve run into the problem of attempting to use an authenticated proxy with Selenium. Using a proxy is relatively easy to configure in most web browsers, but handling authentication with proxies in selenium is far less trivial. For instance in Chrome, the solution is to create an extension with a listener on the onAuthRequired event in order to know when to submit the credentials for the proxy. These hacks vary across web browsers as well.

The one constant between all browsers is that configuring them to use an unauthenticated proxy is simple. So in an effort to support many browser types in my web scraping framework, I decided to set up my own proxy locally that forwards the requests to the authenticated proxy. This eliminates the problem of authentication within selenium and offloads it the the local proxy, which is a very easy task for a proxy.

For this I’ll be using a popular open source proxy called Squid (https://en.wikipedia.org/wiki/Squid_(software)). I’ll be running this in Docker and using Ubuntu’s image (https://hub.docker.com/r/ubuntu/squid).

Configuration

First we need a configuration file for squid to use. This ends up being a rather short configuration file since we’re essentially telling it to treat the authenticated proxy as a cache it should always use for requests. We’ll name this file squid.conf:

http_access allow all
http_port 3128

coredump_dir /var/spool/squid3

cache_peer proxy.com parent 8080 0 no-query default login=user:password
never_direct allow all

Let’s walk through the critical line in this file, which is the one you’ll need to modify for your own use:

cache_peer proxy.com parent 8080 0 no-query default login=user:password

cache_peer specifies to squid another caching server it can reference when serving requests. It takes several parameters, some of which you’ll want to update.

  • proxy.com: This is the hostname or IP address of the cache peer (parent cache server). You’ll want to change this to your proxy server’s IP or domain.
  • parent: This indicates that it’s a parent cache, meaning Squid will query this cache server before attempting to fetch directly from the origin server. In our case, since this is actually another proxy server, it’ll always fetch through here unless its unreachable. In that case it’ll bypass the authenticated proxy.
  • 8080: This is the port number on which the cache peer is listening. You’ll want to change this to the port your proxy uses.
  • 0: This represents the weight for load balancing. In this case, it’s set to 0, indicating no specific weight (equal treatment). Everything will go here since there’s no other cache peers in the configuration.
  • no-query: This specifies that Squid will not send ICP (Internet Cache Protocol) queries to this cache peer. We don’t want to do this because we’re not actually dealing with any caching here.
  • default: This indicates that this is the default parent if no other matching rules are specified.
  • login=user:password: Specifies login credentials (username and password) to use when authenticating with the cache peer. This should be updated to the username and password for your authenticated proxy.

Running The Proxy

Now that you have a proxy configuration setup, you just need to run Squid with that configuration. Like I said earlier, I’m using Docker to do this. A Docker command that gets this done for us would look like this:

docker run -d --name squid -p 3128:3128 -v /path/to/your/squid.conf:/etc/squid/squid.config:ro ubuntu/squid

Now let’s walk through this command in case you have any confusion on what it’s doing:

  • docker run: This is a basic Docker command that tells Docker to start a container with the following arguments.
  • -d: This is a flag we pass to Docker to tell it to run in detached mode, meaning it runs the container in the background and doesn’t attach our current terminal session to the standard output of the container we’re starting.
  • --name squid: This gives our container a user friendly alias of ‘squid’ to use in future commands relevant to this container.
  • -p 3128:3128: This flag maps a host port to a container port. In this case, we’re running Squid on the default port it uses of 3128, so we’ll map our host port to that as well.
  • -v /path/to/your/squid.conf:/etc/squid/squid.config:ro: This might be the most important flag to understand since it tells Docker to mount the configuration file you created on your host filesystem to the default Squid configuration path of /etc/squid/squid.conf. Then we add :ro at the end to tell Docker to give the container read only permissions to the mounted file.
  • ubuntu/squid: Finally we tell the docker run command what image we want it to use.

After all of this you should be able to view logs from the container by running docker logs squid.

Using The Proxy

We can test our proxy using a simple curl command. Running this curl command should result in your machine’s external IP address:

> curl icanhazip.com

Now to run our request through our local proxy, which will then route it to our authenticated proxy, we can add that argument into our request.

> curl --proxy http://localhost:3128 icanhazip.com

If everything worked successfully, we should get the external IP of our authenticated proxy instead of our own external IP. If that’s not the result you got, check the logs of the Squid container to see what went wrong using docker logs squid.

Leave a Reply