In this article, I am going to give you all the information and tools necessary to get your web testing automation or web scraping project up and running quickly using Python & Selenium.
Selenium is good for automated testing of web sites and web apps, as well as web scraping.
Web scraping involves finding patterns in web page source code that can be used to harvest useful data. Python, being a language great for list crawling, is a natural match with Selenium, an industry standard tool for browser automation.
Python + Selenium + Beautiful Soup Install Guide
At the time of writing, Python 3.8.5 is the latest version and is what I am linking to and installing. As new versions come out, you will probably want to grab those versions instead. The main Python download page is located here: https://www.python.org/downloads/
Linux Mint 19 Installation
This was in 2019, and I’m not covering Linux in the rest of the tutorial, but here are the commands that worked for me:
1. Download and install Python 3.8.5
If everything installed correctly, you should get a response similar to this:
2. Next install Selenium & Beautiful Soup 4.x
We use Python’s package manager, pip to install it.
From a Terminal prompt (MacOS) or an elevated CMD/PowerShell (Windows) execute this command:
After that completes, type this command:
If you get an error, you probably need to install pip, although a default installation of Python now installs pip by default.
Install pip using these two commands in the same terminal (MacOS only)
3. Download Browser Webdrivers
Next, we need to pick Firefox and/or Chrome and install their webdrivers. The webdrivers allow Selenium through Python to control the web browser as if they were a user.
The Chrome webdriver needs to be updated everytime Chrome updates, but the Firefox webdriver does not. This makes Firefox more desirable to use… except there is a known bug with the Firefox webdriver on MacOS, which makes it a little trickier to set up initially.
You must select the Chrome webdriver that corresponds to the version of the web browser you are using. Because of this requirement, I recommend disabling auto-updates on Chrome or use Firefox so that you don’t constantly need to download new webdrivers everytime Chrome releases an update.
Chrome Webdrivers Downloads
- Chrome version 86.x: Windows / Mac
- Chrome version 85.x: Windows / Mac
- Chrome version 84.x: Windows / Mac
- Other versions of Chrome: http://chromedriver.chromium.org/downloads
Firefox (geckodriver) Webdrivers Downloads
- Firefox version 60 or greater+ Windows 64-bit / MacOS
- Other versions of geckodriver: https://github.com/mozilla/geckodriver/releases
There are also drivers for Safari and Edge. You can always find the latest official drivers at this URL: https://pypi.org/project/selenium/
4. Install Browser Webdrivers
So you downloaded one or more webdrivers in the last step, now we need to install them. This means extracting the files and making sure the webdrivers are accessible by being in the system path. For Firefox on macOS 10.15 or above, it also means disabling the notarization requirement.
MacOS: I recommend downloading the webdrivers and extracting them to the ~/Downloads folder. Even if you install multiple webdrivers, you need to run these two commands once (don’t worry if the first command produces an error.)
This command creates one of the default MacOS path locations if it doesn’t exist (produces an error that you ignore if it does.
This command makes the default ~/Downloads folder (where ~ is a shortcut to the User’s folder) part of the path by symbolically linking the Downloads folder with the /usr/local/bin folder.
Fore more information on MacOS paths, even though this MacOS guide is out of date it is still accurate: https://coolestguidesontheplanet.com/add-shell-path-osx/ or the answers on this StackExhange topic: https://apple.stackexchange.com/questions/41542/adding-a-new-executable-to-the-path-environment-variable
Windows: I recommend extracting or moving your webdrivers to C:\Windows\ directory so that they are in a location in your path.
For Windows, this article thoroughly explains how to add to the PATH on each different version:
You should get a response as pictured below:
5. Download & Customize My Free Web Scraping Script
Download my Python script to your PATH (as outlined above, so continuing with the defaults, this would be your ~\Downloads folder on macOS.)
- Download learnonlineshop.py
You launch this script by typing this at elevated terminal/CMD/PowerShell:
or if that causes you issues, on MacOS you can try this elevated command instead:
Using this script will launch Firefox in a special mode with a new, blank profile, which means you will need to sign into TikTok, Instagram and YouTube or whatever site you want to scrape. That’s hard to do if the script/bot is controlling the browser, so I’ve given you 20 seconds to do it on the first page loaded.
Chrome looks like this when successful:
My Real-World Use-Cases
I’ve used Python + Selenium for several clients to help them take information on the web (that they had legal rights to use) and to scrape that data into databases and spreadsheets for every day business use.
For another company, I wrote Python & Selenium scripts that scraped popular social media accounts for lead generation. This project didn’t get finished because I read the TOS for the social media services and realized it was against their terms. Remember, don’t use this web scraping technology on website’s that specifically forbid it.
For a company that sells appliance parts, I wrote Python & Selenium scripts that scraped partner websites for inventory, then the script applied a price markup and used Amazon MWS to automatically list these appliance parts on Amazon if they were competitive. The example script I included in this article is lightly edited from one of those scripts.
Here is the full source code of this Python web scraper:
Editing the Script for Your Own Use
The script I’ve provided is good for a starting point, but you need your own MySQL database details and you need to be targeting the data you want to collect using the CSS selectors that the site uses.
You use urlArray.append(“https://www.example.com/category-a/”) to manually add URLs to scrape. This was useful to me because there were a small, finite number of URLs I needed to scrape. On other projects, I’ve had to use urlArray.append() as part of a multi-process of visiting a parent URL, then adding children URLs into the urlArray for later traversal in a second or third phase.
I like to employ plenty of try / except blocks so that I know what is and isn’t working and so the script keeps running even if there are small issues.
I like to use the Chrome or Firefox developer tools, especially the inspector, to find CSS elements to target with this script. I’ve used this exact script as a template to scrape several other websites, with the key being able to find CSS patterns that allowed me to target data in specific HTML elements. Since this will differ with every site, it’s imperative that you use tools like the Chrome or Firefox developer tools, especially the inspector, to locate the right CSS selectors to scrape.
I’ve also used this script to output into a CSV spreadsheet file instead of a database. The point is that you will have to modify this script regardless, so feel free to prune the MySQL content if it doesn’t fit your project.
Python & Selenium are a powerful combination that is relatively quick and easy to get started with. Let me know if I’ve helped you or if you have any additional questions, comments or concerns and I’ll be happy to help. Thanks for reading and happy coding!