Python & Selenium Windows & MacOS Guide + Videos [2020]

In this article, I am going to give you all the information and tools necessary to get your web testing automation or web scraping project up and running quickly using Python & Selenium.

Introduction

Selenium is good for automated testing of web sites and web apps, as well as web scraping.

Web scraping involves finding patterns in web page source code that can be used to harvest useful data. Python, being a language great for list crawling, is a natural match with Selenium, an industry standard tool for browser automation.

Python/Selenium Beginners Free Course from Edureka

You can follow my guide below, but if you learn better with video, then I recommend this hour long free course on the subject by Edureka:

If you’re looking to become professional certified with Selenium or if you learn better with a live instructor and personalized help from tutors, then I recommend checking this out instead:

Python + Selenium + Beautiful Soup Install Guide

At the time of writing, Python 3.8.5 is the latest version and is what I am linking to and installing. As new versions come out, you will probably want to grab those versions instead. The main Python download page is located here: https://www.python.org/downloads/

Linux Mint 19 Installation

This was in 2019, and I’m not covering Linux in the rest of the tutorial, but here are the commands that worked for me:

sudo apt install python-pip
pip install selenium
wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
tar -xvzf geckodriver-v0.23.0-linux64.tar.gz
chmod +x geckodriver
sudo mv geckodriver /usr/local/bin/
pip install setuptools
pip install mysql-connector
sudo apt-get install python-bs4
pip install beautifulsoup4

1. Download and install Python 3.8.5

Apple MacOS 64bit: https://www.python.org/ftp/python/3.8.5/python-3.8.5-macosx10.9.pkg
Microsoft Windows 64bit: https://www.python.org/ftp/python/3.8.5/python-3.8.5-amd64.exe

How to Verify Success
To insure this succeeded, after installing use a Terminal prompt (MacOS) or CMD/PowerShell (Windows) execute this command:
python -V

If everything installed correctly, you should get a response similar to this:

2. Next install Selenium & Beautiful Soup 4.x

We use Python’s package manager, pip to install it.

From a Terminal prompt (MacOS) or an elevated CMD/PowerShell (Windows) execute this command:

pip install selenium

After that completes, type this command:

pip install beautifulsoup4
How to Verify Success
A successful install will look something like this:

If you get an error, you probably need to install pip, although a default installation of Python now installs pip by default.

Install pip using these two commands in the same terminal (MacOS only)

sudo curl -O https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py

3. Download Browser Webdrivers

Next, we need to pick Firefox and/or Chrome and install their webdrivers. The webdrivers allow Selenium through Python to control the web browser as if they were a user.

The Chrome webdriver needs to be updated everytime Chrome updates, but the Firefox webdriver does not. This makes Firefox more desirable to use… except there is a known bug with the Firefox webdriver on MacOS, which makes it a little trickier to set up initially.

You must select the Chrome webdriver that corresponds to the version of the web browser you are using. Because of this requirement, I recommend disabling auto-updates on Chrome or use Firefox so that you don’t constantly need to download new webdrivers everytime Chrome releases an update.

Google Chrome Chrome Webdrivers Downloads

Mozilla Firefox Firefox (geckodriver) Webdrivers Downloads

There are also drivers for Safari and Edge. You can always find the latest official drivers at this URL: https://pypi.org/project/selenium/

4. Install Browser Webdrivers

So you downloaded one or more webdrivers in the last step, now we need to install them. This means extracting the files and making sure the webdrivers are accessible by being in the system path. For Firefox on macOS 10.15 or above, it also means disabling the notarization requirement.

Webdrivers Must Extracted To Your PATH!!!

Apple MacOS: I recommend downloading the webdrivers and extracting them to the ~/Downloads folder. Even if you install multiple webdrivers, you need to run these two commands once (don’t worry if the first command produces an error.)

This command creates one of the default MacOS path locations if it doesn’t exist (produces an error that you ignore if it does.

sudo mkdir -p /usr/local/bin

This command makes the default ~/Downloads folder (where ~ is a shortcut to the User’s folder) part of the path by symbolically linking the Downloads folder with the /usr/local/bin folder.

sudo ln -s ~/Downloads /usr/local/bin
Firefox geckodriver Bug Requires This Work Around
There is a bug on macOS 10.15+ with geckodriver. You must run this command to fix it:
sudo xattr -r -d com.apple.quarantine ~/Downloads/geckodriver

Fore more information on MacOS paths, even though this MacOS guide is out of date it is still accurate: https://coolestguidesontheplanet.com/add-shell-path-osx/ or the answers on this StackExhange topic: https://apple.stackexchange.com/questions/41542/adding-a-new-executable-to-the-path-environment-variable

Microsoft Windows: I recommend extracting or moving your webdrivers to C:\Windows\ directory so that they are in a location in your path.

For Windows, this article thoroughly explains how to add to the PATH on each different version:
https://www.java.com/en/download/help/path.xml

How to Verify Success
After you have the webdriver(s) in your path, you can test to see if they’re in your path by typing the appropriate command below into your terminal/CMD:
chromedriver
geckodriver

You should get a response as pictured below:

5. Download & Customize My Free Web Scraping Script

Download my Python script to your PATH (as outlined above, so continuing with the defaults, this would be your ~\Downloads folder on macOS.)

You launch this script by typing this at elevated terminal/CMD/PowerShell:

python learnonlineshop.py

or if that causes you issues, on MacOS you can try this elevated command instead:

sudo python learnonlineshop.py

Using this script will launch Firefox in a special mode with a new, blank profile, which means you will need to sign into TikTok, Instagram and YouTube or whatever site you want to scrape. That’s hard to do if the script/bot is controlling the browser, so I’ve given you 20 seconds to do it on the first page loaded.

How to Verify Success
If the script is able to open up Firefox and/or Chrome on it’s own, then so far, so good! Firefox looks like this in Robot mode:

Chrome looks like this when successful:

My Real-World Use-Cases

I’ve used Python + Selenium for several clients to help them take information on the web (that they had legal rights to use) and to scrape that data into databases and spreadsheets for every day business use.

For another company, I wrote Python & Selenium scripts that scraped popular social media accounts for lead generation. This project didn’t get finished because I read the TOS for the social media services and realized it was against their terms. Remember, don’t use this web scraping technology on website’s that specifically forbid it.

For a company that sells appliance parts, I wrote Python & Selenium scripts that scraped partner websites for inventory, then the script applied a price markup and used Amazon MWS to automatically list these appliance parts on Amazon if they were competitive. The example script I included in this article is lightly edited from one of those scripts.

Here is the full source code of this Python web scraper:

from selenium import webdriver # Selenium for opening browsers
import mysql.connector #We need MySQL for this project
from bs4 import BeautifulSoup #This make HTML parsing much easier
import sys
import time
from selenium.webdriver.support.ui import Select

#Make a db connection:
mydb = mysql.connector.connect(
  host="localhost",
  user="database_user",
  passwd="database_password",
  database="database_table_name"
)

mycursor = mydb.cursor() #Use this for messing around with database results

browser = webdriver.Firefox() #replace with .Chrome(), or with the browser of your choice
urlArray = [] # Make an array for all our pages

result_brand = "BrandName" #fill out this before running per brand
urlArray.append("https://www.example.com/login/") #log in to partner website first
urlArray.append("https://www.example.com/category-a/") 
urlArray.append("https://www.example.com/category-b/") 
urlArray.append("https://www.example.com/category-c/") 
urlArray.append("https://www.example.com/category-d/") 
urlArray.append("https://www.example.com/category-r/") 

# The main stuff we're looking to extract for each product:
result_title = ""   # Inside of an A tag, inside of an H2, inside a TR inside a TD with class name "PD-name" inside table with class name "product-details" inside another table with class "product" /// Actually inside the IMAGE the ALT tag has a better title!
result_cost = "0.00" # Inside a table class name "product-cart" in a span with class "PC-Price"
result_img = "" # Inside TR in TD with class "product-image" in an A tag, in an IMG tag in the SRC 
result_partno = ""  # Inside a TR, in a TD with class "PD-number"


print("Starting Inventory Collection")

i = 0
while i < len(urlArray):
    browser.get(urlArray[i])
    try:
        if i != 0: #Don't do this on the first one
            #here we are trying to select 100 from the drop down
            select = Select(browser.find_element_by_id('MainContent_DDLPageSize'))
            select.select_by_value('100')
            print("Selected 100, waiting 5 seconds")
            time.sleep(5)
    except:
        print("Failed to change dropdown")
    searchResulted = browser.find_elements_by_xpath("//*[@class='product']")
 
    if i == 0: #Pause on the first one for 15 seconds so we can log in
        time.sleep(20)

    for searchResults in searchResulted:
        failed = 0
        try:
            try:
                result_title_pre = searchResults.find_element_by_css_selector("td.PD-name h2 a") #find the title
                result_title = result_title_pre.text
                if result_title == 'NO LONGER AVAILABLE':
                    print("This item is no longer available")
                    failed = 1
            except:
                print("Failed at title")
                failed = 1
            try:
                result_cost_pre = searchResults.find_element_by_css_selector("span.PC-price") #get the price
                str(result_cost_pre).replace("Your Price:","")
                str(result_cost_pre).replace(" ","")
                result_cost  = result_cost_pre.text
            except:
                print("Failed at price")
                failed = 1
            try:
                result_img_pre = searchResults.find_element_by_css_selector('.product-image a img')
                result_img = result_img_pre.get_attribute("src")
            except:
                print("Failed at image")
                failed = 1
            try:
                result_partno_pre = searchResults.find_element_by_css_selector("td.PD-number strong")
                result_partno = result_partno_pre.text
            except:
                print("Failed at Part Number")
                failed = 1
            if failed == 0:
                val = (str(result_title), str(result_partno).replace("Part Number:",""), str(result_brand), str(result_img), str(result_cost).replace("$","")) #values in query go here
                print("Title: %s --- Cost: %s --- Image: %s --- PartNo: %s ---" % (result_title, result_cost, result_img, result_partno )) 
                sql = "INSERT INTO `amazon_listings` (`title`, `description`, `part_no`, `model_no`, `brand`, `image`, `sorted`, `added_to_amazon`, `date_inserted_db`, `date_last_sorted`, `user_id_last_sorted`, `user_id_added_to_amazon`, `user_id_locking`, `az_listed_price`, `az_asin`, `az_barcode`, `az_barcode_type`, `az_sku`, `az_category`, `az_shipping_profile`, `az_qty`, `az_handling_time_days`, `az_title`, `az_brand`, `az_manufacturer`, `user_notes`, `current_supplier_id`, `current_supplier_cost`, `current_supplier_last_checked`, `current_supplier_last_checked_by_userid`, `metadata`) VALUES(%s, '', %s, '', %s, %s, 0, 0, '', '', 0, 0, 0, NULL, '', '', '', '', '', '', NULL, NULL, '', '', NULL, '', 10, %s, '', 0, '');" #query goes here

                mycursor.execute(sql, val) #execute the querry
                mydb.commit() #commit the changes to the db
                print("Added %s to the database!" % result_title) 

        except:
            e = sys.exc_info()[0]
            print( "

Error: %s

" % e ) i += 1 print("Finished Inventory Collection")

Editing the Script for Your Own Use

The script I’ve provided is good for a starting point, but you need your own MySQL database details and you need to be targeting the data you want to collect using the CSS selectors that the site uses.

You use urlArray.append(“https://www.example.com/category-a/”) to manually add URLs to scrape. This was useful to me because there were a small, finite number of URLs I needed to scrape. On other projects, I’ve had to use urlArray.append() as part of a multi-process of visiting a parent URL, then adding children URLs into the urlArray for later traversal in a second or third phase.

I like to employ plenty of try / except blocks so that I know what is and isn’t working and so the script keeps running even if there are small issues.

I like to use the Chrome or Firefox developer tools, especially the inspector, to find CSS elements to target with this script. I’ve used this exact script as a template to scrape several other websites, with the key being able to find CSS patterns that allowed me to target data in specific HTML elements. Since this will differ with every site, it’s imperative that you use tools like the Chrome or Firefox developer tools, especially the inspector, to locate the right CSS selectors to scrape.

I’ve also used this script to output into a CSV spreadsheet file instead of a database. The point is that you will have to modify this script regardless, so feel free to prune the MySQL content if it doesn’t fit your project.

Conclusion

Python & Selenium are a powerful combination that is relatively quick and easy to get started with. Let me know if I’ve helped you or if you have any additional questions, comments or concerns and I’ll be happy to help. Thanks for reading and happy coding!


Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments