Automate Web Data Extraction 101: Scraping 2400+ Entries with Python

Automate Web Data Extraction 101: Scraping 2400+ Entries with Python

Use case:

I helped a friend extract a list of Matta Members (https://www.matta.org.my/members) into an Excel files. With over 2400 entries across 49 web pages, manually copying each entry is impractical and waste of time.

Solution:

Using Python’s BeautifulSoup and Requests modules, I automated the data extraction. If the site required credentials or used JavaScript, I would have needed Selenium or a similar tool.

Result:

Let's get started!

First, we need to know what modules to use.

Python Modules:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

BeautifulSoup: Mainly use to extract data from HTML content on webpages. If extract from local, Requests module is enough

How BeautifulSoup work?

  1. Loading HTML/XML: The first step is loading the HTML content into BeautifulSoup.

  2. Creating the Soup Object: BeautifulSoup parses the content and creates a "soup" object, which represents the document.

  3. Navigating and Searching: The soup object allows you to navigate and search through the parsed document easily, using methods like find(), find_all(), select(), etc.

Requests: Make HTTP requests in Python. It will fetch HTML content

Pandas: The library provide data analysis tools. Will use to store and manipulate extracted data in tabular as example below:

example of Panda module

OS: The module use to interact with operating system. Will use to handle file path once finish extract and create into csv file

Defining the function

def scrape_matta_members():
    base_url = "https://www.matta.org.my/members?page="
    members = []

def scrape_matta_members(): defines function as per name stated

base_url is the web address for the page that we want to capture. Normally main website is till '?' symbol. After that its on the search or other details.

members = [ ]: It is for initialize empty list to store member details later

Looping

    for page in range(1, 50):
        url = f"{base_url}{page}"
        response = requests.get(url)
        if response.status_code != 200:
            print(f"Failed to load page {page}, status code: {response.status_code}")
            continue

for page in range(1, 50): Will loop from page 1 till 49.

url = f"{base_url}{page}": Constructs the URL for each page by appending the page number to the base URL. Page is int

response = requests.get(url): will send HTTP GET request to the URL that we declare.

if response.status_code != 200: Checks if the request was successful (status code 200 means ok which request succesfully received, understood and accepted)

print(f"Failed to load page {page}, status code: {response.status_code}"): Will print this if unable to capture from the URL

continue: Skips to the next iteration of the loop if the request failed.

Parse

 soup = BeautifulSoup(response.content, 'html.parser')

        # Find the section containing the member details
        member_list = soup.find_all('div', class_='card-box')
        if not member_list:
            print(f"No members found on page {page}")
            continue

soup = BeautifulSoup(response.content, 'html.parser'): Parses the HTML content of the page using BeautifulSoup.

Parses means that it analyze and breakdown the HTML document into structure that easy to navigate. It is converting a string of HTML or XML code into a tree of Python objects

For member_list = soup.find_all('div', class_='card-box') , you need to inspect the website to find the right class.

For this time, all of the details in inside <div class ="card-box"> , hence we will use this to find all div elements that contain all member details.

if not member_list: Checks if no member elements were found.

print(f"No members found on page {page}"): Prints a message if no members were found.

continue: Skips to the next iteration of the loop if no members were found.

Extract and Store Details

        for member in member_list:
            name = member.find('a', class_='search-title').get_text(strip=True)
            reg_number = member.find('span', class_='reg-number').get_text(strip=True)
            contact_number = member.find('span', class_='contact-number').get_text(strip=True)
            web_address = member.find('span', class_='web-address').get_text(strip=True)
            location = member.find('span', class_='location').get_text(separator=", ", strip=True)

            members.append({
                'name': name,
                'reg_number': reg_number,
                'contact_number': contact_number,
                'web_address': web_address,
                'location': location
            })
        print(f"Page {page} scraped successfully.")

Iterates over each member element in member_list and extracts the details.

name = member.find('a', class_='search-title').get_text(strip=True): Extracts the member's name.

reg_number = member.find('span', class_='reg-number').get_text(strip=True): Extracts the registration number.

contact_number = member.find('span', class_='contact-number').get_text(strip=True): Extracts the contact number.

web_address = member.find('span', class_='web-address').get_text(strip=True): Extracts the web address.

location = member.find('span', class_='location').get_text(separator=", ", strip=True): Extracts the location.

Appends the extracted details as a dictionary to the members list.

Convert the extracted data to DataFrame:

# Run the function and get the dataframe
df = scrape_matta_members()

df = pd.DataFrame(members): Converts the list of dictionaries to a Pandas DataFrame.

return df: Returns the DataFrame.

Defining the Save Function

def save_to_csv(df, filename):
    df.to_csv(filename, index=False)
    full_path = os.path.abspath(filename)
    print(f"Data successfully saved to {full_path}")

def save_to_csv(df, filename) defines the function as per mentioned name and takes DataFrame 'df' and filename as argument

df.to_csv(filename, index=False) saves the DataFrame to a CSV file with the specified filename, without including the DataFrame index.

if I put index = True, my csv will look like this:

,index,name,reg_number,contact_number,web_address,location
0,A'Famosa Travel & Tours Sdn. Bhd.,456717-K | MA1999,+60 (06) 5520288,www.afamosa.com,"Club House Building,A'Famosa Resort, Jalan Kemus, Simpang Empat, Alor Gajah, 78000, Melaka, Malaysia"
1,Afbatni Travel & Services Sdn Bhd,1376398-M | MA6643,+60 (03332) 33272,-,"No. 36A, Tingkat 1, Lorong Bayu Tinggi 4c, Taman Bayu Tinggi, Klang, 41200, Selangor, Malaysia"

full_path = os.path.abspath(filename) takes relative path and convert it to absolute path. full_path is variable that will contain the absolute path of file.

print(f"Data successfully saved to {full_path}") is success message

Scrape Data

# Run the function and get the dataframe
df = scrape_matta_members()

df = scrape_matta_members(): Calls the scrape_matta_members function as mentioned earlier and stores the returned DataFrame in the variable df.

Save Data to CSV

# Save the dataframe to a CSV file
csv_filename = 'matta_members.csv'
save_to_csv(df, csv_filename)
  • csv_filename = 'matta_members.csv': to give naming for the CSV file.

  • save_to_csv(df, csv_filename): Calls the save_to_csv function as defined before to save the DataFrame to the specified CSV file.

Thats all. Comment me if need any other help

full code:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import os


def scrape_matta_members():
    base_url = "https://www.matta.org.my/members?page="
    members = []

    for page in range(1, 50):  # There are 49 pages. It will loop each file
        url = f"{base_url}{page}"
        response = requests.get(url)
        if response.status_code != 200:
            print(f"Failed to load page {page}, status code: {response.status_code}")
            continue

        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the section containing the member details
        member_list = soup.find_all('div', class_='card-box')
        if not member_list:
            print(f"No members found on page {page}")
            continue

        # Iterate over each member entry and extract details
        for member in member_list:
            name = member.find('a', class_='search-title').get_text(strip=True)
            reg_number = member.find('span', class_='reg-number').get_text(strip=True)
            contact_number = member.find('span', class_='contact-number').get_text(strip=True)
            web_address = member.find('span', class_='web-address').get_text(strip=True)
            location = member.find('span', class_='location').get_text(separator=", ", strip=True)

            members.append({
                'name': name,
                'reg_number': reg_number,
                'contact_number': contact_number,
                'web_address': web_address,
                'location': location
            })
        print(f"Page {page} scraped successfully.")

    # Convert to DataFrame
    df = pd.DataFrame(members)
    return df


def save_to_csv(df, filename):
    df.to_csv(filename, index=False)
    full_path = os.path.abspath(filename)
    print(f"Data successfully saved to {full_path}")


# Run the function and get the dataframe
df = scrape_matta_members()

# Save the dataframe to a CSV file
csv_filename = 'matta_members.csv'
save_to_csv(df, csv_filename)