Use case:
I helped a friend extract a list of Matta Members (https://www.matta.org.my/members) into an Excel files. With over 2400 entries across 49 web pages, manually copying each entry is impractical and waste of time.
Solution:
Using Python’s BeautifulSoup and Requests modules, I automated the data extraction. If the site required credentials or used JavaScript, I would have needed Selenium or a similar tool.
Result:
Let's get started!
First, we need to know what modules to use.
Python Modules:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
BeautifulSoup: Mainly use to extract data from HTML content on webpages. If extract from local, Requests module is enough
How BeautifulSoup work?
Loading HTML/XML: The first step is loading the HTML content into BeautifulSoup.
Creating the Soup Object: BeautifulSoup parses the content and creates a "soup" object, which represents the document.
Navigating and Searching: The soup object allows you to navigate and search through the parsed document easily, using methods like
find()
,find_all()
,select()
, etc.
Requests: Make HTTP requests in Python. It will fetch HTML content
Pandas: The library provide data analysis tools. Will use to store and manipulate extracted data in tabular as example below:
OS: The module use to interact with operating system. Will use to handle file path once finish extract and create into csv file
Defining the function
def scrape_matta_members():
base_url = "https://www.matta.org.my/members?page="
members = []
def scrape_matta_members()
: defines function as per name stated
base_url
is the web address for the page that we want to capture. Normally main website is till '?' symbol. After that its on the search or other details.
members = [ ]
: It is for initialize empty list to store member details later
Looping
for page in range(1, 50):
url = f"{base_url}{page}"
response = requests.get(url)
if response.status_code != 200:
print(f"Failed to load page {page}, status code: {response.status_code}")
continue
for page in range(1, 50)
: Will loop from page 1 till 49.
url = f"{base_url}{page}"
: Constructs the URL for each page by appending the page number to the base URL. Page is int
response = requests.get(url)
: will send HTTP GET request to the URL that we declare.
if response.status_code != 200
: Checks if the request was successful (status code 200 means ok which request succesfully received, understood and accepted)
print(f"Failed to load page {page}, status code: {response.status_code}")
: Will print this if unable to capture from the URL
continue
: Skips to the next iteration of the loop if the request failed.
Parse
soup = BeautifulSoup(response.content, 'html.parser')
# Find the section containing the member details
member_list = soup.find_all('div', class_='card-box')
if not member_list:
print(f"No members found on page {page}")
continue
soup = BeautifulSoup(response.content, 'html.parser')
: Parses the HTML content of the page using BeautifulSoup.
Parses means that it analyze and breakdown the HTML document into structure that easy to navigate. It is converting a string of HTML or XML code into a tree of Python objects
For member_list = soup.find_all('div', class_='card-box')
, you need to inspect the website to find the right class.
For this time, all of the details in inside <div class ="card-box">
, hence we will use this to find all div
elements that contain all member details.
if not member_list
: Checks if no member elements were found.
print(f"No members found on page {page}")
: Prints a message if no members were found.
continue
: Skips to the next iteration of the loop if no members were found.
Extract and Store Details
for member in member_list:
name = member.find('a', class_='search-title').get_text(strip=True)
reg_number = member.find('span', class_='reg-number').get_text(strip=True)
contact_number = member.find('span', class_='contact-number').get_text(strip=True)
web_address = member.find('span', class_='web-address').get_text(strip=True)
location = member.find('span', class_='location').get_text(separator=", ", strip=True)
members.append({
'name': name,
'reg_number': reg_number,
'contact_number': contact_number,
'web_address': web_address,
'location': location
})
print(f"Page {page} scraped successfully.")
Iterates over each member element in member_list
and extracts the details.
name = member.find('a', class_='search-title').get_text(strip=True)
: Extracts the member's name.
reg_number = member.find('span', class_='reg-number').get_text(strip=True)
: Extracts the registration number.
contact_number = member.find('span', class_='contact-number').get_text(strip=True)
: Extracts the contact number.
web_address = member.find('span', class_='web-address').get_text(strip=True)
: Extracts the web address.
location = member.find('span', class_='location').get_text(separator=", ", strip=True)
: Extracts the location.
Appends the extracted details as a dictionary to the members
list.
Convert the extracted data to DataFrame:
# Run the function and get the dataframe
df = scrape_matta_members()
df = pd.DataFrame(members)
: Converts the list of dictionaries to a Pandas DataFrame.
return df
: Returns the DataFrame.
Defining the Save Function
def save_to_csv(df, filename):
df.to_csv(filename, index=False)
full_path = os.path.abspath(filename)
print(f"Data successfully saved to {full_path}")
def save_to_csv(df, filename)
defines the function as per mentioned name and takes DataFrame 'df' and filename as argument
df.to
_csv(filename, index=False)
saves the DataFrame to a CSV file with the specified filename, without including the DataFrame index.
if I put index = True
, my csv will look like this:
,index,name,reg_number,contact_number,web_address,location
0,A'Famosa Travel & Tours Sdn. Bhd.,456717-K | MA1999,+60 (06) 5520288,www.afamosa.com,"Club House Building,A'Famosa Resort, Jalan Kemus, Simpang Empat, Alor Gajah, 78000, Melaka, Malaysia"
1,Afbatni Travel & Services Sdn Bhd,1376398-M | MA6643,+60 (03332) 33272,-,"No. 36A, Tingkat 1, Lorong Bayu Tinggi 4c, Taman Bayu Tinggi, Klang, 41200, Selangor, Malaysia"
full_path = os.path.abspath(filename)
takes relative path and convert it to absolute path. full_path
is variable that will contain the absolute path of file.
print(f"Data successfully saved to {full_path}")
is success message
Scrape Data
# Run the function and get the dataframe
df = scrape_matta_members()
df = scrape_matta_members()
: Calls the scrape_matta_members
function as mentioned earlier and stores the returned DataFrame in the variable df
.
Save Data to CSV
# Save the dataframe to a CSV file
csv_filename = 'matta_members.csv'
save_to_csv(df, csv_filename)
csv_filename = 'matta_members.csv'
: to give naming for the CSV file.save_to_csv(df, csv_filename)
: Calls thesave_to_csv
function as defined before to save the DataFrame to the specified CSV file.
Thats all. Comment me if need any other help
full code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
def scrape_matta_members():
base_url = "https://www.matta.org.my/members?page="
members = []
for page in range(1, 50): # There are 49 pages. It will loop each file
url = f"{base_url}{page}"
response = requests.get(url)
if response.status_code != 200:
print(f"Failed to load page {page}, status code: {response.status_code}")
continue
soup = BeautifulSoup(response.content, 'html.parser')
# Find the section containing the member details
member_list = soup.find_all('div', class_='card-box')
if not member_list:
print(f"No members found on page {page}")
continue
# Iterate over each member entry and extract details
for member in member_list:
name = member.find('a', class_='search-title').get_text(strip=True)
reg_number = member.find('span', class_='reg-number').get_text(strip=True)
contact_number = member.find('span', class_='contact-number').get_text(strip=True)
web_address = member.find('span', class_='web-address').get_text(strip=True)
location = member.find('span', class_='location').get_text(separator=", ", strip=True)
members.append({
'name': name,
'reg_number': reg_number,
'contact_number': contact_number,
'web_address': web_address,
'location': location
})
print(f"Page {page} scraped successfully.")
# Convert to DataFrame
df = pd.DataFrame(members)
return df
def save_to_csv(df, filename):
df.to_csv(filename, index=False)
full_path = os.path.abspath(filename)
print(f"Data successfully saved to {full_path}")
# Run the function and get the dataframe
df = scrape_matta_members()
# Save the dataframe to a CSV file
csv_filename = 'matta_members.csv'
save_to_csv(df, csv_filename)