# Automate Web Data Extraction 101: Scraping 2400+ Entries with Python

### **Use case:**

I helped a friend extract a list of Matta Members ([https://www.matta.org.my/members](https://www.matta.org.my/members)) into an Excel files. With over 2400 entries across 49 web pages, manually copying each entry is impractical and waste of time.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1716272186389/5f8650a1-6801-43f5-999f-3373ab26687a.png align="center")

### **Solution:**

Using Python’s BeautifulSoup and Requests modules, I automated the data extraction. If the site required credentials or used JavaScript, I would have needed Selenium or a similar tool.

### **Result:**

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1716279593824/75547a9c-bb17-48d2-9657-b2f76ac7ee21.png align="center")

### Let's get started!

First, we need to know what modules to use.

### Python Modules:

```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
```

**BeautifulSoup**: Mainly use to extract data from HTML content on webpages. If extract from local, Requests module is enough

**How BeautifulSoup work?**

1. **Loading HTML/XML:** The first step is loading the HTML content into BeautifulSoup.
    
2. **Creating the Soup Object:** BeautifulSoup parses the content and creates a "soup" object, which represents the document.
    
3. **Navigating and Searching:** The soup object allows you to navigate and search through the parsed document easily, using methods like `find()`, `find_all()`, `select()`, etc.
    

**Requests**: Make HTTP requests in Python. It will fetch HTML content

**Pandas**: The library provide data analysis tools. Will use to store and manipulate extracted data in tabular as example below:

![example of Panda module](https://cdn.hashnode.com/res/hashnode/image/upload/v1716271361235/6290ddb8-5a49-4026-9f09-fd66b6d13ca6.png align="center")

**OS**: The module use to interact with operating system. Will use to handle file path once finish extract and create into csv file

### Defining the function

```python
def scrape_matta_members():
    base_url = "https://www.matta.org.my/members?page="
    members = []
```

`def scrape_matta_members()`: defines function as per name stated

`base_url` is the web address for the page that we want to capture. Normally main website is till '?' symbol. After that its on the search or other details.

`members = [ ]`: It is for initialize empty list to store member details later

### Looping

```python
    for page in range(1, 50):
        url = f"{base_url}{page}"
        response = requests.get(url)
        if response.status_code != 200:
            print(f"Failed to load page {page}, status code: {response.status_code}")
            continue
```

`for page in range(1, 50)`: Will loop from page 1 till 49.

`url = f"{base_url}{page}"`: Constructs the URL for each page by appending the page number to the base URL. Page is int

`response = requests.get(url)`: will send HTTP GET request to the URL that we declare.

`if response.status_code != 200`: Checks if the request was successful (status code 200 means ok which request succesfully received, understood and accepted)

`print(f"Failed to load page {page}, status code: {response.status_code}")`: Will print this if unable to capture from the URL

`continue`: Skips to the next iteration of the loop if the request failed.

### Parse

```python
 soup = BeautifulSoup(response.content, 'html.parser')

        # Find the section containing the member details
        member_list = soup.find_all('div', class_='card-box')
        if not member_list:
            print(f"No members found on page {page}")
            continue
```

`soup = BeautifulSoup(response.content, 'html.parser')`: Parses the HTML content of the page using BeautifulSoup.

Parses means that it analyze and breakdown the HTML document into structure that easy to navigate. It is converting a string of HTML or XML code into a tree of Python objects

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1716274302569/5ec0f7a4-875f-46f3-a3dc-de84c7c846df.png align="center")

For `member_list = soup.find_all('div', class_='card-box')` , you need to inspect the website to find the right class.

For this time, all of the details in inside `<div class ="card-box">` , hence we will use this to find all `div` elements that contain all member details.

`if not member_list`: Checks if no member elements were found.

`print(f"No members found on page {page}")`: Prints a message if no members were found.

`continue`: Skips to the next iteration of the loop if no members were found.

### Extract and Store Details

```python
        for member in member_list:
            name = member.find('a', class_='search-title').get_text(strip=True)
            reg_number = member.find('span', class_='reg-number').get_text(strip=True)
            contact_number = member.find('span', class_='contact-number').get_text(strip=True)
            web_address = member.find('span', class_='web-address').get_text(strip=True)
            location = member.find('span', class_='location').get_text(separator=", ", strip=True)

            members.append({
                'name': name,
                'reg_number': reg_number,
                'contact_number': contact_number,
                'web_address': web_address,
                'location': location
            })
        print(f"Page {page} scraped successfully.")
```

Iterates over each member element in `member_list` and extracts the details.

`name = member.find('a', class_='search-title').get_text(strip=True)`: Extracts the member's name.

`reg_number = member.find('span', class_='reg-number').get_text(strip=True)`: Extracts the registration number.

`contact_number = member.find('span', class_='contact-number').get_text(strip=True)`: Extracts the contact number.

`web_address = member.find('span', class_='web-address').get_text(strip=True)`: Extracts the web address.

`location = member.find('span', class_='location').get_text(separator=", ", strip=True)`: Extracts the location.

Appends the extracted details as a dictionary to the `members` list.

### Convert the extracted data to DataFrame:

```python
# Run the function and get the dataframe
df = scrape_matta_members()
```

`df = pd.DataFrame(members)`: Converts the list of dictionaries to a Pandas DataFrame.

`return df`: Returns the DataFrame.

### Defining the Save Function

```python
def save_to_csv(df, filename):
    df.to_csv(filename, index=False)
    full_path = os.path.abspath(filename)
    print(f"Data successfully saved to {full_path}")
```

`def save_to_csv(df, filename)` defines the function as per mentioned name and takes DataFrame 'df' and filename as argument

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1716277538755/352e4a07-827d-4772-9a9b-a3bcdf44da6e.png align="center")

[`df.to`](http://df.to)`_csv(filename, index=False)` saves the DataFrame to a CSV file with the specified filename, without including the DataFrame index.

if I put `index = True`, my csv will look like this:

```excel
,index,name,reg_number,contact_number,web_address,location
0,A'Famosa Travel & Tours Sdn. Bhd.,456717-K | MA1999,+60 (06) 5520288,www.afamosa.com,"Club House Building,A'Famosa Resort, Jalan Kemus, Simpang Empat, Alor Gajah, 78000, Melaka, Malaysia"
1,Afbatni Travel & Services Sdn Bhd,1376398-M | MA6643,+60 (03332) 33272,-,"No. 36A, Tingkat 1, Lorong Bayu Tinggi 4c, Taman Bayu Tinggi, Klang, 41200, Selangor, Malaysia"
```

`full_path = os.path.abspath(filename)` takes relative path and convert it to absolute path. `full_path` is variable that will contain the absolute path of file.

`print(f"Data successfully saved to {full_path}")` is success message

### Scrape Data

```python
# Run the function and get the dataframe
df = scrape_matta_members()
```

`df = scrape_matta_members()`: Calls the `scrape_matta_members` function as mentioned earlier and stores the returned DataFrame in the variable `df`.

### Save Data to CSV

```python
# Save the dataframe to a CSV file
csv_filename = 'matta_members.csv'
save_to_csv(df, csv_filename)
```

* `csv_filename = 'matta_members.csv'`: to give naming for the CSV file.
    
* `save_to_csv(df, csv_filename)`: Calls the `save_to_csv` function as defined before to save the DataFrame to the specified CSV file.
    

Thats all. Comment me if need any other help

### **full code:**

```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os


def scrape_matta_members():
    base_url = "https://www.matta.org.my/members?page="
    members = []

    for page in range(1, 50):  # There are 49 pages. It will loop each file
        url = f"{base_url}{page}"
        response = requests.get(url)
        if response.status_code != 200:
            print(f"Failed to load page {page}, status code: {response.status_code}")
            continue

        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the section containing the member details
        member_list = soup.find_all('div', class_='card-box')
        if not member_list:
            print(f"No members found on page {page}")
            continue

        # Iterate over each member entry and extract details
        for member in member_list:
            name = member.find('a', class_='search-title').get_text(strip=True)
            reg_number = member.find('span', class_='reg-number').get_text(strip=True)
            contact_number = member.find('span', class_='contact-number').get_text(strip=True)
            web_address = member.find('span', class_='web-address').get_text(strip=True)
            location = member.find('span', class_='location').get_text(separator=", ", strip=True)

            members.append({
                'name': name,
                'reg_number': reg_number,
                'contact_number': contact_number,
                'web_address': web_address,
                'location': location
            })
        print(f"Page {page} scraped successfully.")

    # Convert to DataFrame
    df = pd.DataFrame(members)
    return df


def save_to_csv(df, filename):
    df.to_csv(filename, index=False)
    full_path = os.path.abspath(filename)
    print(f"Data successfully saved to {full_path}")


# Run the function and get the dataframe
df = scrape_matta_members()

# Save the dataframe to a CSV file
csv_filename = 'matta_members.csv'
save_to_csv(df, csv_filename)
```

%[https://github.com/ahmadafif5321/Python/blob/main/webScrapping-%20Matta%20Members.py]
