Python Web Scraping (hallogsm.com) With BeautifulSoup and Requests to CSV file

Farid Winarto
5 min readApr 7, 2021

This article aims to scrape news articles from different websites using Python. Generally, web scraping involves accessing numerous websites and collecting data from them. However, we can limit ourselves to collect large amounts of information from a single source and use it as a dataset.

Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.

STEP 1: Preparation

You can skip this step with a new, blank project that has python libraries installed like BeautifulSoup, Requests, and CSV packages

pip install requests
pip install bs4
pip install csv

STEP 2: Import The Packages

In the first line, we have to initialize the packages that we will use so that the project can run well. That matter can be done by importing the package.

from bs4 import BeautifulSoup
import requests
import csv

STEP 3: Get the Response Text

To get the response text, we must know the URL that we are targeting to do the scraping.

www.hallogsm.com

After that, enter the URL into the request function with the get method.
and I will immediately teach you how to make a key format because here we will look for the type of cellphone type and the number of pages. and we will get the response using the user agent so that it is still read as a browser in scraping data.

key = input('please enter the term:')
url = 'https://www.hallogsm.com/hp/{}/'.format(key)
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'

}
for
page in range(1, 5):
req = requests.get(url + 'page/' +str(page), headers=headers)

Then, print the response to see that the data we are about to retrieve is running smoothly

       print(req)

from the written Wikipedia source that
2xx success- request received, understood, and accepted successfully

After that, change the response format to bs4 format.

soup = BeautifulSoup(req.text, 'html.parser')

STEP 4: Get the Player’s Element

In the picture, every product is inside
tag ‘div’ and class ‘aps-product-box’ because in all products in one tag and the same class, we will use the function
“findAll” in the BeautifulSoup module to retrieve all the information contained in the product.

items = soup.findAll('div', 'aps-product-box')

STEP 5: Lopping the Players Variable with “for” Function

Then we will parse the data contained in the product class by repeating the ‘for’ function. The data that I took in this article include name, club, nation, and league. The name, nation, and league are in the same tag and class so we apply it with the ‘FindAll’ function so that it becomes a list of data, then we will call the data in what order. You can develop your own by adding data variables that you will take and enter into the “find (‘ <tag> ‘, {‘<attribute name>’:’attribute value’})” function. Use the string strip method to remove the characters at the front and back of the string.

for item in items:
nama = item.find('h2', 'aps-product-title').text.strip()
print(nama)
harga = item.find('span', 'aps-price-value').text
print(harga)
link_image = item.find('div', 'aps-product-thumb').find('img')['data-lazy-src']
print(link_image)
alt_item = item.find('div', 'aps-product-thumb').find('img')['alt']
alt_item = str(alt_item).replace(' ', '-').replace('/', '').replace('*', '') + '.jpg'
print(alt_item)

Congratulations you have the data you want on one page. If you are going to scrape data from many pages and get thousands of data, let’s look at the next step.

STEP 6: Pagination

To do pagination, we do a page lopping with a certain range, for example from pages 1 to 5, then we loop the range with the “for” function that loops all the variable data that we scrape so that later the data from each page can be read properly. On the hallogsm site there are 14 pages.

for page in range(1, 5):

req = requests.get(url + 'page/' +str(page), headers=headers)

STEP 7: Append the Data

Create a variable that contains an empty list, then enter the data into the empty list. for example, I named the variable “datas” which I placed before the loop.

datas = []

for
page in range(1, 5):

req = requests.get(url + 'page/' +str(page), headers=headers)
print(req)
soup = BeautifulSoup(req.text, 'html.parser')
items = soup.findAll('div', 'aps-product-box')

Then I will append the data that we have scrapped into the variable. put the append function in a loop.

alt_item = str(alt_item).replace(' ', '-').replace('/', '').replace('*', '') + '.jpg'
print(alt_item)
datas.append([nama, harga, link_image])

STEP 8: Save to CSV File

After we successfully scrape the data, we can save it in CSV file format using the “write” function, for more details click here to see how to complete the guide using the CSV library.

head = ['Nama', 'Harga', 'Link Gambar']
write = csv.writer(open('handphonelistsamsung.csv', 'w', newline=''))
write.writerow(head)
for
d in datas: write.writerow(d)

Check scraped CSV file. Finish, good luck :)

I hope you can benefit from this article and develop it according to your own needs and imagination. And I ask forgiveness for any words and behave which are not supposed to be. Thank you for your kind attention Guys. Stay tuned for my next articles!…. :)

the github link you can get here

--

--