Web Scraping OLX using python requests,BeautifulSoup and pandas

We are going to scrap OLX website for customers name,address,vehicle's price,location of customer,vehicle name using requests and BeautifulSoup.Then we are going to convert the information collected to a CSV file using pandas.

What is webscraping:

Webscraping is process of collection data from website.Later this data can be exported to other file formats like csv,excel,.. This collected data can be useful in data anaylsis,data plotting.Webscarping can be done manually and also by automated tools .Generally automated tools are more preferred since they are fast and can be less costly

Working of Webscraper:

First, the web scraper will be given one or more URLs as input. The scraper then loads the entire HTML code of website at url given.Then the scraper will either extract all the data on the website or specific data selected by the user that is useful for project 

Lastly, the web scraper will output all the data that has been collected into a format that is more useful to the user.Most web scrapers will output data to a CSV or Excel spreadsheet

Libraries used:

1.Requests
2.pandas
3.BeautifulSoup

About Libraries:

requests:    requests is used in sending HTTP requests using python.we used get method to send GET request to url specified within it.Requests can be installed using  pip install requests

Pandas:    pandas is used in for the following operations Data cleansing,Data fillData normalization,Merges and joins,Data visualization,Statistical analysis,Data inspection,Loading and saving dataAnd much more. Pandas can be installed using 
pip install pandas

BeautifulSoup:    BeautifulSoup is python library used for extracting data from HTML and XML files.BeautifulSoup is simple and great for small-scale web scraping.
But if you are interested in scraping data at a larger scale, you should consider using these other alternatives like scrapy,public APIs.BeatutifulSoup can be installed using pip install beautifulsoup4 and can be installed using 
from bs4 import BeautifulSoup 

Methods used:

get: get method is used to request data from the server.get method returns status code of request

findAll():The findAll() method traverses the HTML tree, starting at the given point, and finds all the Tag and attributes that match the criteria you give.
example :soup.findAll("span") will return all tags of "span " in HTML code 

DataFrame:This method is used to create 2 Dimensional DataStructure with rows and coloumns.DataFrame can be formed by adding lists too.

to_csv:to_csv is used to convert Dataframe into a file of CSV format .This method takes arguments of location to be saved,index,encoding type

read_csv:read_csv is used to read a CSV file.This method takes argument of csv file name

Code:

import requests
import pandas as pd
from bs4 import BeautifulSoup
#csv file contains price,car model,km  & year,address
url="https://www.olx.in/items/q-cars?isSearchCall=true"
cars=requests.get(url)
soup=BeautifulSoup(cars.content,'lxml')
#price
price=soup.findAll("span",attrs={"class":"_89yzn"})
cost=[]
for x in price:
 cost.append(x.get_text())
#car model
name=soup.findAll("span",attrs={"class":"_2tW1I"})
model=[]
for y in name:
 model.append(y.get_text())
#km and year
km=soup.findAll("span",attrs={"class":"_2TVI3"})
year=[]
for z in km:
 year.append(z.get_text())
#address
loc=soup.findAll("span",attrs={"class":"tjgMj"})
address=[]
for a in loc:
 address.append(a.get_text())
yc=[]
km=[]
for i in range(len(year)):
 xy=year[i]
 temp1,temp2=xy.split("- ")
 yc.append(int(float(temp1)))
 km.append(int(temp2[:-2].replace(",","")))
mrp=[]
for m in range(len(cost)):
 mn=cost[m]
 yo=mn.replace("₹","").replace(",","")
 mrp.append(int(yo))
names=[]
for n in range(len(model)):
 mn=model[n].split(",")
 names.append(mn[0])
city=[]
for b in range(len(address)):
 ab=address[b].split(",")
 city.append(ab[1])
#print(city)
df=pd.DataFrame({"Names":names,"City":city,"year":yc,"Distance in km":km,"Price":mrp})
df.to_csv("Olx.csv"index=Falseencoding='utf-8')
data=pd.read_csv("Olx.csv")
data.iloc[3:5]


Output: