Web scraping Python BeautifulSoup
In this post, you will learn about Python web scraping tool beautifulsoap.
Web Scraping is a process of data extracting from web sites. The extracted data can be content, urls, contact information, etc., which we can store in a local file or database. This process can be done manually by code called scrapper or by an automated software implemented using a bot or web crawler. The web scraping is not always legal. Some sites has dis-allow the scraping in the 'robots.txt' file. Some popular sites provide APIs to access their data in a structured way. But not all websites. So, we need a web scraper for data extraction, data mining and store in a structured way.
Python is the most popular programming language for web scraping. It provides many libraries that can handle web crawler related process smoothly. BeautifulSoup is most widely used library among them.
The beautifulsoup library makes it easy to scrape the information from the HTML or XML files. The Beautiful Soup4 or bs4 works on Python 3. It is much faster and supports third party parsers like html5lib and lxml. The following command installs the BeautifulSoup module using pip tool.
pip install bs4
On successful installation, it returns the following -
Successfully built bs4 Installing collected packages: soupsieve, beautifulsoup4, bs4 Successfully installed beautifulsoup4-4.8.2 bs4-0.0.1 soupsieve-2.0
For web scraping, first we need to import this module and pass the fetched url content to create soup object. This library provides find_all method to filter data from the web content.
Python Beautifulsoup: Scrap the smartwatches from Amazon
Suppose, we want to get all the name of smart watches which is in 'span' tags from the request url, the code will be -
Output of the above code -
import requests from bs4 import BeautifulSoup URL = 'https://www.amazon.in/s?k=smartwatch&ref=nb_sb_noss' page = requests.get(URL) soup = BeautifulSoup(page.content, 'html.parser') x = soup.find_all('span', class_='a-size-medium a-color-base a-text-normal') for job_title in x: print(job_title.text.strip())
Convert list to dictionary Python
Convert array to list Python
numpy dot product
glob in Python
Python heap implementation
zip function in Python
Remove last element from list Python
Check if list is empty Python
Remove element from list Python
Python split multiple delimiters
Python loop through list
Python iterate list with index
Python add list to list
Python random choice
Python dict inside list
Remove character from string Python
Python raise keyword