Integrating WebScrapingAPI with BeautifulSoup in Python
BeautifulSoup is a popular Python library for parsing HTML and XML files. For a web scraping project, it's useful for tuning unreadable strings of HTML into clean, structured data. In this guide, we'll show you how to integrate BeatifulSoup with WebScrapingAPI.
A basic example of integrating WebScrapingAPI with BeautifulSoup
Our objective, in this case, is to scrape the Wikipedia page for web scraping and get two things: a clear picture of the page's HTML structure and the page's paragraphs, parsed for readability.
Here's the code:
import requests from bs4 import BeautifulSoup ENDPOINT = "https://api.webscrapingapi.com/v1/" params = { "api_key":"YOUR_API_KEY", "url": "https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FWeb_scraping" } page = requests.request("GET", ENDPOINT, params=params) page_soup = BeautifulSoup(page.content, 'html.parser') page_body = page_soup.find('body') for s in page_body.select('script'): s.extract() scraped_data_file = open("cleanpage.txt", "w", encoding="utf-8") scraped_data_file.write(page_body.prettify()) paragraphs = page_soup.find_all('p') paragraphs_file = open("cleanparagraphs.txt", "w", encoding="utf-8") for paragraph in paragraphs: paragraphs_file.write(str(paragraph) + "\n") <br>
Let's look at what the script does, step by step:
- Defines the endpoint (https://api.webscrapingapi.com/v1/), and the basic parameters you'll need to use the API, which are your API key and the encoded URL to the page you want to scrape (https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FWeb_scraping).
- Uses the GET method to extract the HTML on the Wikipedia page.
- Uses BeautifulSoup to parse the extracted data and save the body of the page in "page_body".
- Checks for any scripts in the HTML of the body, removing any it finds.
- Indents the HTML code with the prettify() function and saves the results in a text file named cleanpage.txt.
- Copies all paragraphs on the page and pastes the results, separated with new lines, into a text file named cleanparagraphs.txt.
Save the code in a python file, let's say beautifulsoupexample.py. To use it, run the following command in your IDE:
py beautifulsoupexample.py
After running the code, you will have two files - one containing indented HTML similar to what you'd see with "inspect element", and one containing the page's information formated so that it's easier for people to read.