Integrating WebScrapingAPI with BeautifulSoup in Python

BeautifulSoup is a popular Python library for parsing HTML and XML files. For a web scraping project, it's useful for tuning unreadable strings of HTML into clean, structured data. In this guide, we'll show you how to integrate BeatifulSoup with WebScrapingAPI.

A basic example of integrating WebScrapingAPI with BeautifulSoup

Our objective, in this case, is to scrape the Wikipedia page for web scraping and get two things: a clear picture of the page's HTML structure and the page's paragraphs, parsed for readability.

Here's the code:

import requests
from bs4 import BeautifulSoup

ENDPOINT = "https://api.webscrapingapi.com/v1/"

params = {
  "api_key":"YOUR_API_KEY",
  "url": "https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FWeb_scraping"
}

page = requests.request("GET", ENDPOINT, params=params)
page_soup = BeautifulSoup(page.content, 'html.parser')
page_body = page_soup.find('body')

for s in page_body.select('script'):
    s.extract()

scraped_data_file = open("cleanpage.txt", "w", encoding="utf-8")
scraped_data_file.write(page_body.prettify())

paragraphs = page_soup.find_all('p')

paragraphs_file = open("cleanparagraphs.txt", "w", encoding="utf-8")
for paragraph in paragraphs: 
    paragraphs_file.write(str(paragraph) + "\n")
<br>

Let's look at what the script does, step by step:

  1. Defines the endpoint (https://api.webscrapingapi.com/v1/), and the basic parameters you'll need to use the API, which are your API key and the encoded URL to the page you want to scrape (https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FWeb_scraping).
  2. Uses the GET method to extract the HTML on the Wikipedia page.
  3. Uses BeautifulSoup to parse the extracted data and save the body of the page in "page_body".
  4. Checks for any scripts in the HTML of the body, removing any it finds.
  5. Indents the HTML code with the prettify() function and saves the results in a text file named cleanpage.txt.
  6. Copies all paragraphs on the page and pastes the results, separated with new lines, into a text file named cleanparagraphs.txt.

Save the code in a python file, let's say beautifulsoupexample.py. To use it, run the following command in your IDE:

py beautifulsoupexample.py

After running the code, you will have two files - one containing indented HTML similar to what you'd see with "inspect element", and one containing the page's information formated so that it's easier for people to read.

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.