Integrating WebScrapingAPI with Scrapy in Python
Scrapy is a popular high-level framework used for crawling websites and extracting structured data from them. In this guide, we'll show you how to integrate the powerful framework with WebScrapingAPI so that you can combine Scrapy's capability to build spiders with our API's data extraction functionalities.
A basic example of integrating WebScrapingAPI with Scrapy
Let's say that you want to use the Scrapy framework to send a request through the API to http://httpbin.org/ip. HTTPbin is a good example here because a successful request should return the client's IP. Since WebScrapingAPI automatically rotates proxies, each time you execute the code, you should get a different IP back.
Here's what that code looks like:
import scrapy class BasicSpider(scrapy.Spider): name = 'basic' allowed_domains = ['httpbin.org'] start_urls = ['http://httpbin.org/ip'] def parse(self, response): file = open('scrapypage.txt', mode='wb') file.write(response.body) print("Scrape done. Check the result in scrapypage.txt file.") def start_requests(self): url = 'https%3A%2F%2Fhttpbin.org%2Fip' scraper_url = f'https://api.webscrapingapi.com/v1/?api_key=YOUR_API_KEY&url={url}' yield scrapy.Request(url=scraper_url, callback=self.parse)
As you can see, you have to use the https://api.webscrapingapi.com/v1/ endpoint, after which you add ?api_key=, your personal key, and &url= followed by the encoded version of the URL you want to scrape, which is https%3A%2F%2Fhttpbin.org%2Fip. The results are saved in the scrapypage.txt file in this example.
After saving the code, all you have to do is execute the following line in your terminal:
scrapy crawl basic