Web Scraping WSJ Articles

Ibrahim Boussaa

Data Scraper
Author
Data Engineer
Python
I recently completed a task as a freelancer that involved mining articles from the Wall Street Journal. The idea was to scrape articles related to "Verizon Communications Inc." that were published between July 2021 and March 2023. I thought it would be a great idea to share the Python script that I developed for this task, which can be used as a base for similar tasks. Let's dive into it!

Understanding the Script

The script is designed to perform the following steps:
Request for articles' ids - This part of the script is designed to get the IDs of all the articles related to the specified query.
Request for articles' details - Once we have all the IDs, the script then moves on to get detailed data on the articles corresponding to those IDs.
Compile the articles' details - All the fetched article details are then compiled into a Python list.
Data cleaning - After getting all the details, the script then cleans the data, keeping only the necessary fields, and then saves the data into a .csv file.
Alright, let's dive into the specific code snippets and understand them better.

Code Walkthrough

Setup
First, we import the necessary Python libraries - requests for handling HTTP requests and json and pandas for handling and storing data.
Fetch Articles IDs
The function gettingArticleDetails(id,type) uses the IDs fetched in the previous step to generate a GET request to the WSJ search URL for specific article details.
So there you have it! This script can be easily adjusted for any query or website with a similar structure.
Happy Scraping!

Subscribe to my newsletter

Read articles from ScrapeMind directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Partner With Ibrahim
View Services

More Projects by Ibrahim