Robocorp Reuters News RPA Automation

Daniel Paschoalino

Data Scraper

Automation Engineer

pandas

Python

Selenium

Manufacturing

Overview

This project is a Python-based automation solution for extracting news data from the Reuters website. It was developed as part of the test for the Python Automation Engineer position. The goal is to showcase the ability to build a bot that automates the process of extracting and processing news data using Robocorp's RPA Framework.

🟢 The Challenge

The task is to automate the process of extracting news data from a chosen news site. For this test, the Reuters website was selected. The automation includes:

Opening the news site.

Entering a search phrase and selecting a news category.

Extracting news details such as title, date, description, and picture.

Saving the data into an Excel file.

Processing news based on a specified number of months.

The Source

The automation is implemented for the Reuters website.

Parameters

The process requires the following parameters via a Robocloud work item:

main_url: The Reuters link do access the website.

search_input: The phrase to search for in the news.

section: The category or section of the news.

months: The number of months to retrieve news for (e.g., 1 for the current month, 2 for the current and previous month).

The Process

Open the Site: Navigate to the Reuters website.

Search: Enter the search phrase and select the news section.

Retrieve News: Collect news URLs and extract details from each news article.

Extract Data:

Title

Date

Description

Picture filename

Count of search phrases in the title and description

Whether the title or description contains monetary amounts

Save Data: Store the extracted data in an Excel file, including downloaded news pictures.

Delete files: Deletes files generated after starting the script, keeping space consumption low.

Requirements

Python 3.8+

Robocorp Framework

Pandas

Selenium

Firefox

Installation

Clone the repository:

git clone https://github.com/danielcalp/RPA_Challenge_Fresh_news_2.0.git

Install the required packages:

pip install -r requirements.txt

Usage

Set Up Parameters: Define the parameters in the Robocloud work item or configure them in a configuration file.

Run the Automation: Execute the main task using Robocorp Control Room or local execution.

Attention

If GeoCaptcha error occurs, please insert a VPN or Proxy into the system

Code Overview

selenium_utils.py: Contains utility functions for interacting with the browser.

parse_data_utils.py: Functions for parsing news data and checking for monetary values.

check_date_utils.py: Functions for handling and comparing dates.

task.py: The entry point that coordinates the automation workflow.

save_as_csv.py: Handles saving the extracted data to an Excel file.

remove_files.py: Handles deleting generated files from Firefox profile..

Example

To run the automation locally, you can use the following command:

python task.py

Ensure that the parameters are correctly set in the Variables configuration file or passed via Robocloud work item.

Best Practices and Considerations

Code Quality: The code adheres to PEP8 standards and follows clean code practices.

Resiliency: The automation is fault-tolerant and includes error handling for both application and website issues.

Logging and Debugging: Proper logging should be implemented for real-world use (e.g., using Python's logging module).

License

This project is licensed under the MIT License.

Contact

For any questions or further information, please contact https://github.com/danielcalp/ or https://www.linkedin.com/in/danielpaschoalino/.

Like this project

Posted Jul 13, 2024

This project is a Python-based automation solution for extracting news data from the Reuters website.

Likes

Views

Robocorp Reuters News RPA Automation

Join 50k+ companies and 1M+ independents