Mutual Funds Data Pipeline Implementation by Aditya GoelMutual Funds Data Pipeline Implementation by Aditya Goel

Mutual Funds Data Pipeline Implementation

Aditya Goel

Data Engineer

dbt

MySQL

Investment & Asset Management

Prodigal - Mutual Funds Data Pipeline

This assignment by Prodigal implements a robust data pipeline for scraping mutual fund data, storing it in a MySQL database, and providing various querying options for analysis. It is designed to handle several exceptions, ensure data consistency, and optimize performance for data ingestion and querying.

Features

Data Scraping:

Scrapes mutual fund data from an external source for a specified date range.

Supports both historical and regular (incremental) data ingestion into a MySQL database.

Database Schema:

Creates a mutual_funds table with the following structure:

scheme_code (VARCHAR): Unique identifier for the mutual fund.

scheme_name (VARCHAR): Name of the mutual fund.

isin_div_payout_growth (VARCHAR): ISIN for dividend payout/growth.

isin_div_reinvestment (VARCHAR): ISIN for dividend reinvestment.

net_asset_value (DECIMAL): Net asset value of the fund.

repurchase_price (DECIMAL): Repurchase price of the fund.

sale_price (DECIMAL): Sale price of the fund.

nav_date (DATE): Date of the NAV record.

Index on scheme_code and nav_date for efficient querying.

Exception Handling:

Skips rows with invalid or duplicate data.

Logs skipped rows for debugging.

Gracefully handles database-related errors.

Query Options:

Retrieve NAV trends for a specific mutual fund.

Compare NAVs across multiple mutual funds within a date range.

Run custom SQL queries.

Performance Optimizations:

Batch insertions reduce database overhead.

Indexing for faster lookups.

Skips duplicate rows by checking the database before insertion.

Setup Instructions

1. Prerequisites

Python 3.10.10

MySQL server

Required Python packages (install using requirements.txt):

2. Add Your Database Credentials

Before running the script, update the establish_connection function with your database credentials:

connection = pymysql.connect(
    charset="utf8mb4",
    connect_timeout=timeout,
    cursorclass=pymysql.cursors.DictCursor,
    db="your_database",
    host="your_host",
    password="your_password",
    port=your_port,
    user="your_user",
    write_timeout=timeout,
)

3. After this run the main file

python main.py

What Happens in the Main Function?

Table Clearance

Deletes all existing data in the mutual_funds table for a fresh start.

Ingest Historical Data

Adds historical data for a predefined date range (e.g., 01-Oct-2024 to 03-Oct-2024).

Ingest Regular Data

Automates incremental data ingestion for the last two days, ensuring the database remains up-to-date.

Interactive Query Options

Allows users to test and retrieve data using pre-defined and custom queries.

Database Connection Closure

Ensures the database connection is closed properly after execution.

Exception Handling

Data Sanitization

Rows with invalid net_asset_value, repurchase_price, or sale_price are reformatted during insertion.

The valid entries to insert the data are added based on data type

Duplicate Rows

Checks for existing rows with the same scheme_code and nav_date before insertion.

Skips duplicate rows to maintain data integrity.

Data Insertion Handling

All Data available in the sheet is added

No extra headers or blank rows are added

Performance Measures

Indexing

Indexes on scheme_code and nav_date improve query performance.

Incremental Updates

Check for duplications while adding new data

Query Options

The script provides the following query options for testing:

1. Retrieve NAV Trends

Retrieves the NAV history for a specific mutual fund.

Example Query:

2. Compare NAVs Across Funds

Compares NAVs for multiple funds within a specific date range.

Example Query:

3. Run Custom SQL Queries

Allows users to execute any SQL query directly.

Example Query:

Visual Summary

Key Features

Feature Description Data Scraping Fetches data for a specified date range. Historical & Incremental Handles both historical and regular updates. Interactive Querying Provides predefined and custom query options.

Performance Measures

Measure Description Indexing Improves lookup and query performance. Duplicate Handling Skips rows with the same scheme_code and nav_date.

Sample Queries Output Format

Query Options:
   - Retrieve NAV trends for a single mutual fund
   - Compare NAVs across multiple mutual funds within a date range
   - Run a custom SQL query
   - Exit
Enter your choice: 1
Enter the Scheme Code: 139619
NAV Trends:
{'nav_date': datetime.date(2024, 10, 1), 'net_asset_value': Decimal('10.0000')}
{'nav_date': datetime.date(2024, 10, 3), 'net_asset_value': Decimal('10.0000')}
{'nav_date': datetime.date(2024, 11, 21), 'net_asset_value': Decimal('10.0000')}

Query Options:
   - Retrieve NAV trends for a single mutual fund
   - Compare NAVs across multiple mutual funds within a date range
   - Run a custom SQL query
   - Exit
Enter your choice: 3
Enter your custom SQL query: SELECT COUNT(*) AS total_entries FROM mutual_funds;
Query Results:
{'total_entries': 24786}

Query Options:
   - Retrieve NAV trends for a single mutual fund
   - Compare NAVs across multiple mutual funds within a date range
   - Run a custom SQL query
   - Exit
Enter your choice: 4
Exiting query options.
Connection closed. Pipeline execution complete.

Like this project

Posted Apr 19, 2025

Developed a data pipeline for mutual fund data with MySQL storage and querying.

Likes

Views

Timeline

Dec 4, 2024 - Feb 19, 2025

Clients

Prodigal