Stock_Market_RealTime_Project_DataEngineering

Data Engineer

Apache Airflow

AWS

Kafka

Project Description

This project involves simulating a real-time stock market application, where stock market data is ingested in real-time, processed using Apache Kafka, and stored in Amazon S3 for further analytics using AWS Glue and Amazon Athena.

Project Link

Components

DataSet (CSV):

The project begins with a CSV dataset, presumably containing stock market data such as stock prices, volumes, etc. This dataset is used as input for the stock market simulation.

Stock Market App Simulation (Python):

A Python-based application simulates real-time stock market data generation. It reads data from the CSV file and simulates real-time stock market activity, like stock price changes.

The app acts as a Kafka producer, pushing the stock data to an Apache Kafka instance hosted on an Amazon EC2 server.

Apache Kafka (on Amazon EC2):

Apache Kafka is used as the real-time data streaming platform. The stock data from the simulation is continuously produced into Kafka.

Producers (the Python app) send the data, and Consumers (downstream systems) retrieve this data for further processing.

Amazon S3:

Kafka consumers push the streamed stock data into Amazon S3, an object storage service, for persistent storage.

The data in S3 is partitioned and organized for later use.

AWS Glue Crawlers:

AWS Glue Crawlers are used to automatically scan the data stored in Amazon S3. They identify the schema and metadata of the stock data, allowing it to be cataloged for querying and analysis.

AWS Glue:

AWS Glue is a fully managed ETL (Extract, Transform, Load) service. It processes the raw stock market data stored in Amazon S3, enabling transformations and preparing it for further analysis.

Amazon Athena:

Amazon Athena is used for querying the processed stock market data. It allows users to perform SQL-like queries on the data directly from S3 without the need for complex infrastructure.

Data Flow

CSV Data Ingestion: The stock market data starts from a CSV file.

Real-Time Simulation: Python simulates real-time data generation from the CSV, pushing it into Apache Kafka.

Kafka Stream: The Python app (Producer) streams the data to Kafka, which manages the real-time data flow.

Data Storage (S3): Kafka Consumers retrieve the data and store it in Amazon S3.

Crawling and ETL: AWS Glue Crawlers scan the S3 data, and AWS Glue processes and transforms it.

Data Analysis (Athena): The processed data is then queried using Amazon Athena for insights and reporting.

Purpose of the Project

The primary goal of this project is to demonstrate a real-time data pipeline using Apache Kafka for stock market data simulation and AWS services for data storage, transformation, and analysis. This setup can be used for real-time analytics, financial reporting, or further insights into stock market trends.