◦
Objective
– To develop
efficient big data solution.
◦
Challenge
– To cater large
data (+3TB) and multiple data sources.
◦
Approach
– To deploy cluster
architecture with single data source.
◦
Solution
– Implemented
Hadoop (HDFS) + Spark (PySpark) + Airflow (Scheduler). A Linux based cluster
architecture with compressed data warehouse solution for having single data
source. Full architecture was open source.