• Responsible for building scalable distributed data
solutions using Spark.
• Ingested log files from source servers into HDFS data
lakes using Sqoop.
• Developed Sqoop Jobs to ingest customer and product
data into HDFS data lakes.
• Developed Spark streaming applications to ingest
transactional data from Kafka topics into Cassandra tables in near real time.
• Developed an spark application to flatten the transactional
data coming from using various dimensional tables and persist on Cassandra
tables.
• Involved in developing framework for metadata
management on HDFS data lakes.
• Worked on various hive optimizations like partitioning,
bucketing, vectorization, indexing and using right type of hive joins like
Bucket Map Join and SMB join.
• Worked with various files format like CSV, JSON, ORC,
AVRO and Parquet.
• Developed HQL scripts to create external tables and
analyze incoming and intermediate data for analytics applications in Hive.
• Optimized spark jobs using various optimization
techniques like broadcasting, executor tuning, persisting etc.
• Responsible for developing custom UDFs, UDAFs and UDTFs
in Hive.
• Analyze the tweets json data using hive SerDe API to
deserialize and convert into readable format.
• Orchestrating Hadoop and Spark jobs using Oozie
workflow to create dependency of jobs and run multiple Jobs in sequence for
processing data.
• Continuous monitoring and managing the Hadoop cluster through
Cloudera Manager.