Building End-to-End API integrated Data Pipelines: BHP

Sandeep Dudhraj

Cloud Infrastructure Architect
Data Engineer
Software Architect
Apache Airflow
AWS
Python
BHP

Objective:

The goal was to create a unified, and repeatable API integration pattern for company-wide data pipelines to drive efficiency, maintainability, and cost.

Problem Context:

BHP's data engineering (DE) team creates, and maintains thousands of data pipelines with a range of data sources: APIs, databases, and files, where API makes up 75%~85% of data sources. All the integration solutions are in-house built with Python and AWS Lambda. When I started to work with DE team, I at once knew software development design principles were not followed. With each new API data source, minor modifications were required, and new lambdas were created each time for faster data pipeline delivery.
The new lambdas created have 90%~95% code similarity with minor modifications, which caused lots of duplicate lambda creation. This resulted in maintenance-overhead, and adding new features difficult. In the long term, it caused slow data pipeline creation.

Solution:

Going through the past implementations of the solution, I drafted a generic pattern that fits most of the use cases: authentication, data querying, and data storage. Then, I deployed a new lambda for API integration as a generic solution for new data pipelines. It has been widely adopted, and now 80% of the API pipelines use this generic solution.

Introspect:

During this project, I improved my planning, development, and managerial skills. I understood more about AWS infrastructure and Python. Since I was handling the project on my own, I understand the importance of communicating the changes to the wider team and receiving their feedback as crucial input. Since it was the platform change, it affected lots of other developers who will utilise the platform to build pipelines, I understood the risk, opportunity, and significance of the project.

Future projects improvement:

While the planning, development, and deployment of the solution went according to plan, its' adoption by the pipeline-building team was slow because there was no proper documentation. Nobody likes to do documentation, and this project also had limited documentation time, which caused a slow rise in its adoption.
Partner With Sandeep
View Services

More Projects by Sandeep