The existing data pipeline was struggling with high-frequency data; the costs were increasing, the bugs were interrupting all operations, and the load time was making it non-scalable as the data grows
Outcomes
8x faster run time for the full data pipeline
90% lower running costs of storage and computing combined
Reduced average large maintenance tasks from 1 week to less than 1 day
The Work
1- Identifying the pain points and bottlenecks
Listed all the performance bottlenecks that were discovered over the past 6 months
Identified the major sources of bugs during that period
Created an architecture diagram representing the current data pipeline
2- Research
Refreshed on the best practices around building timeseries data pipeline
Investigated the potential storage, transformation, and deployment options with synthetic data that represented the main challenges
Researched the costs of the shortlisted options
3- Architecture Design
Using the outcomes of the research process, started identifying the main components of each pipeline stage
Iterated through the communication patterns and the data flow between all the components
Finalized the design by drawing a diagram representing all of the above
4- Implementation Planning
Planned the detailed tech strategy of how each component will reach the goals of the architecture design
Planned all the migration steps required
5- Implementation
Started applying the implementation plan by working on the least dependent components first
Started an ongoing evaluation of the performance to make sure that the implementation is on track
Ran an end-to-end demo run on synthetic data once the full data pipeline was complete
Started developing the logic to migrate the data from the old storage and structure to the new ones
Like this project
0
Posted Oct 26, 2024
Improved an existing data processing pipeline by optimizing the architecture resulting in increased speed and reliability and significantly reduced costs