Build or Buy Data Pipelines? The Only Guide You’ll Need (+Detai… by Amit PhaujdarBuild or Buy Data Pipelines? The Only Guide You’ll Need (+Detai… by Amit Phaujdar

Build or Buy Data Pipelines? The Only Guide You’ll Need (+Detai…

Amit Phaujdar

Content Editor

Content Writer

Computer Software

Total Enterprise Software Revenue Forecast[1]

IT spending on enterprise software jumped to 856.03 billion dollars in 2023, indicating that ‘custom-tailored’ software is on the rise.

To build or not to build, that is the question.

That’s a question that continues to baffle people. On one side, you have increasingly easy-to-use cloud-compatible third-party tools that do the heavy lifting for you.

Total Enterprise Software Revenue Forecast[1]

On the other end, according to Sidu Ponnappa,

In the build vs buy debate in enterprise software, AI is going to tilt the deck steeply toward build (3-5 year horizon). The capital, time, and risk involved in developing “custom-tailored” software is going to collapse.

A natural, albeit frustrating answer to whether you should build or buy data pipelines might be ‘it depends’. When considering this scenario, it may be useful to explore a few frameworks to help you identify which requirements are the most critical for your needs.

Additionally, it may be beneficial to weigh the advantages and disadvantages of building a solution in-house versus purchasing a no-code SaaS solution. By doing so, you will have the necessary information to make an informed decision that aligns with your unique needs and objectives.

Here’s a decision tree you can use to settle the build vs buy debate:

Decision Tree for Build vs Buy[2]

Here are a couple of factors to put this debate to rest for data pipelines:

Expertise: Irrespective of whether you’re integrating a new tool or raising a platform from scratch, engineering support is pivotal. This means taking a good look at your team’s capabilities and time. Does your team have the necessary expertise to build a data pipeline in-house? If they do, can they maintain the in-house pipeline without sacrificing valuable time for other priority projects? These are a few questions that’ll get you a step closer to making a decision.

Differentiator/Utility: Looking at previous frameworks, if the in-house data pipeline is a core differentiator for your business, opportunity costs might be worth incurring. But, if data pipelines are a utility for you, it might go in the overhead column.

Time to Value: The success of your data pipeline solution would also depend on how long it takes to build the tool you need versus buying it. You need to factor in how important is time for your team.

Control/Security: Ensuring a secure and robust data transfer is one of the most critical aspects of any data platform. Building a data pipeline from scratch provides you with a holistic view of your data along with the operations running on it. It could even provide granular-level control over your data. However, keeping your data secure needs the right vigilance layers to keep your critical data protected at all times. Implementing dynamic security regulations and compliances is an arduous task, so keep this in the back of your head too while making this decision.

Costs/Resources: For both solutions (in-house and third-party), make a list of all the issues, items, resources, and anything else that might have a price tag attached to it. Calculate the overall sum and estimate when you’ll see tangible results and actual return on investment (ROI).

Dan Luu (ex-Twitter) made a pretty strong case for why someone might prefer building a data pipeline over buying a third-party tool:

For example, we tried “buy” instead of “build” for a product that syncs data from Postgres to Snowflake. Syncing from Postgres is the main offering (as in the offering with the most customers) from a leading data sync company, and we found that it would lose data, duplicate data, and corrupt data. After digging into it, it turns out that the product has a design that, among other issues, relies on the data source being able to seek backward on its changelog. But Postgres throws changelogs away once they’re consumed, so the Postgres data source can’t support this operation.

It boils down to one word: control. Simply put, with in-house data pipelines you have complete control over the functionalities of the tool, and you can tweak it according to your needs freely.

A third-party tool would automate the grunt work of building and maintaining pipelines for you. But it might not be quite tailored to your exact needs as compared to an in-house data pipeline.

Let’s go over the pros and cons of building a data pipeline to round out this discussion.

Pros of Building Data Pipelines

Data Security: In today’s ecosystem of privacy regulations and concerns, it’s natural to be wary of how a third-party tool uses your proprietary data. If your data is a vital component of your competitive advantage, it becomes that much more pivotal to maintain this information internally. A risk you avoid when building a data pipeline revolves around data security. With in-house data pipelines, you can ensure the most robust cybersecurity measures of your choice are in place to guard your data against attacks. This kind of flexibility might be missing from third-party data pipeline tools. You also don’t have to worry about the overhead of reviewing and approving a new data processor that might add international data flows to your current set-up; with an in-house data pipeline.

Complete Control: If your pipeline needs drastic changes, waiting on a third party could adversely impact your time-to-value. Owning the developmental process gives you complete control over the data, ongoing support, and pipeline roadmap. Most third-party tools might struggle to fully integrate with your existing solutions. This is why a lot of companies have a multi-ETL architecture in place because one ETL tool might not have all the connectors you need. Case in point, Incident.io has been using three tools — Stitch, Fivetran, and Segment to sync their source data. But, what once worked may soon reach a ceiling where constantly onboarding new tools to cater to your needs just isn’t feasible. Therefore, you’d be better off building your in-house data pipeline.

Building a data pipeline would be ideal for the following use cases:

Single source, one-time replication: When your business teams need data from your source only quarterly, yearly, or just one time, then buying a third-party tool isn’t recommended. Similarly, for situations where you’re dealing with a lesser volume of data or legacy application transfer, ‘build’ is recommended over ‘buy’.

2-3 sources, with no schema changes: When you can count the number of sources you’re working with on your fingers, where you’re working with historical data with no schema changes, building an in-house pipeline would be recommended.

Cons of Building Data Pipelines

Data Pipeline Performance and Monitoring: Setting up high-performance data pipelines needs organizations to ensure both engineering and DevOps bandwidth. Building a DIY solution requires setting up high-performance monitoring and instrumentation systems. These systems are an absolute must to keep track of such errors. Setting up a dependable system; capable of meeting the requirements across all use cases & operation scales is a hard nut to crack.

Manual Error Handling: Building a reliable in-house data pipeline needs immense technical expertise in place. To ensure data reliability, you’ll have to handle errors manually. These errors can include schema changes, and variations, among others. Tackling these issues can often result in never-ending delays. These can even result in analysts using inconsistent data for analysis. Therefore, these errors can affect the quality of the decision-making process. You can invest in third-party tools that offer automated schema management and resolution of trivial errors that might come up.

According to Gartner’s latest estimate, the global low-code development technologies industry will be worth ~$10 billion in 2023 and ~12.3 billion in 2024. Over the past decade, usage of no-code/low-code tools has slowly picked up steam, and it’ll continue to grow at a fast pace.

Now, that we’ve covered the pros and cons of building in-house pipelines, let’s go over why the pros and cons of no-code data pipeline tools.

Pros of Buying a Data Pipeline Tool

Quick TAT: Off-the-shelf data pipeline solutions can meet the majority of a company’s use cases pretty swiftly. After the sales cycle, the only time required is for implementation. Easy to set up and a faster time-to-value ensure you achieve your goals faster.

Less Building, more Analyzing: It allows you to focus on core engineering objectives while your business teams can jump on to reporting without any delays or data dependency on you.

No more Keeping Up With APIs: Keeping up with connector changes like API expiration is an arduous time drain for data engineers. A lot of data pipeline tools provide connectors out-of-the-box, shifting the maintenance of keeping up with connectors from the company to the solution provider.

Comprehensive Support: Buying a data pipeline tool lets you tap into dedicated support that guides you through every step of the way so that you can get back to working on your primary objectives instead of putting out fires. The data pipeline vendor will take care of your maintenance and technical debt, distributing these costs over their complete customer base.

Scalable: As your connectors increase in number to keep up with your company’s growth, using an in-house data pipeline might no longer be feasible for you. With third-party SaaS tools, you can rest assured that your data pipeline tool can keep up with the increasing number of connectors. On average, data pipeline tool support anywhere between 50-150 connectors!

Cons of Buying a Data Pipeline Tool

Exposure to the vendor’s market risk: When using a data pipeline tool, you need to be confident in their ability to weather a market downturn or other factors that might be detrimental to the health of their business. Opting to buy a data pipeline tool also exposes you to the vendor’s market risk. Read up on how stable the vendor might be before signing a contract with them.

Buyer Journey taking longer than expected: The process of evaluation and purchase of a data pipeline solution doesn’t happen overnight and might take a few months. It might take longer depending on how specific your use case is, and in the end, you might end up deciding to build an in-house data pipeline — that’s time you aren’t getting back.

Less Flexibility: Most data pipeline tools you buy would put a cap on how much you can modify it in terms of functionality. Sure, you can put in requests to include a feature that’s specific to your use case, but the turnaround time for it might be several weeks/months. There’s also a scenario where the third-party vendor might not implement the change you suggested if you’re the only one asking for it.

Vendor Lock-In: Whenever you choose a tool, there is an inevitable amount of lock-in. A scenario where you’re tied to the vendor with a multi-year agreement might not be ideal if you realize that the data pipeline isn’t a good fit for you, say 2 months into using it.

Multiple tools bring multiple learning curves: Every tool has a learning curve. Even low-code/no-code tools. Hiring people that might not be familiar with the tool you work with might mean slower initial development. For Multi-ETL environments, this overhead compounds and might deter your productivity further.

Here’s a table to use when calculating the Total Cost of Ownership for an in-house data pipeline:

Cost of Storage and InfrastructurePer MonthDepending on The Project ScaleCost of people who manage/build the service~$30-50KConsidering minimum two data engineers in the team, here’s what the cost breakdown would look like:

-$300-500K/year for 2 salaries with insurance and other costs

-~$162K/year for project manager

-CTO/Supervisor Hours

-Factor in burnout/morale factor of being on-call and fixing the pipeline if/when it breaks down where everything else would have to be put on hold!Cost of documenting and training people to use the platform~$3K-5K (One-time cost)

2 months on average to get a data team member acquainted with their platformIf you’re building your own data pipeline, documenting fixes for common problems, frameworks to deal with new connectors, and how everyone in the company will interact with the data. The last bit would help prioritize team-wise data demands when a new connector comes in. The bare minimum would be documenting common problems with your in-house data pipeline and their fixes for the new engineers joining the team.Maintenance CostsTechnical debt — the shortcuts and trade-offs that developers might have taken while building, managing, and maintaining the pipelines. E.g., Lack of modularity, scalability issues, and poor code quality. ~46K in the first year and $57K-65K/yr in the subsequent years.(Time spent by data team on maintenance + Resources used to build the pipeline + documentation.)

The assumption here is that you have ~5 sources and each connector would require a dedicated week of maintenance work per quarter.Consider the impact on your technical debt when you add another 5-20 connectors to your in-house data pipeline.Opportunity CostsAs an engineer, it’s much more natural to build stuff, but the bandwidth allocated to building an in-house pipeline comes with opportunity costs. The engineers building the data pipeline would otherwise be available to build things that provide significant value to the business.Make a list of potential missed opportunities and resulting costs:

-What other customer features/requests will be put on hold to address common data scalability challenges?

-If you didn’t hire two data engineers to build the data pipeline from scratch, who would you have hired and what would you have built?

-What is the impact on your on-call team when V1 of your data pipeline inevitably hits scale limits, either massively dropping query response times or losing data?

Considering all these costs, an in-house data pipeline will set you back by $40K-60K/month.

Now, here’s what a no-code SaaS tool like Hevo would cost in the same circumstances:

Cost of Storage and InfrastructurePer MonthDepending on The Project ScaleCost of documenting and training people to use the platform0The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.Maintenance Costs/Cost of People who Manage or Build the Service0The Hevo platform can be set up in just a few minutes and requires minimal maintenance.Platform Adoption3 monthsThis refers to the time it would take to get Hevo up and running as a part of their workflow.

Based on the pricing plan you choose, using Hevo as your data pipeline tool will cost anywhere between $0 (if you opt for the 14-day free plan) to $1159/month if you dabble with 100M events. You can pick a custom plan if you have a larger requirement.

The ‘Build vs Buy’ conundrum for data pipelines is an important decision for any CTO/CEO looking to become more data-driven. Ultimately, there is no one-size-fits-all answer to this question, as the decision should be based on the resources, capabilities, and needs of the organization.

For those with the technical capacity and resources, building a custom data pipeline may be the best solution. For those without the technical resources or with limited resources, buying an existing data pipeline solution may be the more efficient and cost-effective option.

Regardless of whether you should build or buy data pipelines, the key is to ensure that the data is collected, managed, and stored securely and that the system is scalable, reliable, and cost-effective for the organization’s needs.

If you decide to buy a data pipeline, Hevo can be a good choice if you’d prefer a no-code, automated tool that replicates data in real time. For the rare times things do go wrong, Hevo ensures zero data loss. Add 24*7 customer support to the list, and you get a reliable tool that puts you at the wheel with greater visibility. Start modernizing your data stack by scheduling a demo now!

Like this project

Posted Feb 5, 2024

Unsure whether to build or buy data pipelines? Our guide provides insights to help make an informed decision aligned with your business needs.

Likes

Views