Open Source | Data Source Control

Max Hora

The Data Version Control (DVC) project is an open-source version control system specifically designed for machine learning projects. DVC enables data scientists and machine learning engineers to efficiently manage data, models, and experiments.

Project Scope

The DVC team aimed to fulfill user requests by making their tool accessible via the Anaconda Package Manager. Additionally, the goal was to expand DVC’s capabilities by integrating support for Google Drive cloud storage.

Key Contributions

Anaconda DVC Package: Developed the dvc-feedstock repository, enabling DVC and all relevant dependencies to be available through the Anaconda Package Manager.
Google Drive Integration: Implemented the highly requested Google Drive support, allowing users to store data seamlessly using Google's cloud storage service. Beyond software development, I personally handled Google's verification process, successfully collaborating with Google's API team to activate necessary Google Drive API services for DVC.
Documentation Enhancement: Updated and expanded DVC documentation with comprehensive, step-by-step guides for setting up and utilizing the Google Drive integration.

Technologies Used

Python: Developed new features and resolved bugs within the DVC codebase.
Google Drive API: Managed user data interactions aligned with DVC’s functionality.
Git: Utilized for effective version control and collaborative development.
Markdown: Created and maintained clear, user-friendly documentation.
Continuous Integration: Established CI processes to maintain high-quality standards and ensure the reliability of the DVC project.

Challenges and Learnings

Integrating DVC with Anaconda Package Manager required comprehensive management of all DVC dependencies, updating them to compatible versions, and maintaining a dedicated GitHub repository listed within the conda-forge recipe database. Technical challenges were systematically resolved, successfully making DVC accessible to conda users. Adding Google Drive support necessitated reliable interaction with Google Drive APIs. After thorough research, the decision was made to revive and maintain an open-source Python wrapper for Google Drive APIs, named PyDrive2. This wrapper was integrated into DVC, providing seamless API usage from Python. Through extensive testing and iterative improvements, an optimized solution was developed that respected Google's API usage limits.
5 stars feedback received for the work on DVC project
5 stars feedback received for the work on DVC project

Outcome

The DVC project was successfully published on the Anaconda Package Manager, greatly enhancing its accessibility. Additionally, the integration of Google Drive support significantly expanded DVC’s data storage capabilities, benefiting its growing community. The PyDrive2 library was revitalized and actively maintained to ensure reliable and efficient interaction with Google Drive APIs from Python-based applications.
Like this project
0

Posted Mar 20, 2025

Anaconda package manager support and Google Drive storage capabilities integration allowed DVC significantly extend the accessibility for their user base.

Likes

0

Views

1

Timeline

Jul 25, 2019 - Mar 28, 2020

Clients

Iterative.ai

Open Source | Fluent Terminal for Windows
Open Source | Fluent Terminal for Windows
Open Source | Windows support for Apache Parquet
Open Source | Windows support for Apache Parquet
Open Source | Windows support for Apache Arrow
Open Source | Windows support for Apache Arrow
Source Elements | Low latency desktop webrtc streaming solution
Source Elements | Low latency desktop webrtc streaming solution