Open Source Data Management Toolkit Design

Sayantika

Sayantika Banik

🚌 DataJourney

πŸͺΆShort version
Design- first Open Source Data Management Toolkit. Simplifies data workflows with modular, reproducible solutions
🌲Long version
DataJourney demonstrates how organizations can effectively manage and utilize data by harnessing the power of open-source technologies. It's designed to help navigate the complex landscape of data tools, offering a structured approach to building scalable, and reproducible data workflows.
Built on open-source principles, the framework guides users through essential stepsβ€”from identifying goals and selecting tools to testing and customising workflows. With its flexible, modular design, DataJourney can be tailored to individual needs, making it an invaluable toolkit for data professionals.

🚦 Hold on, looking to contribute?

Head over to the wiki, let's make it happen together. We don't bite :)

🧱 Design Philosophy (LEGO)

Built with additive, subtractive capabilities glued with open source. Each layer has a certain strength of communication inbuilt
PO (Base): Static home(s) to keep it together (GitHub)
P1 (Tooling): Tooling, strings (Powered by open source)
P2 (Maintenance + Monitoring): Env, automations (Pixi + GHA)
P3 (Abstraction): Layer(s), CLI/task manager for users to interact with (Pixi)

πŸ›  Current workflows covered

{✨= Experimental, βœ… = Implemented}
Status Workflow Description βœ… Python Packaging framework design principles βœ… GitHub actions configured βœ… Vale.sh configured at PR level βœ… Pre-commit hooks configured for code linting/formatting βœ… Hello world LLM design example based on LangChain βœ… Environment management via pixi βœ… Reading data from online sources using intake βœ… Sample pipeline built using Dagster βœ… Building Dashboard using holoviews + panel βœ… Exploratory data analysis (EDA) using mito βœ… Web UI build on Flask βœ… Web UI re-done and expanded with FastHTML βœ… Leverage AI models to analyse data GitHub AI models Beta

β˜•οΈ Quickly getting started with DataJourney

Fork the repository
Generate & add GITHUB_TOKEN, instructions here
Added requirement to run the LLM based workflows
Switch directory cd DataJourney
Download pixi : prefix.dev
Activate env: pixi shell
Install DJ framework locally pixi run DJ_package
List all the tasks: pixi run DJ_list
Execute a specific task from the list: pixi run <TASK_NAME>
Execute a specific task with additional logs: pixi run -v <TASK_NAME>

πŸƒπŸ½β€β™€οΈ Active tasks under DJ

Task Name Description GIT_TOKEN_CHECK Verifies the availability and validity of the Git authentication token. DJ_package Prepares and builds the Python package for the DataJourney project. DJ_pre_commit Runs pre-commit hooks to ensure code quality and adherence to standards. DJ_dagster Sets up and runs a Dagster workflow for orchestration in the project. DJ_fasthtml_app Executes a FastAPI-based HTML application. DJ_flask_app Configures and runs a Flask-based application for data services. DJ_mito_app Launches the Mito application for interactive data analysis in notebooks. DJ_panel_app Executes a Panel dashboard app for data visualization and analytics. DJ_llm_analysis Performs analysis using large language models (LLMs) on project data. DJ_hello_world_langchain Sets up a basic LangChain app as a "Hello World" example for LLMs. DJ_spanish_eng_translation Performs Spanish to English translation with Deepseek-R1 (NOTE: Takes about ~30 secs to execute this task) DJ_sync_dataset_trees Downloads and synchronizes the trees.csv dataset into the project structure.

πŸ”Œ About pre-commit-hooks and activating

Just like the name suggests, pre-commit-hooks are designed to format the code based on PEP standards before committing. More details
pixi run DJ_pre_commit

🦭 Executing LLM script: Generate stock price recommendations

pixi run DJ_llm_analysis

πŸͺΌ Execute pre-configured Dagster pipeline

pixi run DJ_dagster

πŸ™ Panel app

pixi run DJ_panel_app
NOTE: The dashboard generated is exported into HTML format and saved as stock_price_twilio_dashboard

🐡 Mito

To explore further visit trymito.io
pixi run DJ_mito_app

πŸ¦‹ Display all data sources present via web UI

# Run FastHTML app
pixi run DJ_fasthtml_app
Like this project

Posted Apr 27, 2025

Designed an open-source data management toolkit for modular workflows.