Automic ETL Development for Lakehouse Architecture by Carlos PachecoAutomic ETL Development for Lakehouse Architecture by Carlos Pacheco

Automic ETL Development for Lakehouse Architecture

Carlos Pacheco

Completed work

Prototyper

PostgreSQL

Data

Automic ETL

AI-Augmented ETL Tool for Lakehouse Architecture

Automic ETL is a comprehensive data engineering platform that builds lakehouses using the medallion architecture (Bronze/Silver/Gold) on cloud storage (AWS S3, GCS, Azure Blob) with Apache Iceberg and Delta Lake tables. It features LLM integration for intelligent processing of unstructured, semi-structured, and structured data.

Features

Core Capabilities

Feature Description Medallion Architecture Bronze/Silver/Gold data layers with automatic transformations Multi-Cloud Storage AWS S3, Google Cloud Storage, Azure Blob Storage Table Formats Apache Iceberg and Delta Lake support LLM Augmentation Schema inference, entity extraction, NL-to-SQL Data Quality Validation rules, profiling, anomaly detection Data Lineage Full lineage tracking with impact analysis REST API Complete API for programmatic access Web UI Streamlit-based dashboard with theming

Data Connectors

Databases

PostgreSQL, MySQL, MongoDB

Snowflake, BigQuery, Redshift

Streaming

Apache Kafka (with Schema Registry, Avro support)

AWS Kinesis (with Enhanced Fan-Out)

Google Pub/Sub

APIs

REST API (generic)

Salesforce, HubSpot, Stripe

Files & Storage

CSV, JSON, Parquet, Excel

AWS S3, Google Cloud Storage, Azure Blob

PDF and unstructured documents

Open Source Integrations

Tool Integration Apache Spark Distributed processing with Delta Lake/Iceberg dbt SQL transformations, model management Great Expectations Data validation and profiling Apache Airflow Workflow orchestration via REST API MLflow Experiment tracking, model registry OpenMetadata Data catalog and governance

Installation

Or install from source:

Optional Dependencies

Quick Start

Initialize Lakehouse

Basic Pipeline

Streaming Ingestion (Kafka)

LLM-Powered Queries

The SQL Assistant provides natural language to SQL conversion with multi-turn conversation support, security validation, and tier-based access control.

Multi-turn Conversations

Result Explanation

Using Integrations

REST API

Start the API server:

API Endpoints

Endpoint Methods Description /api/v1/health GET Health check and metrics /api/v1/pipelines GET, POST List and create pipelines /api/v1/pipelines/{id} GET, PUT, DELETE Manage pipeline /api/v1/pipelines/{id}/run POST Execute pipeline /api/v1/tables GET, POST List and create tables /api/v1/tables/{id}/data POST Query table data /api/v1/queries/execute POST Execute SQL or NL query /api/v1/queries/natural POST Natural language query with conversation /api/v1/queries/refine POST Refine previous query /api/v1/queries/suggestions GET Get suggested queries /api/v1/queries/autocomplete GET Query autocomplete /api/v1/queries/schemas GET Get available table schemas /api/v1/queries/rate-limit-status GET Check LLM rate limit status /api/v1/queries/conversations/{id} GET, DELETE Manage conversations /api/v1/connectors GET, POST Manage connectors /api/v1/connectors/{id}/test POST Test connection /api/v1/lineage/graph GET Get lineage graph /api/v1/lineage/impact/{table} GET Impact analysis /api/v1/jobs GET, POST Manage scheduled jobs

Example: Execute Query via API

Example: Refine a Query

Web UI

Launch the Streamlit dashboard:

Features:

Home dashboard with medallion architecture overview

Data ingestion wizard (files, databases, APIs, streaming)

Pipeline builder with visual stage management

Query Studio - LLM-powered SQL interface with natural language support

Data profiling and quality metrics

Data lineage visualization

Monitoring dashboard

Settings and connector management

Query Studio

The Query Studio provides an intuitive interface for querying your data lakehouse using natural language or SQL:

Natural Language Tab: Ask questions in plain English and get optimized SQL

SQL Editor: Write and execute SQL with syntax validation and formatting

Conversation Tab: Multi-turn conversations with query refinement

Query History: Track and re-run previous queries

Features:

AI-powered SQL generation with confidence scores

Schema browser with tier-based access control (Bronze/Silver/Gold)

Interactive query results with auto-visualization

Export to CSV, JSON, Excel, Parquet

Query suggestions based on your data

Security validation and SQL injection prevention

CLI Usage

Configuration

Create a config/settings.yaml file:

Architecture

Technology Stack

Category Technologies Language Python 3.10+ Data Processing Polars, PyArrow, PySpark Table Formats Delta Lake, Apache Iceberg Cloud SDKs boto3, google-cloud-storage, azure-storage-blob LLM Providers Anthropic, OpenAI, LiteLLM Streaming confluent-kafka, boto3 (Kinesis), google-cloud-pubsub Web UI Streamlit REST API FastAPI, Pydantic CLI Typer, Rich Auth JWT, OAuth2, RBAC

Project Structure

Examples

See the examples/ directory for complete examples:

basic_pipeline.py - Simple ETL pipeline

streaming_pipeline.py - Kafka/Kinesis streaming

spark_pipeline.py - Distributed processing with Spark

dbt_pipeline.py - SQL transformations with dbt

llm_augmented_pipeline.py - LLM-powered data processing

scd2_pipeline.py - SCD Type 2 dimension tracking

License

MIT License - see LICENSE file for details.

Contributing

Contributions welcome! Please read our contributing guidelines.

Support

GitHub Issues: Report bugs or request features

Documentation: Full docs

Like this project