Automic ETL Development for Lakehouse Architecture by Carlos PachecoAutomic ETL Development for Lakehouse Architecture by Carlos Pacheco

Automic ETL Development for Lakehouse Architecture

Carlos Pacheco

Carlos Pacheco

Automic ETL

AI-Augmented ETL Tool for Lakehouse Architecture
Automic ETL is a comprehensive data engineering platform that builds lakehouses using the medallion architecture (Bronze/Silver/Gold) on cloud storage (AWS S3, GCS, Azure Blob) with Apache Iceberg and Delta Lake tables. It features LLM integration for intelligent processing of unstructured, semi-structured, and structured data.

Features

Core Capabilities

Feature Description Medallion Architecture Bronze/Silver/Gold data layers with automatic transformations Multi-Cloud Storage AWS S3, Google Cloud Storage, Azure Blob Storage Table Formats Apache Iceberg and Delta Lake support LLM Augmentation Schema inference, entity extraction, NL-to-SQL Data Quality Validation rules, profiling, anomaly detection Data Lineage Full lineage tracking with impact analysis REST API Complete API for programmatic access Web UI Streamlit-based dashboard with theming

Data Connectors

Databases

PostgreSQL, MySQL, MongoDB
Snowflake, BigQuery, Redshift

Streaming

Apache Kafka (with Schema Registry, Avro support)
AWS Kinesis (with Enhanced Fan-Out)
Google Pub/Sub

APIs

REST API (generic)
Salesforce, HubSpot, Stripe

Files & Storage

CSV, JSON, Parquet, Excel
AWS S3, Google Cloud Storage, Azure Blob
PDF and unstructured documents

Open Source Integrations

Tool Integration Apache Spark Distributed processing with Delta Lake/Iceberg dbt SQL transformations, model management Great Expectations Data validation and profiling Apache Airflow Workflow orchestration via REST API MLflow Experiment tracking, model registry OpenMetadata Data catalog and governance

Installation


Or install from source:

Optional Dependencies


Quick Start

Initialize Lakehouse


Basic Pipeline


Streaming Ingestion (Kafka)


LLM-Powered Queries

The SQL Assistant provides natural language to SQL conversion with multi-turn conversation support, security validation, and tier-based access control.

Multi-turn Conversations


Result Explanation


Using Integrations


REST API

Start the API server:

API Endpoints

Endpoint Methods Description /api/v1/health GET Health check and metrics /api/v1/pipelines GET, POST List and create pipelines /api/v1/pipelines/{id} GET, PUT, DELETE Manage pipeline /api/v1/pipelines/{id}/run POST Execute pipeline /api/v1/tables GET, POST List and create tables /api/v1/tables/{id}/data POST Query table data /api/v1/queries/execute POST Execute SQL or NL query /api/v1/queries/natural POST Natural language query with conversation /api/v1/queries/refine POST Refine previous query /api/v1/queries/suggestions GET Get suggested queries /api/v1/queries/autocomplete GET Query autocomplete /api/v1/queries/schemas GET Get available table schemas /api/v1/queries/rate-limit-status GET Check LLM rate limit status /api/v1/queries/conversations/{id} GET, DELETE Manage conversations /api/v1/connectors GET, POST Manage connectors /api/v1/connectors/{id}/test POST Test connection /api/v1/lineage/graph GET Get lineage graph /api/v1/lineage/impact/{table} GET Impact analysis /api/v1/jobs GET, POST Manage scheduled jobs

Example: Execute Query via API


Example: Refine a Query


Web UI

Launch the Streamlit dashboard:

Features:
Home dashboard with medallion architecture overview
Data ingestion wizard (files, databases, APIs, streaming)
Pipeline builder with visual stage management
Query Studio - LLM-powered SQL interface with natural language support
Data profiling and quality metrics
Data lineage visualization
Monitoring dashboard
Settings and connector management

Query Studio

The Query Studio provides an intuitive interface for querying your data lakehouse using natural language or SQL:
Natural Language Tab: Ask questions in plain English and get optimized SQL
SQL Editor: Write and execute SQL with syntax validation and formatting
Conversation Tab: Multi-turn conversations with query refinement
Query History: Track and re-run previous queries
Features:
AI-powered SQL generation with confidence scores
Schema browser with tier-based access control (Bronze/Silver/Gold)
Interactive query results with auto-visualization
Export to CSV, JSON, Excel, Parquet
Query suggestions based on your data
Security validation and SQL injection prevention

CLI Usage


Configuration

Create a config/settings.yaml file:

Architecture


Technology Stack

Category Technologies Language Python 3.10+ Data Processing Polars, PyArrow, PySpark Table Formats Delta Lake, Apache Iceberg Cloud SDKs boto3, google-cloud-storage, azure-storage-blob LLM Providers Anthropic, OpenAI, LiteLLM Streaming confluent-kafka, boto3 (Kinesis), google-cloud-pubsub Web UI Streamlit REST API FastAPI, Pydantic CLI Typer, Rich Auth JWT, OAuth2, RBAC

Project Structure


Examples

See the examples/ directory for complete examples:
basic_pipeline.py - Simple ETL pipeline
streaming_pipeline.py - Kafka/Kinesis streaming
spark_pipeline.py - Distributed processing with Spark
dbt_pipeline.py - SQL transformations with dbt
llm_augmented_pipeline.py - LLM-powered data processing
scd2_pipeline.py - SCD Type 2 dimension tracking

License

MIT License - see LICENSE file for details.

Contributing

Contributions welcome! Please read our contributing guidelines.

Support

Documentation: Full docs
Like this project

Posted Mar 25, 2026

Developed Automic ETL for building lakehouses using medallion architecture.

Likes

0

Views

0