IOS Risk Data Foundry — Domain-Specific Financial Risk AI by Ugo ChukwuIOS Risk Data Foundry — Domain-Specific Financial Risk AI by Ugo Chukwu

IOS Risk Data Foundry — Domain-Specific Financial Risk AI

Ugo Chukwu

Completed work

Data Engineer

ML Engineer

AI Engineer

BeautifulSoup

Hugging Face

Python

Artificial Intelligence

Overview

This project is the second component of IntelligenceOS Risk — a domain-specific AI system I am building from scratch for financial risk and fraud detection. The Data Foundry is the data infrastructure layer: it takes raw inputs from three sources and produces a clean, versioned, training-ready dataset for fine-tuning a language model.

Deliverable: 276,772 instruction pairs in Alpaca format, published to HuggingFace Hub, versioned with SHA256 fingerprints, reproducible with a single command.

The Problem

Training a financial risk AI on generic data produces a generic model. The data infrastructure — how you engineer features, what text you ingest, how you generate synthetic edge cases — is what makes a domain-specific model meaningfully better than a general one.

Three specific challenges had to be solved:

Class imbalance. The real fraud dataset is 99.83% legitimate transactions. A model trained on this as-is learns to predict "legitimate" always and scores 99.83% accuracy — useless for fraud detection.

Domain language. Regulatory risk language (AML, BSA, SAR filings, 10-K risk sections) is not in any standard training corpus. A model that can't read a regulatory filing is not enterprise-grade.

Reasoning, not just classification. Binary fraud labels teach a model to classify. Natural language risk explanations teach it to reason — to articulate why a transaction is suspicious in the language a risk analyst would use.

What I Built

Feature Engineering Pipeline (foundry/features/tabular_features.py)

11 fraud-signal features engineered from raw transaction data, each encoding specific domain knowledge about how fraud behaves:

Velocity features (transaction counts per 1h and 24h windows) — catches card testing

Amount z-score, round-number flag, micro-transaction flag, large-transaction flag

Time features: hour of day, off-hours flag (11pm–6am), day of week proxy

Result: avg_precision improved from 0.8510 → 0.8623 (+0.0114). Recall improved by +0.051 — the model detects significantly more actual fraud.

SEC EDGAR Ingestion Pipeline (foundry/sources/sec_edgar.py)

Fetches 10-K risk filings from the public SEC EDGAR full-text search API across five risk query categories (fraud, AML, credit default, cybersecurity, operational risk). Downloads, HTML-strips, and chunks into 512-word overlapping windows formatted as instruction pairs.

Synthetic Fraud Scenario Factory

(foundry/sources/synthetic_generator.py)

Four fraud type generators producing realistic transactions with natural-language risk explanations:

Card testing: micro-transaction velocity bursts

Account takeover: large off-hours purchases

Money mule: round-number structuring below BSA reporting threshold

Bust-out fraud: sudden high-velocity large purchases

Each record includes a risk_explanation field — the LLM's training target for learning to reason about fraud, not just classify it.

Pipeline Orchestration and Export

A fully orchestrated pipeline (foundry/pipeline.py) processes all three sources, merges and deduplicates (276,772 final pairs after removing 18,035 duplicates), shuffles, and exports to HuggingFace Hub as a versioned Parquet dataset.

Dataset versioned as foundry-v1.0 with SHA256 hashes in dataset_manifest.json for full reproducibility.