
foundry/features/tabular_features.py)foundry/sources/sec_edgar.py)foundry/sources/synthetic_generator.py)risk_explanation field — the LLM's training target for learning to reason about fraud, not just classify it.foundry/pipeline.py) processes all three sources, merges and deduplicates (276,772 final pairs after removing 18,035 duplicates), shuffles, and exports to HuggingFace Hub as a versioned Parquet dataset.foundry-v1.0 with SHA256 hashes in dataset_manifest.json for full reproducibility.foundry-v1.0 git tag
Posted Apr 12, 2026
A complete data engineering pipeline: feature engineering, regulatory text ingestion, synthetic fraud generation, and LLM fine-tuning dataset export.
1
2
Mar 29, 2026 - Apr 3, 2026