LegalGPT is Claw LegalTech’s proprietary Generative AI assistant tailored for Indian legal professionals. The tool allows users to perform contextual legal research, draft contracts, and understand court judgments by interacting with a chat-based interface powered by a custom-trained LLM.
At its core, LegalGPT is a hybrid RAG (Retrieval-Augmented Generation) system layered with multiple legal-specific capabilities, including citation insertion, clause extraction, judgment summarization, and precedent suggestion.
Objective:
To build a domain-specific LLM system optimized for legal texts with reliable, up-to-date Indian case law and legislation. The system needed a scalable and maintainable ETL pipeline to handle continuously evolving legal corpora.
ETL Pipeline: Design and Implementation
1. Data Sources
Case Laws: Extracted from Indian judiciary portals (e.g., Indiankanoon, eCourts, Judis)
Statutes and Bare Acts: Scraped from government websites in structured and unstructured formats
Legal Contracts: Anonymized document bank from law firms (PDFs, DOCX, HTML)
Law Firm Blogs and Treatises: Extracted with semantic weighting for relevance
2. Extraction Layer
Built using Scrapy + Playwright for dynamic scraping
Used asyncio-based job runners to schedule crawls for over 200k+ case law documents per month
Integrated pdfminer and docx parsers for contract ingestion
3. Transformation Layer
Data Cleaning: Removal of procedural text, pagination errors, footnotes
Clause and Entity Tagging: Custom spaCy NER pipeline to tag: