Regulatory Document Analysis

AI Engineer — Multi-Model Orchestration, Context Caching, Intelligent ETL Pipelines
Technical team at MEWE Architects
From April 2025 to July 2025 (4 months)
Building MAWA, an AI platform that transforms hundred-page urban planning regulations into structured, traceable intelligence — reducing analysis time from days to minutes.
Supabase (PostgreSQL), xAI Grok, Google Gemini, Mistral AI (OCR), Pydantic
Overview
Urban planning regulations in France are a nightmare by design. Hundreds of pages of legal constraints, zoning maps, and exception tables — all buried in PDF. I built MAWA, an AI platform that orchestrates multiple LLM providers to transform these documents into structured, legally traceable intelligence for architects and urban planners.
THE PROBLEM
French urban planning documents (PLU) are notoriously complex: a single city's regulations can span over 1,000 pages of dense legal text, mixed with technical tables, zoning maps, and cross-references between sections. Every rule must be traced back to its source for legal accountability.
Classic RAG pipelines fall short here. The context is too large for simple chunking strategies, tables and maps carry critical information that text-only approaches miss, and architects need 100% verifiability — not "probably correct" summaries.
DATA & PIPELINE
The ETL pipeline reconstructs document semantics with high fidelity:
- OCR Extraction: Mistral OCR digitizes raw PDFs, preserving the spatial relationship between text blocks, tables, and visual elements — not just extracting flat text.
- Multimodal Processing: Complex tables and zoning diagrams are encoded as Base64 images and sent alongside text, so the LLM can "see" visual constraints that text-only extraction would miss.
- Schema Enforcement: Every intermediate and final output is validated through Pydantic models, ensuring data integrity from ingestion to final JSON reporting.
METHODOLOGY
MAWA's intelligence rests on three pillars:
- Multi-Model Orchestration: Tasks are routed dynamically between Gemini Pro (deep regulatory analysis), Gemini Flash (speed-sensitive operations), and xAI Grok (structured reasoning) — optimizing the performance-to-cost ratio for each step.
- Context Caching: A city's "General Provisions" — hundreds of pages that apply to every zone — are loaded into Gemini's Context Cache once. Individual zone analyses then reference this cache, eliminating redundant reads and cutting API costs dramatically.
- Dynamic Schema Generation: Before extracting rules, the system uses an LLM to scan sample zones and generate a custom JSON schema tailored to that city's regulatory structure — adapting to France's infinite PLU variations without hardcoded rules.
RESULTS & IMPACT
Every AI-extracted rule is linked to a specific page number, paragraph, and table ID — giving architects the legal accountability they need to trust automated analysis.
MAWA reduces urban regulatory analysis from days to minutes. More importantly, it demonstrates that Generative AI can move beyond chat interfaces into robust industrial software components — where precision, traceability, and cost control matter as much as raw capability.
REFERENCES
- Google. Gemini API — Context Caching. cloud.google.com/vertex-ai/generative-ai/docs/context-cache
- Mistral AI. OCR API — Document Understanding. docs.mistral.ai
- xAI. Grok API — Structured Reasoning. docs.x.ai
Detailed architecture documentation available upon request.
TECH STACK
AI & Models: Google Gemini (Pro/Flash), xAI Grok,
Mistral OCR.
Backend: Python, Pydantic, Supabase
(PostgreSQL), SQLite.
Pipeline: Custom ETL with multimodal processing
(Text + Image Base64).
This is an archived project. Please reach out if you have any questions.