System Overview & Architecture

EPGOAT Documentation - AI Reference (Educational)

System Overview & Architecture (Educational Version)

Note: This is the educational, human-readable version with examples and detailed explanations. For the AI-optimized version, see 1-CONTEXT/_SYSTEM.md.


EPGOAT System Reference (AI-Optimized)

Purpose: Ultra-compressed system overview for Claude Code AI assistant Token Budget: ~5K tokens (part of 50K Layer 1 budget) Load Priority: HIGH (always load at session start) Last Updated: 2025-11-14

🎯 Why Aggressive Token Compression?

Layer 1 (CONTEXT) uses aggressive compression (95% reduction: 200K → 11K tokens) for three reasons:

1. AI Context Window Limits - Claude has 200K token context limit - Compressed docs fit easily with room for code - Enables loading entire system context at once

2. Cost Efficiency - Tokens = $$$ - Smaller context = lower costs per session - Budget: <50K tokens for all Layer 1 docs

3. Faster Processing - Less content = faster reads - AI can scan compressed docs quickly - Better for real-time decision making

Trade-off: Humans need verbose version (Layer 3 HTML) - Educational content with examples - Diagrams and visual aids - "Why this matters" sections

Full Details: Documentation/03-Architecture/System-Overview.md (1363 lines, human-readable)


Identity

EPGOAT = SaaS platform generating high-quality Electronic Program Guide (EPG) data for IPTV services

Problem Solved: IPTV services have incomplete/inaccurate EPG data Solution: AI-powered matching of IPTV channels → real sports events from databases

Business Model: - Provider Packs: $2.99-$6.99/month (shared EPG for popular IPTV providers) - BYO Builder: $5.99-$12.99/month (custom EPG from user M3U)

Target State: 96%+ match rate, 10K+ channels/day, 100-1K paying customers


Architecture (C4 Level 1-2)

📖 Understanding C4 Architecture Model

The C4 model (Context, Containers, Components, Code) provides a hierarchical way to describe systems:

Level 1 - System Context: The big picture - Shows EPGOAT and its external dependencies - Useful for: Onboarding, executive overviews, compliance docs - Example: "EPGOAT connects to TheSportsDB API and serves EPG data to IPTV players"

Level 2 - Containers: Deployable units - Shows EPG Engine, Admin Portal, Cloudflare Workers, etc. - Useful for: DevOps, deployment planning, technology decisions - Example: "EPG Engine is a Python 3.11 application deployed via GitHub Actions"

Level 3 - Components: Internal modules - Shows services, parsers, domain models within each container - Useful for: Development, code organization, refactoring - Example: "EPG Engine contains: M3U Parser, Pattern Matcher, XMLTV Generator"

Level 4 - Code: Class/function level - Shows actual code structure - Useful for: Development, code reviews - Example: "Pattern Matcher uses regex patterns from ALLOWED_CHANNEL_PATTERNS"

📊 EPGOAT System Architecture (C4 Level 1)

graph TB
    User[IPTV User] --> Player[IPTV Player]
    Player --> EPG[EPG XML]
    Player --> M3U[M3U Playlist]

    Admin[Admin User] --> AdminUI[Admin Portal]
    AdminUI --> API[Cloudflare Workers API]
    API --> D1[(Cloudflare D1)]

    Engine[EPG Generation Engine] --> EPG
    Engine --> D1
    Engine --> TheSportsDB[TheSportsDB API]
    Engine --> R2[Cloudflare R2 Storage]

    GH[GitHub Actions] --> Engine
    GH --> R2

    style EPG fill:#90EE90
    style Engine fill:#FFB6C1
    style AdminUI fill:#87CEEB
    style D1 fill:#FFD700

Green*: Output consumed by IPTV players Pink: Core EPG generation engine Blue: Admin management interface Yellow**: Data storage *

Core Components

1. EPG Generation Engine (backend/epgoat/) - Language: Python 3.11 - Purpose: Parse M3U → Match channels → Generate XMLTV - Runs on: GitHub Actions (cron + manual triggers) - Key Services: 30+ services in backend/epgoat/services/

2. Supabase PostgreSQL Database (epgoat-events-prod, epgoat-events-staging) - 28 tables: events, users, subscriptions, providers, matches - Migration history: 20 migrations completed - Free tier: 500MB (current: ~3MB staging) - CRITICAL: D1 permanently deprecated (2025-11-13) - Supabase only

3. Cloudflare R2 Storage (epgoat-epg-files) - EPG XML files (per provider + per user) - M3U clones (with tvg-ids) - Logos, snapshots - Public URLs: https://epg.epgoat.tv/tps.xml

4. Cloudflare Workers (API Layer) - Serverless functions (Python via Pyodide) - Public API: /api/v1/* (customers) - Admin API: /api/admin/* (internal) - Stripe webhooks: /api/webhooks/stripe

5. Cloudflare Pages (Frontends) - Public site: https://epgoat.tv (React + Tailwind) - NOT YET BUILT - Admin site: https://admin.epgoat.tv (React + MUI) - 20% complete

6. GitHub Actions (Automation) - generate-epg.yml: Daily EPG generation (NOT YET BUILT) - refresh-events-db.yml: Weekly event DB refresh (NOT YET BUILT) - cleanup-old-epg.yml: Housekeeping (NOT YET BUILT)


EPG Matching Pipeline (Core Intelligence)

📊 EPG Matching Pipeline Flow

flowchart LR
    M3U[M3U Input] --> Parse[M3U Parser]
    Parse --> Pattern[Pattern Matcher]
    Pattern --> Time[Time Extractor]
    Time --> Match[Event Matcher]
    Match --> DB[(Event Database)]
    Match --> Schedule[Schedule Builder]
    Schedule --> XMLTV[XMLTV Generator]
    XMLTV --> Output[EPG XML]

    style M3U fill:#E8F5E9
    style Output fill:#E8F5E9
    style DB fill:#FFF3E0

*Each step in the pipeline is testable independently. The pipeline processes ~1000 channels/second. *

💡 EPG Matching Pipeline - Step by Step

# Example: Processing a channel "NBA 01: Lakers vs Celtics 7:00 PM ET"

# Step 1: Parse M3U
channel = parse_m3u_line('NBA 01: Lakers vs Celtics 7:00 PM ET')
# Output: Channel(name='NBA 01: Lakers vs Celtics 7:00 PM ET', ...)

# Step 2: Pattern Match
sport = match_pattern(channel.name)
# Output: 'NBA' (matched by regex r'^NBA\s+\d+:')

# Step 3: Extract Time
event_time = extract_time(channel.name)
# Output: datetime(2025, 11, 10, 19, 0, tzinfo=ET)

# Step 4: Build Schedule
programs = build_schedule(event_time, duration=180)
# Output: [
#   Programme(start=18:30, title='NBA Pre-Game', duration=30),
#   Programme(start=19:00, title='Lakers vs Celtics', duration=180),
#   Programme(start=22:00, title='NBA Post-Game', duration=30)
# ]

# Step 5: Generate XMLTV
xml = generate_xmltv(programs)
# Output: <programme start="20251110193000 -0500" ...>

Why Three Blocks?

Sports broadcasts typically have pre-game analysis (30min), live game (varies by sport), and post-game highlights (30min). This structure: - Improves EPG accuracy for viewers - Allows for scheduling flexibility - Matches real broadcast patterns

Sport-Specific Durations: - NBA/NHL: 180 minutes (3 hours) - NFL: 210 minutes (3.5 hours) - Soccer: 120 minutes (2 hours) - MLB: 180 minutes (3 hours)

Data Flow

M3U → VOD Filter → Parser → Channel Extraction → 7-Handler Pipeline → XMLTV → R2

7-Handler Matching Chain

  1. Enhanced Match Cache: O(1) lookup of previous matches
  2. Event Details Cache: Preprocessed event data
  3. Local Events DB: Bulk-downloaded events from TheSportsDB
  4. Regex Matcher: Pattern-based matching (exact + fuzzy)
  5. Cross-Provider Cache: Learn from other providers
  6. API Match: TheSportsDB/ESPN API calls
  7. LLM Fallback: Claude Haiku with prompt caching

Current Match Rate: ~35% (target: 96%+)

Key Optimizations (Phase 2)

  • VOD Filtering: 26 patterns filter 91.7% of non-event channels (355K→32K)
  • Provider Config Manager: 53x faster loads (8s→150ms), YAML cache + DB
  • Team Resolution: LLM-based canonical name resolution (self-improving)
  • Pattern Discovery: Auto-detects numbered channel series (e.g., "NBA 01:", "NFL 12")

Database Schema (27 Tables)

Table Groups

  • Events & Sports: events, participants, leagues, sports, event_participants
  • Users & Auth: users, sessions, api_keys (NOT YET BUILT)
  • Subscriptions: subscriptions, invoices, payment_methods (NOT YET BUILT)
  • Providers: providers, provider_patterns, tvg_id_mappings, vod_filter_patterns
  • Matching: match_overrides, match_cache, unmatched_channels, learned_patterns
  • Team Resolution: team_aliases, team_resolution_cache (Phase 2 NEW)
  • System: audit_log, daily_stats, jobs, schema_migrations

Latest Migration: 019 (2025-11-12) - LLM verification columns for provider_patterns

Schema Details: Documentation/05-Reference/Database/Schema.md


Technology Stack (Production)

Backend

  • Python 3.11 (EPG engine)
  • Supabase PostgreSQL (database, replaces D1)
  • Cloudflare Workers (API, Python via Pyodide)
  • GitHub Actions (automation, cron)

Frontend

  • React 18 + Vite + Tailwind (public site - NOT BUILT)
  • React 18 + MUI (admin site - 20% complete)

Infrastructure

  • Cloudflare: Workers + Pages + R2 (free tier covers dev + initial prod)
  • Supabase: PostgreSQL database (free tier: 500MB, REQUIRED - no D1)
  • GitHub: Actions (free for public repos)

External APIs

  • TheSportsDB: Primary event database ($5/mo Patreon tier)
  • ESPN API: Fallback event data (free, unofficial)
  • Claude API: LLM fallback matching ($2-5/mo estimated)
  • Stripe: Payment processing (2.9% + $0.30/tx)

Cost Model

  • Development: $0/month
  • Production (100 customers): ~$60/month ($600 revenue = 90% margin)

Project Structure

epgoat-internal/
├── backend/epgoat/             # EPG generation engine (Python)
│   ├── domain/                 # Core business logic (models, parsers, schemas)
│   ├── services/               # 30+ services (matching, caching, APIs)
│   ├── infrastructure/         # External deps (database, clients, parsers)
│   ├── application/            # High-level workflows (epg_generator.py)
│   ├── cli/                    # Command-line interfaces (run_provider.py)
│   └── tests/                  # Test suite (784 tests, 98.2% passing)
├── backend/config/             # Configuration YAML (patterns, sport emojis, categories)
├── frontend/                   # Web frontends (NOT YET BUILT)
│   ├── public-site/            # Customer-facing React app
│   └── admin/                  # Admin UI (React + MUI, 20% complete)
├── .github/workflows/          # GitHub Actions (NOT YET BUILT)
├── Documentation/              # System documentation (151 files, UNDER RENOVATION)
└── CLAUDE.md                   # AI assistant instructions

Code Location: backend/epgoat/ is heart of system


Current Status (2025-11-10)

✅ Complete

  • EPG matching engine (80% - missing Learning Engine, full LLM integration)
  • M3U parsing + XMLTV generation
  • VOD filtering (Phase 2)
  • Provider config manager (Phase 2)
  • Team resolution service (Phase 2)
  • 7-handler matching pipeline (Phase 2)
  • Match overrides UI (partial, 20%)
  • Database schema (designed, 17 migrations complete)
  • Supabase staging DB (2,782 events, 28 tables)
  • 784 tests (98.2% passing - 14 failures due to pre-existing bugs)

🚧 In Progress

  • Documentation overhaul (Pyramid Architecture, Phases 1-4)
  • Fixing 14 failing tests (cache bugs, mocking issues)

❌ Not Started

  • Public website (React + Tailwind) - 40-60 hours
  • Subscription system (Stripe integration) - 30-40 hours
  • Admin UI enhancements (80% remaining) - 40-50 hours
  • GitHub Actions workflows (automation) - 20-30 hours
  • Customer API (rate limiting, API keys) - 20-30 hours
  • Email system (SendGrid/Resend) - 10-15 hours

Key Architectural Principles

  1. Serverless-First: Zero server management (Cloudflare + GitHub Actions)
  2. Edge-Native: Global distribution via Cloudflare's 300+ data centers
  3. Cost-Efficient: Free tier covers dev + initial prod
  4. Self-Learning: System improves matching over time (learned_patterns, team_aliases)
  5. API-First: All functionality exposed via REST API
  6. Data is Forever: Soft deletes only (record_status field), historical data preserved

🎯 Why Soft Deletes?

We use soft deletes (record_status: active/archived/deleted) instead of hard deletes for three key reasons:

1. Future Analytics (Core Principle #2: We are a Data Company) - Historical data enables trend analysis - Match accuracy improvements over time - User behavior insights

2. API Monetization (Core Principle #5: API-First Design) - Paid API customers may want historical data - "Give me all NBA games from 2024 season" - Premium feature: historical EPG generation

3. Data Recovery - Accidental deletes can be recovered - Audit trail for compliance - Rollback capabilities

Cost: Minimal storage impact (database is <100MB currently)


Deployment (Cloudflare)

Environments

  • Production: epgoat.tv (DB: epgoat-events-prod, R2: epgoat-epg-files)
  • Staging: staging.epgoat.tv (DB: epgoat-events-staging, R2: epgoat-epg-files-staging)
  • Development: localhost (Local SQLite, local files)

Domains

  • epgoat.tv: Public website (NOT YET DEPLOYED)
  • admin.epgoat.tv: Admin UI (DEPLOYED, 20% complete)
  • epg.epgoat.tv: EPG file CDN (via R2 custom domain)
  • api.epgoat.tv: API endpoints (via Workers)

Authentication

  • Public: Clerk ($25/mo) or Auth0 (NOT YET IMPLEMENTED)
  • Admin: Cloudflare Access (zero-trust, email @epgoat.tv only)

Critical Decisions (ADRs)

  • ADR-001: Use Supabase PostgreSQL exclusively (2025-11-13)
  • Reason: Better tooling, PostgreSQL compatibility, no storage limits
  • Migration: Complete - D1 permanently deprecated, all code migrated
  • Breaking Change: EventDatabase now requires Supabase (no graceful degradation)

  • ADR-002: EPG Matching Architecture (7-handler chain) (2025-11 draft)

  • Reason: Optimize for speed (cache first) + accuracy (fallback to LLM)

  • ADR-003: Phase 2 Service Architecture (2025-11 draft)

  • Reason: Separate concerns, improve testability, enable caching

  • ADR-004: Multi-Stage Regex Matcher (2025-11 draft)

  • Reason: Balance speed vs accuracy (exact first, fuzzy fallback)

Full ADRs: Documentation/06-Decisions/


Performance Targets

Metric Current Target (Phase 11)
Match Rate 35% 96%+
Channels/Day 1,261 10,000
Processing Time 2 hours <60 seconds
Customers 0 100-1,000
Monthly Cost $0 $15-25

Common Operations

Generate EPG for TPS provider:

cd backend/epgoat
python cli/run_provider.py --provider tps --date 2025-11-10

Refresh event database:

cd backend/epgoat
python utilities/refresh_event_db_v2.py --date 2025-11-10

Run tests:

make test  # from project root

Admin UI:

https://admin.epgoat.tv

Security

  • Data at Rest: AES-256 encryption (Supabase + R2)
  • Data in Transit: TLS 1.3 (all connections)
  • Secrets: Cloudflare Workers env vars + GitHub Secrets
  • Rate Limiting: By tier (100-100K req/day)
  • PII: Email hashing, GDPR compliance

Glossary

  • EPG: Electronic Program Guide (TV schedule data)
  • XMLTV: XML-based EPG format (industry standard)
  • M3U: Playlist format used by IPTV services
  • tvg-id: Unique channel identifier in M3U files

⚠️ M3U Parsing - Missing tvg-id

Problem: Many M3U playlists don't include tvg-id attributes, causing EPG matching to fail silently.

Solution: EPGOAT uses channel name pattern matching as a fallback:

  1. Try tvg-id first (if present)
  2. Fall back to pattern matching on channel name
  3. Extract event details from name (team, date, time)
  4. Match against Event Database

This dual-approach achieves ~85% match rate vs ~30% with tvg-id only.

  • VOD: Video on Demand (movies/shows, not live events)
  • Provider: IPTV service (TPS, Trex, Necro, etc.)
  • Family: Channel grouping (e.g., "NBA", "Flo Sports", "Paramount+")
  • League: Actual sports league (NBA, NFL, UFC, NCAAF, etc.)
  • Match Rate: % of channels successfully matched to events

Next Review: After Phase 1 complete (Public Frontend) For Details: See Documentation/03-Architecture/System-Overview.md (full 1363-line version)