System Overview & Architecture (Educational Version)
Note: This is the educational, human-readable version with examples and detailed explanations. For the AI-optimized version, see 1-CONTEXT/_SYSTEM.md.
EPGOAT System Reference (AI-Optimized)
Purpose: Ultra-compressed system overview for Claude Code AI assistant Token Budget: ~5K tokens (part of 50K Layer 1 budget) Load Priority: HIGH (always load at session start) Last Updated: 2025-11-14
🎯 Why Aggressive Token Compression?
Layer 1 (CONTEXT) uses aggressive compression (95% reduction: 200K → 11K tokens) for three reasons:
1. AI Context Window Limits - Claude has 200K token context limit - Compressed docs fit easily with room for code - Enables loading entire system context at once
2. Cost Efficiency - Tokens = $$$ - Smaller context = lower costs per session - Budget: <50K tokens for all Layer 1 docs
3. Faster Processing - Less content = faster reads - AI can scan compressed docs quickly - Better for real-time decision making
Trade-off: Humans need verbose version (Layer 3 HTML) - Educational content with examples - Diagrams and visual aids - "Why this matters" sections
Full Details: Documentation/03-Architecture/System-Overview.md (1363 lines, human-readable)
Identity
EPGOAT = SaaS platform generating high-quality Electronic Program Guide (EPG) data for IPTV services
Problem Solved: IPTV services have incomplete/inaccurate EPG data Solution: AI-powered matching of IPTV channels → real sports events from databases
Business Model: - Provider Packs: $2.99-$6.99/month (shared EPG for popular IPTV providers) - BYO Builder: $5.99-$12.99/month (custom EPG from user M3U)
Target State: 96%+ match rate, 10K+ channels/day, 100-1K paying customers
Architecture (C4 Level 1-2)
📖 Understanding C4 Architecture Model
The C4 model (Context, Containers, Components, Code) provides a hierarchical way to describe systems:
Level 1 - System Context: The big picture - Shows EPGOAT and its external dependencies - Useful for: Onboarding, executive overviews, compliance docs - Example: "EPGOAT connects to TheSportsDB API and serves EPG data to IPTV players"
Level 2 - Containers: Deployable units - Shows EPG Engine, Admin Portal, Cloudflare Workers, etc. - Useful for: DevOps, deployment planning, technology decisions - Example: "EPG Engine is a Python 3.11 application deployed via GitHub Actions"
Level 3 - Components: Internal modules - Shows services, parsers, domain models within each container - Useful for: Development, code organization, refactoring - Example: "EPG Engine contains: M3U Parser, Pattern Matcher, XMLTV Generator"
Level 4 - Code: Class/function level - Shows actual code structure - Useful for: Development, code reviews - Example: "Pattern Matcher uses regex patterns from ALLOWED_CHANNEL_PATTERNS"
📊 EPGOAT System Architecture (C4 Level 1)
graph TB
User[IPTV User] --> Player[IPTV Player]
Player --> EPG[EPG XML]
Player --> M3U[M3U Playlist]
Admin[Admin User] --> AdminUI[Admin Portal]
AdminUI --> API[Cloudflare Workers API]
API --> D1[(Cloudflare D1)]
Engine[EPG Generation Engine] --> EPG
Engine --> D1
Engine --> TheSportsDB[TheSportsDB API]
Engine --> R2[Cloudflare R2 Storage]
GH[GitHub Actions] --> Engine
GH --> R2
style EPG fill:#90EE90
style Engine fill:#FFB6C1
style AdminUI fill:#87CEEB
style D1 fill:#FFD700
Green*: Output consumed by IPTV players Pink: Core EPG generation engine Blue: Admin management interface Yellow**: Data storage *
Core Components
1. EPG Generation Engine (backend/epgoat/)
- Language: Python 3.11
- Purpose: Parse M3U → Match channels → Generate XMLTV
- Runs on: GitHub Actions (cron + manual triggers)
- Key Services: 30+ services in backend/epgoat/services/
2. Supabase PostgreSQL Database (epgoat-events-prod, epgoat-events-staging)
- 28 tables: events, users, subscriptions, providers, matches
- Migration history: 20 migrations completed
- Free tier: 500MB (current: ~3MB staging)
- CRITICAL: D1 permanently deprecated (2025-11-13) - Supabase only
3. Cloudflare R2 Storage (epgoat-epg-files)
- EPG XML files (per provider + per user)
- M3U clones (with tvg-ids)
- Logos, snapshots
- Public URLs: https://epg.epgoat.tv/tps.xml
4. Cloudflare Workers (API Layer)
- Serverless functions (Python via Pyodide)
- Public API: /api/v1/* (customers)
- Admin API: /api/admin/* (internal)
- Stripe webhooks: /api/webhooks/stripe
5. Cloudflare Pages (Frontends)
- Public site: https://epgoat.tv (React + Tailwind) - NOT YET BUILT
- Admin site: https://admin.epgoat.tv (React + MUI) - 20% complete
6. GitHub Actions (Automation)
- generate-epg.yml: Daily EPG generation (NOT YET BUILT)
- refresh-events-db.yml: Weekly event DB refresh (NOT YET BUILT)
- cleanup-old-epg.yml: Housekeeping (NOT YET BUILT)
EPG Matching Pipeline (Core Intelligence)
📊 EPG Matching Pipeline Flow
flowchart LR
M3U[M3U Input] --> Parse[M3U Parser]
Parse --> Pattern[Pattern Matcher]
Pattern --> Time[Time Extractor]
Time --> Match[Event Matcher]
Match --> DB[(Event Database)]
Match --> Schedule[Schedule Builder]
Schedule --> XMLTV[XMLTV Generator]
XMLTV --> Output[EPG XML]
style M3U fill:#E8F5E9
style Output fill:#E8F5E9
style DB fill:#FFF3E0
*Each step in the pipeline is testable independently. The pipeline processes ~1000 channels/second. *
💡 EPG Matching Pipeline - Step by Step
# Example: Processing a channel "NBA 01: Lakers vs Celtics 7:00 PM ET"
# Step 1: Parse M3U
channel = parse_m3u_line('NBA 01: Lakers vs Celtics 7:00 PM ET')
# Output: Channel(name='NBA 01: Lakers vs Celtics 7:00 PM ET', ...)
# Step 2: Pattern Match
sport = match_pattern(channel.name)
# Output: 'NBA' (matched by regex r'^NBA\s+\d+:')
# Step 3: Extract Time
event_time = extract_time(channel.name)
# Output: datetime(2025, 11, 10, 19, 0, tzinfo=ET)
# Step 4: Build Schedule
programs = build_schedule(event_time, duration=180)
# Output: [
# Programme(start=18:30, title='NBA Pre-Game', duration=30),
# Programme(start=19:00, title='Lakers vs Celtics', duration=180),
# Programme(start=22:00, title='NBA Post-Game', duration=30)
# ]
# Step 5: Generate XMLTV
xml = generate_xmltv(programs)
# Output: <programme start="20251110193000 -0500" ...>
Why Three Blocks?
Sports broadcasts typically have pre-game analysis (30min), live game (varies by sport), and post-game highlights (30min). This structure: - Improves EPG accuracy for viewers - Allows for scheduling flexibility - Matches real broadcast patterns
Sport-Specific Durations: - NBA/NHL: 180 minutes (3 hours) - NFL: 210 minutes (3.5 hours) - Soccer: 120 minutes (2 hours) - MLB: 180 minutes (3 hours)
Data Flow
M3U → VOD Filter → Parser → Channel Extraction → 7-Handler Pipeline → XMLTV → R2
7-Handler Matching Chain
- Enhanced Match Cache: O(1) lookup of previous matches
- Event Details Cache: Preprocessed event data
- Local Events DB: Bulk-downloaded events from TheSportsDB
- Regex Matcher: Pattern-based matching (exact + fuzzy)
- Cross-Provider Cache: Learn from other providers
- API Match: TheSportsDB/ESPN API calls
- LLM Fallback: Claude Haiku with prompt caching
Current Match Rate: ~35% (target: 96%+)
Key Optimizations (Phase 2)
- VOD Filtering: 26 patterns filter 91.7% of non-event channels (355K→32K)
- Provider Config Manager: 53x faster loads (8s→150ms), YAML cache + DB
- Team Resolution: LLM-based canonical name resolution (self-improving)
- Pattern Discovery: Auto-detects numbered channel series (e.g., "NBA 01:", "NFL 12")
Database Schema (27 Tables)
Table Groups
- Events & Sports:
events,participants,leagues,sports,event_participants - Users & Auth:
users,sessions,api_keys(NOT YET BUILT) - Subscriptions:
subscriptions,invoices,payment_methods(NOT YET BUILT) - Providers:
providers,provider_patterns,tvg_id_mappings,vod_filter_patterns - Matching:
match_overrides,match_cache,unmatched_channels,learned_patterns - Team Resolution:
team_aliases,team_resolution_cache(Phase 2 NEW) - System:
audit_log,daily_stats,jobs,schema_migrations
Latest Migration: 019 (2025-11-12) - LLM verification columns for provider_patterns
Schema Details: Documentation/05-Reference/Database/Schema.md
Technology Stack (Production)
Backend
- Python 3.11 (EPG engine)
- Supabase PostgreSQL (database, replaces D1)
- Cloudflare Workers (API, Python via Pyodide)
- GitHub Actions (automation, cron)
Frontend
- React 18 + Vite + Tailwind (public site - NOT BUILT)
- React 18 + MUI (admin site - 20% complete)
Infrastructure
- Cloudflare: Workers + Pages + R2 (free tier covers dev + initial prod)
- Supabase: PostgreSQL database (free tier: 500MB, REQUIRED - no D1)
- GitHub: Actions (free for public repos)
External APIs
- TheSportsDB: Primary event database ($5/mo Patreon tier)
- ESPN API: Fallback event data (free, unofficial)
- Claude API: LLM fallback matching ($2-5/mo estimated)
- Stripe: Payment processing (2.9% + $0.30/tx)
Cost Model
- Development: $0/month
- Production (100 customers): ~$60/month ($600 revenue = 90% margin)
Project Structure
epgoat-internal/
├── backend/epgoat/ # EPG generation engine (Python)
│ ├── domain/ # Core business logic (models, parsers, schemas)
│ ├── services/ # 30+ services (matching, caching, APIs)
│ ├── infrastructure/ # External deps (database, clients, parsers)
│ ├── application/ # High-level workflows (epg_generator.py)
│ ├── cli/ # Command-line interfaces (run_provider.py)
│ └── tests/ # Test suite (784 tests, 98.2% passing)
├── backend/config/ # Configuration YAML (patterns, sport emojis, categories)
├── frontend/ # Web frontends (NOT YET BUILT)
│ ├── public-site/ # Customer-facing React app
│ └── admin/ # Admin UI (React + MUI, 20% complete)
├── .github/workflows/ # GitHub Actions (NOT YET BUILT)
├── Documentation/ # System documentation (151 files, UNDER RENOVATION)
└── CLAUDE.md # AI assistant instructions
Code Location: backend/epgoat/ is heart of system
Current Status (2025-11-10)
✅ Complete
- EPG matching engine (80% - missing Learning Engine, full LLM integration)
- M3U parsing + XMLTV generation
- VOD filtering (Phase 2)
- Provider config manager (Phase 2)
- Team resolution service (Phase 2)
- 7-handler matching pipeline (Phase 2)
- Match overrides UI (partial, 20%)
- Database schema (designed, 17 migrations complete)
- Supabase staging DB (2,782 events, 28 tables)
- 784 tests (98.2% passing - 14 failures due to pre-existing bugs)
🚧 In Progress
- Documentation overhaul (Pyramid Architecture, Phases 1-4)
- Fixing 14 failing tests (cache bugs, mocking issues)
❌ Not Started
- Public website (React + Tailwind) - 40-60 hours
- Subscription system (Stripe integration) - 30-40 hours
- Admin UI enhancements (80% remaining) - 40-50 hours
- GitHub Actions workflows (automation) - 20-30 hours
- Customer API (rate limiting, API keys) - 20-30 hours
- Email system (SendGrid/Resend) - 10-15 hours
Key Architectural Principles
- Serverless-First: Zero server management (Cloudflare + GitHub Actions)
- Edge-Native: Global distribution via Cloudflare's 300+ data centers
- Cost-Efficient: Free tier covers dev + initial prod
- Self-Learning: System improves matching over time (learned_patterns, team_aliases)
- API-First: All functionality exposed via REST API
- Data is Forever: Soft deletes only (
record_statusfield), historical data preserved
🎯 Why Soft Deletes?
We use soft deletes (record_status: active/archived/deleted) instead of hard deletes for three key reasons:
1. Future Analytics (Core Principle #2: We are a Data Company) - Historical data enables trend analysis - Match accuracy improvements over time - User behavior insights
2. API Monetization (Core Principle #5: API-First Design) - Paid API customers may want historical data - "Give me all NBA games from 2024 season" - Premium feature: historical EPG generation
3. Data Recovery - Accidental deletes can be recovered - Audit trail for compliance - Rollback capabilities
Cost: Minimal storage impact (database is <100MB currently)
Deployment (Cloudflare)
Environments
- Production:
epgoat.tv(DB:epgoat-events-prod, R2:epgoat-epg-files) - Staging:
staging.epgoat.tv(DB:epgoat-events-staging, R2:epgoat-epg-files-staging) - Development:
localhost(Local SQLite, local files)
Domains
epgoat.tv: Public website (NOT YET DEPLOYED)admin.epgoat.tv: Admin UI (DEPLOYED, 20% complete)epg.epgoat.tv: EPG file CDN (via R2 custom domain)api.epgoat.tv: API endpoints (via Workers)
Authentication
- Public: Clerk ($25/mo) or Auth0 (NOT YET IMPLEMENTED)
- Admin: Cloudflare Access (zero-trust, email @epgoat.tv only)
Critical Decisions (ADRs)
- ADR-001: Use Supabase PostgreSQL exclusively (2025-11-13)
- Reason: Better tooling, PostgreSQL compatibility, no storage limits
- Migration: Complete - D1 permanently deprecated, all code migrated
-
Breaking Change: EventDatabase now requires Supabase (no graceful degradation)
-
ADR-002: EPG Matching Architecture (7-handler chain) (2025-11 draft)
-
Reason: Optimize for speed (cache first) + accuracy (fallback to LLM)
-
ADR-003: Phase 2 Service Architecture (2025-11 draft)
-
Reason: Separate concerns, improve testability, enable caching
-
ADR-004: Multi-Stage Regex Matcher (2025-11 draft)
- Reason: Balance speed vs accuracy (exact first, fuzzy fallback)
Full ADRs: Documentation/06-Decisions/
Performance Targets
| Metric | Current | Target (Phase 11) |
|---|---|---|
| Match Rate | 35% | 96%+ |
| Channels/Day | 1,261 | 10,000 |
| Processing Time | 2 hours | <60 seconds |
| Customers | 0 | 100-1,000 |
| Monthly Cost | $0 | $15-25 |
Common Operations
Generate EPG for TPS provider:
cd backend/epgoat
python cli/run_provider.py --provider tps --date 2025-11-10
Refresh event database:
cd backend/epgoat
python utilities/refresh_event_db_v2.py --date 2025-11-10
Run tests:
make test # from project root
Admin UI:
https://admin.epgoat.tv
Security
- Data at Rest: AES-256 encryption (Supabase + R2)
- Data in Transit: TLS 1.3 (all connections)
- Secrets: Cloudflare Workers env vars + GitHub Secrets
- Rate Limiting: By tier (100-100K req/day)
- PII: Email hashing, GDPR compliance
Glossary
- EPG: Electronic Program Guide (TV schedule data)
- XMLTV: XML-based EPG format (industry standard)
- M3U: Playlist format used by IPTV services
- tvg-id: Unique channel identifier in M3U files
⚠️ M3U Parsing - Missing tvg-id
Problem: Many M3U playlists don't include tvg-id attributes, causing EPG matching to fail silently.
Solution: EPGOAT uses channel name pattern matching as a fallback:
- Try
tvg-idfirst (if present) - Fall back to pattern matching on channel name
- Extract event details from name (team, date, time)
- Match against Event Database
This dual-approach achieves ~85% match rate vs ~30% with tvg-id only.
- VOD: Video on Demand (movies/shows, not live events)
- Provider: IPTV service (TPS, Trex, Necro, etc.)
- Family: Channel grouping (e.g., "NBA", "Flo Sports", "Paramount+")
- League: Actual sports league (NBA, NFL, UFC, NCAAF, etc.)
- Match Rate: % of channels successfully matched to events
Next Review: After Phase 1 complete (Public Frontend)
For Details: See Documentation/03-Architecture/System-Overview.md (full 1363-line version)