Project: Architecture & Workflow Redesign - EPGOAT V2

📍 Current Context

Last Updated: 2025-11-07 [current session] Current Task: Phase 4 COMPLETE! 🚀 Ready for Documentation Breadcrumbs: Phase 4 → Phase 4.6 → Testing COMPLETE Status: Phase 4 EPG Generation System fully implemented and tested: - ✅ XMLTVGenerator class (database-powered, three-tier system, three-scenario handling) - ✅ Language detector service (detect_language_from_channel_name) - ✅ Channel parser service (LLM-ready with fallback chain) - ✅ EPG data generator service (populates epg_data table with sports_metadata) - ✅ R2 storage client (Cloudflare R2 integration with S3-compatible API) - ✅ Batch EPG generation script (450 files: 3 tiers × 10 providers × 15 timezones) - ✅ Cache invalidation utility (regenerate_invalidated_epg.py with retry logic) - ✅ Comprehensive test suite (300+ tests covering all components) Next Step: Phase 4.7 - Create EPG Generation guide and finalize documentation

✅ Progress Tracker

[x] Phase 1: Infrastructure Setup (Week 1) ✅ 2025-01-05
[x] 1.1: Create Supabase project (free tier) ✅ 2025-01-05
[x] 1.2: Set up R2 bucket ("epgoat-epg-files") ✅ 2025-01-05
[x] 1.3: Generate API tokens and connection strings ✅ 2025-01-05
[x] 1.4: Test connectivity from dev environment ✅ 2025-01-05
[x] 1.5: Document setup in Documentation/05-Reference/ ✅ 2025-01-05
[x] Phase 2: Database Schema (Week 1-2) ✅ 2025-11-07
[x] 2.1: Write SQL migration (010_create_new_schema.sql) ✅ 2025-01-05
[x] 2.2: Apply migration to Supabase dev instance ✅ 2025-11-07
[x] 2.3: Create rollback migration (DOWN section) ✅ 2025-11-07
[x] 2.4: Verify all indexes created ✅ 2025-11-07
[x] 2.5: Test trigger (EPGData insert → invalidate files) ✅ 2025-11-07
[x] Phase 3: Parser & Matching Engine (Week 2-3) ✅ 2025-11-06
[x] 3.1: Channel Parser ✅ 2025-11-06
- [x] 3.1.1: Implement regex patterns (NBA, NFL, NHL, MLB, Soccer) ✅ 2025-11-06
- [x] 3.1.2: Integrate Claude Haiku API ✅ 2025-11-06
- [x] 3.1.3: Build LLM cache (exact + similarity matching) ✅ 2025-11-06
- [x] 3.1.4: Test with 100+ real channel names ✅ 2025-11-06
[x] 3.2: Event Matching ✅ 2025-11-06
- [x] 3.2.1: Build fuzzy team name matching ✅ 2025-11-06
- [x] 3.2.2: Integrate manual override system ✅ 2025-11-06
- [x] 3.2.3: Support confidence scoring ✅ 2025-11-06
- [x] 3.2.4: Test match accuracy ✅ 2025-11-06
[x] 3.3: Two-Phase Event Loading ✅ 2025-11-06
- [x] 3.3.1: Daily job - Bulk load event IDs (TheSportsDB + ESPN) ✅ 2025-11-06
- [x] 3.3.2: On-match - Fetch full details ✅ 2025-11-06
- [x] 3.3.3: Merge data from both APIs (TheSportsDB preferred) ✅ 2025-11-06
[in-progress] Phase 4: EPG Generation System (Week 3-4) ← ACTIVE
[x] Planning Complete (2025-11-07)
- [x] 4.0.1: Define tier system and field specifications ✅ 2025-11-07
- [x] 4.0.2: Design three channel scenarios (matched/parsed-unmatched/unparseable) ✅ 2025-11-07
- [x] 4.0.3: Specify event filtering logic (3-day lookahead, future only) ✅ 2025-11-07
- [x] 4.0.4: Define pre/post event timing (midnight start, 2-hour blocks, 4-hour post) ✅ 2025-11-07
- [x] 4.0.5: Design broadcaster conditional logic ✅ 2025-11-07
- [x] 4.0.6: Create field specification CSV (epg-field-specification.csv) ✅ 2025-11-07
- [x] 4.0.7: Validate epg_data schema against CSV ✅ 2025-11-07
[x] 4.1: Core Implementation ✅ 2025-11-07
- [x] 4.1.1: Add constants to backend/epgoat/domain/models.py ✅ 2025-11-07
- [x] 4.1.2: Refactor backend/epgoat/domain/xmltv.py with XMLTVGenerator class ✅ 2025-11-07
- [x] 4.1.3: Implement three-scenario handling (matched/parsed-unmatched/unparseable) ✅ 2025-11-07
- [x] 4.1.4: Implement tier-based field filtering ✅ 2025-11-07
- [x] 4.1.5: Implement timezone-aware descriptions ✅ 2025-11-07
- [x] 4.1.6: Add emoji support with conditional logic ✅ 2025-11-07
[x] 4.2: Modular Services ✅ 2025-11-07
- [x] 4.2.1: Create backend/epgoat/services/channel_parser.py (LLM-ready) ✅ 2025-11-07
- [x] 4.2.2: Create backend/epgoat/services/language_detector.py ✅ 2025-11-07
- [x] 4.2.3: Add TODO for future LLM enhancement ✅ 2025-11-07
- [x] 4.2.4: Update living document with LLM task ✅ 2025-11-07
[x] 4.3: EPG Data Population ✅ 2025-11-07
- [x] 4.3.1: Create backend/epgoat/services/epg_data_generator.py ✅ 2025-11-07
- [ ] 4.3.2: Schedule daily processing workflow (future task)
[x] 4.4: Batch Generation & Storage ✅ 2025-11-07
- [x] 4.4.1: Create backend/epgoat/utilities/generate_epg_files_batch.py ✅ 2025-11-07
- [x] 4.4.2: Create backend/epgoat/services/r2_storage.py ✅ 2025-11-07
- [x] 4.4.3: Implement R2 upload for EPG files ✅ 2025-11-07
[x] 4.5: Cache Invalidation ✅ 2025-11-07
- [x] 4.5.1: Create backend/epgoat/utilities/regenerate_invalidated_epg.py ✅ 2025-11-07
- [ ] 4.5.2: Test trigger integration (ready for integration testing)
[x] 4.6: Testing ✅ 2025-11-07
- [x] 4.6.1: Write unit tests (tier filtering, emoji logic, event filtering) ✅ 2025-11-07
- [x] 4.6.2: Write integration tests (full EPG generation, scenario comparison) ✅ 2025-11-07
- [ ] 4.6.3: Visual testing with TiviMate/VLC (manual testing phase)
[ ] 4.7: Documentation
- [ ] 4.7.1: Update living document with implementation details
- [ ] 4.7.2: Create EPG Generation guide
[ ] Phase 5: Migration & Cutover (Week 4) !!!This phase is unneeded!!!
[ ] 5.1: Data Migration
- [ ] 5.1.1: Export all tables from D1 to JSON
- [ ] 5.1.2: Transform to new schema
- [ ] 5.1.3: Import to Supabase
- [ ] 5.1.4: Validate data completeness (target: 100%)
[ ] 5.2: Parallel Run (2 weeks)
- [ ] 5.2.1: Dual-write to both D1 and Supabase
- [ ] 5.2.2: Compare outputs for consistency
- [ ] 5.2.3: Monitor performance
- [ ] 5.2.4: Fix discrepancies (target: <1% difference)
[ ] 5.3: Cutover
- [ ] 5.3.1: Switch reads to Supabase
- [ ] 5.3.2: Monitor for 48 hours
- [ ] 5.3.3: Stop D1 writes
- [ ] 5.3.4: Document rollback procedure
[ ] 5.4: Decommission D1 (after 30-day safety period)
- [ ] 5.4.1: Keep D1 running for 30 days (safety)
- [ ] 5.4.2: Archive D1 data
- [ ] 5.4.3: Delete Supabase database
- [ ] 5.4.4: Update all documentation

🔀 Tangent Log

Active Tangents

[No active tangents yet]

Completed Tangents

[No completed tangents yet]

📋 Original Plan

Architecture & Workflow Redesign - EPGOAT V2

Status: Active Last Updated: 2025-01-05 Related Docs: System Overview, Core Principles, TODO-BACKLOG Project Lead: CTO Target Completion: 2025-02-05 (4 weeks)

Executive Summary

Complete architectural redesign of EPGOAT's EPG generation system to address scalability, maintainability, and operational challenges. This is a strategic rebuild that replaces the current D1-based system with a PostgreSQL (Supabase) foundation, implements parser-first workflows, and introduces intelligent caching with R2 storage.

Business Impact: - Enables worldwide launch with 15 timezone support - Reduces LLM costs by 70-80% through intelligent caching - Provides production-ready data visibility and debugging tools - Scales to 10,000+ users at $200-300/month (vs $1000+ with current approach)

Technical Transformation: - Database: Supabase PostgreSQL (SQLite) → PostgreSQL (Supabase) - Schema: Flat structure → 3-table parser-first design with JSONB - Storage: Workers KV → Cloudflare R2 (zero egress costs) - Caching: Basic → Intelligent hybrid (auto-promotion/demotion) - Event Data: Single API → Dual API with two-phase loading

Problem Statement

Current Limitations

Database (Supabase PostgreSQL): - ❌ No visual query tools (no phpMyAdmin, limited DataGrip support) - ❌ Limited debugging capabilities - ❌ Poor indexing performance for complex queries - ❌ Difficult data exploration and analytics

Parser Workflow: - ❌ Match-first approach loses data when events don't match - ❌ No storage of parse attempts (can't analyze failures) - ❌ LLM calls not deduplicated (expensive, redundant) - ❌ Limited visibility into why channels fail to match

EPG Generation: - ❌ Single timezone support (America/Chicago only) - ❌ Generate-on-demand only (no pre-generation) - ❌ No tier differentiation (kid/billy/goat) - ❌ Storage strategy undefined for production

Event Data: - ❌ Single API source (TheSportsDB only) - ❌ Fetch all details upfront (slow, expensive) - ❌ No redundancy if API unavailable

Cost & Scale: - ❌ Current approach doesn't scale economically - ❌ No cost optimization strategies - ❌ LLM costs grow linearly with channels

Solution Architecture

Core Architectural Decisions

1. Database: PostgreSQL via Supabase

Why Supabase? - ✅ PostgreSQL 15+ with full JSONB support - ✅ Built-in GUI (no setup needed) - ✅ REST API auto-generated from schema - ✅ Row-level security for multi-tenancy - ✅ Real-time subscriptions for admin dashboard - ✅ Free tier: 500MB DB (plenty for current scale) - ✅ DataGrip compatibility (full SQL IDE support)

Migration Impact: - Minimal downtime (parallel run strategy) - Rollback capability (keep D1 for 30 days) - Improved query performance (PostgreSQL optimizer) - Better analytics capabilities (window functions, CTEs)

2. Schema: Three-Table Core with JSONB

channel_names (one per unique channel/provider)
    ↓
parsed_data (all parse attempts, JSONB for flexibility)
    ↓
epg_data (final output-ready EPG entries)

Supporting:
- llm_parse_cache (LLM deduplication)
- epg_file_cache (R2 file tracking)

Key Benefits: - Clear separation: parse → match → output - JSONB flexibility (add fields without migrations) - Full audit trail (every parse attempt stored) - Analytics-ready (query parse success rates, confidence trends)

3. Storage: Cloudflare R2

Why R2? - ✅ Zero egress fees (unlimited downloads at no cost) - ✅ 10GB free tier (20x current needs) - ✅ Perfect Workers integration - ✅ Global CDN built-in - ✅ S3-compatible API

Cost Comparison (at 1000 users, 1M EPG downloads/month): | Service | Storage Cost | Egress Cost | Total | |---------|--------------|-------------|-------| | Cloudflare R2 | $0.15 | $0 | $0.15/mo | | AWS S3 + CloudFront | $0.23 | $90 | $90.23/mo |

Savings: $1,080/year at just 1000 users

4. EPG Strategy: Hybrid Caching

Pre-Generated Files (Tier 1 - 95% of users): - 3 tiers (kid/billy/goat) × 10 providers × 15 timezones = 450 files - Storage: ~225MB - Generation: When EPGData changes (not on every request) - Instant load for vast majority

On-Demand + Cache (Tier 2 - 5% of users): - Custom configurations - Rare timezones - First load: 4-20 sec, then cached for 24h - Acceptable for edge cases

Auto-Promotion (Tier 3 - Intelligence Layer):

-- Promote to pre-generated if popular
IF accessed 3+ times in 24h:
  → Move to Tier 1 (pre-generated)

-- Demote if unused
IF unused for 7 days:
  → Move back to Tier 2 (on-demand)
  → Delete cached file

15 Pre-Generated Timezones (Worldwide Coverage): 1. America/New_York (US Eastern) 2. America/Chicago (US Central) 3. America/Denver (US Mountain) 4. America/Los_Angeles (US Pacific) 5. America/Sao_Paulo (Brazil/South America) 6. Europe/London (UK/Ireland) 7. Europe/Paris (Western Europe) 8. Europe/Istanbul (Eastern Europe) 9. Europe/Moscow (Russia) 10. Asia/Dubai (Middle East) 11. Asia/Kolkata (India) 12. Asia/Singapore (Southeast Asia) 13. Asia/Tokyo (East Asia) 14. Australia/Sydney (Australia East) 15. Pacific/Auckland (New Zealand)

5. Parser Workflow: Parse-First Approach

Old Workflow:

M3U → Parse → Match → Generate EPG (matched only)

Problem: Unmatched channels lose all data, no analytics

New Workflow:

M3U → Parse ALL → Store ParsedData → Match → Generate EPG (ALL)

Benefits: - ✅ Store every parse attempt (matched or not) - ✅ Rich analytics (parse success rates, confidence trends) - ✅ Use extracted data even when no event match - ✅ Debug visibility (why did this channel fail?)

Parser Strategy: 1. Regex first (fast, free, 70-80% coverage) 2. Check LLM cache (exact match) 3. Check LLM cache (similar channels: same teams/time/league) 4. LLM fallback (Claude Haiku) → cache result

LLM Deduplication Example:

Without dedup:
- "NBA 01: Lakers vs Rockets 7:30" → LLM
- "Basketball: Rockets vs Lakers 7:30" → LLM
- "NBA 05: LA vs HOU 7:30" → LLM
= 3 LLM calls ($0.015)

With dedup:
- "NBA 01: Lakers vs Rockets 7:30" → LLM (cache)
- "Basketball: Rockets vs Lakers 7:30" → Cache hit
- "NBA 05: LA vs HOU 7:30" → Cache hit
= 1 LLM call ($0.005)

Savings: 66% cost reduction

6. Event Data: Two-Phase Loading

Phase 1: Bulk Event ID Ingestion (Daily, 2 AM UTC)

# Download lightweight event lists for next 7 days
events = [
    # TheSportsDB: ~1000 events/day
    fetch_thesportsdb_schedule(today → today+7days),

    # ESPN API: ~1000 events/day
    fetch_espn_schedule(today → today+7days)
]

# Store minimal data for matching
for event in events:
    save_event_id(
        api_id=event.id,
        league=event.league,
        teams=[event.home, event.away],
        datetime=event.datetime,
        details_fetched=False  # Flag: no full details yet
    )

Phase 2: On-Demand Detail Fetching (During Matching)

# Match channel to event ID (fast, all IDs pre-loaded)
event = match_to_event_id(parsed_channel_data)

if event and not event.details_fetched:
    # Fetch full details from BOTH APIs in parallel
    details = fetch_details_parallel(
        thesportsdb_id=event.thesportsdb_id,
        espn_id=event.espn_id
    )

    # Merge best data from both sources
    event = merge_details(event, details)
    event.details_fetched = True

Benefits: - ✅ Fast matching (all event IDs pre-loaded) - ✅ API cost savings (only fetch details for matched events: ~5-10%) - ✅ Redundancy (ESPN backup if TheSportsDB unavailable) - ✅ Better data quality (merge best from both APIs)

API Usage Estimate:

Daily event ID refresh: 2,000 events × 2 APIs = 4,000 lightweight calls
On-demand details: 200 matched × 2 APIs = 400 detailed calls
Total: ~4,400 calls/day (well within free tier limits)

Database Schema Design

Core Tables

1. channel_names

One record per unique channel per provider

CREATE TABLE channel_names (
    id BIGSERIAL PRIMARY KEY,
    provider_id INTEGER NOT NULL REFERENCES providers(id),
    channel_name TEXT NOT NULL,
    channel_name_normalized TEXT NOT NULL,
    tvg_id TEXT,
    tvg_name TEXT,
    tvg_logo TEXT,
    group_title TEXT,
    stream_url TEXT,
    record_status TEXT DEFAULT 'active',
    first_seen TIMESTAMPTZ DEFAULT NOW(),
    last_seen TIMESTAMPTZ DEFAULT NOW(),
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(provider_id, channel_name_normalized)
);

Purpose: Stable channel registry Key Features: - Soft deletes (record_status) - Deduplication via channel_name_normalized - Tracks when channels first/last seen

2. parsed_data

Every parse attempt, matched or not

CREATE TABLE parsed_data (
    id BIGSERIAL PRIMARY KEY,
    channel_name_id BIGINT NOT NULL REFERENCES channel_names(id),
    parse_date DATE NOT NULL,
    raw_title TEXT NOT NULL,

    -- JSONB for flexible extraction
    parsed_data JSONB NOT NULL,
    /* Example:
    {
      "time": "20:00",
      "timezone": "ET",
      "date": "2025-11-05",
      "teams": ["Lakers", "Celtics"],
      "league": "NBA",
      "sport": "Basketball",
      "confidence": {
        "time": 0.95,
        "teams": 0.90,
        "league": 1.0,
        "overall": 0.88
      }
    }
    */

    matched_event_id INTEGER REFERENCES events(id),
    match_confidence REAL,
    match_method TEXT,  -- 'regex', 'llm', 'llm_cached', 'manual'
    match_reason TEXT,

    record_status TEXT DEFAULT 'active',
    created_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(channel_name_id, parse_date)
);

CREATE INDEX idx_parsed_data_jsonb ON parsed_data USING GIN (parsed_data);

Purpose: Full parse audit trail Key Features: - JSONB flexibility (add fields without migrations) - GIN index for fast JSONB queries - Tracks match method (for analytics: how often does LLM vs regex succeed?)

3. epg_data

Final output-ready EPG entries

CREATE TABLE epg_data (
    id BIGSERIAL PRIMARY KEY,
    channel_name_id BIGINT NOT NULL REFERENCES channel_names(id),
    parsed_data_id BIGINT REFERENCES parsed_data(id),

    -- Core XMLTV fields
    title TEXT NOT NULL,
    sub_title TEXT,
    description TEXT,
    start_time TIMESTAMPTZ NOT NULL,
    end_time TIMESTAMPTZ NOT NULL,
    category TEXT,
    language TEXT DEFAULT 'en',
    icon_url TEXT,

    -- Event characteristics
    is_live BOOLEAN DEFAULT false,
    is_new BOOLEAN DEFAULT false,
    is_repeat BOOLEAN DEFAULT false,
    is_premiere BOOLEAN DEFAULT false,

    -- Sports-specific (JSONB)
    sports_metadata JSONB,
    /* {
      "venue": "Madison Square Garden",
      "home_team": "Lakers",
      "away_team": "Celtics",
      "home_logo": "https://...",
      "league": "NBA",
      "broadcaster": "ESPN"
    }
    */

    -- Tier features (controls rendering)
    tier_features JSONB,
    /* {
      "show_emoji": true,
      "show_logos": true,
      "show_debug_info": false,
      "timezone": "America/Chicago"
    }
    */

    confidence REAL,
    match_status TEXT,  -- 'matched', 'partial', 'unmatched', 'manual'
    record_status TEXT DEFAULT 'active',
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

Purpose: Ready-to-generate EPG data Key Features: - Full XMLTV field support - Sports metadata in JSONB (flexible per sport) - Tier features control what renders (emojis, logos, debug)

4. llm_parse_cache

LLM result deduplication

CREATE TABLE llm_parse_cache (
    id BIGSERIAL PRIMARY KEY,
    channel_name_normalized TEXT NOT NULL UNIQUE,
    raw_channel_name TEXT NOT NULL,

    -- For similarity matching
    extracted_teams TEXT[],
    extracted_time TEXT,
    extracted_league TEXT,
    extracted_sport TEXT,

    -- LLM metadata
    llm_response JSONB,
    llm_cost_cents REAL,
    llm_tokens_used INTEGER,

    -- Usage tracking
    reuse_count INTEGER DEFAULT 0,
    last_reused_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_llm_cache_similarity
  ON llm_parse_cache(extracted_teams, extracted_time, extracted_league);

Purpose: Prevent duplicate LLM calls Key Features: - Exact match via channel_name_normalized - Similarity match via extracted_teams + time + league - Tracks cost savings via reuse_count

5. epg_file_cache

Track EPG files in R2

CREATE TABLE epg_file_cache (
    id BIGSERIAL PRIMARY KEY,
    provider_id INTEGER NOT NULL REFERENCES providers(id),
    tier TEXT NOT NULL,  -- 'kid', 'billy', 'goat', 'custom'
    timezone TEXT NOT NULL,
    custom_config_hash TEXT,

    -- R2 storage
    r2_key TEXT NOT NULL UNIQUE,
    file_size_bytes BIGINT,

    -- Generation tracking
    generation_status TEXT DEFAULT 'pending',
    generation_duration_ms INTEGER,

    -- Access tracking
    access_count INTEGER DEFAULT 0,
    last_accessed_at TIMESTAMPTZ,
    promotion_status TEXT DEFAULT 'on_demand',  -- or 'pre_generated'

    -- Invalidation
    invalidated_at TIMESTAMPTZ,

    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(provider_id, tier, timezone, COALESCE(custom_config_hash, ''))
);

Purpose: Intelligent cache management Key Features: - Tracks every EPG file in R2 - access_count enables auto-promotion - invalidated_at triggers regeneration when data changes

Enhanced events Table

ALTER TABLE events ADD COLUMN espn_id TEXT;
ALTER TABLE events ADD COLUMN home_team TEXT;
ALTER TABLE events ADD COLUMN away_team TEXT;
ALTER TABLE events ADD COLUMN details_fetched BOOLEAN DEFAULT false;
ALTER TABLE events ADD COLUMN thesportsdb_metadata JSONB;
ALTER TABLE events ADD COLUMN espn_metadata JSONB;

Purpose: Two-phase event loading Key Features: - details_fetched flag (false = ID only, true = full details) - Store full API responses in JSONB (for debugging) - Support both TheSportsDB and ESPN IDs

Implementation Phases

Phase 1: Infrastructure Setup (Week 1)

Deliverables: - ✅ Supabase PostgreSQL project created - ✅ Cloudflare R2 bucket configured - ✅ Environment variables set up - ✅ Connection tests passing

Tasks: 1. Create Supabase project (free tier) 2. Set up R2 bucket ("epgoat-epg-files") 3. Generate API tokens and connection strings 4. Test connectivity from dev environment 5. Document setup in Documentation/05-Reference/

Success Criteria: - Can connect to Supabase from Python - Can upload/download to R2 - All credentials in .env (not committed)

Phase 2: Database Schema (Week 1-2)

Deliverables: - ✅ All 5 new tables created in Supabase - ✅ Triggers for auto-invalidation working - ✅ Migration scripts tested (UP and DOWN) - ✅ Indexes optimized for query patterns

Tasks: 1. Write SQL migration (010_create_new_schema.sql) 2. Test migration on dev Supabase instance 3. Create rollback migration (DOWN section) 4. Verify all indexes created 5. Test trigger: EPGData insert → invalidate files

Success Criteria: - Migration runs without errors - All 5 tables exist with correct schema - Triggers fire correctly - Can query JSONB fields efficiently

Phase 3: Parser & Matching Engine (Week 2-3)

Deliverables: - ✅ Channel parser with regex + LLM fallback - ✅ LLM caching with 70%+ reuse rate - ✅ Event matching with dual API support - ✅ Two-phase event loading operational

Tasks:

3.1: Channel Parser - Implement regex patterns (NBA, NFL, NHL, MLB, Soccer) - Integrate Claude Haiku API - Build LLM cache (exact + similarity matching) - Test with 100+ real channel names - Target: 80%+ regex success, 70%+ cache hits

3.2: Event Matching - Build fuzzy team name matching - Integrate manual override system - Support confidence scoring - Test match accuracy - Target: 85%+ match rate

3.3: Two-Phase Event Loading - Daily job: Bulk load event IDs (TheSportsDB + ESPN) - On-match: Fetch full details - Merge data from both APIs (TheSportsDB preferred) - Target: <2 sec per match

Success Criteria: - Parser handles 90%+ channels successfully - LLM cache reduces costs by 70%+ - Event matching achieves 85%+ success rate - Details fetch in <2 seconds per event

Phase 4: EPG Generation System (Week 3-4)

Status: Planning Complete (2025-11-07) | Implementation In Progress

Overview

Phase 4 transforms the database-stored EPG data into XMLTV-formatted files that IPTV clients can consume. This phase implements a sophisticated three-tier system (KID/BILLY/GOAT) with intelligent handling of matched, parsed-but-unmatched, and unparseable channels.

Key Design Decisions (2025-11-07 Planning Session): - Refactor existing xmltv.py (not create new file) - maintains continuity - Generate pre/post blocks on-the-fly (not stored in DB) - saves massive space - Three channel scenarios with different handling for each - 3-day lookahead window for events (only future events) - Midnight start for pre-events (00:00 of target_date) - 4-hour post-event blocks (rounded to 2-hour grid) - Keep original channel names during testing (configurable for production) - Provider tvg-id priority (never change user's existing IDs) - LLM-ready architecture with fallback chain (future enhancement)

Field Specification

Reference Document: epg-field-specification.csv (GitHub root)

This CSV defines all 48 EPG fields with tier availability, data sources, XMLTV elements, and example values. Key sections: - Channel Fields (rows 2-5): ID, display name, icon, category - Programme Core (rows 6-8): Start/stop times, channel reference - Title Variants (rows 9-11): Pre-event, live event, post-event - Descriptions (rows 13-15): Tier-specific with broadcaster conditional - Visual Elements (rows 16-18): Icons, categories, keywords - Sports Metadata (rows 30-42): Venue, teams, logos, broadcaster, league, season, round - Block Timing (rows 43-45): 2-hour grid alignment rules - Emoji Config (rows 46-48): Sport-specific, GOAT live only

Important CSV Notes: - Row 13 (K13): GOAT pre-event adds venue + broadcaster conditional - Row 14 (K14): GOAT live event adds venue + broadcaster conditional - Row 27 (K27): Language detection from channel names (espanol, español, french, etc.)

Tier System

KID Tier (Minimal): - ✅ Channel ID, display name - ✅ Start/stop times for live events - ✅ Title: Event name only - ❌ No descriptions, icons, emojis, metadata

BILLY Tier (Mid-Level): - ✅ Everything in KID - ✅ Pre-event: "Airing today: {title}" - ✅ Post-event: "Ended: {title}" - ✅ Basic descriptions - ✅ Language detection - ❌ No emojis, icons, detailed metadata

GOAT Tier (Full Features): - ✅ Everything in BILLY - ✅ Sport emojis on live events (🏀🏈⚽) - ✅ Programme icons (team/league logos) - ✅ Rich descriptions with venue, city, broadcaster - ✅ Timezone-aware times in titles - ✅ Categories, keywords - ✅ Live flags (live/new/premiere) - ✅ Full sports_metadata JSONB

Three Channel Scenarios

Scenario 1: Matched Channels (✅) - Condition: epg_data.matched_event_id IS NOT NULL - Emoji: Sport-specific (🏀 for Basketball) - GOAT tier live events only - Title: Full canonical team names from database - Description: Rich with venue, broadcaster, season context - Pre/Post Blocks: Full 2-hour grid from midnight to event, 4 hours post - Example: "🏀 Los Angeles Lakers vs Boston Celtics" (GOAT live)

Scenario 2: Parsed-Unmatched Channels (⚠️) - Condition: parsed_data IS NOT NULL AND matched_event_id IS NULL - Emoji: ⚠️ Warning emoji - ALL tiers (testing only, will remove in production) - Title: Extracted team names from channel name (best effort) - Description: Basic with extracted data - Pre/Post Blocks: Same structure as matched (mirror behavior) - Example: "⚠️ Lakers vs Celtics" (testing indicator)

Scenario 3: Unparseable Channels (📺) - Condition: parsed_data IS NULL OR parse failed completely - Emoji: 📺 TV emoji - ALL tiers (testing only, will remove in production) - Title: Original channel name (no parsing) - Description: "Programming on this channel" - Blocks: 24-hour block with original channel name - Example: "📺 Some Random Channel 01"

Testing Emojis (⚠️ and 📺): - Purpose: Visual identification during testing/debugging - Will be removed in production (or made configurable) - Sport emojis (🏀🏈⚽) remain in production for GOAT tier

Event Filtering Logic

3-Day Lookahead Window:

# Only include events within next 3 days
if event.date <= target_date + timedelta(days=3):
    # AND event hasn't already happened
    if event_datetime > now():
        include_event()

Past Event Handling: - If event time < now: Ignore completely (don't generate any blocks) - Rationale: No value in showing past events

No Date Handling: - Assume today: If no date extracted, treat as target_date - Best effort: Use time if available, create single block

Date Trust: - Always trust parsed dates from channel names - Don't second-guess extraction logic

Pre/Post Event Block Timing

Pre-Event Blocks:

# Start at midnight of target_date (NOT current time)
pre_event_start = target_date.replace(hour=0, minute=0, second=0)

# End at event start time
pre_event_end = event.start_time

# Fill with 2-hour blocks, aligned to global grid
# Global grid: 00:00, 02:00, 04:00, 06:00, 08:00, etc.
# Last block may be partial if event doesn't start on grid

Example: Event at 19:30 (7:30 PM) - Block 1: 00:00-02:00 (2 hours) - Block 2: 02:00-04:00 (2 hours) - Block 3: 04:00-06:00 (2 hours) - Block 4: 06:00-08:00 (2 hours) - Block 5: 08:00-10:00 (2 hours) - Block 6: 10:00-12:00 (2 hours) - Block 7: 12:00-14:00 (2 hours) - Block 8: 14:00-16:00 (2 hours) - Block 9: 16:00-18:00 (2 hours) - Block 10: 18:00-19:30 (1.5 hours - partial)

Post-Event Blocks:

# Start at event end time
post_event_start = event.end_time

# Total duration: 4 hours
post_event_end = event.end_time + timedelta(hours=4)

# Fill with 2-hour blocks (usually 2 blocks)
# Round to nearest grid boundary

2-Hour Grid Alignment: - All blocks align to global 2-hour grid based on local timezone - Grid boundaries: 00:00, 02:00, 04:00, 06:00, 08:00, 10:00, 12:00, 14:00, 16:00, 18:00, 20:00, 22:00 - Last block may be partial if event doesn't align

Broadcaster Conditional Logic

Pre-Event Descriptions (GOAT tier):

if broadcaster:
    desc = f"NBA game showing on this channel and on {broadcaster} at {time_str}"
else:
    desc = f"NBA game at Staples Center"

Example with broadcaster: "NBA game showing on this channel and on ESPN at 7:30 PM CT" Example without: "NBA game at Staples Center"

Live Event Descriptions (GOAT tier):

if broadcaster:
    desc = f"Live NBA game from Staples Center featuring Lakers vs Celtics. Broadcast on {broadcaster}"
else:
    desc = f"Live NBA game from Staples Center featuring Lakers vs Celtics"

Key Points: - NO "LIVE:" prefix in titles (emoji serves that purpose) - Broadcaster is optional (many events don't have it) - Different wording for pre-event vs live event

Channel Display Names

Testing Phase (Current): - Keep full original channel names from M3U - Example: "NBA 01: Lakers vs Celtics 7:30 PM ET" - Rationale: Full context during debugging

Production Phase (Future): - Use shell names only (configurable) - Example: "NBA 01" - Rationale: Cleaner UI, all details in EPG data

Implementation:

def get_display_name(channel, clean_names: bool = False):
    if clean_names:
        # Production: shell only
        return channel.tvg_name or "Channel {id}"
    else:
        # Testing: full original name
        return channel.channel_name

tvg-id Priority Strategy

Priority 1: Use provider's tvg-id (if exists)

if channel.tvg_id and channel.tvg_id.strip():
    return channel.tvg_id  # NEVER change this

Priority 2: Generate consistent ID

else:
    return generate_channel_id(channel.channel_name)

Rationale: - User compatibility: Changing tvg-ids breaks existing user setups - Provider trust: If provider specifies ID, it's intentional - Fallback only: Generate IDs only when provider doesn't provide them

Language Detection

Strategy: Parse channel name for language indicators

def detect_language(channel_name: str) -> str:
    name_lower = channel_name.lower()

    # Spanish
    if any(word in name_lower for word in ['espanol', 'español', 'spanish', 'latino', 'latina']):
        return 'es'

    # French
    if any(word in name_lower for word in ['français', 'francais', 'french']):
        return 'fr'

    # Portuguese
    if any(word in name_lower for word in ['português', 'portugues', 'portuguese', 'brasil', 'brazil']):
        return 'pt'

    # Default
    return 'en'

Tiers: BILLY + GOAT (KID doesn't include language field)

Sport Default Durations

For parsed-unmatched events (no event match), use sport-specific durations:

SPORT_DEFAULT_DURATIONS = {
    'Basketball': 180,  # 3 hours in minutes
    'Football': 210,    # 3.5 hours (NFL)
    'Baseball': 210,    # 3.5 hours (MLB)
    'Hockey': 180,      # 3 hours (NHL)
    'Soccer': 150,      # 2.5 hours
    'default': 180      # 3 hours fallback
}

Rationale: Without event match, we don't have actual duration, so use typical duration per sport.

LLM-Ready Parser Architecture

Future Enhancement (Phase 4.X):

Design parser with fallback chain for future Claude Haiku integration:

class ChannelParser:
    def parse(self, channel_name: str) -> ParseResult:
        # Step 1: Regex patterns (fast, free, 70-80% coverage)
        result = self._parse_with_regex(channel_name)
        if result.confidence > 0.8:
            return result

        # Step 2: LLM cache (exact match)
        cached = self._check_llm_cache_exact(channel_name)
        if cached:
            return cached

        # Step 3: LLM cache (similarity: same teams/time/league)
        cached = self._check_llm_cache_similar(channel_name)
        if cached and cached.confidence > 0.9:
            return cached

        # Step 4: [FUTURE] Claude Haiku API call
        # TODO: Implement LLM fallback
        # - Call Claude Haiku with prompt
        # - Cache result for future reuse
        # - Track cost and usage

        # Step 5: Return best available result
        return result if result.confidence > 0.0 else ParseResult.empty()

Benefits: - Code structure ready for LLM integration - No refactoring needed when adding LLM - Gradual enhancement without blocking current work

Living Document Task: Add Phase 4.X task to implement LLM integration (estimated: 1 week, future sprint)

Phase 4 Implementation Tasks

Deliverables: - Database-powered XMLTV generator with 3-tier support (KID/BILLY/GOAT) - Three-scenario handling (matched/parsed-unmatched/unparseable) - LLM-ready modular parser architecture - EPG data population service - R2 storage integration - Batch EPG file generation (450 files) - Cache invalidation logic - Comprehensive test suite - Updated documentation

4.0: Schema Validation (Current Task) - [ ] Read epg-field-specification.csv to understand all required fields - [ ] Check migration 010 for epg_data table structure - [ ] Validate all CSV-specified fields exist in database - [ ] Create migration 011 if any fields are missing - [ ] Document any schema gaps

4.1: Core Implementation - [ ] Add constants to backend/epgoat/domain/models.py: - SPORT_DEFAULT_DURATIONS dict - EVENT_LOOKAHEAD_DAYS = 3 - EMOJI_PARSED_UNMATCHED = '⚠️' - EMOJI_UNPARSEABLE = '📺' - [ ] Refactor backend/epgoat/domain/xmltv.py: - Create XMLTVGenerator class - Implement generate_for_provider() method - Implement _generate_programmes_for_channel() with three scenarios - Implement _should_include_event() (3-day lookahead + future only) - Implement _format_pre_event_title() with tier logic - Implement _format_live_event_title() with emoji logic - Implement _format_post_event_title() with tier logic - Implement _generate_pre_event_blocks() (midnight start, 2-hour grid) - Implement _generate_post_event_blocks() (4 hours, 2-hour grid) - Implement tier-based field filtering

4.2: Modular Services - [ ] Create backend/epgoat/services/channel_parser.py: - Define ParseResult dataclass - Implement ChannelParser class with fallback chain: 1. _parse_with_regex() 2. _check_llm_cache_exact() 3. _check_llm_cache_similar() 4. [TODO] LLM API call (future) - Add comprehensive TODO comments for LLM integration - [ ] Create backend/epgoat/services/language_detector.py: - Implement detect_language_from_channel_name() - Support es/fr/pt detection - Default to 'en'

4.3: EPG Data Population - [ ] Create backend/epgoat/services/epg_data_generator.py: - Implement EPGDataGenerator class - Method: generate_for_provider(provider_id, target_date) - Method: _create_epg_data() with sports_metadata JSONB - Method: _update_epg_data_if_changed() - [ ] Schedule daily processing workflow (cron/Workers)

4.4: Batch Generation & Storage - [ ] Create backend/epgoat/services/r2_storage.py: - Implement get_r2_client() (S3-compatible) - Implement upload_to_r2(key, content) - Implement download_from_r2(key) - Error handling and retries - [ ] Create backend/epgoat/utilities/generate_epg_files_batch.py: - Loop through 3 tiers × 10 providers × 15 timezones - Generate EPG for each combination - Upload to R2 - Track in epg_file_cache table - Target: Generate all 450 files in <10 minutes

4.5: Cache Invalidation - [ ] Create backend/epgoat/utilities/regenerate_invalidated_epg.py: - Query epg_file_cache WHERE invalidated_at IS NOT NULL - Regenerate each invalidated file - Upload to R2 - Clear invalidated_at timestamp - Log regeneration events - [ ] Test trigger integration (insert into epg_data → invalidates cache)

4.6: Testing - [ ] Unit tests for tier filtering: - Test KID tier includes minimal fields - Test BILLY tier includes descriptions - Test GOAT tier includes all fields - [ ] Unit tests for emoji logic: - Test sport emoji only on GOAT live events - Test ⚠️ emoji on parsed-unmatched (all tiers) - Test 📺 emoji on unparseable (all tiers) - [ ] Unit tests for event filtering: - Test 3-day lookahead window - Test past event exclusion - Test future-only inclusion - [ ] Unit tests for tvg-id priority: - Test provider ID preserved - Test generated ID fallback - [ ] Integration tests: - Test full EPG generation for matched channel - Test full EPG generation for parsed-unmatched channel - Test full EPG generation for unparseable channel - Compare output across three tiers - [ ] Visual testing: - Load generated EPG in TiviMate - Load generated EPG in VLC - Verify pre/post blocks display correctly - Verify emojis render correctly (GOAT tier)

4.7: Documentation - [ ] Update living document with implementation progress - [ ] Create EPG Generation guide: - How to generate EPG files - How tier system works - How to test with IPTV clients - Troubleshooting common issues

Success Criteria: - EPG files validate as correct XMLTV - 450 pre-generated files in R2 (~225MB) - Instant load for 95% of requests - Workers route correctly with <50ms overhead

Phase 5: Migration & Cutover (Week 4)

Deliverables: - ✅ All D1 data migrated to Supabase - ✅ Parallel run complete (both systems) - ✅ Cutover successful - ✅ D1 decommissioned (after 30-day safety period)

Tasks:

5.1: Data Migration - Export all tables from D1 to JSON - Transform to new schema - Import to Supabase - Validate data completeness - Target: 100% data migrated

5.2: Parallel Run - Dual-write to both D1 and Supabase (2 weeks) - Compare outputs for consistency - Monitor performance - Fix discrepancies - Target: <1% difference between systems

5.3: Cutover - Switch reads to Supabase - Monitor for 48 hours - Stop D1 writes - Document rollback procedure - Target: Zero downtime

5.4: Decommission D1 - Keep D1 running for 30 days (safety) - Archive D1 data - Delete Supabase database - Update all documentation

Success Criteria: - 100% data migrated accurately - New system performs better than old - Zero data loss during cutover - Rollback procedure documented and tested

Cost Analysis

Projected Monthly Costs

Users	DB Size	R2	Workers	LLM	APIs	Total
10	50MB	Free	Free	$5	Free	$5
100	150MB	Free	Free	$10	Free	$10
500	400MB	Free	Free	$15	Free	$15
1,000	600MB	$25	Free	$20	$3	$48
5,000	2GB	$25	$5	$50	$10	$90
10,000	5GB	$79	$20	$100	$20	$219

Key Cost Drivers: 1. Supabase: Free until 500MB, then $25/mo (Pro) 2. R2: Free until 10GB, then $0.015/GB/month 3. Workers: Free until 100k req/day, then $5/10M requests 4. Claude Haiku: ~$5-10/month with 70% cache hit rate 5. APIs: Free tiers sufficient until 1000+ users

Cost Savings vs Alternatives: - AWS S3 + CloudFront: Would cost $90/month in egress alone at 1000 users - R2 saves: ~$1,080/year at 1000 users (zero egress) - LLM caching saves: 70% reduction vs no caching

Risk Assessment & Mitigation

Technical Risks

Risk 1: Supabase Free Tier Exhaustion - Probability: Medium (500MB limit) - Impact: High (requires paid plan) - Mitigation: - Monitor DB size daily - Set up alerts at 400MB (80% threshold) - Budget for Pro plan ($25/mo) in advance - Optimize queries to reduce bloat

Risk 2: Migration Data Loss - Probability: Low (with proper testing) - Impact: Critical (lose all data) - Mitigation: - Parallel run for 2 weeks (dual-write) - Keep D1 for 30 days post-cutover - Export full D1 backup before migration - Validate 100% data migrated - Document rollback procedure

Risk 3: EPG Generation Timeout - Probability: Low (with optimization) - Impact: Medium (poor UX for custom configs) - Mitigation: - Optimize queries (single query per provider) - Parallel processing where possible - Background generation on purchase - 202 Accepted response with retry-after

Risk 4: LLM API Outage - Probability: Low (Anthropic SLA: 99.9%) - Impact: Medium (can't parse new channels) - Mitigation: - Regex handles 80% (continues working) - LLM cache handles repeat channels - Graceful degradation (use partial regex data) - Retry logic with exponential backoff

Risk 5: R2 Storage Costs Exceed Budget - Probability: Low (10GB free tier) - Impact: Low ($0.15/GB beyond free tier) - Mitigation: - Auto-demotion deletes unused files - Monitor storage daily - Set alerts at 8GB (80% of free tier)

Business Risks

Risk 6: User Resistance to Timezone Selection - Probability: Low (standard feature) - Impact: Low (defaults to America/New_York) - Mitigation: - Smart defaults based on IP geolocation - Clear UI for timezone selection - Support docs explaining timezone feature

Risk 7: Development Timeline Overrun - Probability: Medium (4-week aggressive timeline) - Impact: Medium (delayed launch) - Mitigation: - Phased approach (MVP first, enhancements later) - Weekly milestones with clear deliverables - Buffer time in Phase 5 (1 week contingency)

Success Metrics

Technical KPIs

Parser Performance: - ✅ Regex success rate: >70% - ✅ LLM cache hit rate: >70% - ✅ Overall parse success: >90% - ✅ LLM cost savings: >60%

Matching Performance: - ✅ Event match rate: >85% - ✅ Match confidence average: >0.8 - ✅ Manual override usage: <10%

EPG Generation: - ✅ Pre-generated files: 450 - ✅ Instant load percentage: >95% - ✅ Custom config generation: <5 seconds - ✅ Storage usage: <500MB

Infrastructure: - ✅ Database size: Within free tier (Week 1-8) - ✅ R2 egress cost: $0/month - ✅ Worker latency: <50ms routing - ✅ Uptime: >99.9%

Business KPIs

User Experience: - ✅ EPG load time: <500ms for 95% of users - ✅ Worldwide coverage: 15 timezones - ✅ Support tickets: <5/month related to EPG

Cost Efficiency: - ✅ Cost per user: <$0.10/month (at 1000 users) - ✅ LLM cost savings: >$50/month - ✅ Storage cost savings: >$90/month vs S3

Scalability: - ✅ Support 1000 users at <$50/month - ✅ Support 10,000 users at <$250/month - ✅ Zero manual intervention for scaling

Next Steps After Completion

Phase 2 Enhancements (Month 2-3)

Custom Tier Development: - User-configurable features - Feature toggle UI - Custom config hashing - Target: Launch by Week 8

Advanced Analytics: - Parse success dashboard - Match quality reports - Cost tracking dashboard - Provider quality scorecards

Pattern Learning: - Automated regex generation from LLM successes - Pattern approval workflow - Continuous improvement loops

Phase 3 Scale Optimizations (Month 4-6)

Performance: - Parallel EPG generation - Query optimization (indexes, materialized views) - Caching layers (Redis for hot data)

Additional APIs: - ESPN Deep Integration - Others (Gracenote, Sportradar) - API fallback chains

ML Enhancements: - Match confidence scoring via ML - Anomaly detection (bad matches) - Auto-learn team aliases

Documentation Updates Required

New Documentation: - [ ] Supabase Setup Guide (05-Reference/Database/Supabase-Setup.md) - [ ] R2 Storage Guide (05-Reference/Storage/R2-Setup.md) - [ ] Parser Architecture (03-Architecture/Parser-Architecture.md) - [ ] EPG Generation V2 (04-Guides/EPG-Generation-V2.md) - [ ] Migration Runbook (08-Processes/Supabase-Migration.md)

Updated Documentation: - [ ] System Overview (03-Architecture/System-Overview.md) - [ ] Database Schema (05-Reference/Database/Schema.md) - [ ] Core Principles (cost section) (03-Architecture/Core-Principles.md) - [ ] Quick Start Guide (04-Guides/Quick-Start.md)

ADRs to Create: - [ ] ADR-010: PostgreSQL Migration via Supabase - [ ] ADR-011: Parser-First Workflow - [ ] ADR-012: Hybrid EPG Caching Strategy - [ ] ADR-013: Two-Phase Event Loading

Appendix: Detailed Implementation Plan

For detailed task-by-task implementation with code examples, tests, and verification steps, see:

Documentation/08-Projects/Design-Plans/2025-01-05-architecture-workflow-redesign.md

This operational plan contains: - Exact file paths for every change - Complete code examples - Test-first approach (TDD) - Verification commands with expected output - Commit messages for each step

Note: The detailed plan follows the superpowers:writing-plans format with bite-sized tasks (2-5 minutes each), while this document provides the strategic overview and business context.

Approvals

Technical Lead: ✅ Approved (2025-01-05) CTO: ✅ Approved (2025-01-05) Product: ⏳ Pending Review Budget: ✅ Approved (<$50/month initial cost)

Start Date: 2025-01-06 Target Completion: 2025-02-05 (4 weeks) Review Cadence: Weekly (Mondays, 10 AM)

Status Log: - 2025-01-05: Project initiated, architecture designed - 2025-01-05: Detailed implementation plan created - 2025-01-05: Awaiting final approvals to begin Phase 1 - 2025-01-05: Phase 1 completed (Infrastructure Setup) - 2025-01-05: Phase 2 started (migration 010 written) - 2025-11-06: Phase 3 completed (Parser & Matching Engine) - 2025-11-07: Converted to living document format - 2025-11-07: Ready to apply migration 010 to Supabase