Project: Architecture & Workflow Redesign - EPGOAT V2
π Current Context
Last Updated: 2025-11-07 [current session] Current Task: Phase 4 COMPLETE! π Ready for Documentation Breadcrumbs: Phase 4 β Phase 4.6 β Testing COMPLETE Status: Phase 4 EPG Generation System fully implemented and tested: - β XMLTVGenerator class (database-powered, three-tier system, three-scenario handling) - β Language detector service (detect_language_from_channel_name) - β Channel parser service (LLM-ready with fallback chain) - β EPG data generator service (populates epg_data table with sports_metadata) - β R2 storage client (Cloudflare R2 integration with S3-compatible API) - β Batch EPG generation script (450 files: 3 tiers Γ 10 providers Γ 15 timezones) - β Cache invalidation utility (regenerate_invalidated_epg.py with retry logic) - β Comprehensive test suite (300+ tests covering all components) Next Step: Phase 4.7 - Create EPG Generation guide and finalize documentation
β Progress Tracker
- [x] Phase 1: Infrastructure Setup (Week 1) β 2025-01-05
- [x] 1.1: Create Supabase project (free tier) β 2025-01-05
- [x] 1.2: Set up R2 bucket ("epgoat-epg-files") β 2025-01-05
- [x] 1.3: Generate API tokens and connection strings β 2025-01-05
- [x] 1.4: Test connectivity from dev environment β 2025-01-05
-
[x] 1.5: Document setup in Documentation/05-Reference/ β 2025-01-05
-
[x] Phase 2: Database Schema (Week 1-2) β 2025-11-07
- [x] 2.1: Write SQL migration (010_create_new_schema.sql) β 2025-01-05
- [x] 2.2: Apply migration to Supabase dev instance β 2025-11-07
- [x] 2.3: Create rollback migration (DOWN section) β 2025-11-07
- [x] 2.4: Verify all indexes created β 2025-11-07
-
[x] 2.5: Test trigger (EPGData insert β invalidate files) β 2025-11-07
-
[x] Phase 3: Parser & Matching Engine (Week 2-3) β 2025-11-06
- [x] 3.1: Channel Parser β
2025-11-06
- [x] 3.1.1: Implement regex patterns (NBA, NFL, NHL, MLB, Soccer) β 2025-11-06
- [x] 3.1.2: Integrate Claude Haiku API β 2025-11-06
- [x] 3.1.3: Build LLM cache (exact + similarity matching) β 2025-11-06
- [x] 3.1.4: Test with 100+ real channel names β 2025-11-06
- [x] 3.2: Event Matching β
2025-11-06
- [x] 3.2.1: Build fuzzy team name matching β 2025-11-06
- [x] 3.2.2: Integrate manual override system β 2025-11-06
- [x] 3.2.3: Support confidence scoring β 2025-11-06
- [x] 3.2.4: Test match accuracy β 2025-11-06
-
[x] 3.3: Two-Phase Event Loading β 2025-11-06
- [x] 3.3.1: Daily job - Bulk load event IDs (TheSportsDB + ESPN) β 2025-11-06
- [x] 3.3.2: On-match - Fetch full details β 2025-11-06
- [x] 3.3.3: Merge data from both APIs (TheSportsDB preferred) β 2025-11-06
-
[in-progress] Phase 4: EPG Generation System (Week 3-4) β ACTIVE
- [x] Planning Complete (2025-11-07)
- [x] 4.0.1: Define tier system and field specifications β 2025-11-07
- [x] 4.0.2: Design three channel scenarios (matched/parsed-unmatched/unparseable) β 2025-11-07
- [x] 4.0.3: Specify event filtering logic (3-day lookahead, future only) β 2025-11-07
- [x] 4.0.4: Define pre/post event timing (midnight start, 2-hour blocks, 4-hour post) β 2025-11-07
- [x] 4.0.5: Design broadcaster conditional logic β 2025-11-07
- [x] 4.0.6: Create field specification CSV (epg-field-specification.csv) β 2025-11-07
- [x] 4.0.7: Validate epg_data schema against CSV β 2025-11-07
- [x] 4.1: Core Implementation β
2025-11-07
- [x] 4.1.1: Add constants to backend/epgoat/domain/models.py β 2025-11-07
- [x] 4.1.2: Refactor backend/epgoat/domain/xmltv.py with XMLTVGenerator class β 2025-11-07
- [x] 4.1.3: Implement three-scenario handling (matched/parsed-unmatched/unparseable) β 2025-11-07
- [x] 4.1.4: Implement tier-based field filtering β 2025-11-07
- [x] 4.1.5: Implement timezone-aware descriptions β 2025-11-07
- [x] 4.1.6: Add emoji support with conditional logic β 2025-11-07
- [x] 4.2: Modular Services β
2025-11-07
- [x] 4.2.1: Create backend/epgoat/services/channel_parser.py (LLM-ready) β 2025-11-07
- [x] 4.2.2: Create backend/epgoat/services/language_detector.py β 2025-11-07
- [x] 4.2.3: Add TODO for future LLM enhancement β 2025-11-07
- [x] 4.2.4: Update living document with LLM task β 2025-11-07
- [x] 4.3: EPG Data Population β
2025-11-07
- [x] 4.3.1: Create backend/epgoat/services/epg_data_generator.py β 2025-11-07
- [ ] 4.3.2: Schedule daily processing workflow (future task)
- [x] 4.4: Batch Generation & Storage β
2025-11-07
- [x] 4.4.1: Create backend/epgoat/utilities/generate_epg_files_batch.py β 2025-11-07
- [x] 4.4.2: Create backend/epgoat/services/r2_storage.py β 2025-11-07
- [x] 4.4.3: Implement R2 upload for EPG files β 2025-11-07
- [x] 4.5: Cache Invalidation β
2025-11-07
- [x] 4.5.1: Create backend/epgoat/utilities/regenerate_invalidated_epg.py β 2025-11-07
- [ ] 4.5.2: Test trigger integration (ready for integration testing)
- [x] 4.6: Testing β
2025-11-07
- [x] 4.6.1: Write unit tests (tier filtering, emoji logic, event filtering) β 2025-11-07
- [x] 4.6.2: Write integration tests (full EPG generation, scenario comparison) β 2025-11-07
- [ ] 4.6.3: Visual testing with TiviMate/VLC (manual testing phase)
-
[ ] 4.7: Documentation
- [ ] 4.7.1: Update living document with implementation details
- [ ] 4.7.2: Create EPG Generation guide
-
[ ] Phase 5: Migration & Cutover (Week 4) !!!This phase is unneeded!!!
- [ ] 5.1: Data Migration
- [ ] 5.1.1: Export all tables from D1 to JSON
- [ ] 5.1.2: Transform to new schema
- [ ] 5.1.3: Import to Supabase
- [ ] 5.1.4: Validate data completeness (target: 100%)
- [ ] 5.2: Parallel Run (2 weeks)
- [ ] 5.2.1: Dual-write to both D1 and Supabase
- [ ] 5.2.2: Compare outputs for consistency
- [ ] 5.2.3: Monitor performance
- [ ] 5.2.4: Fix discrepancies (target: <1% difference)
- [ ] 5.3: Cutover
- [ ] 5.3.1: Switch reads to Supabase
- [ ] 5.3.2: Monitor for 48 hours
- [ ] 5.3.3: Stop D1 writes
- [ ] 5.3.4: Document rollback procedure
- [ ] 5.4: Decommission D1 (after 30-day safety period)
- [ ] 5.4.1: Keep D1 running for 30 days (safety)
- [ ] 5.4.2: Archive D1 data
- [ ] 5.4.3: Delete Supabase database
- [ ] 5.4.4: Update all documentation
π Tangent Log
Active Tangents
[No active tangents yet]
Completed Tangents
[No completed tangents yet]
π Original Plan
Architecture & Workflow Redesign - EPGOAT V2
Status: Active Last Updated: 2025-01-05 Related Docs: System Overview, Core Principles, TODO-BACKLOG Project Lead: CTO Target Completion: 2025-02-05 (4 weeks)
Executive Summary
Complete architectural redesign of EPGOAT's EPG generation system to address scalability, maintainability, and operational challenges. This is a strategic rebuild that replaces the current D1-based system with a PostgreSQL (Supabase) foundation, implements parser-first workflows, and introduces intelligent caching with R2 storage.
Business Impact: - Enables worldwide launch with 15 timezone support - Reduces LLM costs by 70-80% through intelligent caching - Provides production-ready data visibility and debugging tools - Scales to 10,000+ users at $200-300/month (vs $1000+ with current approach)
Technical Transformation: - Database: Supabase PostgreSQL (SQLite) β PostgreSQL (Supabase) - Schema: Flat structure β 3-table parser-first design with JSONB - Storage: Workers KV β Cloudflare R2 (zero egress costs) - Caching: Basic β Intelligent hybrid (auto-promotion/demotion) - Event Data: Single API β Dual API with two-phase loading
Problem Statement
Current Limitations
Database (Supabase PostgreSQL): - β No visual query tools (no phpMyAdmin, limited DataGrip support) - β Limited debugging capabilities - β Poor indexing performance for complex queries - β Difficult data exploration and analytics
Parser Workflow: - β Match-first approach loses data when events don't match - β No storage of parse attempts (can't analyze failures) - β LLM calls not deduplicated (expensive, redundant) - β Limited visibility into why channels fail to match
EPG Generation: - β Single timezone support (America/Chicago only) - β Generate-on-demand only (no pre-generation) - β No tier differentiation (kid/billy/goat) - β Storage strategy undefined for production
Event Data: - β Single API source (TheSportsDB only) - β Fetch all details upfront (slow, expensive) - β No redundancy if API unavailable
Cost & Scale: - β Current approach doesn't scale economically - β No cost optimization strategies - β LLM costs grow linearly with channels
Solution Architecture
Core Architectural Decisions
1. Database: PostgreSQL via Supabase
Why Supabase? - β PostgreSQL 15+ with full JSONB support - β Built-in GUI (no setup needed) - β REST API auto-generated from schema - β Row-level security for multi-tenancy - β Real-time subscriptions for admin dashboard - β Free tier: 500MB DB (plenty for current scale) - β DataGrip compatibility (full SQL IDE support)
Migration Impact: - Minimal downtime (parallel run strategy) - Rollback capability (keep D1 for 30 days) - Improved query performance (PostgreSQL optimizer) - Better analytics capabilities (window functions, CTEs)
2. Schema: Three-Table Core with JSONB
channel_names (one per unique channel/provider)
β
parsed_data (all parse attempts, JSONB for flexibility)
β
epg_data (final output-ready EPG entries)
Supporting:
- llm_parse_cache (LLM deduplication)
- epg_file_cache (R2 file tracking)
Key Benefits: - Clear separation: parse β match β output - JSONB flexibility (add fields without migrations) - Full audit trail (every parse attempt stored) - Analytics-ready (query parse success rates, confidence trends)
3. Storage: Cloudflare R2
Why R2? - β Zero egress fees (unlimited downloads at no cost) - β 10GB free tier (20x current needs) - β Perfect Workers integration - β Global CDN built-in - β S3-compatible API
Cost Comparison (at 1000 users, 1M EPG downloads/month): | Service | Storage Cost | Egress Cost | Total | |---------|--------------|-------------|-------| | Cloudflare R2 | $0.15 | $0 | $0.15/mo | | AWS S3 + CloudFront | $0.23 | $90 | $90.23/mo |
Savings: $1,080/year at just 1000 users
4. EPG Strategy: Hybrid Caching
Pre-Generated Files (Tier 1 - 95% of users): - 3 tiers (kid/billy/goat) Γ 10 providers Γ 15 timezones = 450 files - Storage: ~225MB - Generation: When EPGData changes (not on every request) - Instant load for vast majority
On-Demand + Cache (Tier 2 - 5% of users): - Custom configurations - Rare timezones - First load: 4-20 sec, then cached for 24h - Acceptable for edge cases
Auto-Promotion (Tier 3 - Intelligence Layer):
-- Promote to pre-generated if popular
IF accessed 3+ times in 24h:
β Move to Tier 1 (pre-generated)
-- Demote if unused
IF unused for 7 days:
β Move back to Tier 2 (on-demand)
β Delete cached file
15 Pre-Generated Timezones (Worldwide Coverage): 1. America/New_York (US Eastern) 2. America/Chicago (US Central) 3. America/Denver (US Mountain) 4. America/Los_Angeles (US Pacific) 5. America/Sao_Paulo (Brazil/South America) 6. Europe/London (UK/Ireland) 7. Europe/Paris (Western Europe) 8. Europe/Istanbul (Eastern Europe) 9. Europe/Moscow (Russia) 10. Asia/Dubai (Middle East) 11. Asia/Kolkata (India) 12. Asia/Singapore (Southeast Asia) 13. Asia/Tokyo (East Asia) 14. Australia/Sydney (Australia East) 15. Pacific/Auckland (New Zealand)
5. Parser Workflow: Parse-First Approach
Old Workflow:
M3U β Parse β Match β Generate EPG (matched only)
Problem: Unmatched channels lose all data, no analytics
New Workflow:
M3U β Parse ALL β Store ParsedData β Match β Generate EPG (ALL)
Benefits: - β Store every parse attempt (matched or not) - β Rich analytics (parse success rates, confidence trends) - β Use extracted data even when no event match - β Debug visibility (why did this channel fail?)
Parser Strategy: 1. Regex first (fast, free, 70-80% coverage) 2. Check LLM cache (exact match) 3. Check LLM cache (similar channels: same teams/time/league) 4. LLM fallback (Claude Haiku) β cache result
LLM Deduplication Example:
Without dedup:
- "NBA 01: Lakers vs Rockets 7:30" β LLM
- "Basketball: Rockets vs Lakers 7:30" β LLM
- "NBA 05: LA vs HOU 7:30" β LLM
= 3 LLM calls ($0.015)
With dedup:
- "NBA 01: Lakers vs Rockets 7:30" β LLM (cache)
- "Basketball: Rockets vs Lakers 7:30" β Cache hit
- "NBA 05: LA vs HOU 7:30" β Cache hit
= 1 LLM call ($0.005)
Savings: 66% cost reduction
6. Event Data: Two-Phase Loading
Phase 1: Bulk Event ID Ingestion (Daily, 2 AM UTC)
# Download lightweight event lists for next 7 days
events = [
# TheSportsDB: ~1000 events/day
fetch_thesportsdb_schedule(today β today+7days),
# ESPN API: ~1000 events/day
fetch_espn_schedule(today β today+7days)
]
# Store minimal data for matching
for event in events:
save_event_id(
api_id=event.id,
league=event.league,
teams=[event.home, event.away],
datetime=event.datetime,
details_fetched=False # Flag: no full details yet
)
Phase 2: On-Demand Detail Fetching (During Matching)
# Match channel to event ID (fast, all IDs pre-loaded)
event = match_to_event_id(parsed_channel_data)
if event and not event.details_fetched:
# Fetch full details from BOTH APIs in parallel
details = fetch_details_parallel(
thesportsdb_id=event.thesportsdb_id,
espn_id=event.espn_id
)
# Merge best data from both sources
event = merge_details(event, details)
event.details_fetched = True
Benefits: - β Fast matching (all event IDs pre-loaded) - β API cost savings (only fetch details for matched events: ~5-10%) - β Redundancy (ESPN backup if TheSportsDB unavailable) - β Better data quality (merge best from both APIs)
API Usage Estimate:
Daily event ID refresh: 2,000 events Γ 2 APIs = 4,000 lightweight calls
On-demand details: 200 matched Γ 2 APIs = 400 detailed calls
Total: ~4,400 calls/day (well within free tier limits)
Database Schema Design
Core Tables
1. channel_names
One record per unique channel per provider
CREATE TABLE channel_names (
id BIGSERIAL PRIMARY KEY,
provider_id INTEGER NOT NULL REFERENCES providers(id),
channel_name TEXT NOT NULL,
channel_name_normalized TEXT NOT NULL,
tvg_id TEXT,
tvg_name TEXT,
tvg_logo TEXT,
group_title TEXT,
stream_url TEXT,
record_status TEXT DEFAULT 'active',
first_seen TIMESTAMPTZ DEFAULT NOW(),
last_seen TIMESTAMPTZ DEFAULT NOW(),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(provider_id, channel_name_normalized)
);
Purpose: Stable channel registry
Key Features:
- Soft deletes (record_status)
- Deduplication via channel_name_normalized
- Tracks when channels first/last seen
2. parsed_data
Every parse attempt, matched or not
CREATE TABLE parsed_data (
id BIGSERIAL PRIMARY KEY,
channel_name_id BIGINT NOT NULL REFERENCES channel_names(id),
parse_date DATE NOT NULL,
raw_title TEXT NOT NULL,
-- JSONB for flexible extraction
parsed_data JSONB NOT NULL,
/* Example:
{
"time": "20:00",
"timezone": "ET",
"date": "2025-11-05",
"teams": ["Lakers", "Celtics"],
"league": "NBA",
"sport": "Basketball",
"confidence": {
"time": 0.95,
"teams": 0.90,
"league": 1.0,
"overall": 0.88
}
}
*/
matched_event_id INTEGER REFERENCES events(id),
match_confidence REAL,
match_method TEXT, -- 'regex', 'llm', 'llm_cached', 'manual'
match_reason TEXT,
record_status TEXT DEFAULT 'active',
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(channel_name_id, parse_date)
);
CREATE INDEX idx_parsed_data_jsonb ON parsed_data USING GIN (parsed_data);
Purpose: Full parse audit trail Key Features: - JSONB flexibility (add fields without migrations) - GIN index for fast JSONB queries - Tracks match method (for analytics: how often does LLM vs regex succeed?)
3. epg_data
Final output-ready EPG entries
CREATE TABLE epg_data (
id BIGSERIAL PRIMARY KEY,
channel_name_id BIGINT NOT NULL REFERENCES channel_names(id),
parsed_data_id BIGINT REFERENCES parsed_data(id),
-- Core XMLTV fields
title TEXT NOT NULL,
sub_title TEXT,
description TEXT,
start_time TIMESTAMPTZ NOT NULL,
end_time TIMESTAMPTZ NOT NULL,
category TEXT,
language TEXT DEFAULT 'en',
icon_url TEXT,
-- Event characteristics
is_live BOOLEAN DEFAULT false,
is_new BOOLEAN DEFAULT false,
is_repeat BOOLEAN DEFAULT false,
is_premiere BOOLEAN DEFAULT false,
-- Sports-specific (JSONB)
sports_metadata JSONB,
/* {
"venue": "Madison Square Garden",
"home_team": "Lakers",
"away_team": "Celtics",
"home_logo": "https://...",
"league": "NBA",
"broadcaster": "ESPN"
}
*/
-- Tier features (controls rendering)
tier_features JSONB,
/* {
"show_emoji": true,
"show_logos": true,
"show_debug_info": false,
"timezone": "America/Chicago"
}
*/
confidence REAL,
match_status TEXT, -- 'matched', 'partial', 'unmatched', 'manual'
record_status TEXT DEFAULT 'active',
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
Purpose: Ready-to-generate EPG data Key Features: - Full XMLTV field support - Sports metadata in JSONB (flexible per sport) - Tier features control what renders (emojis, logos, debug)
4. llm_parse_cache
LLM result deduplication
CREATE TABLE llm_parse_cache (
id BIGSERIAL PRIMARY KEY,
channel_name_normalized TEXT NOT NULL UNIQUE,
raw_channel_name TEXT NOT NULL,
-- For similarity matching
extracted_teams TEXT[],
extracted_time TEXT,
extracted_league TEXT,
extracted_sport TEXT,
-- LLM metadata
llm_response JSONB,
llm_cost_cents REAL,
llm_tokens_used INTEGER,
-- Usage tracking
reuse_count INTEGER DEFAULT 0,
last_reused_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_llm_cache_similarity
ON llm_parse_cache(extracted_teams, extracted_time, extracted_league);
Purpose: Prevent duplicate LLM calls
Key Features:
- Exact match via channel_name_normalized
- Similarity match via extracted_teams + time + league
- Tracks cost savings via reuse_count
5. epg_file_cache
Track EPG files in R2
CREATE TABLE epg_file_cache (
id BIGSERIAL PRIMARY KEY,
provider_id INTEGER NOT NULL REFERENCES providers(id),
tier TEXT NOT NULL, -- 'kid', 'billy', 'goat', 'custom'
timezone TEXT NOT NULL,
custom_config_hash TEXT,
-- R2 storage
r2_key TEXT NOT NULL UNIQUE,
file_size_bytes BIGINT,
-- Generation tracking
generation_status TEXT DEFAULT 'pending',
generation_duration_ms INTEGER,
-- Access tracking
access_count INTEGER DEFAULT 0,
last_accessed_at TIMESTAMPTZ,
promotion_status TEXT DEFAULT 'on_demand', -- or 'pre_generated'
-- Invalidation
invalidated_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(provider_id, tier, timezone, COALESCE(custom_config_hash, ''))
);
Purpose: Intelligent cache management
Key Features:
- Tracks every EPG file in R2
- access_count enables auto-promotion
- invalidated_at triggers regeneration when data changes
Enhanced events Table
ALTER TABLE events ADD COLUMN espn_id TEXT;
ALTER TABLE events ADD COLUMN home_team TEXT;
ALTER TABLE events ADD COLUMN away_team TEXT;
ALTER TABLE events ADD COLUMN details_fetched BOOLEAN DEFAULT false;
ALTER TABLE events ADD COLUMN thesportsdb_metadata JSONB;
ALTER TABLE events ADD COLUMN espn_metadata JSONB;
Purpose: Two-phase event loading
Key Features:
- details_fetched flag (false = ID only, true = full details)
- Store full API responses in JSONB (for debugging)
- Support both TheSportsDB and ESPN IDs
Implementation Phases
Phase 1: Infrastructure Setup (Week 1)
Deliverables: - β Supabase PostgreSQL project created - β Cloudflare R2 bucket configured - β Environment variables set up - β Connection tests passing
Tasks:
1. Create Supabase project (free tier)
2. Set up R2 bucket ("epgoat-epg-files")
3. Generate API tokens and connection strings
4. Test connectivity from dev environment
5. Document setup in Documentation/05-Reference/
Success Criteria:
- Can connect to Supabase from Python
- Can upload/download to R2
- All credentials in .env (not committed)
Phase 2: Database Schema (Week 1-2)
Deliverables: - β All 5 new tables created in Supabase - β Triggers for auto-invalidation working - β Migration scripts tested (UP and DOWN) - β Indexes optimized for query patterns
Tasks:
1. Write SQL migration (010_create_new_schema.sql)
2. Test migration on dev Supabase instance
3. Create rollback migration (DOWN section)
4. Verify all indexes created
5. Test trigger: EPGData insert β invalidate files
Success Criteria: - Migration runs without errors - All 5 tables exist with correct schema - Triggers fire correctly - Can query JSONB fields efficiently
Phase 3: Parser & Matching Engine (Week 2-3)
Deliverables: - β Channel parser with regex + LLM fallback - β LLM caching with 70%+ reuse rate - β Event matching with dual API support - β Two-phase event loading operational
Tasks:
3.1: Channel Parser - Implement regex patterns (NBA, NFL, NHL, MLB, Soccer) - Integrate Claude Haiku API - Build LLM cache (exact + similarity matching) - Test with 100+ real channel names - Target: 80%+ regex success, 70%+ cache hits
3.2: Event Matching - Build fuzzy team name matching - Integrate manual override system - Support confidence scoring - Test match accuracy - Target: 85%+ match rate
3.3: Two-Phase Event Loading - Daily job: Bulk load event IDs (TheSportsDB + ESPN) - On-match: Fetch full details - Merge data from both APIs (TheSportsDB preferred) - Target: <2 sec per match
Success Criteria: - Parser handles 90%+ channels successfully - LLM cache reduces costs by 70%+ - Event matching achieves 85%+ success rate - Details fetch in <2 seconds per event
Phase 4: EPG Generation System (Week 3-4)
Status: Planning Complete (2025-11-07) | Implementation In Progress
Overview
Phase 4 transforms the database-stored EPG data into XMLTV-formatted files that IPTV clients can consume. This phase implements a sophisticated three-tier system (KID/BILLY/GOAT) with intelligent handling of matched, parsed-but-unmatched, and unparseable channels.
Key Design Decisions (2025-11-07 Planning Session): - Refactor existing xmltv.py (not create new file) - maintains continuity - Generate pre/post blocks on-the-fly (not stored in DB) - saves massive space - Three channel scenarios with different handling for each - 3-day lookahead window for events (only future events) - Midnight start for pre-events (00:00 of target_date) - 4-hour post-event blocks (rounded to 2-hour grid) - Keep original channel names during testing (configurable for production) - Provider tvg-id priority (never change user's existing IDs) - LLM-ready architecture with fallback chain (future enhancement)
Field Specification
Reference Document: epg-field-specification.csv (GitHub root)
This CSV defines all 48 EPG fields with tier availability, data sources, XMLTV elements, and example values. Key sections: - Channel Fields (rows 2-5): ID, display name, icon, category - Programme Core (rows 6-8): Start/stop times, channel reference - Title Variants (rows 9-11): Pre-event, live event, post-event - Descriptions (rows 13-15): Tier-specific with broadcaster conditional - Visual Elements (rows 16-18): Icons, categories, keywords - Sports Metadata (rows 30-42): Venue, teams, logos, broadcaster, league, season, round - Block Timing (rows 43-45): 2-hour grid alignment rules - Emoji Config (rows 46-48): Sport-specific, GOAT live only
Important CSV Notes: - Row 13 (K13): GOAT pre-event adds venue + broadcaster conditional - Row 14 (K14): GOAT live event adds venue + broadcaster conditional - Row 27 (K27): Language detection from channel names (espanol, espaΓ±ol, french, etc.)
Tier System
KID Tier (Minimal): - β Channel ID, display name - β Start/stop times for live events - β Title: Event name only - β No descriptions, icons, emojis, metadata
BILLY Tier (Mid-Level): - β Everything in KID - β Pre-event: "Airing today: {title}" - β Post-event: "Ended: {title}" - β Basic descriptions - β Language detection - β No emojis, icons, detailed metadata
GOAT Tier (Full Features): - β Everything in BILLY - β Sport emojis on live events (ππβ½) - β Programme icons (team/league logos) - β Rich descriptions with venue, city, broadcaster - β Timezone-aware times in titles - β Categories, keywords - β Live flags (live/new/premiere) - β Full sports_metadata JSONB
Three Channel Scenarios
Scenario 1: Matched Channels (β
)
- Condition: epg_data.matched_event_id IS NOT NULL
- Emoji: Sport-specific (π for Basketball) - GOAT tier live events only
- Title: Full canonical team names from database
- Description: Rich with venue, broadcaster, season context
- Pre/Post Blocks: Full 2-hour grid from midnight to event, 4 hours post
- Example: "π Los Angeles Lakers vs Boston Celtics" (GOAT live)
Scenario 2: Parsed-Unmatched Channels (β οΈ)
- Condition: parsed_data IS NOT NULL AND matched_event_id IS NULL
- Emoji: β οΈ Warning emoji - ALL tiers (testing only, will remove in production)
- Title: Extracted team names from channel name (best effort)
- Description: Basic with extracted data
- Pre/Post Blocks: Same structure as matched (mirror behavior)
- Example: "β οΈ Lakers vs Celtics" (testing indicator)
Scenario 3: Unparseable Channels (πΊ)
- Condition: parsed_data IS NULL OR parse failed completely
- Emoji: πΊ TV emoji - ALL tiers (testing only, will remove in production)
- Title: Original channel name (no parsing)
- Description: "Programming on this channel"
- Blocks: 24-hour block with original channel name
- Example: "πΊ Some Random Channel 01"
Testing Emojis (β οΈ and πΊ): - Purpose: Visual identification during testing/debugging - Will be removed in production (or made configurable) - Sport emojis (ππβ½) remain in production for GOAT tier
Event Filtering Logic
3-Day Lookahead Window:
# Only include events within next 3 days
if event.date <= target_date + timedelta(days=3):
# AND event hasn't already happened
if event_datetime > now():
include_event()
Past Event Handling: - If event time < now: Ignore completely (don't generate any blocks) - Rationale: No value in showing past events
No Date Handling: - Assume today: If no date extracted, treat as target_date - Best effort: Use time if available, create single block
Date Trust: - Always trust parsed dates from channel names - Don't second-guess extraction logic
Pre/Post Event Block Timing
Pre-Event Blocks:
# Start at midnight of target_date (NOT current time)
pre_event_start = target_date.replace(hour=0, minute=0, second=0)
# End at event start time
pre_event_end = event.start_time
# Fill with 2-hour blocks, aligned to global grid
# Global grid: 00:00, 02:00, 04:00, 06:00, 08:00, etc.
# Last block may be partial if event doesn't start on grid
Example: Event at 19:30 (7:30 PM) - Block 1: 00:00-02:00 (2 hours) - Block 2: 02:00-04:00 (2 hours) - Block 3: 04:00-06:00 (2 hours) - Block 4: 06:00-08:00 (2 hours) - Block 5: 08:00-10:00 (2 hours) - Block 6: 10:00-12:00 (2 hours) - Block 7: 12:00-14:00 (2 hours) - Block 8: 14:00-16:00 (2 hours) - Block 9: 16:00-18:00 (2 hours) - Block 10: 18:00-19:30 (1.5 hours - partial)
Post-Event Blocks:
# Start at event end time
post_event_start = event.end_time
# Total duration: 4 hours
post_event_end = event.end_time + timedelta(hours=4)
# Fill with 2-hour blocks (usually 2 blocks)
# Round to nearest grid boundary
2-Hour Grid Alignment: - All blocks align to global 2-hour grid based on local timezone - Grid boundaries: 00:00, 02:00, 04:00, 06:00, 08:00, 10:00, 12:00, 14:00, 16:00, 18:00, 20:00, 22:00 - Last block may be partial if event doesn't align
Broadcaster Conditional Logic
Pre-Event Descriptions (GOAT tier):
if broadcaster:
desc = f"NBA game showing on this channel and on {broadcaster} at {time_str}"
else:
desc = f"NBA game at Staples Center"
Example with broadcaster: "NBA game showing on this channel and on ESPN at 7:30 PM CT" Example without: "NBA game at Staples Center"
Live Event Descriptions (GOAT tier):
if broadcaster:
desc = f"Live NBA game from Staples Center featuring Lakers vs Celtics. Broadcast on {broadcaster}"
else:
desc = f"Live NBA game from Staples Center featuring Lakers vs Celtics"
Key Points: - NO "LIVE:" prefix in titles (emoji serves that purpose) - Broadcaster is optional (many events don't have it) - Different wording for pre-event vs live event
Channel Display Names
Testing Phase (Current): - Keep full original channel names from M3U - Example: "NBA 01: Lakers vs Celtics 7:30 PM ET" - Rationale: Full context during debugging
Production Phase (Future): - Use shell names only (configurable) - Example: "NBA 01" - Rationale: Cleaner UI, all details in EPG data
Implementation:
def get_display_name(channel, clean_names: bool = False):
if clean_names:
# Production: shell only
return channel.tvg_name or "Channel {id}"
else:
# Testing: full original name
return channel.channel_name
tvg-id Priority Strategy
Priority 1: Use provider's tvg-id (if exists)
if channel.tvg_id and channel.tvg_id.strip():
return channel.tvg_id # NEVER change this
Priority 2: Generate consistent ID
else:
return generate_channel_id(channel.channel_name)
Rationale: - User compatibility: Changing tvg-ids breaks existing user setups - Provider trust: If provider specifies ID, it's intentional - Fallback only: Generate IDs only when provider doesn't provide them
Language Detection
Strategy: Parse channel name for language indicators
def detect_language(channel_name: str) -> str:
name_lower = channel_name.lower()
# Spanish
if any(word in name_lower for word in ['espanol', 'espaΓ±ol', 'spanish', 'latino', 'latina']):
return 'es'
# French
if any(word in name_lower for word in ['franΓ§ais', 'francais', 'french']):
return 'fr'
# Portuguese
if any(word in name_lower for word in ['portuguΓͺs', 'portugues', 'portuguese', 'brasil', 'brazil']):
return 'pt'
# Default
return 'en'
Tiers: BILLY + GOAT (KID doesn't include language field)
Sport Default Durations
For parsed-unmatched events (no event match), use sport-specific durations:
SPORT_DEFAULT_DURATIONS = {
'Basketball': 180, # 3 hours in minutes
'Football': 210, # 3.5 hours (NFL)
'Baseball': 210, # 3.5 hours (MLB)
'Hockey': 180, # 3 hours (NHL)
'Soccer': 150, # 2.5 hours
'default': 180 # 3 hours fallback
}
Rationale: Without event match, we don't have actual duration, so use typical duration per sport.
LLM-Ready Parser Architecture
Future Enhancement (Phase 4.X):
Design parser with fallback chain for future Claude Haiku integration:
class ChannelParser:
def parse(self, channel_name: str) -> ParseResult:
# Step 1: Regex patterns (fast, free, 70-80% coverage)
result = self._parse_with_regex(channel_name)
if result.confidence > 0.8:
return result
# Step 2: LLM cache (exact match)
cached = self._check_llm_cache_exact(channel_name)
if cached:
return cached
# Step 3: LLM cache (similarity: same teams/time/league)
cached = self._check_llm_cache_similar(channel_name)
if cached and cached.confidence > 0.9:
return cached
# Step 4: [FUTURE] Claude Haiku API call
# TODO: Implement LLM fallback
# - Call Claude Haiku with prompt
# - Cache result for future reuse
# - Track cost and usage
# Step 5: Return best available result
return result if result.confidence > 0.0 else ParseResult.empty()
Benefits: - Code structure ready for LLM integration - No refactoring needed when adding LLM - Gradual enhancement without blocking current work
Living Document Task: Add Phase 4.X task to implement LLM integration (estimated: 1 week, future sprint)
Phase 4 Implementation Tasks
Deliverables: - Database-powered XMLTV generator with 3-tier support (KID/BILLY/GOAT) - Three-scenario handling (matched/parsed-unmatched/unparseable) - LLM-ready modular parser architecture - EPG data population service - R2 storage integration - Batch EPG file generation (450 files) - Cache invalidation logic - Comprehensive test suite - Updated documentation
4.0: Schema Validation (Current Task) - [ ] Read epg-field-specification.csv to understand all required fields - [ ] Check migration 010 for epg_data table structure - [ ] Validate all CSV-specified fields exist in database - [ ] Create migration 011 if any fields are missing - [ ] Document any schema gaps
4.1: Core Implementation
- [ ] Add constants to backend/epgoat/domain/models.py:
- SPORT_DEFAULT_DURATIONS dict
- EVENT_LOOKAHEAD_DAYS = 3
- EMOJI_PARSED_UNMATCHED = 'β οΈ'
- EMOJI_UNPARSEABLE = 'πΊ'
- [ ] Refactor backend/epgoat/domain/xmltv.py:
- Create XMLTVGenerator class
- Implement generate_for_provider() method
- Implement _generate_programmes_for_channel() with three scenarios
- Implement _should_include_event() (3-day lookahead + future only)
- Implement _format_pre_event_title() with tier logic
- Implement _format_live_event_title() with emoji logic
- Implement _format_post_event_title() with tier logic
- Implement _generate_pre_event_blocks() (midnight start, 2-hour grid)
- Implement _generate_post_event_blocks() (4 hours, 2-hour grid)
- Implement tier-based field filtering
4.2: Modular Services
- [ ] Create backend/epgoat/services/channel_parser.py:
- Define ParseResult dataclass
- Implement ChannelParser class with fallback chain:
1. _parse_with_regex()
2. _check_llm_cache_exact()
3. _check_llm_cache_similar()
4. [TODO] LLM API call (future)
- Add comprehensive TODO comments for LLM integration
- [ ] Create backend/epgoat/services/language_detector.py:
- Implement detect_language_from_channel_name()
- Support es/fr/pt detection
- Default to 'en'
4.3: EPG Data Population
- [ ] Create backend/epgoat/services/epg_data_generator.py:
- Implement EPGDataGenerator class
- Method: generate_for_provider(provider_id, target_date)
- Method: _create_epg_data() with sports_metadata JSONB
- Method: _update_epg_data_if_changed()
- [ ] Schedule daily processing workflow (cron/Workers)
4.4: Batch Generation & Storage
- [ ] Create backend/epgoat/services/r2_storage.py:
- Implement get_r2_client() (S3-compatible)
- Implement upload_to_r2(key, content)
- Implement download_from_r2(key)
- Error handling and retries
- [ ] Create backend/epgoat/utilities/generate_epg_files_batch.py:
- Loop through 3 tiers Γ 10 providers Γ 15 timezones
- Generate EPG for each combination
- Upload to R2
- Track in epg_file_cache table
- Target: Generate all 450 files in <10 minutes
4.5: Cache Invalidation
- [ ] Create backend/epgoat/utilities/regenerate_invalidated_epg.py:
- Query epg_file_cache WHERE invalidated_at IS NOT NULL
- Regenerate each invalidated file
- Upload to R2
- Clear invalidated_at timestamp
- Log regeneration events
- [ ] Test trigger integration (insert into epg_data β invalidates cache)
4.6: Testing - [ ] Unit tests for tier filtering: - Test KID tier includes minimal fields - Test BILLY tier includes descriptions - Test GOAT tier includes all fields - [ ] Unit tests for emoji logic: - Test sport emoji only on GOAT live events - Test β οΈ emoji on parsed-unmatched (all tiers) - Test πΊ emoji on unparseable (all tiers) - [ ] Unit tests for event filtering: - Test 3-day lookahead window - Test past event exclusion - Test future-only inclusion - [ ] Unit tests for tvg-id priority: - Test provider ID preserved - Test generated ID fallback - [ ] Integration tests: - Test full EPG generation for matched channel - Test full EPG generation for parsed-unmatched channel - Test full EPG generation for unparseable channel - Compare output across three tiers - [ ] Visual testing: - Load generated EPG in TiviMate - Load generated EPG in VLC - Verify pre/post blocks display correctly - Verify emojis render correctly (GOAT tier)
4.7: Documentation - [ ] Update living document with implementation progress - [ ] Create EPG Generation guide: - How to generate EPG files - How tier system works - How to test with IPTV clients - Troubleshooting common issues
Success Criteria: - EPG files validate as correct XMLTV - 450 pre-generated files in R2 (~225MB) - Instant load for 95% of requests - Workers route correctly with <50ms overhead
Phase 5: Migration & Cutover (Week 4)
Deliverables: - β All D1 data migrated to Supabase - β Parallel run complete (both systems) - β Cutover successful - β D1 decommissioned (after 30-day safety period)
Tasks:
5.1: Data Migration - Export all tables from D1 to JSON - Transform to new schema - Import to Supabase - Validate data completeness - Target: 100% data migrated
5.2: Parallel Run - Dual-write to both D1 and Supabase (2 weeks) - Compare outputs for consistency - Monitor performance - Fix discrepancies - Target: <1% difference between systems
5.3: Cutover - Switch reads to Supabase - Monitor for 48 hours - Stop D1 writes - Document rollback procedure - Target: Zero downtime
5.4: Decommission D1 - Keep D1 running for 30 days (safety) - Archive D1 data - Delete Supabase database - Update all documentation
Success Criteria: - 100% data migrated accurately - New system performs better than old - Zero data loss during cutover - Rollback procedure documented and tested
Cost Analysis
Projected Monthly Costs
| Users | DB Size | R2 | Workers | LLM | APIs | Total |
|---|---|---|---|---|---|---|
| 10 | 50MB | Free | Free | $5 | Free | $5 |
| 100 | 150MB | Free | Free | $10 | Free | $10 |
| 500 | 400MB | Free | Free | $15 | Free | $15 |
| 1,000 | 600MB | $25 | Free | $20 | $3 | $48 |
| 5,000 | 2GB | $25 | $5 | $50 | $10 | $90 |
| 10,000 | 5GB | $79 | $20 | $100 | $20 | $219 |
Key Cost Drivers: 1. Supabase: Free until 500MB, then $25/mo (Pro) 2. R2: Free until 10GB, then $0.015/GB/month 3. Workers: Free until 100k req/day, then $5/10M requests 4. Claude Haiku: ~$5-10/month with 70% cache hit rate 5. APIs: Free tiers sufficient until 1000+ users
Cost Savings vs Alternatives: - AWS S3 + CloudFront: Would cost $90/month in egress alone at 1000 users - R2 saves: ~$1,080/year at 1000 users (zero egress) - LLM caching saves: 70% reduction vs no caching
Risk Assessment & Mitigation
Technical Risks
Risk 1: Supabase Free Tier Exhaustion - Probability: Medium (500MB limit) - Impact: High (requires paid plan) - Mitigation: - Monitor DB size daily - Set up alerts at 400MB (80% threshold) - Budget for Pro plan ($25/mo) in advance - Optimize queries to reduce bloat
Risk 2: Migration Data Loss - Probability: Low (with proper testing) - Impact: Critical (lose all data) - Mitigation: - Parallel run for 2 weeks (dual-write) - Keep D1 for 30 days post-cutover - Export full D1 backup before migration - Validate 100% data migrated - Document rollback procedure
Risk 3: EPG Generation Timeout - Probability: Low (with optimization) - Impact: Medium (poor UX for custom configs) - Mitigation: - Optimize queries (single query per provider) - Parallel processing where possible - Background generation on purchase - 202 Accepted response with retry-after
Risk 4: LLM API Outage - Probability: Low (Anthropic SLA: 99.9%) - Impact: Medium (can't parse new channels) - Mitigation: - Regex handles 80% (continues working) - LLM cache handles repeat channels - Graceful degradation (use partial regex data) - Retry logic with exponential backoff
Risk 5: R2 Storage Costs Exceed Budget - Probability: Low (10GB free tier) - Impact: Low ($0.15/GB beyond free tier) - Mitigation: - Auto-demotion deletes unused files - Monitor storage daily - Set alerts at 8GB (80% of free tier)
Business Risks
Risk 6: User Resistance to Timezone Selection - Probability: Low (standard feature) - Impact: Low (defaults to America/New_York) - Mitigation: - Smart defaults based on IP geolocation - Clear UI for timezone selection - Support docs explaining timezone feature
Risk 7: Development Timeline Overrun - Probability: Medium (4-week aggressive timeline) - Impact: Medium (delayed launch) - Mitigation: - Phased approach (MVP first, enhancements later) - Weekly milestones with clear deliverables - Buffer time in Phase 5 (1 week contingency)
Success Metrics
Technical KPIs
Parser Performance: - β Regex success rate: >70% - β LLM cache hit rate: >70% - β Overall parse success: >90% - β LLM cost savings: >60%
Matching Performance: - β Event match rate: >85% - β Match confidence average: >0.8 - β Manual override usage: <10%
EPG Generation: - β Pre-generated files: 450 - β Instant load percentage: >95% - β Custom config generation: <5 seconds - β Storage usage: <500MB
Infrastructure: - β Database size: Within free tier (Week 1-8) - β R2 egress cost: $0/month - β Worker latency: <50ms routing - β Uptime: >99.9%
Business KPIs
User Experience: - β EPG load time: <500ms for 95% of users - β Worldwide coverage: 15 timezones - β Support tickets: <5/month related to EPG
Cost Efficiency: - β Cost per user: <$0.10/month (at 1000 users) - β LLM cost savings: >$50/month - β Storage cost savings: >$90/month vs S3
Scalability: - β Support 1000 users at <$50/month - β Support 10,000 users at <$250/month - β Zero manual intervention for scaling
Next Steps After Completion
Phase 2 Enhancements (Month 2-3)
Custom Tier Development: - User-configurable features - Feature toggle UI - Custom config hashing - Target: Launch by Week 8
Advanced Analytics: - Parse success dashboard - Match quality reports - Cost tracking dashboard - Provider quality scorecards
Pattern Learning: - Automated regex generation from LLM successes - Pattern approval workflow - Continuous improvement loops
Phase 3 Scale Optimizations (Month 4-6)
Performance: - Parallel EPG generation - Query optimization (indexes, materialized views) - Caching layers (Redis for hot data)
Additional APIs: - ESPN Deep Integration - Others (Gracenote, Sportradar) - API fallback chains
ML Enhancements: - Match confidence scoring via ML - Anomaly detection (bad matches) - Auto-learn team aliases
Documentation Updates Required
New Documentation:
- [ ] Supabase Setup Guide (05-Reference/Database/Supabase-Setup.md)
- [ ] R2 Storage Guide (05-Reference/Storage/R2-Setup.md)
- [ ] Parser Architecture (03-Architecture/Parser-Architecture.md)
- [ ] EPG Generation V2 (04-Guides/EPG-Generation-V2.md)
- [ ] Migration Runbook (08-Processes/Supabase-Migration.md)
Updated Documentation:
- [ ] System Overview (03-Architecture/System-Overview.md)
- [ ] Database Schema (05-Reference/Database/Schema.md)
- [ ] Core Principles (cost section) (03-Architecture/Core-Principles.md)
- [ ] Quick Start Guide (04-Guides/Quick-Start.md)
ADRs to Create: - [ ] ADR-010: PostgreSQL Migration via Supabase - [ ] ADR-011: Parser-First Workflow - [ ] ADR-012: Hybrid EPG Caching Strategy - [ ] ADR-013: Two-Phase Event Loading
Appendix: Detailed Implementation Plan
For detailed task-by-task implementation with code examples, tests, and verification steps, see:
Documentation/08-Projects/Design-Plans/2025-01-05-architecture-workflow-redesign.md
This operational plan contains: - Exact file paths for every change - Complete code examples - Test-first approach (TDD) - Verification commands with expected output - Commit messages for each step
Note: The detailed plan follows the superpowers:writing-plans format with bite-sized tasks (2-5 minutes each), while this document provides the strategic overview and business context.
Approvals
Technical Lead: β Approved (2025-01-05) CTO: β Approved (2025-01-05) Product: β³ Pending Review Budget: β Approved (<$50/month initial cost)
Start Date: 2025-01-06 Target Completion: 2025-02-05 (4 weeks) Review Cadence: Weekly (Mondays, 10 AM)
Status Log: - 2025-01-05: Project initiated, architecture designed - 2025-01-05: Detailed implementation plan created - 2025-01-05: Awaiting final approvals to begin Phase 1 - 2025-01-05: Phase 1 completed (Infrastructure Setup) - 2025-01-05: Phase 2 started (migration 010 written) - 2025-11-06: Phase 3 completed (Parser & Matching Engine) - 2025-11-07: Converted to living document format - 2025-11-07: Ready to apply migration 010 to Supabase