Article
YouChannel: Agent Loops, YouTube Data, and the Cost of Real-Time Learning
A code-led postmortem of YouChannel, an open-source YouTube language-learning product built around bounded agent loops, YouTube Data API reads, Gemini video analysis, Gemini Live practice, Supabase, Fastify, and pg-boss.
YouChannel was my attempt to turn YouTube from passive input into an interactive language-learning workspace.
The basic idea was simple: learners already spend time with videos they care about. They watch interviews, creator channels, lectures, product demos, travel vlogs, comedy clips, language content, and technical talks. The content is authentic. It has emotional context. It is much stickier than a synthetic lesson.
The missing layer is not “more content.” It is a loop that converts one video into study, conversation, memory, and practice.
That is where YouChannel became interesting. It was not one chatbot glued beside a video player. It was a set of bounded agent loops: fetch YouTube data, select material, analyze the video, turn the analysis into learning surfaces, run real-time spoken practice, persist the useful parts, and feed assessment back into future drills.
The Product Loop
The product flow I wanted was less like an import tool and more like a learning flywheel.
Browse live YouTube material
The app reads the learner's playlists on demand. Nothing durable is created until the learner chooses a video.
- State checkpoint
- OAuth account, access token, page token
- Code evidence
- fetchPlaylistSummaries, fetchPlaylistVideosPage
Turn playlist entries into candidates
The playlist UI filters already-owned videos, private placeholders, duration violations, and quota pressure before submission.
- State checkpoint
- candidate list, selected IDs, quota summary
- Code evidence
- routes/_layout/playlists.tsx
Create durable work
The service inserts user-owned videos, deduplicates by YouTube ID, creates pending analysis records, and queues pg-boss jobs.
- State checkpoint
- videos, video_analyses, pg-boss job
- Code evidence
- POST /openapi/analysis
Ask Gemini for a study pack
The worker sends a YouTube URL plus a strict JSON prompt, validates the output, stores usage, and handles retry or refund paths.
- State checkpoint
- quota event, model usage, analysis JSON
- Code evidence
- apps/jobs/src/workers.ts
Render the study surface
The learning page binds video playback, summary, wiki moments, transcript, and character chat into one workspace.
- State checkpoint
- latest completed analysis
- Code evidence
- learn/$videoId.tsx, LearningTabs, ChatSidebar
Feed speech back into the system
Live sessions, assessments, and shadowing attempts become new context for future practice instead of disappearing after a call.
- State checkpoint
- live messages, assessments, shadowing_attempts
- Code evidence
- live.tsx, assessment.ts, practice.ts
The implementation eventually made an important product choice: YouTube is a source of material, not the product database. Durable product state begins when a user selects a video and YouChannel creates user-owned rows, analysis history, quota events, live sessions, assessments, and practice attempts.
Agent Loop Design
The word “agent” can hide too much. In YouChannel, the useful pattern was not one autonomous master process. It was a handful of loops with clear triggers, bounded model calls, durable checkpoints, and recoverable failure states.
Analysis
Video analysis loop
Roleplay
Video character roleplay loop
Live coach
General live practice loop
Drills
Assessment to shadowing loop
Business Workflow
The production workflow had more bookkeeping than the original demo idea suggested.
Connect YouTube with a read-only OAuth grant
The app builds a Google OAuth URL with youtube.readonly, requests offline access, and stores the refreshable account link. Server functions refresh the token when it is close to expiry.
Fetch playlists passively
The playlist page calls server functions that read current YouTube playlists and playlist items. React Query uses an infinite stale time and disables automatic refetch on focus or reconnect. Refresh is explicit.
Let the user choose videos
The UI maps playlist items into analysis candidates. It filters already-imported videos, private placeholders, and videos longer than the active per-video limit. The review dialog estimates the quota impact before submission.
Submit analysis as an external API call
The frontend calls POST /openapi/analysis on the jobs service with a shared service key. The service deduplicates by YouTube video ID, inserts only new user-owned rows, and enqueues only inserted videos.
Consume quota in the worker
The worker validates the video, parses duration, calls consume_quota, invokes Gemini, saves usage and output, and refunds quota when a terminal worker failure means the paid work did not complete.
Expose the result as learning state
The app reads the latest analysis status and renders the learning page when the JSON is complete. The user sees a video player, structured tabs, and Live chat grounded in the analysis.
Code evidence: the analysis API is a service boundary youchannel-service/apps/jobs/src/server/routes/openapi.ts
POST /openapi/analysis uses a service-key pre-handler, validates userId and video payloads, deduplicates by youtubeVideoId, inserts new user-owned videos, and passes only inserted rows to enqueueAnalyses. The response reports requested, unique, inserted, existing, enqueued, skipped, and skip reasons.
Code evidence: the worker owns quota and retries youchannel-service/apps/jobs/src/workers.ts
The worker calls consume_quota before analysis, persists usage, saves completed, failed, or skipped status, classifies timeout, 429, and 5xx errors as retryable, and calls refund_quota for terminal failures that should not charge the learner.
YouTube Data Sources
The current product reads YouTube data through the YouTube Data API v3 and Google OAuth. It does not scrape YouTube pages, and it does not call a YouTube captions endpoint for the transcript.
accounts.google.com/o/oauth2/v2/auth
- Request shape
- scope=youtube.readonly, access_type=offline, prompt=consent
- Product payload
- The connected YouTube account and refreshable access token.
- Sync pitfall
- A connected account is still fragile. Refresh tokens can be missing, revoked, or invalidated outside the app.
- Code path
- src/lib/server/youtube.ts
GET /youtube/v3/playlists
- Request shape
- part=snippet,contentDetails, mine=true, maxResults=50
- Product payload
- Playlist ID, title, description, thumbnails, and page tokens.
- Sync pitfall
- Empty accounts must be allowed. Pagination is part of the product path, not a rare edge case.
- Code path
- fetchPlaylistSummaries
GET /youtube/v3/playlistItems
- Request shape
- part=snippet,contentDetails, playlistId, maxResults<=50, pageToken
- Product payload
- Playlist membership, item snippets, and video IDs.
- Sync pitfall
- Playlist items do not provide enough reliable duration/status information for quota decisions.
- Code path
- fetchPlaylistVideosPage
GET /youtube/v3/videos
- Request shape
- part=snippet,contentDetails, id=<up to 50 IDs>
- Product payload
- ISO 8601 duration, video snippet, status-ish detail, and thumbnails.
- Sync pitfall
- Private, deleted, and unavailable videos can collapse into placeholders or missing details.
- Code path
- fetchVideoDetails
chat message with video URL part
- Request shape
- type=video URL plus strict JSON prompt
- Product payload
- Scene, summary, wiki timeline, characters, voices, languages, transcript.
- Sync pitfall
- The transcript-like output is model-generated learning material, not a separate YouTube captions import.
- Code path
- apps/jobs/src/workers.ts
The subtle thing is that YouTube playlist data is not clean product data. It is a moving third-party view. A video can become private. A user can remove it from a playlist. A title can become “Private video.” Durations arrive as ISO 8601 strings. Page tokens are stateful. API quota can be consumed just by browsing. OAuth can fail after the user thinks they are connected.
That is why the final code moved toward passive browsing and explicit selection instead of maintaining a constantly synchronized playlist mirror.
Code evidence: passive playlist reads youchannel/src/routes/_layout/playlists.tsx
The playlist page uses query keys for YouTube playlists and playlist items, disables refetch on window focus and reconnect, and relies on explicit refresh. The page maps playlist items into local candidates, checks existing library IDs, detects private placeholders, enforces per-video duration limits, and then calls the analysis endpoint only for selected videos.
From Sync Engine to Passive Fetching
The migration history tells the real story better than a diagram.
e2ccc06, 2086a66
Background sync looked clean on paper
The system tried to keep playlist state mirrored through sync jobs, workers, schedules, and logs.
This created a second source of truth beside YouTube and the learner-owned library.
20250102020000_add_sync_support.sql
The schema exposed YouTube drift
Playlist entries and videos needed active, lost, auth-invalid, removed, unavailable, skipped, and error paths.
Those states were real, but they made the product feel like infrastructure before it felt like learning.
89f612f
The sync worker was deleted
Schedulers, workers, sync routes, scripts, and process config were part of the app.
The app stopped trying to reconcile the world in the background.
cbc16ec, ddebe2a
Playlist browsing became explicit
The app acted like it owned playlist state.
The playlist page reads on demand, paginates, and refreshes only when the user asks.
20260109000000, 20260109010000
Videos became user-owned
Playlist membership shaped durable video rows.
A selected video belongs to the user. Analysis history is independent of playlist membership.
The earlier “bidirectional sync” idea was attractive in product language: connect a YouTube playlist, keep it aligned, react when items appear or disappear, and make the learning library follow. In practice, the code was telling a different story. Sync needed playlist status, video status, lost/auth-invalid states, job runs, retry windows, sync logs, dedupe rules, and cleanup paths. It also blurred ownership: if the same YouTube video appeared in two playlists, which product object owned the analysis?
The final model is less magical and more reliable:
- read YouTube playlists when the learner opens the page,
- page through playlist items only when needed,
- let the learner choose videos,
- create user-owned video records only after selection,
- keep analysis history independent from playlist membership,
- rely on queue infrastructure for async work instead of custom sync state,
- make refresh an explicit UI action.
That migration was a product decision, not just a refactor. Passive fetching reduced surprise. It also made quota easier to explain: the user is charged for selected analysis work, not for invisible background synchronization.
Video Analysis Prompt Design
The analysis prompt had to produce learning primitives, not prose.
The worker sends Gemini two things in one user message: a video input with the YouTube URL, and a text prompt that demands strict JSON. The model configuration uses low temperature, a large output cap, JSON MIME type, and a JSON schema. After streaming, the worker strips fences, parses JSON, validates the shape, validates timestamps, validates characters, and persists both output and usage.
| Output field | Why it existed | UI or agent use |
|---|---|---|
scene | Capture the setting and vibe of the video. | Gives chat and Live prompts grounding beyond facts. |
summarize | Provide a compact 3-6 sentence overview. | The overview tab and quick memory for later prompts. |
wiki | Create chronological moments with timestamps. | Clickable learning timeline that can seek the video. |
characters | Identify speakers or roles with traits, topics, language, and voice. | Character directory and Gemini Live roleplay setup. |
transcript | Preserve original-language segments with timestamps and speakers. | Transcript tab, seeking, and study context. |
This prompt design reflects the product. A generic “summarize this video” output would have been cheaper, but it would not power the learning interface. The system needed timestamps for navigation, speakers for roleplay, language labels for Live, voice choices for audio personality, and transcript segments for precise study.
Code evidence: the prompt is built for interaction youchannel-service/apps/jobs/src/workers.ts
ANALYSIS_OUTPUT_SCHEMA requires scene, summarize, wiki, characters, and transcript. The prompt tells the model not to invent facts, to keep wiki timestamps chronological, to keep transcript text in the original spoken language, and to select a TTS voice for each main character. The worker calls Gemini with the video URL as a video part and the prompt as a text part.
Code evidence: token accounting is stored youchannel-service/apps/jobs/src/workers.ts
The stream collector records provider usage from the done chunk. updateVideoAnalysis persists that usage on the video_analyses row. The analysis call sets maxTokens: 65536, which is a sign that transcript-heavy output can become large even before counting video input tokens.
The cost lesson was blunt. Video input is not a normal chat prompt. Tokens scale with duration, audio, visual sampling, and the amount of structured output requested. In project testing, a 10-minute video analyzed with a heavier Gemini 3 Pro style setup could land near the low six figures of tokens. The checked-in repo does not include raw provider usage samples, so I would not present that as a benchmark. But the design pressure is visible in the code: quota ledgers, per-video duration limits, usage persistence, retries, refunds, and admin operations all exist because long-video analysis is a metered product, not a free utility call.
Planning estimate
120k Gemini 3 Pro style run, rich prompt, 10 minute reference.This also affected prompt strategy:
- ask for structured learning fields once, rather than many scattered follow-up calls,
- keep the schema small enough to validate,
- use timestamps as stable UI anchors,
- keep transcripts concise and allow truncation,
- store usage for later cost analysis,
- make quota deduction idempotent around the analysis record,
- refund on terminal worker failure where the product should not charge.
The unpleasant tradeoff is that the best product output is often the most expensive output. A rich study pack wants transcript, timeline, roles, and summary. A cheap study pack wants fewer fields. YouChannel never fully resolved that pricing/product tension.
The Learning Interface
The main learning screen was designed around simultaneous context: the learner should not lose the video while reading notes or speaking.
Desktop
Resizable study workspace
The desktop route uses a resizable layout: video in the main area, learning tabs below, and Live chat on the side. The user can keep the video visible while reading summary, wiki, or transcript.
Mobile
Video-first switching
The mobile route keeps the video at the top and uses a compact tab model for Learn and Chat. It is not the same layout squeezed smaller; it is a different interaction shape.
Navigation
Timestamp as control surface
Wiki and transcript entries include timestamp buttons. Clicking them seeks the YouTube player, so generated structure becomes player control.
Conversation
Characters become roles
The chat sidebar reads analysis characters, lets the learner choose a role and voice, and builds a Gemini Live system prompt from that role plus the video context.
Code evidence: learning page layout youchannel/src/routes/_layout/learn/$videoId.tsx
The route composes VideoPlayerCard, LearningTabs, and ChatSidebar. It passes a seek handler down into the wiki and transcript surfaces, so analysis timestamps can control the YouTube iframe player.
Code evidence: character prompt design youchannel/src/lib/dashboard/learn/components/ChatSidebar.tsx
buildSystemPrompt tells the model to become the selected character, not a generic assistant. It includes role, description, traits, speaking style, audio profile, video memories, and language preference. It also instructs the model to admit uncertainty when the video context does not support an answer.
The design mistake I would avoid next time: I let too many learning surfaces mature at once. Overview, wiki, transcript, character chat, general Live, assessment, shadowing, and quota all competed for product attention. The better wedge would have been one excellent study screen from one video, then one excellent practice loop from that study screen.
Gemini Live for Real-Time Learning
The Live system was the most promising part emotionally. It was also where the product needed the most care.
The app did not expose the long-lived Gemini API key to the browser. It created a short-lived Live auth token on the server, then the client opened a Gemini Live session with native audio output. The client handled microphone capture, audio chunking, output playback, input and output transcriptions, session resumption, function-call plumbing, and message synchronization.
Profile and preferences enter the room
- Durable state
- live_user_profile_versions, chat preferences, device context
The browser opens a Live audio session
- Durable state
- ephemeral token, native audio model, resumption handle
The learner speaks, the model responds in audio
- Durable state
- input transcription, output transcription, audio chunks
Only useful turns become durable
- Durable state
- message queue, synced IDs, sessionStorage guard
The session becomes feedback
- Durable state
- CEFR-style assessment, strengths, weaknesses, recommendations
Assessment becomes shadowing practice
- Durable state
- shadowing drill, recorded attempt, score, next focus
| Live design part | Implementation detail | Product reason |
|---|---|---|
| Short-lived access | getGeminiToken creates a 30-minute, single-use token from the server API key. | Avoid shipping the provider key to the browser. |
| Audio model | gemini-2.5-flash-native-audio-preview-12-2025 is the Live default in the client hook. | Keep the experience spoken, not text-first. |
| Audio pipeline | Input chunks are sent at 16 kHz; output audio is played at 24 kHz. | Browser microphone behavior needs a dedicated real-time path. |
| Session config | Audio response modality, affective dialog, input/output transcription, optional resumption handle. | The app needs speech, transcripts, continuity, and a more natural conversation shape. |
| Proactivity | proactiveAudio: false in the Live config. | Let the product control openings and turns instead of letting the model interrupt. |
| Persistence | Messages are synced in batches, on disconnect, page hide, and before reconnect. | Streaming partials are noisy. Durable history should store useful turns, not every fragment. |
There were two major Live modes.
Mode 1
Video roleplay
The learner opens a video and talks to a generated role from the analysis. The character prompt carries video context, speaking style, topics, voice, and language.
Mode 2
General language coach
The learner opens the Live page and speaks with a multilingual coach. The system prompt is short, calibrated, and conversation-first.
The general coach had its own role design. It was not meant to be a verbose teacher. The system prompt asks for short one-to-three sentence replies, one engaging question, calibrated difficulty, natural recasts instead of constant grammar lectures, and a willingness to ask for repetition when speech is garbled. The assistant’s job is to keep the learner talking.
The surrounding agents gave the Live coach memory:
| Role | Model behavior | Durable output |
|---|---|---|
| Conversation partner | Real-time Gemini Live voice conversation. | Session messages and resumption handle. |
| Profile builder | Gemini 3 Flash preview formats optional preferences, audio intro, and approximate region into a privacy-bounded profile. | Append-only live profile versions. |
| Session evaluator | After the call, the app asks the Live session to evaluate user utterances and can format the result into strict JSON. | CEFR-style assessment, strengths, weaknesses, and recommended drills. |
| Shadowing scorer | A separate scoring prompt compares a learner’s recorded attempt against a target sentence. | shadowing_attempts with scores and next focus. |
| Drill recommender | Recent assessments and attempts are sampled into future practice suggestions. | A practice queue biased toward weak, recent, or unattempted areas. |
Code evidence: Live session loop youchannel/src/routes/_layout/live.tsx
The Live route builds device context, optional profile context, and the base Live system prompt, then connects with a fresh token. It sends an initial greeting after connection, stores resumption handles, retries reconnects with backoff, syncs previous sessions before reconnecting, batches message persistence, evaluates the session on end, and generates a title after the call.
Code evidence: Live role and profile design youchannel/src/lib/dashboard/live/constants.ts, profile.ts, assessment.ts
The base assistant is a multilingual conversation partner and language coach. Profile generation uses gemini-3-flash-preview and privacy constraints. Assessment uses Live session context and can reformat vague output into strict JSON. Shadowing drills later reuse assessment weaknesses.
The key migration inside Live was also from eager synchronization to passive, batched persistence. Streaming audio and transcripts produce too many intermediate states. The app only needs durable turns, session metadata, assessment output, and practice signals. So the implementation queues message sync, filters empty or streaming fragments, retries failed batches, and flushes on lifecycle events.
That pattern mirrors the YouTube migration. The app moved away from “sync everything all the time” and toward “observe the source, then persist product state at deliberate checkpoints.”
Why the Full Product Stopped
The honest reason is not one dramatic failure. The system simply became a serious metered product before the learning habit had enough proof.
| Pressure | What the code revealed | Why it mattered |
|---|---|---|
| YouTube dependency | OAuth refresh, page tokens, private placeholders, unavailable videos, and API quota all became normal paths. | The source material was valuable but not fully under product control. |
| Video analysis cost | The prompt wanted scene, summary, wiki, roles, voices, languages, and transcript. | The best learning artifact was expensive to generate. |
| Live complexity | Real-time audio needed token minting, browser capture, playback, transcriptions, session resumption, profile context, and assessment. | A mediocre voice loop feels worse than a simpler text product. |
| State surface | Videos, analyses, quotas, grants, usage events, sessions, messages, profiles, assessments, and attempts all mattered. | YouChannel became an operating system for learning before it became a focused habit. |
| Business model | Hosted mode requires cost control; BYOK reduces platform cost but adds trust, support, privacy, and UX complexity. | Normal learners do not want to debug provider keys before studying. |
Cloudflare could host the static archive cleanly. A full Cloudflare-native YouChannel would need a deeper rewrite: Workers, Queues, D1 or Hyperdrive, a different job model, secrets and BYOK strategy, and a replacement for the Node process assumptions around pg-boss workers. That is possible, but it is not just “deploy the existing app to Pages.”
So I stopped the full product and kept the useful artifacts: source code, deployment notes, OpenAPI docs, the product archive, and this postmortem.
What I Would Build Instead
If I restarted this idea, I would make the wedge smaller and sharper.
Wedge 1
One URL to one study pack
Skip playlist sync, admin dashboards, and Live voice at first. Paste a YouTube URL, generate one excellent study pack, and measure whether people return to it.
Wedge 2
Shadowing before open conversation
Shadowing is easier to score and explain than free-form voice chat. It gives the learner a clear action and clear feedback.
Wedge 3
Passive data, explicit cost
Read YouTube only when the user asks. Charge quota only when the user selects analysis. Make cost visible before the model call.
Wedge 4
Roles after evidence
Add character roleplay only when the video analysis can reliably identify speakers and useful speaking styles.
The core lesson survived: authentic media plus structured AI can become a powerful learning interface. But the product has to respect cost, third-party instability, and user attention. The real challenge is not connecting YouTube to an LLM. The challenge is designing a loop that a learner can trust enough to repeat.
The static project archive is at youchannel-product.pages.dev. The code remains useful as a product-engineering notebook: a record of how a simple idea turned into agent loops, quota ledgers, passive YouTube reads, Gemini video analysis, real-time speech practice, and eventually a decision to stop before the platform became larger than the learning problem.