YouChannel: Agent Loops, YouTube Data, and the Cost of Real-Time Learning

YouChannel was my attempt to turn YouTube from passive input into an interactive language-learning workspace.

The basic idea was simple: learners already spend time with videos they care about. They watch interviews, creator channels, lectures, product demos, travel vlogs, comedy clips, language content, and technical talks. The content is authentic. It has emotional context. It is much stickier than a synthetic lesson.

The missing layer is not “more content.” It is a loop that converts one video into study, conversation, memory, and practice.

That is where YouChannel became interesting. It was not one chatbot glued beside a video player. It was a set of bounded agent loops: fetch YouTube data, select material, analyze the video, turn the analysis into learning surfaces, run real-time spoken practice, persist the useful parts, and feed assessment back into future drills.

Primary loop YouTube to study to speech

User app React + TanStack Start

State layer Supabase Postgres

Async backend Fastify + pg-boss

The Product Loop

The product flow I wanted was less like an import tool and more like a learning flywheel.

Browse live YouTube material

The app reads the learner's playlists on demand. Nothing durable is created until the learner chooses a video.

State checkpoint: OAuth account, access token, page token
Code evidence: fetchPlaylistSummaries, fetchPlaylistVideosPage

The implementation eventually made an important product choice: YouTube is a source of material, not the product database. Durable product state begins when a user selects a video and YouChannel creates user-owned rows, analysis history, quota events, live sessions, assessments, and practice attempts.

Agent Loop Design

The word “agent” can hide too much. In YouChannel, the useful pattern was not one autonomous master process. It was a handful of loops with clear triggers, bounded model calls, durable checkpoints, and recoverable failure states.

Analysis

Video analysis loop

Trigger Selected playlist videos

Model call Gemini video input plus strict JSON prompt

Checkpoint video_analyses, usage, quota events

User surface Study pack, wiki timeline, transcript, roles

Business Workflow

The production workflow had more bookkeeping than the original demo idea suggested.

Connect YouTube with a read-only OAuth grant

The app builds a Google OAuth URL with youtube.readonly, requests offline access, and stores the refreshable account link. Server functions refresh the token when it is close to expiry.

Fetch playlists passively

The playlist page calls server functions that read current YouTube playlists and playlist items. React Query uses an infinite stale time and disables automatic refetch on focus or reconnect. Refresh is explicit.

Let the user choose videos

The UI maps playlist items into analysis candidates. It filters already-imported videos, private placeholders, and videos longer than the active per-video limit. The review dialog estimates the quota impact before submission.

Submit analysis as an external API call

The frontend calls POST /openapi/analysis on the jobs service with a shared service key. The service deduplicates by YouTube video ID, inserts only new user-owned rows, and enqueues only inserted videos.

Consume quota in the worker

The worker validates the video, parses duration, calls consume_quota, invokes Gemini, saves usage and output, and refunds quota when a terminal worker failure means the paid work did not complete.

Expose the result as learning state

The app reads the latest analysis status and renders the learning page when the JSON is complete. The user sees a video player, structured tabs, and Live chat grounded in the analysis.

Code evidence: the analysis API is a service boundary youchannel-service/apps/jobs/src/server/routes/openapi.ts

POST /openapi/analysis uses a service-key pre-handler, validates userId and video payloads, deduplicates by youtubeVideoId, inserts new user-owned videos, and passes only inserted rows to enqueueAnalyses. The response reports requested, unique, inserted, existing, enqueued, skipped, and skip reasons.

Code evidence: the worker owns quota and retries youchannel-service/apps/jobs/src/workers.ts

The worker calls consume_quota before analysis, persists usage, saves completed, failed, or skipped status, classifies timeout, 429, and 5xx errors as retryable, and calls refund_quota for terminal failures that should not charge the learner.

YouTube Data Sources

The current product reads YouTube data through the YouTube Data API v3 and Google OAuth. It does not scrape YouTube pages, and it does not call a YouTube captions endpoint for the transcript.

accounts.google.com/o/oauth2/v2/auth

Request shape: scope=youtube.readonly, access_type=offline, prompt=consent
Product payload: The connected YouTube account and refreshable access token.
Sync pitfall: A connected account is still fragile. Refresh tokens can be missing, revoked, or invalidated outside the app.
Code path: src/lib/server/youtube.ts

The subtle thing is that YouTube playlist data is not clean product data. It is a moving third-party view. A video can become private. A user can remove it from a playlist. A title can become “Private video.” Durations arrive as ISO 8601 strings. Page tokens are stateful. API quota can be consumed just by browsing. OAuth can fail after the user thinks they are connected.

That is why the final code moved toward passive browsing and explicit selection instead of maintaining a constantly synchronized playlist mirror.

Code evidence: passive playlist reads youchannel/src/routes/_layout/playlists.tsx

The playlist page uses query keys for YouTube playlists and playlist items, disables refetch on window focus and reconnect, and relies on explicit refresh. The page maps playlist items into local candidates, checks existing library IDs, detects private placeholders, enforces per-video duration limits, and then calls the analysis endpoint only for selected videos.

From Sync Engine to Passive Fetching

The migration history tells the real story better than a diagram.

e2ccc06, 2086a66

Background sync looked clean on paper

Before

The system tried to keep playlist state mirrored through sync jobs, workers, schedules, and logs.

After

This created a second source of truth beside YouTube and the learner-owned library.

A sync engine is a product commitment, not a helper script.

The earlier “bidirectional sync” idea was attractive in product language: connect a YouTube playlist, keep it aligned, react when items appear or disappear, and make the learning library follow. In practice, the code was telling a different story. Sync needed playlist status, video status, lost/auth-invalid states, job runs, retry windows, sync logs, dedupe rules, and cleanup paths. It also blurred ownership: if the same YouTube video appeared in two playlists, which product object owned the analysis?

The final model is less magical and more reliable:

read YouTube playlists when the learner opens the page,
page through playlist items only when needed,
let the learner choose videos,
create user-owned video records only after selection,
keep analysis history independent from playlist membership,
rely on queue infrastructure for async work instead of custom sync state,
make refresh an explicit UI action.

That migration was a product decision, not just a refactor. Passive fetching reduced surprise. It also made quota easier to explain: the user is charged for selected analysis work, not for invisible background synchronization.

Video Analysis Prompt Design

The analysis prompt had to produce learning primitives, not prose.

The worker sends Gemini two things in one user message: a video input with the YouTube URL, and a text prompt that demands strict JSON. The model configuration uses low temperature, a large output cap, JSON MIME type, and a JSON schema. After streaming, the worker strips fences, parses JSON, validates the shape, validates timestamps, validates characters, and persists both output and usage.

Output field	Why it existed	UI or agent use
`scene`	Capture the setting and vibe of the video.	Gives chat and Live prompts grounding beyond facts.
`summarize`	Provide a compact 3-6 sentence overview.	The overview tab and quick memory for later prompts.
`wiki`	Create chronological moments with timestamps.	Clickable learning timeline that can seek the video.
`characters`	Identify speakers or roles with traits, topics, language, and voice.	Character directory and Gemini Live roleplay setup.
`transcript`	Preserve original-language segments with timestamps and speakers.	Transcript tab, seeking, and study context.

This prompt design reflects the product. A generic “summarize this video” output would have been cheaper, but it would not power the learning interface. The system needed timestamps for navigation, speakers for roleplay, language labels for Live, voice choices for audio personality, and transcript segments for precise study.

Code evidence: the prompt is built for interaction youchannel-service/apps/jobs/src/workers.ts

ANALYSIS_OUTPUT_SCHEMA requires scene, summarize, wiki, characters, and transcript. The prompt tells the model not to invent facts, to keep wiki timestamps chronological, to keep transcript text in the original spoken language, and to select a TTS voice for each main character. The worker calls Gemini with the video URL as a video part and the prompt as a text part.

Code evidence: token accounting is stored youchannel-service/apps/jobs/src/workers.ts

The stream collector records provider usage from the done chunk. updateVideoAnalysis persists that usage on the video_analyses row. The analysis call sets maxTokens: 65536, which is a sign that transcript-heavy output can become large even before counting video input tokens.

The cost lesson was blunt. Video input is not a normal chat prompt. Tokens scale with duration, audio, visual sampling, and the amount of structured output requested. In project testing, a 10-minute video analyzed with a heavier Gemini 3 Pro style setup could land near the low six figures of tokens. The checked-in repo does not include raw provider usage samples, so I would not present that as a benchmark. But the design pressure is visible in the code: quota ledgers, per-video duration limits, usage persistence, retries, refunds, and admin operations all exist because long-video analysis is a metered product, not a free utility call.

Video length 10 minutes

Analysis shape Compact summary Rich study pack Transcript-heavy

Planning estimate

120k Gemini 3 Pro style run, rich prompt, 10 minute reference.

Prompt payload 78k

Structured output 42k

Relative pressure 1.0x the 10 minute reference

This also affected prompt strategy:

ask for structured learning fields once, rather than many scattered follow-up calls,
keep the schema small enough to validate,
use timestamps as stable UI anchors,
keep transcripts concise and allow truncation,
store usage for later cost analysis,
make quota deduction idempotent around the analysis record,
refund on terminal worker failure where the product should not charge.

The unpleasant tradeoff is that the best product output is often the most expensive output. A rich study pack wants transcript, timeline, roles, and summary. A cheap study pack wants fewer fields. YouChannel never fully resolved that pricing/product tension.

The Learning Interface

The main learning screen was designed around simultaneous context: the learner should not lose the video while reading notes or speaking.

Desktop

Resizable study workspace

The desktop route uses a resizable layout: video in the main area, learning tabs below, and Live chat on the side. The user can keep the video visible while reading summary, wiki, or transcript.

Mobile

Video-first switching

The mobile route keeps the video at the top and uses a compact tab model for Learn and Chat. It is not the same layout squeezed smaller; it is a different interaction shape.

Navigation

Timestamp as control surface

Wiki and transcript entries include timestamp buttons. Clicking them seeks the YouTube player, so generated structure becomes player control.

Conversation

Characters become roles

The chat sidebar reads analysis characters, lets the learner choose a role and voice, and builds a Gemini Live system prompt from that role plus the video context.

Code evidence: learning page layout youchannel/src/routes/_layout/learn/$videoId.tsx

The route composes VideoPlayerCard, LearningTabs, and ChatSidebar. It passes a seek handler down into the wiki and transcript surfaces, so analysis timestamps can control the YouTube iframe player.

Code evidence: character prompt design youchannel/src/lib/dashboard/learn/components/ChatSidebar.tsx

buildSystemPrompt tells the model to become the selected character, not a generic assistant. It includes role, description, traits, speaking style, audio profile, video memories, and language preference. It also instructs the model to admit uncertainty when the video context does not support an answer.

The design mistake I would avoid next time: I let too many learning surfaces mature at once. Overview, wiki, transcript, character chat, general Live, assessment, shadowing, and quota all competed for product attention. The better wedge would have been one excellent study screen from one video, then one excellent practice loop from that study screen.

Gemini Live for Real-Time Learning

The Live system was the most promising part emotionally. It was also where the product needed the most care.

The app did not expose the long-lived Gemini API key to the browser. It created a short-lived Live auth token on the server, then the client opened a Gemini Live session with native audio output. The client handled microphone capture, audio chunking, output playback, input and output transcriptions, session resumption, function-call plumbing, and message synchronization.

Profile and preferences enter the room

I will keep replies short, recast mistakes gently, and ask one concrete question at a time.

Durable state: live_user_profile_versions, chat preferences, device context

Live design part	Implementation detail	Product reason
Short-lived access	`getGeminiToken` creates a 30-minute, single-use token from the server API key.	Avoid shipping the provider key to the browser.
Audio model	`gemini-2.5-flash-native-audio-preview-12-2025` is the Live default in the client hook.	Keep the experience spoken, not text-first.
Audio pipeline	Input chunks are sent at 16 kHz; output audio is played at 24 kHz.	Browser microphone behavior needs a dedicated real-time path.
Session config	Audio response modality, affective dialog, input/output transcription, optional resumption handle.	The app needs speech, transcripts, continuity, and a more natural conversation shape.
Proactivity	`proactiveAudio: false` in the Live config.	Let the product control openings and turns instead of letting the model interrupt.
Persistence	Messages are synced in batches, on disconnect, page hide, and before reconnect.	Streaming partials are noisy. Durable history should store useful turns, not every fragment.

There were two major Live modes.

Mode 1

Video roleplay

The learner opens a video and talks to a generated role from the analysis. The character prompt carries video context, speaking style, topics, voice, and language.

Mode 2

General language coach

The learner opens the Live page and speaks with a multilingual coach. The system prompt is short, calibrated, and conversation-first.

The general coach had its own role design. It was not meant to be a verbose teacher. The system prompt asks for short one-to-three sentence replies, one engaging question, calibrated difficulty, natural recasts instead of constant grammar lectures, and a willingness to ask for repetition when speech is garbled. The assistant’s job is to keep the learner talking.

The surrounding agents gave the Live coach memory:

Role	Model behavior	Durable output
Conversation partner	Real-time Gemini Live voice conversation.	Session messages and resumption handle.
Profile builder	Gemini 3 Flash preview formats optional preferences, audio intro, and approximate region into a privacy-bounded profile.	Append-only live profile versions.
Session evaluator	After the call, the app asks the Live session to evaluate user utterances and can format the result into strict JSON.	CEFR-style assessment, strengths, weaknesses, and recommended drills.
Shadowing scorer	A separate scoring prompt compares a learner’s recorded attempt against a target sentence.	`shadowing_attempts` with scores and next focus.
Drill recommender	Recent assessments and attempts are sampled into future practice suggestions.	A practice queue biased toward weak, recent, or unattempted areas.

Code evidence: Live session loop youchannel/src/routes/_layout/live.tsx

The Live route builds device context, optional profile context, and the base Live system prompt, then connects with a fresh token. It sends an initial greeting after connection, stores resumption handles, retries reconnects with backoff, syncs previous sessions before reconnecting, batches message persistence, evaluates the session on end, and generates a title after the call.

Code evidence: Live role and profile design youchannel/src/lib/dashboard/live/constants.ts, profile.ts, assessment.ts

The base assistant is a multilingual conversation partner and language coach. Profile generation uses gemini-3-flash-preview and privacy constraints. Assessment uses Live session context and can reformat vague output into strict JSON. Shadowing drills later reuse assessment weaknesses.

The key migration inside Live was also from eager synchronization to passive, batched persistence. Streaming audio and transcripts produce too many intermediate states. The app only needs durable turns, session metadata, assessment output, and practice signals. So the implementation queues message sync, filters empty or streaming fragments, retries failed batches, and flushes on lifecycle events.

That pattern mirrors the YouTube migration. The app moved away from “sync everything all the time” and toward “observe the source, then persist product state at deliberate checkpoints.”

Why the Full Product Stopped

The honest reason is not one dramatic failure. The system simply became a serious metered product before the learning habit had enough proof.

Pressure	What the code revealed	Why it mattered
YouTube dependency	OAuth refresh, page tokens, private placeholders, unavailable videos, and API quota all became normal paths.	The source material was valuable but not fully under product control.
Video analysis cost	The prompt wanted scene, summary, wiki, roles, voices, languages, and transcript.	The best learning artifact was expensive to generate.
Live complexity	Real-time audio needed token minting, browser capture, playback, transcriptions, session resumption, profile context, and assessment.	A mediocre voice loop feels worse than a simpler text product.
State surface	Videos, analyses, quotas, grants, usage events, sessions, messages, profiles, assessments, and attempts all mattered.	YouChannel became an operating system for learning before it became a focused habit.
Business model	Hosted mode requires cost control; BYOK reduces platform cost but adds trust, support, privacy, and UX complexity.	Normal learners do not want to debug provider keys before studying.

Cloudflare could host the static archive cleanly. A full Cloudflare-native YouChannel would need a deeper rewrite: Workers, Queues, D1 or Hyperdrive, a different job model, secrets and BYOK strategy, and a replacement for the Node process assumptions around pg-boss workers. That is possible, but it is not just “deploy the existing app to Pages.”

So I stopped the full product and kept the useful artifacts: source code, deployment notes, OpenAPI docs, the product archive, and this postmortem. The next experiment is Zota, which takes the YouChannel lesson into scenario-based practice instead of media-based study.

What I Would Build Instead

If I restarted this idea, I would make the wedge smaller and sharper.

Wedge 1

One URL to one study pack

Skip playlist sync, admin dashboards, and Live voice at first. Paste a YouTube URL, generate one excellent study pack, and measure whether people return to it.

Wedge 2

Shadowing before open conversation

Shadowing is easier to score and explain than free-form voice chat. It gives the learner a clear action and clear feedback.

Wedge 3

Passive data, explicit cost

Read YouTube only when the user asks. Charge quota only when the user selects analysis. Make cost visible before the model call.

Wedge 4

Roles after evidence

Add character roleplay only when the video analysis can reliably identify speakers and useful speaking styles.

The core lesson survived: authentic media plus structured AI can become a powerful learning interface. But the product has to respect cost, third-party instability, and user attention. The real challenge is not connecting YouTube to an LLM. The challenge is designing a loop that a learner can trust enough to repeat.

The static project archive is at youchannel-product.pages.dev. The code remains useful as a product-engineering notebook: a record of how a simple idea turned into agent loops, quota ledgers, passive YouTube reads, Gemini video analysis, real-time speech practice, and eventually a decision to stop before the platform became larger than the learning problem. The follow-up project is Zota, where the same durable-state idea moves from video analysis into memory-led, scenario-based communication practice.