Whisper transcribes it. Llama picks the best moments. FFmpeg renders them 9:16. Your video never leaves your machine — the transcript is the only thing we send anywhere.
OpusClip is a $30M company. Videotto is a clone of that. This is the same stack, rearranged so the user's machine does most of the work.
Web Audio API decodes your video's audio track to 16kHz mono WAV. This happens in your tab. No upload.
Just the audio (not the video) gets sent to Workers AI's Whisper endpoint. Word-level timestamps come back.
The transcript is sent to Llama 3.3 70B with a prompt: "return the 4 most engaging 15–55s segments as JSON."
FFmpeg compiled to WebAssembly runs inside your browser. It cuts, crops to 9:16, and encodes. Your video never touches our server. The paid clones upload your file to their backend, run FFmpeg on a GPU they rent, and charge you for it.
No. Only the extracted audio (a few MB) goes to Whisper. Only the transcript text goes to Llama. The video file itself is never uploaded.
The heavy parts (transcription, rendering) run on free tiers (Cloudflare Workers AI) and your machine. There's no paid GPU rental because there's no server-side render.
No. OpusClip has animated captions, better viral-moment scoring trained on their own engagement data, and face-tracking that reframes on the speaker. This has none of that. But for a raw "give me the 4 best 30-second moments" it's shockingly close.
Workers AI has a payload limit on Whisper. At 16kHz mono audio that caps at roughly 5 minutes. Longer videos would need chunked transcription — doable, not done yet.
FFmpeg in WASM is single-threaded and your browser isn't built for this. A 30-second clip takes roughly 30–90 seconds to render. Trade-off for $0.