<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://scrapingenthusiast.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://scrapingenthusiast.github.io/" rel="alternate" type="text/html" /><updated>2026-04-08T10:45:59+07:00</updated><id>https://scrapingenthusiast.github.io/feed.xml</id><title type="html">Scraping Enthusiast</title><subtitle>Learn no-code web scraping from the experts at Novi Develop. Master data extraction tools for TikTok, X.com, and more with step-by-step tutorials.</subtitle><author><name>Novi Develop</name></author><entry><title type="html">Claude Code Architecture (Part 3): Masterclass in Memory &amp;amp; Context Compaction</title><link href="https://scrapingenthusiast.github.io/performance/claude-code-context-memory-compaction/" rel="alternate" type="text/html" title="Claude Code Architecture (Part 3): Masterclass in Memory &amp;amp; Context Compaction" /><published>2026-04-06T12:00:00+07:00</published><updated>2026-04-06T12:00:00+07:00</updated><id>https://scrapingenthusiast.github.io/performance/claude-code-context-memory-compaction</id><content type="html" xml:base="https://scrapingenthusiast.github.io/performance/claude-code-context-memory-compaction/"><![CDATA[<p>We complete the ScrapingEnthusiast architecture series today. Combining our <a href="/performance/claude-code-agentic-search-optimization/">Search knowledge in Part 2</a> with terminal bounds, we arrive at the absolute climax of LLM engineering: Token and Context Scaling.</p>

<p>Executing advanced background queries constantly expands context windows sequentially until the system reaches strict absolute limits (~200,000 constraint markers) resulting natively in <code class="language-plaintext highlighter-rouge">prompt too long (413 HTTP)</code> rejections. Claude evades context bottlenecks natively using sequential <code class="language-plaintext highlighter-rouge">Compaction</code>.</p>

<h2 id="the-hybrid-tracking-method">The Hybrid Tracking Method</h2>

<p>Token accuracy equates explicitly to memory survival. Rather than calculating via strict expensive API counting payloads perpetually, the local state mixes API counts natively interspersed with highly optimized heuristics roughly mapping (~1 raw token sequentially per 4 plain string characters).</p>

<h2 id="progressive-multi-layer-compaction">Progressive Multi-Layer Compaction</h2>

<p>When the loop runs natively, it explicitly reduces token footprint across 5 progressive limits:</p>
<ol>
  <li><strong>Tool Result Caps:</strong> Truncating large JSON dumps inherently passing exactly <code class="language-plaintext highlighter-rouge">20,000 max string limits</code> returning <code class="language-plaintext highlighter-rouge">.md</code> formatted file storage indicators alternatively.</li>
  <li><strong>Transient Pruning:</strong> Dropping prior analytical <code class="language-plaintext highlighter-rouge">HISTORY_SNIP</code> blocks natively reducing overhead maintaining exclusively the formal execution outline parameters natively.</li>
  <li><strong>Active Microcompaction:</strong> Leveraging the native <code class="language-plaintext highlighter-rouge">cache_edits</code> payloads to silently clear transient backend tool memory configurations directly piggybacking adjacent network transfers.</li>
  <li><strong>Context Collapse States:</strong> Experimental modeling parameters condensing native conversational branch networks natively.</li>
</ol>

<h3 id="the-asynchronous-fail-safe-autocompact">The Asynchronous Fail-safe: Autocompact</h3>

<p>When total sizes breach the internal warning bounds (~13k overhead), Claude activates an asynchronous operation natively entirely executing independent summarization branches silently.
Session Memory limits are built structurally extracting exact project paths (like logging <code class="language-plaintext highlighter-rouge"># Errors &amp; Corrections</code>) dumping entirely via hidden memory paths locally mapped directly inside <code class="language-plaintext highlighter-rouge">.claude/session_memory</code>.</p>

<p>The central prompt engine purges everything before the boundary checkpoint retaining fully logically mapped histories! The terminal executes cleanly utilizing infinitely looping logic frameworks natively extending context survival globally.</p>

<p>We hope this 3-part series expanded your perspective massively surrounding modern LLM backend constraints!</p>]]></content><author><name>Scraping Enthusiast</name></author><category term="Token Budget" /><category term="Memory Storage" /><category term="LLM Limits" /><category term="Context Management" /><summary type="html"><![CDATA[Our final architectural deep dive exploring how Claude natively controls expansive token payloads to securely prevent Context Overflow.]]></summary></entry><entry><title type="html">Claude Code Architecture (Part 2): Optimizing Agentic Search via Ripgrep</title><link href="https://scrapingenthusiast.github.io/performance/claude-code-agentic-search-optimization/" rel="alternate" type="text/html" title="Claude Code Architecture (Part 2): Optimizing Agentic Search via Ripgrep" /><published>2026-04-06T11:00:00+07:00</published><updated>2026-04-06T11:00:00+07:00</updated><id>https://scrapingenthusiast.github.io/performance/claude-code-agentic-search-optimization</id><content type="html" xml:base="https://scrapingenthusiast.github.io/performance/claude-code-agentic-search-optimization/"><![CDATA[<p>Building from our <a href="/performance/claude-code-architecture-performance-loop/">Part 1 overview</a> of the core engine, we arrive at a critical bottleneck: querying the filesystem.</p>

<p>When scaling agentic LLMs against massive enterprise directories containing thousands of components, utilizing native JavaScript filesystem commands is prohibitively slow. Claude rectifies this explicitly by offloading operations directly to local system rust binaries.</p>

<h2 id="ripgrep-optimizations">Ripgrep Optimizations</h2>

<p><code class="language-plaintext highlighter-rouge">GrepTool</code> natively executes Ripgrep mappings by actively bypassing node-bound file reads. To ensure zero cold-start deployment delays, Claude bundles embedded binaries securely avoiding <code class="language-plaintext highlighter-rouge">$PATH</code> environment resolution blocking delays.</p>

<h3 id="the-eagain-trade-off">The EAGAIN Trade-off</h3>

<p>One of the cleverest performance safeguards relates to thread saturation natively. 
Inside constrained systems (like Docker execution limits), initiating overly aggressive <code class="language-plaintext highlighter-rouge">rg</code> parallel thread pools destroys resources, triggering an explicit <code class="language-plaintext highlighter-rouge">EAGAIN</code> (resource temporarily unavailable) exception.</p>

<p>Claude intercepts the exact <code class="language-plaintext highlighter-rouge">EAGAIN</code> network signature dynamically. Rather than entirely terminating the extraction loop, it scales the process silently downwards, injecting the <code class="language-plaintext highlighter-rouge">-j 1</code> argument (single-threaded execution constraint). It trades overall execution speed exclusively for guaranteed terminal robustness.</p>

<h2 id="memory-bounded-pagination">Memory Bounded Pagination</h2>

<p>When analyzing search pipelines, returning 20,000 matches absolutely destroys token budgets instantly. How does Claude search without overflowing context models?</p>

<p>The entire search parameter uses hard-coded output buffers (a static <code class="language-plaintext highlighter-rouge">head_limit</code>). Should Ripgrep find thousands of elements, it truncates the payload severely offering only exact samples wrapped by the system indication:
<code class="language-plaintext highlighter-rouge">[Showing results with pagination = limit: 250]</code></p>

<p>The LLM is programmed internally to recognize optimization block limits logically prompting an incremental <code class="language-plaintext highlighter-rouge">offset: 250</code> execution step recursively. Searching your entire repository essentially behaves exactly like iterating over dynamic relational database buffers via cursor endpoints natively.</p>

<p>In <strong>Part 3</strong>, we will move into Claude’s greatest performance implementation yet: The 5-layer Context Compaction parameters.</p>]]></content><author><name>Scraping Enthusiast</name></author><category term="Ripgrep" /><category term="Search Performance" /><category term="Binary Execution" /><category term="Memory Optimization" /><summary type="html"><![CDATA[Part 2 of our series. Explore how Claude scales massive native search loops flawlessly utilizing offset pagination and embedded Ripgrep binaries.]]></summary></entry><entry><title type="html">Claude Code Architecture (Part 1): Performance Benchmarks &amp;amp; The Core Query Loop</title><link href="https://scrapingenthusiast.github.io/performance/claude-code-architecture-performance-loop/" rel="alternate" type="text/html" title="Claude Code Architecture (Part 1): Performance Benchmarks &amp;amp; The Core Query Loop" /><published>2026-04-06T10:00:00+07:00</published><updated>2026-04-06T10:00:00+07:00</updated><id>https://scrapingenthusiast.github.io/performance/claude-code-architecture-performance-loop</id><content type="html" xml:base="https://scrapingenthusiast.github.io/performance/claude-code-architecture-performance-loop/"><![CDATA[<p>Welcome to Part 1 of our Claude Code architecture series. Today, we look at Claude Code completely through the lens of <strong>Performance Optimization and Execution Contexts</strong>.</p>

<p>For AI performance enthusiasts, seeing how an LLM agent wraps around a native Node.js interface seamlessly is exhilarating. Let’s unbox the core architectural structure behind the blindingly fast experience it delivers in the terminal.</p>

<h2 id="architectural-bottlenecks--the-fast-path">Architectural Bottlenecks &amp; The Fast-Path</h2>

<p>The typical Node.js CLI tool crashes heavily on import cascades. Claude solves startup latency explicitly with heavily engineered <strong>Fast-Path Routing</strong> inside <code class="language-plaintext highlighter-rouge">src/entrypoints</code>.</p>

<p>Before any dependencies or large React/Ink components load, a trivial scan processes simple arguments. A static <code class="language-plaintext highlighter-rouge">TCP</code> connection dynamically initiates the Anthropic API ping parallelizing DNS resolutions before the UI even renders. Background promises concurrently establish configuration reads without blocking visual shell indicators.</p>

<h2 id="the-query-loop-streaming-performance">The Query Loop: Streaming Performance</h2>

<p>When analyzing the execution layer (<code class="language-plaintext highlighter-rouge">QueryEngine.ts</code>), the primary focus is how it mitigates response latency. Total latency is completely eradicated by streaming chunks directly using <code class="language-plaintext highlighter-rouge">for await</code>. As Anthropic processes prompt batches natively, Claude pipes the raw byte yields instantaneously into the virtual UI render state.</p>

<p>To further lower API friction, Claude intercepts mutating terminal actions natively via local Zod validation schemas. Instead of waiting for a network error round-trip specifying schema failures, local Zod implementations block malformed JSON outputs generated by the LLM seamlessly forcing internal reroutes.</p>

<p>Performance doesn’t stop randomly at execution loops. In <strong>Part 2</strong>, we will address the scalability optimizations within Claude’s native <code class="language-plaintext highlighter-rouge">Agentic Search</code> engine, illustrating how it safely handles executing terminal binaries across localized directories infinitely.</p>]]></content><author><name>Scraping Enthusiast</name></author><category term="Performance" /><category term="Node.js" /><category term="Async Streams" /><category term="System Architecture" /><summary type="html"><![CDATA[Part 1 of our ScrapingEnthusiast deep dive into Claude Code architecture. We analyze the fast-path routing and latency benchmarks of the QueryEngine.]]></summary></entry><entry><title type="html">Scaling Scraping Operations: Bypassing WAFs with Network Identity and Infrastructure</title><link href="https://scrapingenthusiast.github.io/scaling-scraping-network-infrastructure/" rel="alternate" type="text/html" title="Scaling Scraping Operations: Bypassing WAFs with Network Identity and Infrastructure" /><published>2026-04-03T10:00:00+07:00</published><updated>2026-04-03T10:00:00+07:00</updated><id>https://scrapingenthusiast.github.io/scaling-scraping-network-infrastructure</id><content type="html" xml:base="https://scrapingenthusiast.github.io/scaling-scraping-network-infrastructure/"><![CDATA[<p>When building a web scraper to target platforms protected by Cloudflare, Akamai, or DataDome, bypassing the initial block is only half the battle. The true challenge lies in scaling that operation securely and reliably.</p>

<p>To successfully extract data at scale without triggering <strong>CAPTCHA</strong> loops or IP bans, you must meticulously manage your network identity and decouple your scraping infrastructure.</p>

<h2 id="1-establishing-network-integrity">1. Establishing Network Integrity</h2>

<p>The first line of defense for enterprise anti-bot solutions is IP reputation. Datacenter IP ranges (AWS, GCP, DigitalOcean) are universally blacklisted or aggressively rate-limited.</p>

<ul>
  <li><strong>Residential Proxy Networks:</strong> The gold standard is utilizing <strong>Rotating Residential Proxies</strong> sourced from P2P networks. These IPs are assigned by ISPs to actual home users, making your traffic indistinguishable from organic user requests.</li>
  <li><strong>Strategy: Sticky Sessions vs. Rotation:</strong>
    <ul>
      <li>If your workflow requires authenticated login states, configure your proxy manager for <strong>Sticky Sessions</strong> to maintain the same IP.</li>
      <li>For stateless, high-volume scraping, rotate the IP on every single request to distribute the load and mitigate rate-limiting.</li>
    </ul>
  </li>
  <li><strong>Mobile Proxies (4G/5G):</strong> For the most restrictive targets, mobile proxies are unmatched. Because CGNAT (Carrier-Grade NAT) forces thousands of legitimate mobile users to share a single gateway IP, anti-bot systems are extremely hesitant to block them to avoid catastrophic false positives.</li>
</ul>

<h2 id="2-decoupled-infrastructure-and-scale">2. Decoupled Infrastructure and Scale</h2>

<p>Once you have established a trusted network identity, the focus shifts to hardware utilization and execution design.</p>

<ul>
  <li><strong>Headless Browser Management:</strong> Running 100+ instances of Playwright or Puppeteer is massively RAM-intensive. To scale, you must offload browser execution. Utilize services like <strong>Browserless.io</strong> or deploy a <strong>Dockerized Selenium Grid</strong>. This perfectly decouples your lightweight scraper logic from the heavy, resource-draining rendering engine.</li>
  <li><strong>Data Normalization at the Edge:</strong> Raw HTML is volatile and prone to sudden structural changes. Implement strict schema validation (e.g., using <strong>Pydantic</strong> in Python) immediately upon data extraction. If the target site alters its DOM, your parser should fail loudly rather than ingesting corrupted data into your database.</li>
  <li><strong>Storage and Deduplication Pipelines:</strong> Managing state across distributed scrapers is critical. Utilize <strong>Redis</strong> to manage a distributed “URL Frontier”—ensuring you aren’t scraping the same resource across multiple workers and wasting expensive proxy bandwidth. For long-term storage, <strong>PostgreSQL</strong> with JSONB fields provides the necessary flexibility for unstructured scraped data.</li>
</ul>

<h2 id="3-play-by-the-rules">3. Play by the Rules</h2>

<p>High-performance scraping must respect the ecosystem:</p>
<ul>
  <li>Always verify the target’s <code class="language-plaintext highlighter-rouge">robots.txt</code> unless you possess a high-value, legally sound use case for bypassing it.</li>
  <li>Understand the Terms of Service (ToS); bypassing technical measures may hold legal implications.</li>
  <li>Implement human-like rate limiting. Your goal is to gather data, not execute a Denial of Service (DOS) attack against the target’s infrastructure.</li>
</ul>

<p>By combining high-quality residential networks with a decoupled, containerized architecture, you can build data pipelines capable of reliably handling massive throughput against modern enterprise defenses. If you’d rather skip the infrastructure setup, our <a href="/tools/">ready-to-use scraping tools</a> for TikTok, Twitter, and more already handle proxy management, browser orchestration, and anti-bot evasion at scale.</p>]]></content><author><name>Novi Develop</name></author><category term="Infrastructure" /><category term="Data Engineering" /><category term="Scraping" /><category term="proxies" /><category term="docker" /><category term="scaling" /><category term="python" /><category term="architecture" /><summary type="html"><![CDATA[When building a web scraper to target platforms protected by Cloudflare, Akamai, or DataDome, bypassing the initial block is only half the battle. The true challenge lies in scaling that operation securely and reliably.]]></summary></entry><entry><title type="html">The Ecosystem Impact of the Claude Code Leak: Vectors, Moats, and Ethics</title><link href="https://scrapingenthusiast.github.io/news/ecosystem-impact-claude-code-leak/" rel="alternate" type="text/html" title="The Ecosystem Impact of the Claude Code Leak: Vectors, Moats, and Ethics" /><published>2026-04-03T09:00:00+07:00</published><updated>2026-04-03T09:00:00+07:00</updated><id>https://scrapingenthusiast.github.io/news/ecosystem-impact-claude-code-leak</id><content type="html" xml:base="https://scrapingenthusiast.github.io/news/ecosystem-impact-claude-code-leak/"><![CDATA[<p>Following the infamous “Great Claude Code Leak” of March 2026, the tech industry is reeling from the implications of having state-of-the-art AI orchestration logic exposed into the wild. After uncovering the technical oversight—an accidentally packaged source map—we must examine how this event alters the programming ecosystem and cybersecurity landscape.</p>

<h2 id="1-the-erosion-of-competitive-moats">1. The Erosion of Competitive “Moats”</h2>

<p>For AI startups, this leak effectively commoditized Anthropic’s specific orchestration logic. Competitors across the ecosystem now have a detailed blueprint for complex LLM deployment:</p>
<ul>
  <li><strong>Dynamic Boundary Prompting:</strong> The leak exposed exactly how to structure agent system prompts to minimize hallucinations specifically during file-writing tasks.</li>
  <li><strong>Error Recovery Loops:</strong> We can now see the exact logic used to retry failed bash commands or to step around <strong>Permission Denied</strong> errors autonomously, a previously heavily-guarded trade secret.</li>
</ul>

<h2 id="2-supply-chain-security-risks">2. Supply Chain Security Risks</h2>

<p>Predictably, the leak triggered an immediate surge in malicious activity. Threat actors quickly published “cracked” or “unlocked” versions of Claude Code on GitHub and npm.</p>

<p>Many of these unofficial forks contain <strong>obfuscated malware</strong>—specifically <strong>Remote Access Trojans (RATs)</strong>—designed to systematically exfiltrate developers’ <code class="language-plaintext highlighter-rouge">.env</code> files and AWS credentials during “agentic” runs.</p>

<blockquote>
  <p><strong>Technical Advisory:</strong> Engineering teams should audit local environments and instantly block unauthorized npm scopes. Running leaked or unofficial AI agent binaries presently poses a Tier-1 risk to corporate intellectual property and security.</p>
</blockquote>

<h3 id="devsecops-lessons">DevSecOps Lessons</h3>

<p>This event serves as a brutal case study for <strong>Artifact Scanning</strong> in the CI/CD pipeline.</p>
<ul>
  <li><strong>Validation:</strong> DevOps teams must implement automated checks (e.g., using robust GitHub Actions) to strictly ensure no <code class="language-plaintext highlighter-rouge">.map</code> or <code class="language-plaintext highlighter-rouge">.ts</code> files are present in production distribution folders.</li>
  <li><strong>Runtime Auditing:</strong> The failure of the Bun bundler in this incident emphasizes that developers cannot rely solely on tool flags; explicit binary and package inspection must become an uncompromising part of the release lifecycle.</li>
</ul>

<h2 id="3-ethical-and-legal-boundaries">3. Ethical and Legal Boundaries</h2>

<p>The leak has sparked an intense, industry-wide debate regarding <strong>AI Transparency vs. Intellectual Property</strong>. While the foundation model <strong>Weights</strong> remain entirely secure behind Anthropic’s private API, the unprecedented exposure of the “System Instructions” (Prompts) highlights the vanishing line between a commercial product and its configuration.</p>

<p>Moving forward, the industry is highly likely to see an accelerated shift toward <strong>Model Context Protocol (MCP)</strong> standardization. As Anthropic’s leaked code clearly demonstrated, maintaining proprietary, highly-fragmented tool-calling logic is evolving from a competitive advantage into a severe maintenance burden and security liability. In the data extraction space, this same principle holds — standardized, well-audited <a href="/tools/">scraping tools</a> are replacing fragile bespoke scripts, lowering supply-chain risks while improving reliability.</p>

<p>The overarching takeaway is clear: In the age of AI, the “Agentic Logic”—how an AI thinks and uses tools—is just as valuable as the neural model underlying it.</p>]]></content><author><name>Novi Develop</name></author><category term="DevSecOps" /><category term="AI" /><category term="Tech News" /><category term="cybersecurity" /><category term="ai-ethics" /><category term="supply-chain" /><category term="anthropic" /><summary type="html"><![CDATA[Following the infamous “Great Claude Code Leak” of March 2026, the tech industry is reeling from the implications of having state-of-the-art AI orchestration logic exposed into the wild. After uncovering the technical oversight—an accidentally packaged source map—we must examine how this event alters the programming ecosystem and cybersecurity landscape.]]></summary></entry><entry><title type="html">TikTok to YouTube Pipeline Part 4: Scheduling &amp;amp; Publishing to YouTube</title><link href="https://scrapingenthusiast.github.io/tiktok/tiktok-youtube-part4-scheduling-publishing/" rel="alternate" type="text/html" title="TikTok to YouTube Pipeline Part 4: Scheduling &amp;amp; Publishing to YouTube" /><published>2026-03-27T08:00:00+07:00</published><updated>2026-03-27T08:00:00+07:00</updated><id>https://scrapingenthusiast.github.io/tiktok/tiktok-youtube-part4-scheduling-publishing</id><content type="html" xml:base="https://scrapingenthusiast.github.io/tiktok/tiktok-youtube-part4-scheduling-publishing/"><![CDATA[<p>This is the <strong>final part</strong> of our series on building an automated YouTube news channel from TikTok data. We’ve searched TikTok (<a href="/tiktok/tiktok-youtube-part1-searching-api/">Part 1</a>), downloaded videos (<a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2</a>), and generated AI narration (<a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3</a>). Now we’ll tie it all together: <strong>uploading to YouTube</strong> and <strong>scheduling the pipeline to run automatically</strong>.</p>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><a href="/tiktok/tiktok-youtube-part1-searching-api/">Part 1: Searching TikTok with the Apify Python Client</a></li>
    <li><a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2: Downloading &amp; Organizing Video Files</a></li>
    <li><a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3: Auto-Generating News Scripts with AI</a></li>
    <li><strong>Part 4: Scheduling &amp; Publishing to YouTube</strong> ← You are here</li>
  </ul>
</blockquote>

<h2 id="prerequisites">Prerequisites</h2>

<ul>
  <li>Compiled video segments from Part 3</li>
  <li>A Google Cloud project with YouTube Data API v3 enabled</li>
  <li>OAuth2 credentials (<code class="language-plaintext highlighter-rouge">client_secret.json</code>)</li>
</ul>

<h2 id="step-1-install-dependencies">Step 1: Install Dependencies</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>google-auth google-auth-oauthlib google-api-python-client
</code></pre></div></div>

<h2 id="step-2-youtube-oauth2-authentication">Step 2: YouTube OAuth2 Authentication</h2>

<p>Create <code class="language-plaintext highlighter-rouge">youtube_auth.py</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># youtube_auth.py
</span>
<span class="kn">import</span> <span class="n">os</span>
<span class="kn">import</span> <span class="n">pickle</span>
<span class="kn">from</span> <span class="n">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="kn">from</span> <span class="n">google.auth.transport.requests</span> <span class="kn">import</span> <span class="n">Request</span>
<span class="kn">from</span> <span class="n">google_auth_oauthlib.flow</span> <span class="kn">import</span> <span class="n">InstalledAppFlow</span>
<span class="kn">from</span> <span class="n">googleapiclient.discovery</span> <span class="kn">import</span> <span class="n">build</span>

<span class="n">SCOPES</span> <span class="o">=</span> <span class="p">[</span><span class="sh">"</span><span class="s">https://www.googleapis.com/auth/youtube.upload</span><span class="sh">"</span><span class="p">]</span>
<span class="n">TOKEN_PATH</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">credentials/youtube_token.pickle</span><span class="sh">"</span><span class="p">)</span>
<span class="n">CLIENT_SECRET_PATH</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">credentials/client_secret.json</span><span class="sh">"</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">get_authenticated_service</span><span class="p">():</span>
    <span class="sh">"""</span><span class="s">Authenticate and return a YouTube API service object.</span><span class="sh">"""</span>
    <span class="n">credentials</span> <span class="o">=</span> <span class="bp">None</span>

    <span class="k">if</span> <span class="n">TOKEN_PATH</span><span class="p">.</span><span class="nf">exists</span><span class="p">():</span>
        <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">TOKEN_PATH</span><span class="p">,</span> <span class="sh">"</span><span class="s">rb</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">token</span><span class="p">:</span>
            <span class="n">credentials</span> <span class="o">=</span> <span class="n">pickle</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>

    <span class="k">if</span> <span class="ow">not</span> <span class="n">credentials</span> <span class="ow">or</span> <span class="ow">not</span> <span class="n">credentials</span><span class="p">.</span><span class="n">valid</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">credentials</span> <span class="ow">and</span> <span class="n">credentials</span><span class="p">.</span><span class="n">expired</span> <span class="ow">and</span> <span class="n">credentials</span><span class="p">.</span><span class="n">refresh_token</span><span class="p">:</span>
            <span class="n">credentials</span><span class="p">.</span><span class="nf">refresh</span><span class="p">(</span><span class="nc">Request</span><span class="p">())</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">flow</span> <span class="o">=</span> <span class="n">InstalledAppFlow</span><span class="p">.</span><span class="nf">from_client_secrets_file</span><span class="p">(</span>
                <span class="nf">str</span><span class="p">(</span><span class="n">CLIENT_SECRET_PATH</span><span class="p">),</span> <span class="n">SCOPES</span>
            <span class="p">)</span>
            <span class="n">credentials</span> <span class="o">=</span> <span class="n">flow</span><span class="p">.</span><span class="nf">run_local_server</span><span class="p">(</span><span class="n">port</span><span class="o">=</span><span class="mi">8090</span><span class="p">)</span>

        <span class="n">TOKEN_PATH</span><span class="p">.</span><span class="n">parent</span><span class="p">.</span><span class="nf">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">TOKEN_PATH</span><span class="p">,</span> <span class="sh">"</span><span class="s">wb</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">token</span><span class="p">:</span>
            <span class="n">pickle</span><span class="p">.</span><span class="nf">dump</span><span class="p">(</span><span class="n">credentials</span><span class="p">,</span> <span class="n">token</span><span class="p">)</span>

    <span class="k">return</span> <span class="nf">build</span><span class="p">(</span><span class="sh">"</span><span class="s">youtube</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">v3</span><span class="sh">"</span><span class="p">,</span> <span class="n">credentials</span><span class="o">=</span><span class="n">credentials</span><span class="p">)</span>
</code></pre></div></div>

<p>The first time you run this, a browser window will open for Google OAuth consent. After that, the token is cached locally.</p>

<h2 id="step-3-upload-videos-to-youtube">Step 3: Upload Videos to YouTube</h2>

<p>Create <code class="language-plaintext highlighter-rouge">upload_youtube.py</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># upload_youtube.py
</span>
<span class="kn">import</span> <span class="n">sys</span>
<span class="kn">from</span> <span class="n">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>
<span class="kn">from</span> <span class="n">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="kn">from</span> <span class="n">googleapiclient.http</span> <span class="kn">import</span> <span class="n">MediaFileUpload</span>

<span class="kn">from</span> <span class="n">youtube_auth</span> <span class="kn">import</span> <span class="n">get_authenticated_service</span>

<span class="n">OUTPUT_DIR</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">data/output</span><span class="sh">"</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">generate_metadata</span><span class="p">(</span><span class="n">keyword</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Generate title, description, and tags for a YouTube upload.</span><span class="sh">"""</span>
    <span class="n">topic</span> <span class="o">=</span> <span class="n">keyword</span><span class="p">.</span><span class="nf">replace</span><span class="p">(</span><span class="sh">"</span><span class="s">_</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s"> </span><span class="sh">"</span><span class="p">).</span><span class="nf">title</span><span class="p">()</span>
    <span class="n">date_str</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="nf">now</span><span class="p">().</span><span class="nf">strftime</span><span class="p">(</span><span class="sh">"</span><span class="s">%B %d, %Y</span><span class="sh">"</span><span class="p">)</span>

    <span class="k">return</span> <span class="p">{</span>
        <span class="sh">"</span><span class="s">title</span><span class="sh">"</span><span class="p">:</span> <span class="sa">f</span><span class="sh">"</span><span class="s">🔴 </span><span class="si">{</span><span class="n">topic</span><span class="si">}</span><span class="s"> — Latest Updates | </span><span class="si">{</span><span class="n">date_str</span><span class="si">}</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">description</span><span class="sh">"</span><span class="p">:</span> <span class="p">(</span>
            <span class="sa">f</span><span class="sh">"</span><span class="s">Today</span><span class="sh">'</span><span class="s">s coverage of </span><span class="si">{</span><span class="n">topic</span><span class="si">}</span><span class="s">. </span><span class="sh">"</span>
            <span class="sa">f</span><span class="sh">"</span><span class="s">This segment compiles the latest on-the-ground footage </span><span class="sh">"</span>
            <span class="sa">f</span><span class="sh">"</span><span class="s">and verified reports.</span><span class="se">\n\n</span><span class="sh">"</span>
            <span class="sa">f</span><span class="sh">"</span><span class="s">⚠️ Some footage may be graphic. Viewer discretion advised.</span><span class="se">\n\n</span><span class="sh">"</span>
            <span class="sa">f</span><span class="sh">"</span><span class="s">📌 Subscribe for daily global news updates.</span><span class="se">\n</span><span class="sh">"</span>
            <span class="sa">f</span><span class="sh">"</span><span class="s">🔔 Turn on notifications to never miss breaking news.</span><span class="se">\n\n</span><span class="sh">"</span>
            <span class="sa">f</span><span class="sh">"</span><span class="s">#GlobalNews #</span><span class="si">{</span><span class="n">keyword</span><span class="p">.</span><span class="nf">replace</span><span class="p">(</span><span class="sh">'</span><span class="s">_</span><span class="sh">'</span><span class="p">,</span> <span class="sh">''</span><span class="p">)</span><span class="si">}</span><span class="s"> #BreakingNews</span><span class="sh">"</span>
        <span class="p">),</span>
        <span class="sh">"</span><span class="s">tags</span><span class="sh">"</span><span class="p">:</span> <span class="p">[</span>
            <span class="n">topic</span><span class="p">,</span> <span class="sh">"</span><span class="s">breaking news</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">global news</span><span class="sh">"</span><span class="p">,</span>
            <span class="n">keyword</span><span class="p">.</span><span class="nf">replace</span><span class="p">(</span><span class="sh">"</span><span class="s">_</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s"> </span><span class="sh">"</span><span class="p">),</span> <span class="sh">"</span><span class="s">world news</span><span class="sh">"</span><span class="p">,</span>
            <span class="sh">"</span><span class="s">current events</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">news today</span><span class="sh">"</span><span class="p">,</span>
        <span class="p">],</span>
    <span class="p">}</span>


<span class="k">def</span> <span class="nf">upload_video</span><span class="p">(</span>
    <span class="n">youtube</span><span class="p">,</span>
    <span class="n">video_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">title</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">description</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">tags</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
    <span class="n">category_id</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="sh">"</span><span class="s">25</span><span class="sh">"</span><span class="p">,</span>  <span class="c1"># News &amp; Politics
</span>    <span class="n">privacy</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="sh">"</span><span class="s">private</span><span class="sh">"</span><span class="p">,</span>
<span class="p">):</span>
    <span class="sh">"""</span><span class="s">Upload a single video to YouTube.</span><span class="sh">"""</span>
    <span class="n">body</span> <span class="o">=</span> <span class="p">{</span>
        <span class="sh">"</span><span class="s">snippet</span><span class="sh">"</span><span class="p">:</span> <span class="p">{</span>
            <span class="sh">"</span><span class="s">title</span><span class="sh">"</span><span class="p">:</span> <span class="n">title</span><span class="p">,</span>
            <span class="sh">"</span><span class="s">description</span><span class="sh">"</span><span class="p">:</span> <span class="n">description</span><span class="p">,</span>
            <span class="sh">"</span><span class="s">tags</span><span class="sh">"</span><span class="p">:</span> <span class="n">tags</span><span class="p">,</span>
            <span class="sh">"</span><span class="s">categoryId</span><span class="sh">"</span><span class="p">:</span> <span class="n">category_id</span><span class="p">,</span>
        <span class="p">},</span>
        <span class="sh">"</span><span class="s">status</span><span class="sh">"</span><span class="p">:</span> <span class="p">{</span>
            <span class="sh">"</span><span class="s">privacyStatus</span><span class="sh">"</span><span class="p">:</span> <span class="n">privacy</span><span class="p">,</span>
            <span class="sh">"</span><span class="s">selfDeclaredMadeForKids</span><span class="sh">"</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
        <span class="p">},</span>
    <span class="p">}</span>

    <span class="n">media</span> <span class="o">=</span> <span class="nc">MediaFileUpload</span><span class="p">(</span>
        <span class="n">video_path</span><span class="p">,</span>
        <span class="n">mimetype</span><span class="o">=</span><span class="sh">"</span><span class="s">video/mp4</span><span class="sh">"</span><span class="p">,</span>
        <span class="n">resumable</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="n">chunksize</span><span class="o">=</span><span class="mi">1024</span> <span class="o">*</span> <span class="mi">1024</span> <span class="o">*</span> <span class="mi">10</span><span class="p">,</span>  <span class="c1"># 10 MB chunks
</span>    <span class="p">)</span>

    <span class="n">request</span> <span class="o">=</span> <span class="n">youtube</span><span class="p">.</span><span class="nf">videos</span><span class="p">().</span><span class="nf">insert</span><span class="p">(</span>
        <span class="n">part</span><span class="o">=</span><span class="sh">"</span><span class="s">snippet,status</span><span class="sh">"</span><span class="p">,</span>
        <span class="n">body</span><span class="o">=</span><span class="n">body</span><span class="p">,</span>
        <span class="n">media_body</span><span class="o">=</span><span class="n">media</span><span class="p">,</span>
    <span class="p">)</span>

    <span class="n">response</span> <span class="o">=</span> <span class="bp">None</span>
    <span class="k">while</span> <span class="n">response</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">status</span><span class="p">,</span> <span class="n">response</span> <span class="o">=</span> <span class="n">request</span><span class="p">.</span><span class="nf">next_chunk</span><span class="p">()</span>
        <span class="k">if</span> <span class="n">status</span><span class="p">:</span>
            <span class="n">pct</span> <span class="o">=</span> <span class="nf">int</span><span class="p">(</span><span class="n">status</span><span class="p">.</span><span class="nf">progress</span><span class="p">()</span> <span class="o">*</span> <span class="mi">100</span><span class="p">)</span>
            <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">   📤 Uploading... </span><span class="si">{</span><span class="n">pct</span><span class="si">}</span><span class="s">%</span><span class="sh">"</span><span class="p">)</span>

    <span class="n">video_id</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="sh">"</span><span class="s">id</span><span class="sh">"</span><span class="p">]</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">   ✅ Uploaded: https://youtu.be/</span><span class="si">{</span><span class="n">video_id</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">video_id</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">youtube</span> <span class="o">=</span> <span class="nf">get_authenticated_service</span><span class="p">()</span>

    <span class="n">video_files</span> <span class="o">=</span> <span class="nf">sorted</span><span class="p">(</span><span class="n">OUTPUT_DIR</span><span class="p">.</span><span class="nf">glob</span><span class="p">(</span><span class="sh">"</span><span class="s">*_news_segment.mp4</span><span class="sh">"</span><span class="p">))</span>

    <span class="k">if</span> <span class="ow">not</span> <span class="n">video_files</span><span class="p">:</span>
        <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">❌ No compiled segments found in data/output/</span><span class="sh">"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nf">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">📋 Found </span><span class="si">{</span><span class="nf">len</span><span class="p">(</span><span class="n">video_files</span><span class="p">)</span><span class="si">}</span><span class="s"> segment(s) to upload.</span><span class="se">\n</span><span class="sh">"</span><span class="p">)</span>

    <span class="k">for</span> <span class="n">video_path</span> <span class="ow">in</span> <span class="n">video_files</span><span class="p">:</span>
        <span class="n">keyword</span> <span class="o">=</span> <span class="n">video_path</span><span class="p">.</span><span class="n">stem</span><span class="p">.</span><span class="nf">replace</span><span class="p">(</span><span class="sh">"</span><span class="s">_news_segment</span><span class="sh">"</span><span class="p">,</span> <span class="sh">""</span><span class="p">)</span>
        <span class="n">meta</span> <span class="o">=</span> <span class="nf">generate_metadata</span><span class="p">(</span><span class="n">keyword</span><span class="p">)</span>

        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">🎬 Uploading: </span><span class="si">{</span><span class="n">meta</span><span class="p">[</span><span class="sh">'</span><span class="s">title</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>

        <span class="nf">upload_video</span><span class="p">(</span>
            <span class="n">youtube</span><span class="p">,</span>
            <span class="nf">str</span><span class="p">(</span><span class="n">video_path</span><span class="p">),</span>
            <span class="n">title</span><span class="o">=</span><span class="n">meta</span><span class="p">[</span><span class="sh">"</span><span class="s">title</span><span class="sh">"</span><span class="p">],</span>
            <span class="n">description</span><span class="o">=</span><span class="n">meta</span><span class="p">[</span><span class="sh">"</span><span class="s">description</span><span class="sh">"</span><span class="p">],</span>
            <span class="n">tags</span><span class="o">=</span><span class="n">meta</span><span class="p">[</span><span class="sh">"</span><span class="s">tags</span><span class="sh">"</span><span class="p">],</span>
            <span class="n">privacy</span><span class="o">=</span><span class="sh">"</span><span class="s">private</span><span class="sh">"</span><span class="p">,</span>  <span class="c1"># Change to "public" when ready
</span>        <span class="p">)</span>

        <span class="nf">print</span><span class="p">()</span>

    <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">🎉 All uploads complete!</span><span class="sh">"</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="sh">"</span><span class="s">__main__</span><span class="sh">"</span><span class="p">:</span>
    <span class="nf">main</span><span class="p">()</span>
</code></pre></div></div>

<blockquote>
  <p><strong>Tip:</strong> Start with <code class="language-plaintext highlighter-rouge">privacy="private"</code> to review your uploads before making them public.</p>
</blockquote>

<h2 id="step-4-the-master-pipeline-script">Step 4: The Master Pipeline Script</h2>

<p>Create <code class="language-plaintext highlighter-rouge">pipeline.py</code> to orchestrate the entire workflow in a single command:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># pipeline.py
</span>
<span class="kn">import</span> <span class="n">subprocess</span>
<span class="kn">import</span> <span class="n">sys</span>
<span class="kn">from</span> <span class="n">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>


<span class="k">def</span> <span class="nf">run_step</span><span class="p">(</span><span class="n">script</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">args</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Run a pipeline step and exit on failure.</span><span class="sh">"""</span>
    <span class="n">cmd</span> <span class="o">=</span> <span class="p">[</span><span class="n">sys</span><span class="p">.</span><span class="n">executable</span><span class="p">,</span> <span class="n">script</span><span class="p">]</span> <span class="o">+</span> <span class="p">(</span><span class="n">args</span> <span class="ow">or</span> <span class="p">[])</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="se">\n</span><span class="si">{</span><span class="sh">'</span><span class="s">=</span><span class="sh">'</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">▶️  Running: </span><span class="si">{</span><span class="sh">'</span><span class="s"> </span><span class="sh">'</span><span class="p">.</span><span class="nf">join</span><span class="p">(</span><span class="n">cmd</span><span class="p">)</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="sh">'</span><span class="s">=</span><span class="sh">'</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="se">\n</span><span class="sh">"</span><span class="p">)</span>

    <span class="n">result</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="nf">run</span><span class="p">(</span><span class="n">cmd</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">result</span><span class="p">.</span><span class="n">returncode</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="se">\n</span><span class="s">❌ Step failed: </span><span class="si">{</span><span class="n">script</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nf">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">find_latest_results</span><span class="p">()</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Find the most recent search results JSON file.</span><span class="sh">"""</span>
    <span class="n">results</span> <span class="o">=</span> <span class="nf">sorted</span><span class="p">(</span><span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">data/raw</span><span class="sh">"</span><span class="p">).</span><span class="nf">glob</span><span class="p">(</span><span class="sh">"</span><span class="s">tiktok_results_*.json</span><span class="sh">"</span><span class="p">))</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="n">results</span><span class="p">:</span>
        <span class="k">raise</span> <span class="nc">FileNotFoundError</span><span class="p">(</span><span class="sh">"</span><span class="s">No results found in data/raw/</span><span class="sh">"</span><span class="p">)</span>
    <span class="k">return</span> <span class="nf">str</span><span class="p">(</span><span class="n">results</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">🚀 Starting TikTok → YouTube Pipeline</span><span class="se">\n</span><span class="sh">"</span><span class="p">)</span>

    <span class="c1"># Step 1: Search TikTok
</span>    <span class="nf">run_step</span><span class="p">(</span><span class="sh">"</span><span class="s">search_tiktok.py</span><span class="sh">"</span><span class="p">)</span>

    <span class="c1"># Step 2: Download videos
</span>    <span class="n">latest_json</span> <span class="o">=</span> <span class="nf">find_latest_results</span><span class="p">()</span>
    <span class="nf">run_step</span><span class="p">(</span><span class="sh">"</span><span class="s">download_videos.py</span><span class="sh">"</span><span class="p">,</span> <span class="p">[</span><span class="n">latest_json</span><span class="p">])</span>

    <span class="c1"># Step 3: Generate scripts
</span>    <span class="nf">run_step</span><span class="p">(</span><span class="sh">"</span><span class="s">generate_script.py</span><span class="sh">"</span><span class="p">)</span>

    <span class="c1"># Step 4: Generate TTS audio
</span>    <span class="nf">run_step</span><span class="p">(</span><span class="sh">"</span><span class="s">generate_tts.py</span><span class="sh">"</span><span class="p">)</span>

    <span class="c1"># Step 5: Compile videos
</span>    <span class="nf">run_step</span><span class="p">(</span><span class="sh">"</span><span class="s">compile_video.py</span><span class="sh">"</span><span class="p">)</span>

    <span class="c1"># Step 6: Upload to YouTube
</span>    <span class="nf">run_step</span><span class="p">(</span><span class="sh">"</span><span class="s">upload_youtube.py</span><span class="sh">"</span><span class="p">)</span>

    <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="se">\n</span><span class="s">🎉 Pipeline complete! Check your YouTube Studio.</span><span class="sh">"</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="sh">"</span><span class="s">__main__</span><span class="sh">"</span><span class="p">:</span>
    <span class="nf">main</span><span class="p">()</span>
</code></pre></div></div>

<p>Run the entire pipeline:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python pipeline.py
</code></pre></div></div>

<h2 id="step-5-schedule-with-apify">Step 5: Schedule with Apify</h2>

<p>To run this pipeline automatically, you have two options:</p>

<h3 id="option-a-apify-scheduler--webhooks">Option A: Apify Scheduler + Webhooks</h3>

<ol>
  <li>Go to the <a href="https://apify.com/novi/advanced-search-tiktok-api?fpr=7hce1m">Advanced TikTok Search API Actor</a> page on <a href="https://apify.com?fpr=7hce1m">Apify</a>.</li>
  <li>Click <strong>Schedule</strong> and set it to run every 12 hours.</li>
  <li>Add a <strong>Webhook</strong> that triggers on run completion, pointing to your server endpoint.</li>
  <li>Your server receives the webhook, fetches the dataset, and runs steps 2–6.</li>
</ol>

<h3 id="option-b-cron-job-self-hosted">Option B: Cron Job (Self-Hosted)</h3>

<p>For a simpler setup, add a cron job on your server:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Run every 12 hours at 6:00 AM and 6:00 PM</span>
0 6,18 <span class="k">*</span> <span class="k">*</span> <span class="k">*</span> <span class="nb">cd</span> /path/to/project <span class="o">&amp;&amp;</span> /usr/bin/python3 pipeline.py <span class="o">&gt;&gt;</span> logs/pipeline.log 2&gt;&amp;1
</code></pre></div></div>

<h2 id="final-project-structure">Final Project Structure</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tiktok-youtube-pipeline/
├── .env
├── config.py
├── search_tiktok.py          # Part 1
├── download_videos.py        # Part 2
├── generate_script.py        # Part 3
├── generate_tts.py           # Part 3
├── compile_video.py          # Part 3
├── youtube_auth.py           # Part 4
├── upload_youtube.py         # Part 4
├── pipeline.py               # Orchestrator
├── credentials/
│   ├── client_secret.json
│   └── youtube_token.pickle
├── data/
│   ├── raw/                  # JSON search results
│   ├── videos/               # Downloaded .mp4 files
│   ├── scripts/              # AI-generated narration
│   ├── audio/                # TTS audio files
│   └── output/               # Final compiled segments
└── logs/
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>Over this 4-part series, we built a complete automated pipeline that:</p>

<ol>
  <li><strong>Searches</strong> TikTok for breaking news footage using the <a href="https://apify.com/novi/advanced-search-tiktok-api?fpr=7hce1m">Advanced TikTok Search API</a></li>
  <li><strong>Downloads</strong> watermark-free video clips asynchronously</li>
  <li><strong>Generates</strong> professional narration scripts and TTS audio using AI</li>
  <li><strong>Compiles</strong> everything into polished news segments with FFmpeg</li>
  <li><strong>Uploads</strong> to YouTube with proper metadata</li>
  <li><strong>Runs automatically</strong> on a schedule</li>
</ol>

<p>This pipeline transforms raw social media footage into a professional news channel — all with Python and a handful of APIs. The total cost per run is approximately $0.05 (Apify compute + OpenAI tokens), making it incredibly cost-effective for daily news production.</p>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><a href="/tiktok/tiktok-youtube-part1-searching-api/">Part 1: Searching TikTok with the Apify Python Client</a></li>
    <li><a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2: Downloading &amp; Organizing Video Files</a></li>
    <li><a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3: Auto-Generating News Scripts with AI</a></li>
    <li><strong>Part 4: Scheduling &amp; Publishing to YouTube</strong> ← You are here</li>
  </ul>
</blockquote>]]></content><author><name>Novi Develop</name></author><category term="TikTok" /><category term="Web Scraping" /><category term="YouTube" /><category term="apify" /><category term="tiktok-api" /><category term="python" /><category term="youtube-api" /><category term="tutorial" /><summary type="html"><![CDATA[This is the final part of our series on building an automated YouTube news channel from TikTok data. We’ve searched TikTok (Part 1), downloaded videos (Part 2), and generated AI narration (Part 3). Now we’ll tie it all together: uploading to YouTube and scheduling the pipeline to run automatically.]]></summary></entry><entry><title type="html">TikTok to YouTube Pipeline Part 3: Auto-Generating News Scripts with AI</title><link href="https://scrapingenthusiast.github.io/tiktok/tiktok-youtube-part3-ai-narration/" rel="alternate" type="text/html" title="TikTok to YouTube Pipeline Part 3: Auto-Generating News Scripts with AI" /><published>2026-03-26T08:00:00+07:00</published><updated>2026-03-26T08:00:00+07:00</updated><id>https://scrapingenthusiast.github.io/tiktok/tiktok-youtube-part3-ai-narration</id><content type="html" xml:base="https://scrapingenthusiast.github.io/tiktok/tiktok-youtube-part3-ai-narration/"><![CDATA[<p>This is <strong>Part 3</strong> of our series on building an automated YouTube news channel from TikTok data. In <a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2</a>, we downloaded watermark-free videos and organized them by topic. Now we’ll turn those raw clips into a polished news segment by <strong>generating narration scripts with AI</strong>, converting them to <strong>text-to-speech audio</strong>, and <strong>compiling everything with FFmpeg</strong>.</p>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><a href="/tiktok/tiktok-youtube-part1-searching-api/">Part 1: Searching TikTok with the Apify Python Client</a></li>
    <li><a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2: Downloading &amp; Organizing Video Files</a></li>
    <li><strong>Part 3: Auto-Generating News Scripts with AI</strong> ← You are here</li>
    <li><a href="/tiktok/tiktok-youtube-part4-scheduling-publishing/">Part 4: Scheduling &amp; Publishing to YouTube</a></li>
  </ul>
</blockquote>

<h2 id="prerequisites">Prerequisites</h2>

<ul>
  <li>Downloaded videos from Part 2</li>
  <li>Python 3.10+</li>
  <li>FFmpeg installed on your system (<code class="language-plaintext highlighter-rouge">brew install ffmpeg</code> on macOS)</li>
  <li>An OpenAI API key</li>
</ul>

<h2 id="step-1-install-dependencies">Step 1: Install Dependencies</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>openai ffmpeg-python
</code></pre></div></div>

<p>Add your OpenAI key to <code class="language-plaintext highlighter-rouge">.env</code>:</p>

<pre><code class="language-env">OPENAI_API_KEY=sk-your_key_here
</code></pre>

<h2 id="step-2-generate-narration-scripts-from-video-descriptions">Step 2: Generate Narration Scripts from Video Descriptions</h2>

<p>The TikTok video descriptions (<code class="language-plaintext highlighter-rouge">desc</code> field) captured in Part 1 contain valuable context — hashtags, locations, and brief captions. We’ll feed a batch of these to an LLM to produce a professional news script.</p>

<p>Create <code class="language-plaintext highlighter-rouge">generate_script.py</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># generate_script.py
</span>
<span class="kn">import</span> <span class="n">json</span>
<span class="kn">import</span> <span class="n">os</span>
<span class="kn">from</span> <span class="n">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="kn">from</span> <span class="n">dotenv</span> <span class="kn">import</span> <span class="n">load_dotenv</span>
<span class="kn">from</span> <span class="n">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>

<span class="nf">load_dotenv</span><span class="p">()</span>

<span class="n">METADATA_PATH</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">data/videos/metadata_index.json</span><span class="sh">"</span><span class="p">)</span>
<span class="n">SCRIPTS_DIR</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">data/scripts</span><span class="sh">"</span><span class="p">)</span>

<span class="n">SYSTEM_PROMPT</span> <span class="o">=</span> <span class="sh">"""</span><span class="s">You are a professional news anchor script writer.
Given a list of TikTok video descriptions about a specific topic,
write a concise, factual news narration script (60-90 seconds when
read aloud). The script should:
- Start with a strong headline opener
- Summarize the key events shown in the videos
- Maintain a neutral, journalistic tone
- End with a brief outlook or call to stay updated
Do NOT mention TikTok or social media in the script.
Output ONLY the script text, no titles or formatting.</span><span class="sh">"""</span>


<span class="k">def</span> <span class="nf">load_metadata</span><span class="p">()</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]]:</span>
    <span class="sh">"""</span><span class="s">Load metadata and group by keyword.</span><span class="sh">"""</span>
    <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">METADATA_PATH</span><span class="p">,</span> <span class="sh">"</span><span class="s">r</span><span class="sh">"</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="sh">"</span><span class="s">utf-8</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">metadata</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>

    <span class="n">grouped</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">metadata</span><span class="p">:</span>
        <span class="n">keyword</span> <span class="o">=</span> <span class="n">item</span><span class="p">[</span><span class="sh">"</span><span class="s">keyword</span><span class="sh">"</span><span class="p">]</span>
        <span class="n">grouped</span><span class="p">.</span><span class="nf">setdefault</span><span class="p">(</span><span class="n">keyword</span><span class="p">,</span> <span class="p">[]).</span><span class="nf">append</span><span class="p">(</span><span class="n">item</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">grouped</span>


<span class="k">def</span> <span class="nf">generate_script_for_topic</span><span class="p">(</span>
    <span class="n">client</span><span class="p">:</span> <span class="n">OpenAI</span><span class="p">,</span>
    <span class="n">keyword</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">videos</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">],</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Generate a narration script for a topic using the LLM.</span><span class="sh">"""</span>
    <span class="c1"># Take the top 15 descriptions to stay within token limits
</span>    <span class="n">descriptions</span> <span class="o">=</span> <span class="p">[</span><span class="n">v</span><span class="p">[</span><span class="sh">"</span><span class="s">description</span><span class="sh">"</span><span class="p">]</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">videos</span><span class="p">[:</span><span class="mi">15</span><span class="p">]</span> <span class="k">if</span> <span class="n">v</span><span class="p">[</span><span class="sh">"</span><span class="s">description</span><span class="sh">"</span><span class="p">]]</span>
    <span class="n">descriptions_text</span> <span class="o">=</span> <span class="sh">"</span><span class="se">\n</span><span class="sh">"</span><span class="p">.</span><span class="nf">join</span><span class="p">(</span>
        <span class="sa">f</span><span class="sh">"</span><span class="s">- </span><span class="si">{</span><span class="n">desc</span><span class="si">}</span><span class="sh">"</span> <span class="k">for</span> <span class="n">desc</span> <span class="ow">in</span> <span class="n">descriptions</span>
    <span class="p">)</span>

    <span class="n">user_prompt</span> <span class="o">=</span> <span class="p">(</span>
        <span class="sa">f</span><span class="sh">"</span><span class="s">Topic: </span><span class="si">{</span><span class="n">keyword</span><span class="p">.</span><span class="nf">replace</span><span class="p">(</span><span class="sh">'</span><span class="s">_</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s"> </span><span class="sh">'</span><span class="p">)</span><span class="si">}</span><span class="se">\n\n</span><span class="sh">"</span>
        <span class="sa">f</span><span class="sh">"</span><span class="s">Video descriptions from the field:</span><span class="se">\n</span><span class="si">{</span><span class="n">descriptions_text</span><span class="si">}</span><span class="se">\n\n</span><span class="sh">"</span>
        <span class="sa">f</span><span class="sh">"</span><span class="s">Write the news narration script.</span><span class="sh">"</span>
    <span class="p">)</span>

    <span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">chat</span><span class="p">.</span><span class="n">completions</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">gpt-4o-mini</span><span class="sh">"</span><span class="p">,</span>
        <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
            <span class="p">{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">system</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="n">SYSTEM_PROMPT</span><span class="p">},</span>
            <span class="p">{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">user</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="n">user_prompt</span><span class="p">},</span>
        <span class="p">],</span>
        <span class="n">temperature</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span>
        <span class="n">max_tokens</span><span class="o">=</span><span class="mi">500</span><span class="p">,</span>
    <span class="p">)</span>

    <span class="k">return</span> <span class="n">response</span><span class="p">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">message</span><span class="p">.</span><span class="n">content</span><span class="p">.</span><span class="nf">strip</span><span class="p">()</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">client</span> <span class="o">=</span> <span class="nc">OpenAI</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="sh">"</span><span class="s">OPENAI_API_KEY</span><span class="sh">"</span><span class="p">])</span>
    <span class="n">grouped</span> <span class="o">=</span> <span class="nf">load_metadata</span><span class="p">()</span>

    <span class="n">SCRIPTS_DIR</span><span class="p">.</span><span class="nf">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="k">for</span> <span class="n">keyword</span><span class="p">,</span> <span class="n">videos</span> <span class="ow">in</span> <span class="n">grouped</span><span class="p">.</span><span class="nf">items</span><span class="p">():</span>
        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">✍️  Generating script for: </span><span class="si">{</span><span class="n">keyword</span><span class="si">}</span><span class="s">...</span><span class="sh">"</span><span class="p">)</span>
        <span class="n">script</span> <span class="o">=</span> <span class="nf">generate_script_for_topic</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">keyword</span><span class="p">,</span> <span class="n">videos</span><span class="p">)</span>

        <span class="n">script_path</span> <span class="o">=</span> <span class="n">SCRIPTS_DIR</span> <span class="o">/</span> <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">keyword</span><span class="si">}</span><span class="s">_script.txt</span><span class="sh">"</span>
        <span class="n">script_path</span><span class="p">.</span><span class="nf">write_text</span><span class="p">(</span><span class="n">script</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="sh">"</span><span class="s">utf-8</span><span class="sh">"</span><span class="p">)</span>
        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">   📝 Saved to </span><span class="si">{</span><span class="n">script_path</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">   Preview: </span><span class="si">{</span><span class="n">script</span><span class="p">[</span><span class="si">:</span><span class="mi">120</span><span class="p">]</span><span class="si">}</span><span class="s">...</span><span class="se">\n</span><span class="sh">"</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="sh">"</span><span class="s">__main__</span><span class="sh">"</span><span class="p">:</span>
    <span class="nf">main</span><span class="p">()</span>
</code></pre></div></div>

<h2 id="step-3-convert-scripts-to-speech">Step 3: Convert Scripts to Speech</h2>

<p>Create <code class="language-plaintext highlighter-rouge">generate_tts.py</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># generate_tts.py
</span>
<span class="kn">import</span> <span class="n">os</span>
<span class="kn">from</span> <span class="n">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="kn">from</span> <span class="n">dotenv</span> <span class="kn">import</span> <span class="n">load_dotenv</span>
<span class="kn">from</span> <span class="n">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>

<span class="nf">load_dotenv</span><span class="p">()</span>

<span class="n">SCRIPTS_DIR</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">data/scripts</span><span class="sh">"</span><span class="p">)</span>
<span class="n">AUDIO_DIR</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">data/audio</span><span class="sh">"</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">client</span> <span class="o">=</span> <span class="nc">OpenAI</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="sh">"</span><span class="s">OPENAI_API_KEY</span><span class="sh">"</span><span class="p">])</span>
    <span class="n">AUDIO_DIR</span><span class="p">.</span><span class="nf">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="n">script_files</span> <span class="o">=</span> <span class="nf">sorted</span><span class="p">(</span><span class="n">SCRIPTS_DIR</span><span class="p">.</span><span class="nf">glob</span><span class="p">(</span><span class="sh">"</span><span class="s">*_script.txt</span><span class="sh">"</span><span class="p">))</span>

    <span class="k">for</span> <span class="n">script_path</span> <span class="ow">in</span> <span class="n">script_files</span><span class="p">:</span>
        <span class="n">keyword</span> <span class="o">=</span> <span class="n">script_path</span><span class="p">.</span><span class="n">stem</span><span class="p">.</span><span class="nf">replace</span><span class="p">(</span><span class="sh">"</span><span class="s">_script</span><span class="sh">"</span><span class="p">,</span> <span class="sh">""</span><span class="p">)</span>
        <span class="n">audio_path</span> <span class="o">=</span> <span class="n">AUDIO_DIR</span> <span class="o">/</span> <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">keyword</span><span class="si">}</span><span class="s">_narration.mp3</span><span class="sh">"</span>

        <span class="k">if</span> <span class="n">audio_path</span><span class="p">.</span><span class="nf">exists</span><span class="p">():</span>
            <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">⏭️  Skipping </span><span class="si">{</span><span class="n">keyword</span><span class="si">}</span><span class="s"> (audio exists)</span><span class="sh">"</span><span class="p">)</span>
            <span class="k">continue</span>

        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">🔊 Generating TTS for: </span><span class="si">{</span><span class="n">keyword</span><span class="si">}</span><span class="s">...</span><span class="sh">"</span><span class="p">)</span>
        <span class="n">script_text</span> <span class="o">=</span> <span class="n">script_path</span><span class="p">.</span><span class="nf">read_text</span><span class="p">(</span><span class="n">encoding</span><span class="o">=</span><span class="sh">"</span><span class="s">utf-8</span><span class="sh">"</span><span class="p">)</span>

        <span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">audio</span><span class="p">.</span><span class="n">speech</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">tts-1</span><span class="sh">"</span><span class="p">,</span>
            <span class="n">voice</span><span class="o">=</span><span class="sh">"</span><span class="s">onyx</span><span class="sh">"</span><span class="p">,</span>  <span class="c1"># Deep, authoritative news voice
</span>            <span class="nb">input</span><span class="o">=</span><span class="n">script_text</span><span class="p">,</span>
        <span class="p">)</span>

        <span class="n">response</span><span class="p">.</span><span class="nf">stream_to_file</span><span class="p">(</span><span class="nf">str</span><span class="p">(</span><span class="n">audio_path</span><span class="p">))</span>
        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">   ✅ Saved to </span><span class="si">{</span><span class="n">audio_path</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="sh">"</span><span class="s">__main__</span><span class="sh">"</span><span class="p">:</span>
    <span class="nf">main</span><span class="p">()</span>
</code></pre></div></div>

<h2 id="step-4-compile-videos-with-ffmpeg">Step 4: Compile Videos with FFmpeg</h2>

<p>Now we bring it all together — combine downloaded video clips with the AI narration audio into a single news segment.</p>

<p>Create <code class="language-plaintext highlighter-rouge">compile_video.py</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># compile_video.py
</span>
<span class="kn">import</span> <span class="n">json</span>
<span class="kn">import</span> <span class="n">subprocess</span>
<span class="kn">from</span> <span class="n">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="n">VIDEOS_DIR</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">data/videos</span><span class="sh">"</span><span class="p">)</span>
<span class="n">AUDIO_DIR</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">data/audio</span><span class="sh">"</span><span class="p">)</span>
<span class="n">OUTPUT_DIR</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">data/output</span><span class="sh">"</span><span class="p">)</span>
<span class="n">METADATA_PATH</span> <span class="o">=</span> <span class="n">VIDEOS_DIR</span> <span class="o">/</span> <span class="sh">"</span><span class="s">metadata_index.json</span><span class="sh">"</span>


<span class="k">def</span> <span class="nf">get_video_duration</span><span class="p">(</span><span class="n">video_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Get video duration in seconds using ffprobe.</span><span class="sh">"""</span>
    <span class="n">result</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="nf">run</span><span class="p">(</span>
        <span class="p">[</span>
            <span class="sh">"</span><span class="s">ffprobe</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-v</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">error</span><span class="sh">"</span><span class="p">,</span>
            <span class="sh">"</span><span class="s">-show_entries</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">format=duration</span><span class="sh">"</span><span class="p">,</span>
            <span class="sh">"</span><span class="s">-of</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">default=noprint_wrappers=1:nokey=1</span><span class="sh">"</span><span class="p">,</span>
            <span class="n">video_path</span><span class="p">,</span>
        <span class="p">],</span>
        <span class="n">capture_output</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="p">)</span>
    <span class="k">return</span> <span class="nf">float</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">stdout</span><span class="p">.</span><span class="nf">strip</span><span class="p">())</span>


<span class="k">def</span> <span class="nf">compile_topic</span><span class="p">(</span><span class="n">keyword</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">video_paths</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]):</span>
    <span class="sh">"""</span><span class="s">Compile videos for a single topic into one news segment.</span><span class="sh">"""</span>
    <span class="n">audio_path</span> <span class="o">=</span> <span class="n">AUDIO_DIR</span> <span class="o">/</span> <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">keyword</span><span class="si">}</span><span class="s">_narration.mp3</span><span class="sh">"</span>

    <span class="k">if</span> <span class="ow">not</span> <span class="n">audio_path</span><span class="p">.</span><span class="nf">exists</span><span class="p">():</span>
        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">⚠️  No narration audio for </span><span class="si">{</span><span class="n">keyword</span><span class="si">}</span><span class="s">, skipping.</span><span class="sh">"</span><span class="p">)</span>
        <span class="k">return</span>

    <span class="c1"># Filter to videos that actually exist and take the top 5
</span>    <span class="n">existing</span> <span class="o">=</span> <span class="p">[</span><span class="n">p</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">video_paths</span> <span class="k">if</span> <span class="nc">Path</span><span class="p">(</span><span class="n">p</span><span class="p">).</span><span class="nf">exists</span><span class="p">()][:</span><span class="mi">5</span><span class="p">]</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="n">existing</span><span class="p">:</span>
        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">⚠️  No videos found for </span><span class="si">{</span><span class="n">keyword</span><span class="si">}</span><span class="s">, skipping.</span><span class="sh">"</span><span class="p">)</span>
        <span class="k">return</span>

    <span class="n">OUTPUT_DIR</span><span class="p">.</span><span class="nf">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="c1"># Create a concat file for FFmpeg
</span>    <span class="n">concat_file</span> <span class="o">=</span> <span class="n">OUTPUT_DIR</span> <span class="o">/</span> <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">keyword</span><span class="si">}</span><span class="s">_concat.txt</span><span class="sh">"</span>
    <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">concat_file</span><span class="p">,</span> <span class="sh">"</span><span class="s">w</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="k">for</span> <span class="n">vp</span> <span class="ow">in</span> <span class="n">existing</span><span class="p">:</span>
            <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">file </span><span class="sh">'</span><span class="si">{</span><span class="nc">Path</span><span class="p">(</span><span class="n">vp</span><span class="p">).</span><span class="nf">absolute</span><span class="p">()</span><span class="si">}</span><span class="sh">'</span><span class="se">\n</span><span class="sh">"</span><span class="p">)</span>

    <span class="n">output_path</span> <span class="o">=</span> <span class="n">OUTPUT_DIR</span> <span class="o">/</span> <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">keyword</span><span class="si">}</span><span class="s">_news_segment.mp4</span><span class="sh">"</span>

    <span class="c1"># Step 1: Concatenate video clips
</span>    <span class="n">temp_video</span> <span class="o">=</span> <span class="n">OUTPUT_DIR</span> <span class="o">/</span> <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">keyword</span><span class="si">}</span><span class="s">_temp_concat.mp4</span><span class="sh">"</span>
    <span class="n">subprocess</span><span class="p">.</span><span class="nf">run</span><span class="p">([</span>
        <span class="sh">"</span><span class="s">ffmpeg</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-y</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">-f</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">concat</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-safe</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">0</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">-i</span><span class="sh">"</span><span class="p">,</span> <span class="nf">str</span><span class="p">(</span><span class="n">concat_file</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">-c:v</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">libx264</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">-crf</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">23</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">-preset</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">fast</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">-vf</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">scale=1920:1080:force_original_aspect_ratio=decrease,</span><span class="sh">"</span>
               <span class="sh">"</span><span class="s">pad=1920:1080:(ow-iw)/2:(oh-ih)/2</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">-r</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">30</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">-an</span><span class="sh">"</span><span class="p">,</span>  <span class="c1"># Remove original audio
</span>        <span class="nf">str</span><span class="p">(</span><span class="n">temp_video</span><span class="p">),</span>
    <span class="p">],</span> <span class="n">check</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="c1"># Step 2: Overlay narration audio
</span>    <span class="n">subprocess</span><span class="p">.</span><span class="nf">run</span><span class="p">([</span>
        <span class="sh">"</span><span class="s">ffmpeg</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-y</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">-i</span><span class="sh">"</span><span class="p">,</span> <span class="nf">str</span><span class="p">(</span><span class="n">temp_video</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">-i</span><span class="sh">"</span><span class="p">,</span> <span class="nf">str</span><span class="p">(</span><span class="n">audio_path</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">-c:v</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">copy</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">-c:a</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">aac</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">-b:a</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">192k</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">-shortest</span><span class="sh">"</span><span class="p">,</span>
        <span class="nf">str</span><span class="p">(</span><span class="n">output_path</span><span class="p">),</span>
    <span class="p">],</span> <span class="n">check</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="c1"># Cleanup
</span>    <span class="n">temp_video</span><span class="p">.</span><span class="nf">unlink</span><span class="p">(</span><span class="n">missing_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="n">concat_file</span><span class="p">.</span><span class="nf">unlink</span><span class="p">(</span><span class="n">missing_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="n">duration</span> <span class="o">=</span> <span class="nf">get_video_duration</span><span class="p">(</span><span class="nf">str</span><span class="p">(</span><span class="n">output_path</span><span class="p">))</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">🎬 </span><span class="si">{</span><span class="n">keyword</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">output_path</span><span class="si">}</span><span class="s"> (</span><span class="si">{</span><span class="n">duration</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">s)</span><span class="sh">"</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">METADATA_PATH</span><span class="p">,</span> <span class="sh">"</span><span class="s">r</span><span class="sh">"</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="sh">"</span><span class="s">utf-8</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">metadata</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>

    <span class="c1"># Group by keyword
</span>    <span class="n">grouped</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">metadata</span><span class="p">:</span>
        <span class="n">keyword</span> <span class="o">=</span> <span class="n">item</span><span class="p">[</span><span class="sh">"</span><span class="s">keyword</span><span class="sh">"</span><span class="p">]</span>
        <span class="n">grouped</span><span class="p">.</span><span class="nf">setdefault</span><span class="p">(</span><span class="n">keyword</span><span class="p">,</span> <span class="p">[]).</span><span class="nf">append</span><span class="p">(</span><span class="n">item</span><span class="p">[</span><span class="sh">"</span><span class="s">local_path</span><span class="sh">"</span><span class="p">])</span>

    <span class="k">for</span> <span class="n">keyword</span><span class="p">,</span> <span class="n">paths</span> <span class="ow">in</span> <span class="n">grouped</span><span class="p">.</span><span class="nf">items</span><span class="p">():</span>
        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="se">\n</span><span class="s">🎞️  Compiling: </span><span class="si">{</span><span class="n">keyword</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
        <span class="nf">compile_topic</span><span class="p">(</span><span class="n">keyword</span><span class="p">,</span> <span class="n">paths</span><span class="p">)</span>

    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="se">\n</span><span class="s">✅ All segments compiled in </span><span class="si">{</span><span class="n">OUTPUT_DIR</span><span class="si">}</span><span class="s">/</span><span class="sh">"</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="sh">"</span><span class="s">__main__</span><span class="sh">"</span><span class="p">:</span>
    <span class="nf">main</span><span class="p">()</span>
</code></pre></div></div>

<h2 id="step-5-run-the-full-post-processing-pipeline">Step 5: Run the Full Post-Processing Pipeline</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Generate narration scripts</span>
python generate_script.py

<span class="c"># Convert to speech</span>
python generate_tts.py

<span class="c"># Compile final videos</span>
python compile_video.py
</code></pre></div></div>

<p>Expected output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>✍️  Generating script for: Ukraine_war...
   📝 Saved to data/scripts/Ukraine_war_script.txt
🔊 Generating TTS for: Ukraine_war...
   ✅ Saved to data/audio/Ukraine_war_narration.mp3

🎞️  Compiling: Ukraine_war
🎬 Ukraine_war: data/output/Ukraine_war_news_segment.mp4 (87.3s)

✅ All segments compiled in data/output/
</code></pre></div></div>

<h2 id="whats-next">What’s Next?</h2>

<p>In <a href="/tiktok/tiktok-youtube-part4-scheduling-publishing/">Part 4</a>, we’ll upload these compiled news segments to YouTube using the YouTube Data API, and set up Apify scheduling to run this entire pipeline automatically every 12 hours.</p>

<blockquote>
  <p><strong>Looking for more TikTok data sources?</strong> Beyond keyword search, you can also pull trending videos, user profiles, hashtags, and comments. Browse our full collection of <a href="/tools/">TikTok and Twitter scraping tools</a> to expand your data pipeline.</p>
</blockquote>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><a href="/tiktok/tiktok-youtube-part1-searching-api/">Part 1: Searching TikTok with the Apify Python Client</a></li>
    <li><a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2: Downloading &amp; Organizing Video Files</a></li>
    <li><strong>Part 3: Auto-Generating News Scripts with AI</strong> ← You are here</li>
    <li><a href="/tiktok/tiktok-youtube-part4-scheduling-publishing/">Part 4: Scheduling &amp; Publishing to YouTube</a></li>
  </ul>
</blockquote>]]></content><author><name>Novi Develop</name></author><category term="TikTok" /><category term="Web Scraping" /><category term="YouTube" /><category term="apify" /><category term="tiktok-api" /><category term="python" /><category term="openai" /><category term="ffmpeg" /><category term="tutorial" /><summary type="html"><![CDATA[This is Part 3 of our series on building an automated YouTube news channel from TikTok data. In Part 2, we downloaded watermark-free videos and organized them by topic. Now we’ll turn those raw clips into a polished news segment by generating narration scripts with AI, converting them to text-to-speech audio, and compiling everything with FFmpeg.]]></summary></entry><entry><title type="html">TikTok to YouTube Pipeline Part 2: Downloading &amp;amp; Organizing Video Files</title><link href="https://scrapingenthusiast.github.io/tiktok/tiktok-youtube-part2-downloading-videos/" rel="alternate" type="text/html" title="TikTok to YouTube Pipeline Part 2: Downloading &amp;amp; Organizing Video Files" /><published>2026-03-25T08:00:00+07:00</published><updated>2026-03-25T08:00:00+07:00</updated><id>https://scrapingenthusiast.github.io/tiktok/tiktok-youtube-part2-downloading-videos</id><content type="html" xml:base="https://scrapingenthusiast.github.io/tiktok/tiktok-youtube-part2-downloading-videos/"><![CDATA[<p>This is <strong>Part 2</strong> of our series on building an automated YouTube news channel with TikTok data. In <a href="/tiktok/tiktok-youtube-part1-searching-api/">Part 1</a>, we used the <a href="https://apify.com/novi/advanced-search-tiktok-api?fpr=7hce1m">Advanced TikTok Search API Actor</a> to search TikTok and save structured JSON results. Now, we’ll write an <strong>async video downloader</strong> that fetches watermark-free clips and organizes them by topic.</p>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><a href="/tiktok/tiktok-youtube-part1-searching-api/">Part 1: Searching TikTok with the Apify Python Client</a></li>
    <li><strong>Part 2: Downloading &amp; Organizing Video Files</strong> ← You are here</li>
    <li><a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3: Auto-Generating News Scripts with AI</a></li>
    <li><a href="/tiktok/tiktok-youtube-part4-scheduling-publishing/">Part 4: Scheduling &amp; Publishing to YouTube</a></li>
  </ul>
</blockquote>

<h2 id="prerequisites">Prerequisites</h2>

<ul>
  <li>The JSON output file from Part 1</li>
  <li>Python 3.10+</li>
</ul>

<h2 id="step-1-install-dependencies">Step 1: Install Dependencies</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>httpx tqdm
</code></pre></div></div>

<p>We’ll use <code class="language-plaintext highlighter-rouge">httpx</code> for its first-class async support and <code class="language-plaintext highlighter-rouge">tqdm</code> for progress bars.</p>

<h2 id="step-2-the-async-video-downloader">Step 2: The Async Video Downloader</h2>

<p>Create <code class="language-plaintext highlighter-rouge">download_videos.py</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># download_videos.py
</span>
<span class="kn">import</span> <span class="n">asyncio</span>
<span class="kn">import</span> <span class="n">json</span>
<span class="kn">import</span> <span class="n">re</span>
<span class="kn">import</span> <span class="n">sys</span>
<span class="kn">from</span> <span class="n">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="kn">import</span> <span class="n">httpx</span>
<span class="kn">from</span> <span class="n">tqdm.asyncio</span> <span class="kn">import</span> <span class="n">tqdm_asyncio</span>

<span class="n">MAX_CONCURRENT</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">DOWNLOAD_DIR</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">data/videos</span><span class="sh">"</span><span class="p">)</span>
<span class="n">TIMEOUT</span> <span class="o">=</span> <span class="mf">60.0</span>


<span class="k">def</span> <span class="nf">sanitize_filename</span><span class="p">(</span><span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Remove unsafe characters from a filename.</span><span class="sh">"""</span>
    <span class="k">return</span> <span class="n">re</span><span class="p">.</span><span class="nf">sub</span><span class="p">(</span><span class="sa">r</span><span class="sh">'</span><span class="s">[^\w\-.]</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">_</span><span class="sh">'</span><span class="p">,</span> <span class="n">name</span><span class="p">)[:</span><span class="mi">80</span><span class="p">]</span>


<span class="k">def</span> <span class="nf">load_results</span><span class="p">(</span><span class="n">json_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]:</span>
    <span class="sh">"""</span><span class="s">Load the search results JSON file.</span><span class="sh">"""</span>
    <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">json_path</span><span class="p">,</span> <span class="sh">"</span><span class="s">r</span><span class="sh">"</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="sh">"</span><span class="s">utf-8</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">json</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">build_download_manifest</span><span class="p">(</span><span class="n">results</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]:</span>
    <span class="sh">"""</span><span class="s">Build a list of downloads, skipping entries without a valid URL.</span><span class="sh">"""</span>
    <span class="n">manifest</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">seen_ids</span> <span class="o">=</span> <span class="nf">set</span><span class="p">()</span>

    <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
        <span class="n">aweme_id</span> <span class="o">=</span> <span class="n">item</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">)</span>
        <span class="n">url</span> <span class="o">=</span> <span class="n">item</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">video_url_no_watermark</span><span class="sh">"</span><span class="p">)</span>

        <span class="k">if</span> <span class="ow">not</span> <span class="n">aweme_id</span> <span class="ow">or</span> <span class="ow">not</span> <span class="n">url</span><span class="p">:</span>
            <span class="k">continue</span>
        <span class="k">if</span> <span class="n">aweme_id</span> <span class="ow">in</span> <span class="n">seen_ids</span><span class="p">:</span>
            <span class="k">continue</span>

        <span class="n">seen_ids</span><span class="p">.</span><span class="nf">add</span><span class="p">(</span><span class="n">aweme_id</span><span class="p">)</span>
        <span class="n">keyword</span> <span class="o">=</span> <span class="nf">sanitize_filename</span><span class="p">(</span><span class="n">item</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">_search_keyword</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">unknown</span><span class="sh">"</span><span class="p">))</span>
        <span class="n">filename</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">aweme_id</span><span class="si">}</span><span class="s">.mp4</span><span class="sh">"</span>

        <span class="n">manifest</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
            <span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">:</span> <span class="n">aweme_id</span><span class="p">,</span>
            <span class="sh">"</span><span class="s">url</span><span class="sh">"</span><span class="p">:</span> <span class="n">url</span><span class="p">,</span>
            <span class="sh">"</span><span class="s">keyword</span><span class="sh">"</span><span class="p">:</span> <span class="n">keyword</span><span class="p">,</span>
            <span class="sh">"</span><span class="s">output_dir</span><span class="sh">"</span><span class="p">:</span> <span class="n">DOWNLOAD_DIR</span> <span class="o">/</span> <span class="n">keyword</span><span class="p">,</span>
            <span class="sh">"</span><span class="s">output_path</span><span class="sh">"</span><span class="p">:</span> <span class="n">DOWNLOAD_DIR</span> <span class="o">/</span> <span class="n">keyword</span> <span class="o">/</span> <span class="n">filename</span><span class="p">,</span>
            <span class="sh">"</span><span class="s">description</span><span class="sh">"</span><span class="p">:</span> <span class="n">item</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">description</span><span class="sh">"</span><span class="p">,</span> <span class="sh">""</span><span class="p">),</span>
        <span class="p">})</span>

    <span class="k">return</span> <span class="n">manifest</span>


<span class="k">async</span> <span class="k">def</span> <span class="nf">download_one</span><span class="p">(</span>
    <span class="n">client</span><span class="p">:</span> <span class="n">httpx</span><span class="p">.</span><span class="n">AsyncClient</span><span class="p">,</span>
    <span class="n">task</span><span class="p">:</span> <span class="nb">dict</span><span class="p">,</span>
    <span class="n">semaphore</span><span class="p">:</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">Semaphore</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Download a single video file with retry logic.</span><span class="sh">"""</span>
    <span class="n">task</span><span class="p">[</span><span class="sh">"</span><span class="s">output_dir</span><span class="sh">"</span><span class="p">].</span><span class="nf">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="k">if</span> <span class="n">task</span><span class="p">[</span><span class="sh">"</span><span class="s">output_path</span><span class="sh">"</span><span class="p">].</span><span class="nf">exists</span><span class="p">():</span>
        <span class="k">return</span> <span class="p">{</span><span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">:</span> <span class="n">task</span><span class="p">[</span><span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">],</span> <span class="sh">"</span><span class="s">status</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">skipped</span><span class="sh">"</span><span class="p">}</span>

    <span class="k">async</span> <span class="k">with</span> <span class="n">semaphore</span><span class="p">:</span>
        <span class="k">for</span> <span class="n">attempt</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="mi">3</span><span class="p">):</span>
            <span class="k">try</span><span class="p">:</span>
                <span class="k">async</span> <span class="k">with</span> <span class="n">client</span><span class="p">.</span><span class="nf">stream</span><span class="p">(</span><span class="sh">"</span><span class="s">GET</span><span class="sh">"</span><span class="p">,</span> <span class="n">task</span><span class="p">[</span><span class="sh">"</span><span class="s">url</span><span class="sh">"</span><span class="p">],</span>
                                         <span class="n">timeout</span><span class="o">=</span><span class="n">TIMEOUT</span><span class="p">)</span> <span class="k">as</span> <span class="n">resp</span><span class="p">:</span>
                    <span class="n">resp</span><span class="p">.</span><span class="nf">raise_for_status</span><span class="p">()</span>
                    <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">task</span><span class="p">[</span><span class="sh">"</span><span class="s">output_path</span><span class="sh">"</span><span class="p">],</span> <span class="sh">"</span><span class="s">wb</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
                        <span class="k">async</span> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">resp</span><span class="p">.</span><span class="nf">aiter_bytes</span><span class="p">(</span>
                            <span class="n">chunk_size</span><span class="o">=</span><span class="mi">1024</span> <span class="o">*</span> <span class="mi">64</span>
                        <span class="p">):</span>
                            <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span>

                <span class="k">return</span> <span class="p">{</span><span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">:</span> <span class="n">task</span><span class="p">[</span><span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">],</span> <span class="sh">"</span><span class="s">status</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">ok</span><span class="sh">"</span><span class="p">}</span>

            <span class="nf">except </span><span class="p">(</span><span class="n">httpx</span><span class="p">.</span><span class="n">HTTPStatusError</span><span class="p">,</span> <span class="n">httpx</span><span class="p">.</span><span class="n">ReadTimeout</span><span class="p">)</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
                <span class="k">if</span> <span class="n">attempt</span> <span class="o">==</span> <span class="mi">2</span><span class="p">:</span>
                    <span class="k">return</span> <span class="p">{</span>
                        <span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">:</span> <span class="n">task</span><span class="p">[</span><span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">],</span>
                        <span class="sh">"</span><span class="s">status</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">failed</span><span class="sh">"</span><span class="p">,</span>
                        <span class="sh">"</span><span class="s">error</span><span class="sh">"</span><span class="p">:</span> <span class="nf">str</span><span class="p">(</span><span class="n">e</span><span class="p">),</span>
                    <span class="p">}</span>
                <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="nf">sleep</span><span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="n">attempt</span><span class="p">)</span>

    <span class="k">return</span> <span class="p">{</span><span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">:</span> <span class="n">task</span><span class="p">[</span><span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">],</span> <span class="sh">"</span><span class="s">status</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">failed</span><span class="sh">"</span><span class="p">}</span>


<span class="k">async</span> <span class="k">def</span> <span class="nf">download_all</span><span class="p">(</span><span class="n">manifest</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]:</span>
    <span class="sh">"""</span><span class="s">Download all videos concurrently with a semaphore limit.</span><span class="sh">"""</span>
    <span class="n">semaphore</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="nc">Semaphore</span><span class="p">(</span><span class="n">MAX_CONCURRENT</span><span class="p">)</span>
    <span class="n">headers</span> <span class="o">=</span> <span class="p">{</span>
        <span class="sh">"</span><span class="s">User-Agent</span><span class="sh">"</span><span class="p">:</span> <span class="p">(</span>
            <span class="sh">"</span><span class="s">Mozilla/5.0 (Windows NT 10.0; Win64; x64) </span><span class="sh">"</span>
            <span class="sh">"</span><span class="s">AppleWebKit/537.36</span><span class="sh">"</span>
        <span class="p">),</span>
        <span class="sh">"</span><span class="s">Referer</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">https://www.tiktok.com/</span><span class="sh">"</span><span class="p">,</span>
    <span class="p">}</span>

    <span class="k">async</span> <span class="k">with</span> <span class="n">httpx</span><span class="p">.</span><span class="nc">AsyncClient</span><span class="p">(</span>
        <span class="n">headers</span><span class="o">=</span><span class="n">headers</span><span class="p">,</span> <span class="n">follow_redirects</span><span class="o">=</span><span class="bp">True</span>
    <span class="p">)</span> <span class="k">as</span> <span class="n">client</span><span class="p">:</span>
        <span class="n">tasks</span> <span class="o">=</span> <span class="p">[</span>
            <span class="nf">download_one</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">task</span><span class="p">,</span> <span class="n">semaphore</span><span class="p">)</span>
            <span class="k">for</span> <span class="n">task</span> <span class="ow">in</span> <span class="n">manifest</span>
        <span class="p">]</span>
        <span class="n">results</span> <span class="o">=</span> <span class="k">await</span> <span class="n">tqdm_asyncio</span><span class="p">.</span><span class="nf">gather</span><span class="p">(</span>
            <span class="o">*</span><span class="n">tasks</span><span class="p">,</span> <span class="n">desc</span><span class="o">=</span><span class="sh">"</span><span class="s">Downloading videos</span><span class="sh">"</span>
        <span class="p">)</span>

    <span class="k">return</span> <span class="n">results</span>


<span class="k">def</span> <span class="nf">save_metadata</span><span class="p">(</span><span class="n">manifest</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">],</span> <span class="n">output_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Save a metadata index alongside the downloaded videos.</span><span class="sh">"""</span>
    <span class="n">metadata</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">task</span> <span class="ow">in</span> <span class="n">manifest</span><span class="p">:</span>
        <span class="n">metadata</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
            <span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">:</span> <span class="n">task</span><span class="p">[</span><span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">],</span>
            <span class="sh">"</span><span class="s">keyword</span><span class="sh">"</span><span class="p">:</span> <span class="n">task</span><span class="p">[</span><span class="sh">"</span><span class="s">keyword</span><span class="sh">"</span><span class="p">],</span>
            <span class="sh">"</span><span class="s">description</span><span class="sh">"</span><span class="p">:</span> <span class="n">task</span><span class="p">[</span><span class="sh">"</span><span class="s">description</span><span class="sh">"</span><span class="p">],</span>
            <span class="sh">"</span><span class="s">local_path</span><span class="sh">"</span><span class="p">:</span> <span class="nf">str</span><span class="p">(</span><span class="n">task</span><span class="p">[</span><span class="sh">"</span><span class="s">output_path</span><span class="sh">"</span><span class="p">]),</span>
        <span class="p">})</span>

    <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">output_path</span><span class="p">,</span> <span class="sh">"</span><span class="s">w</span><span class="sh">"</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="sh">"</span><span class="s">utf-8</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">json</span><span class="p">.</span><span class="nf">dump</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span> <span class="n">f</span><span class="p">,</span> <span class="n">ensure_ascii</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">indent</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="k">if</span> <span class="nf">len</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span>
        <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">Usage: python download_videos.py &lt;path_to_results.json&gt;</span><span class="sh">"</span><span class="p">)</span>
        <span class="n">sys</span><span class="p">.</span><span class="nf">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="n">json_path</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">results</span> <span class="o">=</span> <span class="nf">load_results</span><span class="p">(</span><span class="n">json_path</span><span class="p">)</span>
    <span class="n">manifest</span> <span class="o">=</span> <span class="nf">build_download_manifest</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>

    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">📋 </span><span class="si">{</span><span class="nf">len</span><span class="p">(</span><span class="n">manifest</span><span class="p">)</span><span class="si">}</span><span class="s"> unique videos to download </span><span class="sh">"</span>
          <span class="sa">f</span><span class="sh">"</span><span class="s">(from </span><span class="si">{</span><span class="nf">len</span><span class="p">(</span><span class="n">results</span><span class="p">)</span><span class="si">}</span><span class="s"> search results)</span><span class="sh">"</span><span class="p">)</span>

    <span class="n">download_results</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="nf">run</span><span class="p">(</span><span class="nf">download_all</span><span class="p">(</span><span class="n">manifest</span><span class="p">))</span>

    <span class="n">ok</span> <span class="o">=</span> <span class="nf">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">download_results</span> <span class="k">if</span> <span class="n">r</span><span class="p">[</span><span class="sh">"</span><span class="s">status</span><span class="sh">"</span><span class="p">]</span> <span class="o">==</span> <span class="sh">"</span><span class="s">ok</span><span class="sh">"</span><span class="p">)</span>
    <span class="n">skipped</span> <span class="o">=</span> <span class="nf">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">download_results</span> <span class="k">if</span> <span class="n">r</span><span class="p">[</span><span class="sh">"</span><span class="s">status</span><span class="sh">"</span><span class="p">]</span> <span class="o">==</span> <span class="sh">"</span><span class="s">skipped</span><span class="sh">"</span><span class="p">)</span>
    <span class="n">failed</span> <span class="o">=</span> <span class="nf">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">download_results</span> <span class="k">if</span> <span class="n">r</span><span class="p">[</span><span class="sh">"</span><span class="s">status</span><span class="sh">"</span><span class="p">]</span> <span class="o">==</span> <span class="sh">"</span><span class="s">failed</span><span class="sh">"</span><span class="p">)</span>

    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="se">\n</span><span class="s">✅ Downloaded: </span><span class="si">{</span><span class="n">ok</span><span class="si">}</span><span class="s">  ⏭️ Skipped: </span><span class="si">{</span><span class="n">skipped</span><span class="si">}</span><span class="s">  ❌ Failed: </span><span class="si">{</span><span class="n">failed</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>

    <span class="c1"># Save metadata index
</span>    <span class="n">meta_path</span> <span class="o">=</span> <span class="n">DOWNLOAD_DIR</span> <span class="o">/</span> <span class="sh">"</span><span class="s">metadata_index.json</span><span class="sh">"</span>
    <span class="nf">save_metadata</span><span class="p">(</span><span class="n">manifest</span><span class="p">,</span> <span class="n">meta_path</span><span class="p">)</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">📁 Metadata saved to </span><span class="si">{</span><span class="n">meta_path</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="sh">"</span><span class="s">__main__</span><span class="sh">"</span><span class="p">:</span>
    <span class="nf">main</span><span class="p">()</span>
</code></pre></div></div>

<h2 id="step-3-run-the-downloader">Step 3: Run the Downloader</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python download_videos.py data/raw/tiktok_results_20260324_080000.json
</code></pre></div></div>

<p>Expected output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>📋 118 unique videos to download (from 122 search results)
Downloading videos: 100%|████████████████| 118/118 [01:42&lt;00:00]

✅ Downloaded: 112  ⏭️ Skipped: 0  ❌ Failed: 6
📁 Metadata saved to data/videos/metadata_index.json
</code></pre></div></div>

<h2 id="directory-structure-after-download">Directory Structure After Download</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data/
└── videos/
    ├── metadata_index.json
    ├── Ukraine_war/
    │   ├── 7229167805625847041.mp4
    │   ├── 7229167805625847042.mp4
    │   └── ...
    ├── earthquake_Turkey/
    │   ├── 7229167805625847050.mp4
    │   └── ...
    └── protest_France/
        ├── 7229167805625847060.mp4
        └── ...
</code></pre></div></div>

<p>Videos are automatically grouped by keyword, making it easy to find footage for a specific news topic.</p>

<h2 id="key-design-decisions">Key Design Decisions</h2>

<h3 id="why-async">Why Async?</h3>

<p>TikTok video CDN URLs are short-lived and can be slow from certain regions. Downloading 100+ videos sequentially could take 30+ minutes. With <code class="language-plaintext highlighter-rouge">asyncio</code> and a concurrency limit of 5, we typically finish in under 3 minutes.</p>

<h3 id="duplicate-detection">Duplicate Detection</h3>

<p>The <code class="language-plaintext highlighter-rouge">build_download_manifest</code> function tracks <code class="language-plaintext highlighter-rouge">aweme_id</code> values in a set, ensuring we never download the same video twice — even if it appears in multiple search results.</p>

<h3 id="retry-with-exponential-backoff">Retry with Exponential Backoff</h3>

<p>CDN URLs can occasionally return 403 or timeout. Our downloader retries up to 3 times with exponential backoff (1s, 2s, 4s) before marking a video as failed.</p>

<h3 id="metadata-index">Metadata Index</h3>

<p>The <code class="language-plaintext highlighter-rouge">metadata_index.json</code> file serves as a manifest for the next steps in the pipeline. It maps each video to its local path, original keyword, and description — exactly the data we’ll need to generate narration scripts in Part 3.</p>

<h2 id="whats-next">What’s Next?</h2>

<p>In <a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3</a>, we’ll use the downloaded videos and their descriptions to auto-generate news narration scripts with AI (OpenAI / Gemini), convert them to speech, and compile everything into a single news segment using FFmpeg.</p>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><a href="/tiktok/tiktok-youtube-part1-searching-api/">Part 1: Searching TikTok with the Apify Python Client</a></li>
    <li><strong>Part 2: Downloading &amp; Organizing Video Files</strong> ← You are here</li>
    <li><a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3: Auto-Generating News Scripts with AI</a></li>
    <li><a href="/tiktok/tiktok-youtube-part4-scheduling-publishing/">Part 4: Scheduling &amp; Publishing to YouTube</a></li>
  </ul>
</blockquote>]]></content><author><name>Novi Develop</name></author><category term="TikTok" /><category term="Web Scraping" /><category term="YouTube" /><category term="apify" /><category term="tiktok-api" /><category term="python" /><category term="tutorial" /><summary type="html"><![CDATA[This is Part 2 of our series on building an automated YouTube news channel with TikTok data. In Part 1, we used the Advanced TikTok Search API Actor to search TikTok and save structured JSON results. Now, we’ll write an async video downloader that fetches watermark-free clips and organizes them by topic.]]></summary></entry><entry><title type="html">TikTok to YouTube Pipeline Part 1: Searching TikTok with the Apify Python Client</title><link href="https://scrapingenthusiast.github.io/tiktok/tiktok-youtube-part1-searching-api/" rel="alternate" type="text/html" title="TikTok to YouTube Pipeline Part 1: Searching TikTok with the Apify Python Client" /><published>2026-03-24T08:00:00+07:00</published><updated>2026-03-24T08:00:00+07:00</updated><id>https://scrapingenthusiast.github.io/tiktok/tiktok-youtube-part1-searching-api</id><content type="html" xml:base="https://scrapingenthusiast.github.io/tiktok/tiktok-youtube-part1-searching-api/"><![CDATA[<p>This is <strong>Part 1</strong> of a 4-part series on building an automated YouTube news channel powered by TikTok data. In this installment, we’ll write the Python code that searches TikTok using the <a href="https://apify.com/novi/advanced-search-tiktok-api?fpr=7hce1m">Advanced TikTok Search API Actor</a> on Apify and saves structured results to a local JSON file.</p>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><strong>Part 1: Searching TikTok with the Apify Python Client</strong> ← You are here</li>
    <li><a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2: Downloading &amp; Organizing Video Files</a></li>
    <li><a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3: Auto-Generating News Scripts with AI</a></li>
    <li><a href="/tiktok/tiktok-youtube-part4-scheduling-publishing/">Part 4: Scheduling &amp; Publishing to YouTube</a></li>
  </ul>
</blockquote>

<h2 id="prerequisites">Prerequisites</h2>

<ul>
  <li>Python 3.10+</li>
  <li>An <a href="https://apify.com?fpr=7hce1m">Apify account</a> (free tier works for testing)</li>
  <li>Your Apify API token (found in <strong>Settings → Integrations</strong>)</li>
</ul>

<h2 id="step-1-install-dependencies">Step 1: Install Dependencies</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>apify-client python-dotenv
</code></pre></div></div>

<p>Create a <code class="language-plaintext highlighter-rouge">.env</code> file in your project root:</p>

<pre><code class="language-env">APIFY_TOKEN=your_apify_api_token_here
</code></pre>

<h2 id="step-2-define-your-search-configuration">Step 2: Define Your Search Configuration</h2>

<p>Create a file called <code class="language-plaintext highlighter-rouge">config.py</code> to centralize your search parameters. This makes it easy to manage multiple hotspot topics:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># config.py
</span>
<span class="n">HOTSPOT_SEARCHES</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">{</span>
        <span class="sh">"</span><span class="s">keyword</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">Ukraine war</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">region</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">UA</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">sortType</span><span class="sh">"</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>       <span class="c1"># Most recent
</span>        <span class="sh">"</span><span class="s">publishTime</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">WEEK</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">limit</span><span class="sh">"</span><span class="p">:</span> <span class="mi">50</span><span class="p">,</span>
    <span class="p">},</span>
    <span class="p">{</span>
        <span class="sh">"</span><span class="s">keyword</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">earthquake Turkey</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">region</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">TR</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">sortType</span><span class="sh">"</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">publishTime</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">YESTERDAY</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">limit</span><span class="sh">"</span><span class="p">:</span> <span class="mi">50</span><span class="p">,</span>
    <span class="p">},</span>
    <span class="p">{</span>
        <span class="sh">"</span><span class="s">keyword</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">protest France</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">region</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">FR</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">sortType</span><span class="sh">"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>       <span class="c1"># Most liked
</span>        <span class="sh">"</span><span class="s">publishTime</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">WEEK</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">limit</span><span class="sh">"</span><span class="p">:</span> <span class="mi">30</span><span class="p">,</span>
    <span class="p">},</span>
<span class="p">]</span>
</code></pre></div></div>

<p>Each dictionary maps directly to the input schema of the <a href="https://apify.com/novi/advanced-search-tiktok-api?fpr=7hce1m">Advanced TikTok Search API Actor</a>. The key parameters are:</p>

<table>
  <thead>
    <tr>
      <th>Parameter</th>
      <th>Purpose</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">keyword</code></td>
      <td>The search term (supports spaces)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">region</code></td>
      <td>Two-letter country code to localize results</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">sortType</code></td>
      <td><code class="language-plaintext highlighter-rouge">0</code> = Relevance, <code class="language-plaintext highlighter-rouge">1</code> = Most Liked, <code class="language-plaintext highlighter-rouge">2</code> = Most Recent</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">publishTime</code></td>
      <td>Time filter: <code class="language-plaintext highlighter-rouge">YESTERDAY</code>, <code class="language-plaintext highlighter-rouge">WEEK</code>, <code class="language-plaintext highlighter-rouge">MONTH</code>, etc.</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">limit</code></td>
      <td>Soft cap on the number of videos returned</td>
    </tr>
  </tbody>
</table>

<h2 id="step-3-write-the-search-script">Step 3: Write the Search Script</h2>

<p>Create <code class="language-plaintext highlighter-rouge">search_tiktok.py</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># search_tiktok.py
</span>
<span class="kn">import</span> <span class="n">json</span>
<span class="kn">import</span> <span class="n">os</span>
<span class="kn">from</span> <span class="n">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>
<span class="kn">from</span> <span class="n">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="kn">from</span> <span class="n">apify_client</span> <span class="kn">import</span> <span class="n">ApifyClient</span>
<span class="kn">from</span> <span class="n">dotenv</span> <span class="kn">import</span> <span class="n">load_dotenv</span>

<span class="kn">from</span> <span class="n">config</span> <span class="kn">import</span> <span class="n">HOTSPOT_SEARCHES</span>

<span class="nf">load_dotenv</span><span class="p">()</span>

<span class="n">ACTOR_ID</span> <span class="o">=</span> <span class="sh">"</span><span class="s">novi/advanced-search-tiktok-api</span><span class="sh">"</span>
<span class="n">OUTPUT_DIR</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">data/raw</span><span class="sh">"</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">run_search</span><span class="p">(</span><span class="n">client</span><span class="p">:</span> <span class="n">ApifyClient</span><span class="p">,</span> <span class="n">search_params</span><span class="p">:</span> <span class="nb">dict</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]:</span>
    <span class="sh">"""</span><span class="s">Run a single TikTok search and return the results.</span><span class="sh">"""</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">🔍 Searching: </span><span class="sh">'</span><span class="si">{</span><span class="n">search_params</span><span class="p">[</span><span class="sh">'</span><span class="s">keyword</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">'</span><span class="s"> </span><span class="sh">"</span>
          <span class="sa">f</span><span class="sh">"</span><span class="s">(region=</span><span class="si">{</span><span class="n">search_params</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">'</span><span class="s">region</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">ANY</span><span class="sh">'</span><span class="p">)</span><span class="si">}</span><span class="s">)...</span><span class="sh">"</span><span class="p">)</span>

    <span class="n">run</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="nf">actor</span><span class="p">(</span><span class="n">ACTOR_ID</span><span class="p">).</span><span class="nf">call</span><span class="p">(</span><span class="n">run_input</span><span class="o">=</span><span class="n">search_params</span><span class="p">)</span>
    <span class="n">items</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="n">client</span><span class="p">.</span><span class="nf">dataset</span><span class="p">(</span><span class="n">run</span><span class="p">[</span><span class="sh">"</span><span class="s">defaultDatasetId</span><span class="sh">"</span><span class="p">]).</span><span class="nf">iterate_items</span><span class="p">())</span>

    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">   ✅ Found </span><span class="si">{</span><span class="nf">len</span><span class="p">(</span><span class="n">items</span><span class="p">)</span><span class="si">}</span><span class="s"> videos.</span><span class="sh">"</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">items</span>


<span class="k">def</span> <span class="nf">extract_essential_fields</span><span class="p">(</span><span class="n">video</span><span class="p">:</span> <span class="nb">dict</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Extract only the fields we need for the pipeline.</span><span class="sh">"""</span>
    <span class="n">stats</span> <span class="o">=</span> <span class="n">video</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">statistics</span><span class="sh">"</span><span class="p">,</span> <span class="p">{})</span>
    <span class="n">author</span> <span class="o">=</span> <span class="n">video</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">author</span><span class="sh">"</span><span class="p">,</span> <span class="p">{})</span>
    <span class="n">play_addr</span> <span class="o">=</span> <span class="n">video</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">video</span><span class="sh">"</span><span class="p">,</span> <span class="p">{}).</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">play_addr</span><span class="sh">"</span><span class="p">,</span> <span class="p">{})</span>
    <span class="n">download_addr</span> <span class="o">=</span> <span class="n">video</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">video</span><span class="sh">"</span><span class="p">,</span> <span class="p">{}).</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">download_addr</span><span class="sh">"</span><span class="p">,</span> <span class="p">{})</span>

    <span class="k">return</span> <span class="p">{</span>
        <span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">:</span> <span class="n">video</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">description</span><span class="sh">"</span><span class="p">:</span> <span class="n">video</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">desc</span><span class="sh">"</span><span class="p">,</span> <span class="sh">""</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">share_url</span><span class="sh">"</span><span class="p">:</span> <span class="n">video</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">share_url</span><span class="sh">"</span><span class="p">,</span> <span class="sh">""</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">create_time</span><span class="sh">"</span><span class="p">:</span> <span class="n">video</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">create_time</span><span class="sh">"</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">region</span><span class="sh">"</span><span class="p">:</span> <span class="n">video</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">region</span><span class="sh">"</span><span class="p">),</span>
        <span class="c1"># Author info
</span>        <span class="sh">"</span><span class="s">author_username</span><span class="sh">"</span><span class="p">:</span> <span class="n">author</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">unique_id</span><span class="sh">"</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">author_nickname</span><span class="sh">"</span><span class="p">:</span> <span class="n">author</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">nickname</span><span class="sh">"</span><span class="p">),</span>
        <span class="c1"># Statistics
</span>        <span class="sh">"</span><span class="s">play_count</span><span class="sh">"</span><span class="p">:</span> <span class="n">stats</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">play_count</span><span class="sh">"</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">like_count</span><span class="sh">"</span><span class="p">:</span> <span class="n">stats</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">digg_count</span><span class="sh">"</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">comment_count</span><span class="sh">"</span><span class="p">:</span> <span class="n">stats</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">comment_count</span><span class="sh">"</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">share_count</span><span class="sh">"</span><span class="p">:</span> <span class="n">stats</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">share_count</span><span class="sh">"</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span>
        <span class="c1"># Video URLs
</span>        <span class="sh">"</span><span class="s">video_url_no_watermark</span><span class="sh">"</span><span class="p">:</span> <span class="p">(</span>
            <span class="n">play_addr</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">url_list</span><span class="sh">"</span><span class="p">,</span> <span class="p">[</span><span class="bp">None</span><span class="p">])[</span><span class="mi">0</span><span class="p">]</span>
        <span class="p">),</span>
        <span class="sh">"</span><span class="s">video_url_watermark</span><span class="sh">"</span><span class="p">:</span> <span class="p">(</span>
            <span class="n">download_addr</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">url_list</span><span class="sh">"</span><span class="p">,</span> <span class="p">[</span><span class="bp">None</span><span class="p">])[</span><span class="mi">0</span><span class="p">]</span>
        <span class="p">),</span>
        <span class="sh">"</span><span class="s">video_duration_ms</span><span class="sh">"</span><span class="p">:</span> <span class="n">video</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">video</span><span class="sh">"</span><span class="p">,</span> <span class="p">{}).</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">duration</span><span class="sh">"</span><span class="p">),</span>
        <span class="c1"># Hashtags
</span>        <span class="sh">"</span><span class="s">hashtags</span><span class="sh">"</span><span class="p">:</span> <span class="p">[</span>
            <span class="n">t</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">hashtag_name</span><span class="sh">"</span><span class="p">,</span> <span class="sh">""</span><span class="p">)</span>
            <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">video</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">text_extra</span><span class="sh">"</span><span class="p">,</span> <span class="p">[])</span>
            <span class="k">if</span> <span class="n">t</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">type</span><span class="sh">"</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
        <span class="p">],</span>
    <span class="p">}</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">token</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">APIFY_TOKEN</span><span class="sh">"</span><span class="p">)</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="n">token</span><span class="p">:</span>
        <span class="k">raise</span> <span class="nc">SystemExit</span><span class="p">(</span><span class="sh">"</span><span class="s">❌ APIFY_TOKEN not set. Check your .env file.</span><span class="sh">"</span><span class="p">)</span>

    <span class="n">client</span> <span class="o">=</span> <span class="nc">ApifyClient</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>
    <span class="n">OUTPUT_DIR</span><span class="p">.</span><span class="nf">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="n">timestamp</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="nf">now</span><span class="p">().</span><span class="nf">strftime</span><span class="p">(</span><span class="sh">"</span><span class="s">%Y%m%d_%H%M%S</span><span class="sh">"</span><span class="p">)</span>

    <span class="n">all_results</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">for</span> <span class="n">search_params</span> <span class="ow">in</span> <span class="n">HOTSPOT_SEARCHES</span><span class="p">:</span>
        <span class="n">raw_videos</span> <span class="o">=</span> <span class="nf">run_search</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">search_params</span><span class="p">)</span>
        <span class="n">extracted</span> <span class="o">=</span> <span class="p">[</span><span class="nf">extract_essential_fields</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">raw_videos</span><span class="p">]</span>

        <span class="c1"># Tag each result with the original search keyword
</span>        <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">extracted</span><span class="p">:</span>
            <span class="n">item</span><span class="p">[</span><span class="sh">"</span><span class="s">_search_keyword</span><span class="sh">"</span><span class="p">]</span> <span class="o">=</span> <span class="n">search_params</span><span class="p">[</span><span class="sh">"</span><span class="s">keyword</span><span class="sh">"</span><span class="p">]</span>

        <span class="n">all_results</span><span class="p">.</span><span class="nf">extend</span><span class="p">(</span><span class="n">extracted</span><span class="p">)</span>

    <span class="c1"># Save to a single JSON file
</span>    <span class="n">output_path</span> <span class="o">=</span> <span class="n">OUTPUT_DIR</span> <span class="o">/</span> <span class="sa">f</span><span class="sh">"</span><span class="s">tiktok_results_</span><span class="si">{</span><span class="n">timestamp</span><span class="si">}</span><span class="s">.json</span><span class="sh">"</span>
    <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">output_path</span><span class="p">,</span> <span class="sh">"</span><span class="s">w</span><span class="sh">"</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="sh">"</span><span class="s">utf-8</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">json</span><span class="p">.</span><span class="nf">dump</span><span class="p">(</span><span class="n">all_results</span><span class="p">,</span> <span class="n">f</span><span class="p">,</span> <span class="n">ensure_ascii</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">indent</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>

    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="se">\n</span><span class="s">📁 Saved </span><span class="si">{</span><span class="nf">len</span><span class="p">(</span><span class="n">all_results</span><span class="p">)</span><span class="si">}</span><span class="s"> videos to </span><span class="si">{</span><span class="n">output_path</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="sh">"</span><span class="s">__main__</span><span class="sh">"</span><span class="p">:</span>
    <span class="nf">main</span><span class="p">()</span>
</code></pre></div></div>

<h2 id="step-4-run-it">Step 4: Run It</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python search_tiktok.py
</code></pre></div></div>

<p>Expected output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>🔍 Searching: 'Ukraine war' (region=UA)...
   ✅ Found 50 videos.
🔍 Searching: 'earthquake Turkey' (region=TR)...
   ✅ Found 42 videos.
🔍 Searching: 'protest France' (region=FR)...
   ✅ Found 30 videos.

📁 Saved 122 videos to data/raw/tiktok_results_20260324_080000.json
</code></pre></div></div>

<h2 id="understanding-the-output">Understanding the Output</h2>

<p>Each entry in the JSON file looks like this:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"aweme_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"7229167805625847041"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Breaking: massive protest in central Paris #protest #france"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"share_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://www.tiktok.com/@user/video/7229167805625847041"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"create_time"</span><span class="p">:</span><span class="w"> </span><span class="mi">1683171893</span><span class="p">,</span><span class="w">
  </span><span class="nl">"region"</span><span class="p">:</span><span class="w"> </span><span class="s2">"FR"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"author_username"</span><span class="p">:</span><span class="w"> </span><span class="s2">"news_reporter_01"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"author_nickname"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Breaking News FR"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"play_count"</span><span class="p">:</span><span class="w"> </span><span class="mi">585709</span><span class="p">,</span><span class="w">
  </span><span class="nl">"like_count"</span><span class="p">:</span><span class="w"> </span><span class="mi">25006</span><span class="p">,</span><span class="w">
  </span><span class="nl">"comment_count"</span><span class="p">:</span><span class="w"> </span><span class="mi">183</span><span class="p">,</span><span class="w">
  </span><span class="nl">"share_count"</span><span class="p">:</span><span class="w"> </span><span class="mi">492</span><span class="p">,</span><span class="w">
  </span><span class="nl">"video_url_no_watermark"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://v19.tiktokcdn-us.com/..."</span><span class="p">,</span><span class="w">
  </span><span class="nl">"video_url_watermark"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://v19.tiktokcdn-us.com/..."</span><span class="p">,</span><span class="w">
  </span><span class="nl">"video_duration_ms"</span><span class="p">:</span><span class="w"> </span><span class="mi">54635</span><span class="p">,</span><span class="w">
  </span><span class="nl">"hashtags"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"protest"</span><span class="p">,</span><span class="w"> </span><span class="s2">"france"</span><span class="p">],</span><span class="w">
  </span><span class="nl">"_search_keyword"</span><span class="p">:</span><span class="w"> </span><span class="s2">"protest France"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">video_url_no_watermark</code> field is the key asset — this is the clean video URL we’ll download in Part 2.</p>

<h2 id="whats-next">What’s Next?</h2>

<p>In <a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2</a>, we’ll write an async downloader that fetches all these watermark-free videos, organizes them by topic and date, and handles retries for failed downloads.</p>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><strong>Part 1: Searching TikTok with the Apify Python Client</strong> ← You are here</li>
    <li><a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2: Downloading &amp; Organizing Video Files</a></li>
    <li><a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3: Auto-Generating News Scripts with AI</a></li>
    <li><a href="/tiktok/tiktok-youtube-part4-scheduling-publishing/">Part 4: Scheduling &amp; Publishing to YouTube</a></li>
  </ul>
</blockquote>]]></content><author><name>Novi Develop</name></author><category term="TikTok" /><category term="Web Scraping" /><category term="YouTube" /><category term="apify" /><category term="tiktok-api" /><category term="python" /><category term="tutorial" /><summary type="html"><![CDATA[This is Part 1 of a 4-part series on building an automated YouTube news channel powered by TikTok data. In this installment, we’ll write the Python code that searches TikTok using the Advanced TikTok Search API Actor on Apify and saves structured results to a local JSON file.]]></summary></entry><entry><title type="html">Building a YouTube News Channel on Global Hot Spots with TikTok Data</title><link href="https://scrapingenthusiast.github.io/tiktok/youtube-news-channel-tiktok-data/" rel="alternate" type="text/html" title="Building a YouTube News Channel on Global Hot Spots with TikTok Data" /><published>2026-03-23T08:00:00+07:00</published><updated>2026-03-23T08:00:00+07:00</updated><id>https://scrapingenthusiast.github.io/tiktok/youtube-news-channel-tiktok-data</id><content type="html" xml:base="https://scrapingenthusiast.github.io/tiktok/youtube-news-channel-tiktok-data/"><![CDATA[<p>In the fast-paced world of digital media, short-form content reigns supreme. If you’ve ever considered building a YouTube news channel focused on global hot spots—from geopolitical conflicts to breaking natural disasters—TikTok is arguably your most vital source of on-the-ground footage. However, manually sifting through millions of TikTok videos to find relevant, watermark-free, and hyper-local content is incredibly time-consuming.</p>

<p>This is where the <a href="https://apify.com/novi/advanced-search-tiktok-api?fpr=7hce1m">Advanced TikTok Search API Actor</a> on Apify comes into play. In this post, we’ll explore an innovative idea: automating the curation of a YouTube news channel using this powerful web scraping tool.</p>

<h2 id="the-idea-a-data-driven-global-news-channel">The Idea: A Data-Driven Global News Channel</h2>

<p>The concept is straightforward: creating a YouTube channel (using YouTube Shorts and long-form compilations) that reports on current global hot spots. The content relies on citizen journalism and on-the-ground footage sourced directly from TikTok.</p>

<p>To execute this, you need a reliable pipeline to:</p>
<ol>
  <li><strong>Search</strong> for trending keywords related to specific regions or events (e.g., protests in Paris, earthquakes in Japan).</li>
  <li><strong>Filter</strong> the results by date to ensure the news is breaking.</li>
  <li><strong>Download</strong> the videos without watermarks for clean editing.</li>
  <li><strong>Compile</strong> and add editorial narrative or AI voiceovers to provide context.</li>
</ol>

<h2 id="leveraging-the-advanced-tiktok-search-api">Leveraging the Advanced TikTok Search API</h2>

<p>The Advanced TikTok Search API Actor is uniquely positioned to handle the data collection phase of this pipeline. Let’s look at how its features align with our content creation strategy.</p>

<h3 id="1-granular-filtering-for-breaking-news">1. Granular Filtering for Breaking News</h3>

<p>When covering global hot spots, recency is everything. The actor allows you to use the <code class="language-plaintext highlighter-rouge">publishTime</code> filter. By setting it to <code class="language-plaintext highlighter-rouge">YESTERDAY</code> or <code class="language-plaintext highlighter-rouge">WEEK</code>, you ensure that you are only scraping the most recent footage of an ongoing event.</p>

<p>Furthermore, the <code class="language-plaintext highlighter-rouge">region</code> parameter lets you target specific countries using their two-letter code (e.g., <code class="language-plaintext highlighter-rouge">UA</code> for Ukraine, <code class="language-plaintext highlighter-rouge">SD</code> for Sudan). This is crucial for verifying that the footage is likely originating from the hotspot itself, filtering out irrelevant global noise.</p>

<h3 id="2-keyword-searching-and-sorting">2. Keyword Searching and Sorting</h3>

<p>You can automate searches using specific, localized keywords. The actor’s <code class="language-plaintext highlighter-rouge">keyword</code> parameter is perfect for this. Once you have a search phrase, the <code class="language-plaintext highlighter-rouge">sortType</code> parameter becomes incredibly useful:</p>
<ul>
  <li>Setting <code class="language-plaintext highlighter-rouge">sortType: 2</code> (Most recent) gives you a chronological feed of breaking events.</li>
  <li>Setting <code class="language-plaintext highlighter-rouge">sortType: 1</code> (Most liked) helps you find the most viral, impactful footage that is already resonating with audiences.</li>
</ul>

<h3 id="3-clean-watermark-free-footage">3. Clean, Watermark-Free Footage</h3>

<p>A professional YouTube news channel cannot have distracting TikTok watermarks bouncing around the screen. The Advanced TikTok Search API Actor provides access to both <code class="language-plaintext highlighter-rouge">play_addr</code> (usually watermark-free) and <code class="language-plaintext highlighter-rouge">download_addr</code> (with watermark) in its JSON output. Accessing the clean video URL allows for seamless integration into video editing software like Premiere Pro or automated editing pipelines.</p>

<h3 id="4-bypassing-limits-for-deep-dives">4. Bypassing Limits for Deep Dives</h3>

<p>For major events, you might need hundreds of clips to find the perfect angles. Using the <code class="language-plaintext highlighter-rouge">limit</code> and <code class="language-plaintext highlighter-rouge">isUnlimited: true</code> parameters, you can instruct the actor to dig deep into the search results. While unlimited scraping takes more time and resources, it provides a comprehensive dataset of videos for your editorial team (or automated scripts) to review.</p>

<h2 id="the-automated-workflow">The Automated Workflow</h2>

<p>Here’s a high-level view of how you could automate this YouTube channel using Apify and other tools:</p>

<ol>
  <li><strong>Trigger:</strong> A cron job runs every 12 hours on the Apify platform, triggering the Advanced TikTok Search API Actor with predefined keywords for current global hot spots (e.g., “Paris protest”, “Taiwan earthquake”).</li>
  <li><strong>Extract:</strong> The actor extracts the metadata and watermark-free video URLs for the top 50 most recent videos for each keyword.</li>
  <li><strong>Process:</strong> A webhook sends the JSON data to a platform like Make.com or n8n.</li>
  <li><strong>Download &amp; Edit:</strong> A script downloads the <code class="language-plaintext highlighter-rouge">.mp4</code> files from the provided URLs. You can then use tools like FFmpeg to concatenate clips, add a standard channel intro/outro, or even integrate AI text-to-speech APIs to read auto-generated news scripts based on the video captions (<code class="language-plaintext highlighter-rouge">desc</code> field).</li>
  <li><strong>Publish:</strong> The final video is automatically uploaded to YouTube via the YouTube Data API.</li>
</ol>

<h2 id="conclusion">Conclusion</h2>

<p>Building a news channel around global hot spots requires speed, accuracy, and access to raw footage. By using the <a href="https://apify.com/novi/advanced-search-tiktok-api?fpr=7hce1m">Advanced TikTok Search API Actor</a> to systematically scrape and filter TikTok data, you can build a highly efficient news aggregation pipeline. It transforms the chaotic landscape of social media into a structured, easily accessible database of breaking news, ready to be edited and delivered to a global YouTube audience.</p>

<h2 id="implementation-series-build-it-with-python">Implementation Series: Build It with Python</h2>

<p>Ready to build this pipeline? Follow our step-by-step Python implementation series:</p>

<ol>
  <li><a href="/tiktok/tiktok-youtube-part1-searching-api/">Part 1: Searching TikTok with the Apify Python Client</a></li>
  <li><a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2: Downloading &amp; Organizing Video Files</a></li>
  <li><a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3: Auto-Generating News Scripts with AI</a></li>
  <li><a href="/tiktok/tiktok-youtube-part4-scheduling-publishing/">Part 4: Scheduling &amp; Publishing to YouTube</a></li>
</ol>]]></content><author><name>Novi Develop</name></author><category term="TikTok" /><category term="Web Scraping" /><category term="YouTube" /><category term="apify" /><category term="tiktok-api" /><category term="content-creation" /><category term="osint" /><summary type="html"><![CDATA[In the fast-paced world of digital media, short-form content reigns supreme. If you’ve ever considered building a YouTube news channel focused on global hot spots—from geopolitical conflicts to breaking natural disasters—TikTok is arguably your most vital source of on-the-ground footage. However, manually sifting through millions of TikTok videos to find relevant, watermark-free, and hyper-local content is incredibly time-consuming.]]></summary></entry></feed>