How to crawl videos and subtitles in TED pages

I. Introduction
TED talks are stories of people who are either leaders in a particular field or pioneers in an emerging field, telling their stories of extraordinary experiences. TED talks is a private non-profit organization in the United States, which is famous for its TED conference, the purpose of which is "ideas worth spreading". It is a good website for learning English and practicing listening, but not every talk offers video and audio resources for download, and the downloading of subtitles in multiple languages requires reformatting, which is time-consuming and laborious. Therefore, a free Web-based crawling tool, TED Downloader, was created to crawl video, audio, subtitles, and bilingual PDFs in Cornell Note format based on the URL of the talk, making it easy for English learning students to access the material. This article mainly shares the production principle of the tool, including HLS streaming audio and video download, SRT data crawling, PDFReact-based text production and other functions.

Key words: TED; TED Downloader; crawler; Cornell notes; English; TED download;

II. Principles

Get the TED base information. Parse the data returned by URL through cheerio.load, parse the script#NEXT_DATA field in it and convert it to json format, the data can be obtained inside the id, name and other information, as well as the kind of language supported by the TED video, and the record of this information can be used to find the translation script in the target language.

// 发送请求获取 TED talk 页面 HTML 内容    
      const shotUrl = text.replace('https://www.ted.com','/ted')
      const response = await axios.get(shotUrl,{
        withCredentials: true
      });

      // 通过 Cheerio 解析 HTML 内容，找到 "read transcript" 链接
      const $ = cheerio.load(response.data);
      const scriptData = $('script#__NEXT_DATA__').html();

      const jsonData = JSON.parse(scriptData);
      const videoData = jsonData.props.pageProps.videoData;
      setId(videoData.id);
      setName(videoData.slug);
      setLog("[INFO] id:"+videoData.id +" name:"+videoData.slug);

      const jsonPlayerData = JSON.parse(videoData.playerData);
      const topicNames = videoData.topics.nodes.map(node => node.name);
      const concatenatedNames = topicNames.join(';');
      setTopics(concatenatedNames);
      setPlayerData(jsonPlayerData);
      setTranscriptData(jsonData.props.pageProps.transcriptData.translation);
      toast.loading(`Validating...`, { duration: 800 });
      setHreflangs([]);
      $('link[rel="alternate"]').each((i, el) => {
          const hreflang = $(el).attr('hreflang');
          if(hreflang && hreflang!=='x-default')
          {
              hreflangs.push(hreflang);
          }

      });

The video files in the TED talk are m3u8 based streaming files. m3u8 files are M3U files in UTF-8 encoding format. M3U files are recorded as an index plain text file, and when you open it the playback software does not play it, but finds the network address of the corresponding audio and video file for online playback according to its index. M3U8 is a common streaming media format, mainly in the form of file lists, both live and on-demand support, especially in Android, iOS and other platforms are most commonly used. Download audio and video clips according to the content of m3u8 files parsed from the url, and then use ffmpeg to combine them into mp4 videos: ReferenceFenngtun：音视频编解码--M3U8文件格式

async function startDownload(mediaURLs,mime,suffix) {
    if(downloadState == STARTING_DOWNLOAD || downloadState == SEGMENT_STARTING_DOWNLOAD || downloadState ==  SEGMENT_STICHING)
    {
      setLog(`[INFO] data is downloading, please wait a moment ... `);
      return;
    }
    setdownloadState(STARTING_DOWNLOAD);
    setLog(`[INFO] Download MP4 job started`);

    try {
      setLog(`[INFO] Fetching segments`);
      let getSegments = {
        type: SEGMENT,
        data: [],
      };
      for(let i= 0;i<mediaURLs.length;++i)
      {
        let tmpSegments = await parseHlsSegment({ hlsUrl: mediaURLs[i], headers: {} });
        getSegments.data = [...getSegments.data,...tmpSegments.data];
      }


      if (getSegments.type !== SEGMENT)
        throw new Error(`Invalid segment url, Please refresh the page`);

      let segments = getSegments.data.map((s, i) => ({ ...s, index: i })); // comment out .slice

      setLog(`[INFO] Initializing ffmpeg`);
      const ffmpeg = createFFmpeg({log: false});

      await ffmpeg.load();
      setLog(`[INFO] ffmpeg loaded`);

      setdownloadState(SEGMENT_STARTING_DOWNLOAD);

      let segmentChunks = [];
      for (let i = 0; i < segments.length; i += SEGMENT_CHUNK_SIZE) {
        segmentChunks.push(segments.slice(i, i + SEGMENT_CHUNK_SIZE));
      }

      let successSegmentsVideo = [];
      let successSegmentsAudio = [];

      for (let i = 0; i < segmentChunks.length; i++) {
        setLog(
          `[Downloading segment chunks]  ${i}/${segmentChunks.length} - Chunksize: ${SEGMENT_CHUNK_SIZE}`
        );

        let segmentChunk = segmentChunks[i];


        await Promise.all(
          segmentChunk.map(async (segment) => {
            const segments = segment.uri.split('/');
            const filename = segments[segments.length - 1];
            let fileId = filename;
            try {
              let getFile = await fetch(segment.uri, {
                headers: {
                  ...(sendHeaderWhileFetchingTS ? headers : {}),
                },
              });

              if (!getFile.ok) throw new Error("File failed to fetch");

              const totalSize = getFile.headers.get('content-length');
              let downloadedSize = 0;

              let fileArrayBuffer = new ArrayBuffer(totalSize);
              let fileUint8Array = new Uint8Array(fileArrayBuffer);

              const reader = getFile.body.getReader();

              while (true) {
                const { done, value } = await reader.read();
                if (done) break;
                fileUint8Array.set(value, downloadedSize);
                downloadedSize += value.length;
                const progress = Math.round((downloadedSize / totalSize) * 100);
                setLog(`[Downloading segment ${segment.index}in chunk ${i} ] downloading ${progress}%`);
              }

              ffmpeg.FS(
                "writeFile",
                fileId,
                fileUint8Array //await fetchFile(await getFile.arrayBuffer()) //fileUint8Array
              );
              if(fileId.includes('v1'))
              {
                successSegmentsVideo.push(fileId);
              }
              else
              { 
                successSegmentsAudio.push(fileId);
              }
              setLog(`[Downloading segment ${segment.index}in chunk ${i} ] downloaded into  ${fileId}`);
            } catch (error) {
              setLog(`[ERROR] Segment  ${fileId} download error`);
            }
          })
        );
      }
      const regex = /(\d+)/g;
      successSegmentsVideo = successSegmentsVideo.sort((a, b) => {
        let aIndex = regex.exec(a)[1];
        let bIndex = regex.exec(b)[1];
        return aIndex - bIndex;
      });
      successSegmentsAudio = successSegmentsAudio.sort((a, b) => {
        let aIndex = regex.exec(a)[1];
        let bIndex = regex.exec(b)[1];
        return aIndex - bIndex;
      });


      setLog(`[INFO] Stiching segments started`);
      setdownloadState(SEGMENT_STICHING);

      await ffmpeg.run(
        "-i",
        `concat:${successSegmentsVideo.join("|")}`,
        "-i",
        `concat:${successSegmentsAudio.join("|")}`,
        "-c:v",
        "copy",
        "-c:a",
        "copy",
        "output.mp4"
      );

      setLog(`[INFO] Stiching segments finished`);

      successSegmentsVideo.forEach((segment) => {
        // cleanup
        try {
          ffmpeg.FS("unlink", segment);
        } catch (_) {}
      });
      successSegmentsAudio.forEach((segment) => {
        // cleanup
        try {
          ffmpeg.FS("unlink", segment);
        } catch (_) {}
      });

      let data;

      try {
        data = ffmpeg.FS("readFile", "output.mp4");
      } catch (_) {
        throw new Error(`Something went wrong while stiching!`);
      }

      setLog("[INFO] Download finished");
      setdownloadState(JOB_FINISHED);

      const objurl = URL.createObjectURL(new Blob([data.buffer], { type: mime }))//mime

      // Create a new anchor element to download the file
      const downloadLink = document.createElement('a');
      downloadLink.href = objurl;
      downloadLink.download = tedName +suffix;

      // Append the anchor element to the document body and click it
      document.body.appendChild(downloadLink);
      downloadLink.click();
      setTimeout(() => {
        ffmpeg.exit(); // ffmpeg.exit() is callable only after load() stage.
      }, 1000);
    } catch (error) {
      setLog(error.message);
      setdownloadState(DOWNLOAD_ERROR);
      toast.error(error.message);
    }
  }

SRT files are text files used in video playback, so they do not contain any video data. time code formats in SRT files are hours, minutes, seconds and milliseconds, and display formats are HH:MM:SS, MIL. display coordinates for subtitles may appear after the end time code, or if there are no display coordinates, subtitles are displayed in the center of the video by default. The translated text provided by TED comes with its own start time, so if you want to convert it to SRT format, you need to output it according to its specified format. The effect is as follows:

//setTranscriptUrl('/ted/graphql?operationName=Transcript&variables={"id":"' + tedName + '","language":"'+ code+'"}&extensions={"persistedQuery":{"version":1,"sha256Hash":"d08efccc70eb251b9d326e9d6d3e2decb48cb4a9fa7eca979c00c06e522dea6f"}}');
    const transcriptUrl = '/ted/graphql?operationName=Transcript&variables={"id":"' + tedName + '","language":"'+ code+'"}&extensions={"persistedQuery":{"version":1,"sha256Hash":"906b90e820733c27cab3bb5de1cb4578657af4610c346b235b4ece9e89dc88bd"}}';

    try {
      // 发送 "read transcript" 链接的请求，获取 transcript 数据
      const transcriptResponse = await axios.get(transcriptUrl);
      // 提取 transcript 数据并更新组件状态
      const paragraphs = transcriptResponse.data.data.translation.paragraphs;
      let counter = 1;
      let srtData = '';      

      // Loop through each paragraph
      for (let i = 0; i < paragraphs.length; i++) 
      {
        const cues = paragraphs[i].cues;
        const cues_next = i < paragraphs.length-1 ? paragraphs[i+1].cues : paragraphs[paragraphs.length-1].cues ;
        for(let j=0;j< cues.length; j++)
        {
            const { time,  text } = cues[j];
            const endTime = j < cues.length-1 ? cues[j+1].time : cues_next[0].time;

            // Format the start and end times in SRT format
            const item = new SubRipItem(counter,time,endTime,text);
            // Add the SRT entry to the output string
            srtData += item.toString();
            counter++;
        }

      }
      setLog("[INFO] Download finished");
      setdownloadState(JOB_FINISHED);
      // Write the SRT data to a file
      // Create a new Blob object from the SRT data
      const blob = new Blob([srtData], { type: 'text/plain' });

      // Create a new URL object for the Blob
      const url = URL.createObjectURL(blob);

      // Create a new anchor element to download the file
      const downloadLink = document.createElement('a');
      downloadLink.href = url;
      downloadLink.download = tedName + '_'+code+'.srt';

      // Append the anchor element to the document body and click it
      document.body.appendChild(downloadLink);
      downloadLink.click();

    } catch (error) {
      setLog(error.message);
      setdownloadState(DOWNLOAD_ERROR);
      toast.error(error.message);
    }
  }

export class SubRipItem {
    constructor(index, startMs, endMs, text) {
      this.index = index;
      this.start = this.formatTime(startMs);
      this.end = this.formatTime(endMs);
      this.text = text;
    }

    formatTime(ms) {
      const date = new Date(ms);
      const hh = this.padNumber(date.getUTCHours(), 2);
      const mm = this.padNumber(date.getUTCMinutes(), 2);
      const ss = this.padNumber(date.getUTCSeconds(), 2);
      const msFormatted = this.padNumber(date.getUTCMilliseconds(), 3);
      return `${hh}:${mm}:${ss},${msFormatted}`;
    }

    padNumber(num, size) {
      let padded = num.toString();
      while (padded.length < size) {
        padded = "0" + padded;
      }
      return padded;
    }

    toString() {
      return `${this.index}\n${this.start} --> ${this.end}\n${this.text}\n`;
    }
  }

PDFReact based on the production of Cornell Notes type of PDF documents. https://github.com/wojtekmaj/react-pdf, the production of a CornellFormat format React components, the title, subject categories, authors, dates and text in different languages as parameters passed in to generate Cornell Notes format PDF, and then download it locally

export default function CornellFormat ({ title, topics, author, data, children }) 
{

    // 注册字体，这里可以注册不同字重的字体，需要导入不同的字体文件
    Font.register({
    family: "SIMKAI",
    fonts: [
        {
        src: font,
        }
    ],

    });
    return(
        <View>
            <View style={{ flexDirection: 'row', justifyContent: 'space-between'}}>
                <Text style={{ fontSize: 24 }}>{title}</Text>
            </View>
            <View style={{ flexDirection: 'row', marginTop: 10 }}>
                <View style={{ flex: 8 }}>
                    <View style={{ flexDirection: 'row', alignItems: 'center' }}>
                        <View style={{ width: 10, height: 10, borderWidth: 1, marginRight: 5 }} />
                        <Text style={{ fontSize: 12 }}>Topics: {topics}</Text>
                    </View>
                    <View style={{ flexDirection: 'row', alignItems: 'center' }}>
                        <View style={{ width: 10, height: 10, borderWidth: 1, marginRight: 5 }} />
                        <Text style={{ fontSize: 12 }}>Author: {author}</Text>
                    </View>
                    <View style={{ flexDirection: 'row', alignItems: 'center' }}>
                    <View style={{ width: 10, height: 10, borderWidth: 1, marginRight: 5 }} />
                        <Text style={{ fontSize: 12 }}>Data:{data}</Text>
                    </View>
                    <View style={{ marginTop: 5 ,fontSize: 12 ,fontFamily:"SIMKAI"}}>
                        {
                            (children||[]).map((content,index)=>{
                                if(index%2 === 0)
                                {
                                    return (<Text style={{ fontSize: 12,flexWrap: "wrap"  }}>{content}</Text>);
                                }
                                else
                                {
                                    return(
                                        <>
                                            <View style={{ flexDirection: "row", fontSize: 10 ,flexWrap: "wrap" }}>
                                                {Array.from(content).map((char) => <Text>{char}</Text>)}
                                            </View>
                                            <View style={{ width: 10, height: 10, borderWidth: 1, marginRight: 5 }} />
                                        </>               
                                    );
                                } 
                            })
                        }
                    </View>
                </View>
                <View style={{ width: 1, backgroundColor: 'black', marginHorizontal: 10 }} />
                <View style={{ flex: 2 }}>
                    <View style={{ flexDirection: 'row', alignItems: 'center' }}>
                    <View style={{ width: 10, height: 10, borderWidth: 1, marginRight: 5 }} />
                    <Text>Summary</Text>
                    </View>
                    <View style={{ marginTop: 5 }}>
                    <Text></Text>
                    </View>
                </View>
            </View>
        </View>
    );

}

Blog

How to crawl videos and subtitles in TED pages

qiuzijian7

Join Our Newsletter. No Spam, Only the good stuff.

Related