How to extracted an embedded video into your code Python

Below, I show you how I extracted the embedded video source, along with a code example in Python. All right, pick a Yahoo article, and let’s dig!

HTML

As many embedded video extraction tutorials point out, finding the embedded video source through a browser’s web inspector is relatively easy. However, in order to build an efficient extractor that covers all Yahoo! Japan News articles, we need a clear, reproducible path from the original URL to the video source.

First, the obvious: I access an article and search for video-related extensions like “mp4” and “m3u8” in the source code.

Article source code

Nada. Oftentimes, the word “player”is associated with videos, so let’s see if that works.

Article source code

Aha! An external Javascript script, _embed.js_. Note the _contentid_ and _spaceid_ values in the parameters sent to _embed.js_. They look useful. Now, let’s check what’s inside.

embed.js

The code seems to reference another script, player.js and includes a parameter, the current UNIX timestamp converted to hours.

Network Inspection

Let’s take a look inside player.js with Google Chrome DevTools.

player.js

player.js is huge and scary looking and doesn’t contain any useful mp4 or [m3u8](https://www.lifewire.com/m3u8-file-2621956) urls either.

Okay, let’s work backwards and search for mp4 in requests that were made when we loaded the page (you may want to reload the page).

JSON response from https://feapi-yvpub.yahooapis.jp/v1/content/

Bingo! A JSON response with our m3u8 and mp4 sources. This response is generated by a request we made to https://feapi-yvpub.yahooapis.jp/v1/content/ with the following parameters:

What’s the 1602163 in …/v1/content/1602163? That’s the value of contentid we noted earlier. And space_id here matches our spaceid. Nice. What about appid? Let’s see if its value is mentioned in any other responses.

player.script.js

There it is, hard coded in player.script.js! And a quick look at other news articles confirms that this value is used for all embedded video requests. Three values down. Now, let’s search for ak’s value.

Unfortunately, the ak value is nowhere else to be found but in this JSON query. Where is ak coming from?

Breakpoints

I have a hunch that ak might be a Javascript object name. Let’s search for “ak:”.

_player.js_

player.script.js

Aha! It looks like ak is a concatenation of “_” and two strings in both player.js and player.script.js. We also see it in player.js passed to function k.md5().

My guess is that this concatenated string value is converted to an md5 hash value. But first we must figure out the values being concatenated to “_”.

Let’s open player.js in the Sources tab (right click inside Response body and choose Open in Sources panel) and add a breakpoint after ak is defined.

player.js

Next, let’s close our open file tabs under Sources and reload the page.

Odd. It doesn’t seem to hit that line. Let’s try the same thing in player.script.js.

player.script.js

(Make sure to click the Pretty Print brackets on the bottom left after reloading the page.)

player.script.js

Bingo! iappears to be the same value as the spaceid that we noted earlier, and r is “headlines.yahoo.co.jp,” our host name, so we now have the string value “2078710353_headlines.yahoo.co.jp”.

It looks nothing like the long, cryptic ak value in the request query, but remember that k.md5 function call in player.js? Let’s check its md5 hash value.

Would you look at that! It’s the same value as ak in the JSON request query. Nailed it.

Recap

Recall that the JSON request included the following parameters, minus the thumb values.

appid: dj0zaiZpPVZMTV…jcmV0Jng9YjU-
output: json
space_id: 2078710353
domain: headlines.yahoo.co.jp
ak: 40e90ec7a4ffb34260fcbb9497778731
device_type: 1100

We now have all the unique values — and, more importantly, their sources — that are required to make a JSON request programmatically in our code for any article. One last thing on the parameters: is device_typenecessary? Let’s make a request without it.

No video data. Apparently it is, so we’ll keep it.

Quick review on how we got to our video data.

Article URL (extracted host name)
Article HTML source code (extracted contentid and spaceid)
md5 hash generator (ran on spaceid + “_” + host to get value of pk)
player.script.js (extracted appid)
Request to https://feapi-yvpub.yahooapis.jp/v1/content/{contentid} with our contentid, appid, spaceid, and pk values.

Code

What might this extraction process look like in code? Here’s a rough example in Python:

import hashlib
import re
import requests


_VALID_URL = r'https?://(?P<host>(?:news|headlines)\.yahoo\.co\.jp)[^\d]*(?P<id>\d[\d-]*\d)?'

# More functions here...

def yahoojnews_extract(url):
    mobj = re.match(_VALID_URL, url)
    if not mobj:
        raise ValueError('Invalid url %s' % url)
    host = mobj.group('host')
    display_id = mobj.group('id') or host
    webpage = _download_webpage(url)

    title = _search_title(webpage)
    
    if display_id == host:
        # Headline page (w/ multiple BC playlists) ('news.yahoo.co.jp', 'headlines.yahoo.co.jp/videonews/', ...)
        return _playlist_result(webpage)

    # Article page
    description = _search_description(webpage)
    thumbnail = _search_thumbnail(webpage)

    space_id = _search_regex([
            r'<script[^>]+class=["\']yvpub-player["\'][^>]+spaceid=([^&"\']+)',
            r'YAHOO\.JP\.srch\.\w+link\.onLoad[^;]+spaceID["\' ]*:["\' ]+([^"\']+)',
            r'<!--\s+SpaceID=(\d+)'
        ], webpage, 'spaceid')
    content_id = re.search(
        r'<script[^>]+class=(["\'])yvpub-player\1[^>]+contentid=(?P<contentid>[^&"\']+)',
        webpage,
    ).group('contentid')

    r = requests.get(
        'https://feapi-yvpub.yahooapis.jp/v1/content/%s' % content_id,
        headers={
            'Accept': 'application/json, text/javascript, */*; q=0.01',
            'Origin': 'https://s.yimg.jp',
            'Host': 'feapi-yvpub.yahooapis.jp',
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
            'Referer': 'https://s.yimg.jp/images/yvpub/player/vamos/pc/latest/player.html',
        },
        params={
            'appid': 'gj0zaiZpPVZMTVFJR0F...VycbVjcmV0jng9Yju-',
            'output': 'json',
            'space_id': space_id,
            'domain': host,
            'ak': hashlib.md5('_'.join((space_id, host)).encode()).hexdigest(),
            'device_type': '1100',
        },
    )
    r.raise_for_status()
    json_data = r.json()

    formats = _parse_formats(json_data)

    return {
        'id': display_id,
        'title': title,
        'description': description,
        'thumbnail': thumbnail,
        'formats': formats,
    }

Example of Yahoo! Japan News article embedded video extraction.

Note in yahoojnews_extract() that I include header values in the request that match the actual values in the request made from our page to avoid suspicion. Once I have the JSON data, I pass it to _parse_formats to extract the video data (urls, fps, etc.) and return it along with other information such as the title.

View the extraction code I wrote for youtube-dl here or download and watch it in action:

$ pip install youtube-dl && youtube-dl https://news.yahoo.co.jp

Thanks for reading!

#Python #HTML #Javascript #Debugging