How to extracted an embedded video into your code Python

How to extracted an embedded video into your code Python

Learn how I extracted an embedded video from Yahoo and put it into code.

Below, I show you how I extracted the embedded video source, along with a code example in Python. All right, pick a Yahoo article, and let’s dig!

HTML

As many embedded video extraction tutorials point out, finding the embedded video source through a browser’s web inspector is relatively easy. However, in order to build an efficient extractor that covers all Yahoo! Japan News articles, we need a clear, reproducible path from the original URL to the video source.

First, the obvious: I access an article and search for video-related extensions like “mp4” and “m3u8” in the source code.

Article source code

Nada. Oftentimes, the word “player”is associated with videos, so let’s see if that works.

Article source code

Aha! An external Javascript script, _embed.js_. Note the _contentid_ and _spaceid_ values in the parameters sent to _embed.js_. They look useful. Now, let’s check what’s inside.

embed.js

The code seems to reference another script, player.js and includes a parameter, the current UNIX timestamp converted to hours.

Network Inspection

Let’s take a look inside player.js with Google Chrome DevTools.

player.js

player.js is huge and scary looking and doesn’t contain any useful mp4 or [m3u8](https://www.lifewire.com/m3u8-file-2621956) urls either.

Okay, let’s work backwards and search for mp4 in requests that were made when we loaded the page (you may want to reload the page).

JSON response from https://feapi-yvpub.yahooapis.jp/v1/content/

Bingo! A JSON response with our m3u8 and mp4 sources. This response is generated by a request we made to https://feapi-yvpub.yahooapis.jp/v1/content/ with the following parameters:

What’s the 1602163 in …/v1/content/1602163? That’s the value of contentid we noted earlier. And space_id here matches our spaceid. Nice. What about appid? Let’s see if its value is mentioned in any other responses.

player.script.js

There it is, hard coded in player.script.js! And a quick look at other news articles confirms that this value is used for all embedded video requests. Three values down. Now, let’s search for ak’s value.

Unfortunately, the ak value is nowhere else to be found but in this JSON query. Where is ak coming from?

Breakpoints

I have a hunch that ak might be a Javascript object name. Let’s search for “ak:”.

_player.js_

player.script.js

Aha! It looks like ak is a concatenation of “_” and two strings in both player.js and player.script.js. We also see it in player.js passed to function k.md5().

My guess is that this concatenated string value is converted to an md5 hash value. But first we must figure out the values being concatenated to “_”.

Let’s open player.js in the Sources tab (right click inside Response body and choose Open in Sources panel) and add a breakpoint after ak is defined.

player.js

Next, let’s close our open file tabs under Sources and reload the page.

Odd. It doesn’t seem to hit that line. Let’s try the same thing in player.script.js.

player.script.js

(Make sure to click the Pretty Print brackets on the bottom left after reloading the page.)

player.script.js

Bingo! iappears to be the same value as the spaceid that we noted earlier, and r is “headlines.yahoo.co.jp,” our host name, so we now have the string value “2078710353_headlines.yahoo.co.jp”.

It looks nothing like the long, cryptic ak value in the request query, but remember that k.md5 function call in player.js? Let’s check its md5 hash value.

Would you look at that! It’s the same value as ak in the JSON request query. Nailed it.

Recap

Recall that the JSON request included the following parameters, minus the thumb values.

  • appid: dj0zaiZpPVZMTV…jcmV0Jng9YjU-

  • output: json

  • space_id: 2078710353

  • domain: headlines.yahoo.co.jp

  • ak: 40e90ec7a4ffb34260fcbb9497778731

  • device_type: 1100

We now have all the unique values — and, more importantly, their sources — that are required to make a JSON request programmatically in our code for any article. One last thing on the parameters: is device_typenecessary? Let’s make a request without it.

No video data. Apparently it is, so we’ll keep it.

Quick review on how we got to our video data.

  • Article URL (extracted host name)
  • Article HTML source code (extracted contentid and spaceid)
  • md5 hash generator (ran on spaceid + “_” + host to get value of pk)
  • player.script.js (extracted appid)
  • Request to https://feapi-yvpub.yahooapis.jp/v1/content/{contentid} with our contentid, appid, spaceid, and pk values.

Code

What might this extraction process look like in code? Here’s a rough example in Python:

import hashlib
import re
import requests


_VALID_URL = r'https?://(?P<host>(?:news|headlines)\.yahoo\.co\.jp)[^\d]*(?P<id>\d[\d-]*\d)?'

# More functions here...

def yahoojnews_extract(url):
    mobj = re.match(_VALID_URL, url)
    if not mobj:
        raise ValueError('Invalid url %s' % url)
    host = mobj.group('host')
    display_id = mobj.group('id') or host
    webpage = _download_webpage(url)

    title = _search_title(webpage)

    if display_id == host:
        # Headline page (w/ multiple BC playlists) ('news.yahoo.co.jp', 'headlines.yahoo.co.jp/videonews/', ...)
        return _playlist_result(webpage)

    # Article page
    description = _search_description(webpage)
    thumbnail = _search_thumbnail(webpage)

    space_id = _search_regex([
            r'<script[^>]+class=["\']yvpub-player["\'][^>]+spaceid=([^&"\']+)',
            r'YAHOO\.JP\.srch\.\w+link\.onLoad[^;]+spaceID["\' ]*:["\' ]+([^"\']+)',
            r'<!--\s+SpaceID=(\d+)'
        ], webpage, 'spaceid')
    content_id = re.search(
        r'<script[^>]+class=(["\'])yvpub-player\1[^>]+contentid=(?P<contentid>[^&"\']+)',
        webpage,
    ).group('contentid')

    r = requests.get(
        'https://feapi-yvpub.yahooapis.jp/v1/content/%s' % content_id,
        headers={
            'Accept': 'application/json, text/javascript, */*; q=0.01',
            'Origin': 'https://s.yimg.jp',
            'Host': 'feapi-yvpub.yahooapis.jp',
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
            'Referer': 'https://s.yimg.jp/images/yvpub/player/vamos/pc/latest/player.html',
        },
        params={
            'appid': 'gj0zaiZpPVZMTVFJR0F...VycbVjcmV0jng9Yju-',
            'output': 'json',
            'space_id': space_id,
            'domain': host,
            'ak': hashlib.md5('_'.join((space_id, host)).encode()).hexdigest(),
            'device_type': '1100',
        },
    )
    r.raise_for_status()
    json_data = r.json()

    formats = _parse_formats(json_data)

    return {
        'id': display_id,
        'title': title,
        'description': description,
        'thumbnail': thumbnail,
        'formats': formats,
    }

Example of Yahoo! Japan News article embedded video extraction.

Note in yahoojnews_extract() that I include header values in the request that match the actual values in the request made from our page to avoid suspicion. Once I have the JSON data, I pass it to _parse_formats to extract the video data (urls, fps, etc.) and return it along with other information such as the title.

View the extraction code I wrote for youtube-dl here or download and watch it in action:

$ pip install youtube-dl && youtube-dl https://news.yahoo.co.jp

Thanks for reading!

Python HTML Javascript Debugging

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Know Everything About HTML With HTML Experts

HTML Assignment Help Australia @30% OFF from Sample Assignment, with Our Best HTML assignment help experts. Get HTML homework help online at affordable price. 100% Plag free assignment solution.

Basic Data Types in Python | Python Web Development For Beginners

In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.

HTML JavaScript - Add Javascript File to HTML

Learn HTML and javascript, their uses & importance, html javascript function, javascript tags list, why add javascript file to html, HTML Events with JavaScript etc

Python vs JavaScript | Difference between Python & JavaScript

Python vs JavaScript will help you to clearly distinguish between two of the most famous languages and help you select the best for your project. Why compare Python and Javascript? What is Python? What is JavaScript? Python vs Javascript

How To Compare Tesla and Ford Company By Using Magic Methods in Python

Magic Methods are the special methods which gives us the ability to access built in syntactical features such as ‘<’, ‘>’, ‘==’, ‘+’ etc.. You must have worked with such methods without knowing them to be as magic methods. Magic methods can be identified with their names which start with __ and ends with __ like __init__, __call__, __str__ etc. These methods are also called Dunder Methods, because of their name starting and ending with Double Underscore (Dunder).