Below, I show you how I extracted the embedded video source, along with a code example in Python. All right, pick a Yahoo article, and let’s dig!
As many embedded video extraction tutorials point out, finding the embedded video source through a browser’s web inspector is relatively easy. However, in order to build an efficient extractor that covers all Yahoo! Japan News articles, we need a clear, reproducible path from the original URL to the video source.
First, the obvious: I access an article and search for video-related extensions like “mp4
” and “m3u8
” in the source code.
Article source code
Nada. Oftentimes, the word “player”is associated with videos, so let’s see if that works.
Article source code
Aha! An external Javascript script, _embed.js_
. Note the _contentid_
and _spaceid_
values in the parameters sent to _embed.js_
. They look useful. Now, let’s check what’s inside.
embed.js
The code seems to reference another script, player.js
and includes a parameter, the current UNIX timestamp converted to hours.
Let’s take a look inside player.js
with Google Chrome DevTools.
player.js
player.js
is huge and scary looking and doesn’t contain any useful mp4
or [m3u8](https://www.lifewire.com/m3u8-file-2621956)
urls either.
Okay, let’s work backwards and search for mp4
in requests that were made when we loaded the page (you may want to reload the page).
JSON response from https://feapi-yvpub.yahooapis.jp/v1/content/
Bingo! A JSON
response with our m3u8
and mp4
sources. This response is generated by a request we made to https://feapi-yvpub.yahooapis.jp/v1/content/ with the following parameters:
What’s the 1602163 in …/v1/content/1602163
? That’s the value of contentid
we noted earlier. And space_id
here matches our spaceid
. Nice. What about appid
? Let’s see if its value is mentioned in any other responses.
player.script.js
There it is, hard coded in player.script.js
! And a quick look at other news articles confirms that this value is used for all embedded video requests. Three values down. Now, let’s search for ak
’s value.
Unfortunately, the ak
value is nowhere else to be found but in this JSON
query. Where is ak
coming from?
I have a hunch that ak
might be a Javascript object name. Let’s search for “ak:
”.
_player.js_
player.script.js
Aha! It looks like ak
is a concatenation of “_” and two strings in both player.js
and player.script.js
. We also see it in player.js
passed to function k.md5()
.
My guess is that this concatenated string value is converted to an md5 hash value. But first we must figure out the values being concatenated to “_”.
Let’s open player.js
in the Sources tab (right click inside Response body and choose Open in Sources panel) and add a breakpoint after ak
is defined.
player.js
Next, let’s close our open file tabs under Sources and reload the page.
Odd. It doesn’t seem to hit that line. Let’s try the same thing in player.script.js.
player.script.js
(Make sure to click the Pretty Print brackets on the bottom left after reloading the page.)
player.script.js
Bingo! i
appears to be the same value as the spaceid
that we noted earlier, and r
is “headlines.yahoo.co.jp
,” our host name, so we now have the string value “2078710353_headlines.yahoo.co.jp
”.
It looks nothing like the long, cryptic ak
value in the request query, but remember that k.md5
function call in player.js
? Let’s check its md5
hash value.
Would you look at that! It’s the same value as ak
in the JSON
request query. Nailed it.
Recall that the JSON
request included the following parameters, minus the thumb values.
appid: dj0zaiZpPVZMTV…jcmV0Jng9YjU-
output: json
space_id: 2078710353
domain: headlines.yahoo.co.jp
ak: 40e90ec7a4ffb34260fcbb9497778731
device_type: 1100
We now have all the unique values — and, more importantly, their sources — that are required to make a JSON
request programmatically in our code for any article. One last thing on the parameters: is device_type
necessary? Let’s make a request without it.
No video data. Apparently it is, so we’ll keep it.
Quick review on how we got to our video data.
contentid
and spaceid
)md5
hash generator (ran on spaceid
+ “_” + host to get value of pk
)player.script.js
(extracted appid)contentid
, appid
, spaceid
, and pk
values.What might this extraction process look like in code? Here’s a rough example in Python:
import hashlib
import re
import requests
_VALID_URL = r'https?://(?P<host>(?:news|headlines)\.yahoo\.co\.jp)[^\d]*(?P<id>\d[\d-]*\d)?'
# More functions here...
def yahoojnews_extract(url):
mobj = re.match(_VALID_URL, url)
if not mobj:
raise ValueError('Invalid url %s' % url)
host = mobj.group('host')
display_id = mobj.group('id') or host
webpage = _download_webpage(url)
title = _search_title(webpage)
if display_id == host:
# Headline page (w/ multiple BC playlists) ('news.yahoo.co.jp', 'headlines.yahoo.co.jp/videonews/', ...)
return _playlist_result(webpage)
# Article page
description = _search_description(webpage)
thumbnail = _search_thumbnail(webpage)
space_id = _search_regex([
r'<script[^>]+class=["\']yvpub-player["\'][^>]+spaceid=([^&"\']+)',
r'YAHOO\.JP\.srch\.\w+link\.onLoad[^;]+spaceID["\' ]*:["\' ]+([^"\']+)',
r'<!--\s+SpaceID=(\d+)'
], webpage, 'spaceid')
content_id = re.search(
r'<script[^>]+class=(["\'])yvpub-player\1[^>]+contentid=(?P<contentid>[^&"\']+)',
webpage,
).group('contentid')
r = requests.get(
'https://feapi-yvpub.yahooapis.jp/v1/content/%s' % content_id,
headers={
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Origin': 'https://s.yimg.jp',
'Host': 'feapi-yvpub.yahooapis.jp',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'Referer': 'https://s.yimg.jp/images/yvpub/player/vamos/pc/latest/player.html',
},
params={
'appid': 'gj0zaiZpPVZMTVFJR0F...VycbVjcmV0jng9Yju-',
'output': 'json',
'space_id': space_id,
'domain': host,
'ak': hashlib.md5('_'.join((space_id, host)).encode()).hexdigest(),
'device_type': '1100',
},
)
r.raise_for_status()
json_data = r.json()
formats = _parse_formats(json_data)
return {
'id': display_id,
'title': title,
'description': description,
'thumbnail': thumbnail,
'formats': formats,
}
Example of Yahoo! Japan News article embedded video extraction.
Note in yahoojnews_extract()
that I include header values in the request that match the actual values in the request made from our page to avoid suspicion. Once I have the JSON
data, I pass it to _parse_formats
to extract the video data (urls, fps, etc.) and return it along with other information such as the title.
View the extraction code I wrote for youtube-dl here or download and watch it in action:
$ pip install youtube-dl && youtube-dl https://news.yahoo.co.jp
Thanks for reading!
#Python #HTML #Javascript #Debugging