Learn how I extracted an embedded video from Yahoo and put it into code.
As many embedded video extraction tutorials point out, finding the embedded video source through a browser’s web inspector is relatively easy. However, in order to build an efficient extractor that covers all Yahoo! Japan News articles, we need a clear, reproducible path from the original URL to the video source.
First, the obvious: I access an article and search for video-related extensions like “
mp4” and “
m3u8” in the source code.
Article source code
Nada. Oftentimes, the word “player”is associated with videos, so let’s see if that works.
Article source code
_embed.js_. Note the
_spaceid_ values in the parameters sent to
_embed.js_. They look useful. Now, let’s check what’s inside.
The code seems to reference another script,
player.js and includes a parameter, the current UNIX timestamp converted to hours.
Let’s take a look inside
player.js with Google Chrome DevTools.
player.js is huge and scary looking and doesn’t contain any useful
[m3u8](https://www.lifewire.com/m3u8-file-2621956) urls either.
Okay, let’s work backwards and search for
mp4 in requests that were made when we loaded the page (you may want to reload the page).
JSON response from https://feapi-yvpub.yahooapis.jp/v1/content/
JSON response with our
mp4 sources. This response is generated by a request we made to https://feapi-yvpub.yahooapis.jp/v1/content/ with the following parameters:
1602163 in …/v1/content/1602163? That’s the value of
contentid we noted earlier. And
space_id here matches our
spaceid. Nice. What about
appid? Let’s see if its value is mentioned in any other responses.
There it is, hard coded in
player.script.js! And a quick look at other news articles confirms that this value is used for all embedded video requests. Three values down. Now, let’s search for
ak value is nowhere else to be found but in this
JSON query. Where is
ak coming from?
I have a hunch that
Aha! It looks like
ak is a concatenation of “_” and two strings in both
player.script.js. We also see it in
player.js passed to function
My guess is that this concatenated string value is converted to an md5 hash value. But first we must figure out the values being concatenated to “_”.
player.js in the Sources tab (right click inside Response body and choose Open in Sources panel) and add a breakpoint after
ak is defined.
Next, let’s close our open file tabs under Sources and reload the page.
Odd. It doesn’t seem to hit that line. Let’s try the same thing in
(Make sure to click the Pretty Print brackets on the bottom left after reloading the page.)
iappears to be the same value as the
spaceid that we noted earlier, and
r is “
headlines.yahoo.co.jp,” our host name, so we now have the string value “
It looks nothing like the long, cryptic
ak value in the request query, but remember that
k.md5 function call in
player.js? Let’s check its
md5 hash value.
Would you look at that! It’s the same value as
ak in the
JSON request query. Nailed it.
Recall that the
JSON request included the following parameters, minus the thumb values.
We now have all the unique values — and, more importantly, their sources — that are required to make a
JSON request programmatically in our code for any article. One last thing on the parameters: is
device_typenecessary? Let’s make a request without it.
No video data. Apparently it is, so we’ll keep it.
Quick review on how we got to our video data.
md5hash generator (ran on
spaceid+ “_” + host to get value of
What might this extraction process look like in code? Here’s a rough example in Python:
Example of Yahoo! Japan News article embedded video extraction.
yahoojnews_extract() that I include header values in the request that match the actual values in the request made from our page to avoid suspicion. Once I have the
JSON data, I pass it to
_parse_formats to extract the video data (urls, fps, etc.) and return it along with other information such as the title.
$ pip install youtube-dl && youtube-dl https://news.yahoo.co.jp
Thanks for reading!
HTML Assignment Help Australia @30% OFF from Sample Assignment, with Our Best HTML assignment help experts. Get HTML homework help online at affordable price. 100% Plag free assignment solution.
In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.
Magic Methods are the special methods which gives us the ability to access built in syntactical features such as ‘<’, ‘>’, ‘==’, ‘+’ etc.. You must have worked with such methods without knowing them to be as magic methods. Magic methods can be identified with their names which start with __ and ends with __ like __init__, __call__, __str__ etc. These methods are also called Dunder Methods, because of their name starting and ending with Double Underscore (Dunder).