Rupert  Beatty

Rupert Beatty

1675067725

Preview Extractor for News, Articles and Full-texts in Swift

ReadabilityKit

Preview extractor for news, articles and full-texts in Swift


Features

Extracts:

  •  Title
  •  Description
  •  Top image
  •  Top video
  •  Keywords
  •  Date

Usage

let articleUrl = URL(string: "https://someurl.com/")!
Readability.parse(url: articleUrl, completion: { data in
    let title = data?.title
    let description = data?.description
    let keywords = data?.keywords
    let imageUrl = data?.topImage
    let videoUrl = data?.topVideo
    let datePublished = data?.datePublished
})

To run the example project, clone the repo, and run pod install from the Example directory first.

Installation

CocoaPods

pod 'ReadabilityKit'

Carthage

github "exyte/ReadabilityKit"

Manually

  1. Install Ji XML parser.
  2. Download and drop all files from Sources folder in your project.

Development Environment Setup

  1. Install Carthage.
  2. Check out and build the project's dependencies:
carthage bootstrap --platform <name>

Requirements

  • iOS 10.0+ / macOS 10.12+ / tvOS 10.0+ / watchOS 3.0+

Download Details:

Author: Exyte
Source Code: https://github.com/exyte/ReadabilityKit 
License: MIT license

#swift #extract #preview 

Preview Extractor for News, Articles and Full-texts in Swift
Lawrence  Lesch

Lawrence Lesch

1673711820

PythonVSCode: This Extension is Now Maintained in The Microsoft fork

Python extension for Visual Studio Code

A Visual Studio Code extension with rich support for the Python language (for all actively supported versions of the language: 2.7, >=3.6), including features such as IntelliSense, linting, debugging, code navigation, code formatting, Jupyter notebook support, refactoring, variable explorer, test explorer, snippets, and more!

Quick start

Set up your environment

Select your Python interpreter by clicking on the status bar

Configure the debugger through the Debug Activity Bar

Configure tests by running the Configure Tests command

Jupyter Notebook quick start

Open or create a Jupyter Notebook file (.ipynb) and start coding in our Notebook Editor!

For more information you can:

Useful commands

Open the Command Palette (Command+Shift+P on macOS and Ctrl+Shift+P on Windows/Linux) and type in one of the following commands:

CommandDescription
Python: Select InterpreterSwitch between Python interpreters, versions, and environments.
Python: Start REPLStart an interactive Python REPL using the selected interpreter in the VS Code terminal.
Python: Run Python File in TerminalRuns the active Python file in the VS Code terminal. You can also run a Python file by right-clicking on the file and selecting Run Python File in Terminal.
Python: Select LinterSwitch from Pylint to Flake8 or other supported linters.
Format DocumentFormats code using the provided formatter in the settings.json file.
Python: Configure TestsSelect a test framework and configure it to display the Test Explorer.

To see all available Python commands, open the Command Palette and type Python.

Feature details

Learn more about the rich features of the Python extension:

IntelliSense: Edit your code with auto-completion, code navigation, syntax checking and more

Linting: Get additional code analysis with Pylint, Flake8 and more

Code formatting: Format your code with black, autopep or yapf

Debugging: Debug your Python scripts, web apps, remote or multi-threaded processes

Testing: Run and debug tests through the Test Explorer with unittest, pytest or nose

Jupyter Notebooks: Create and edit Jupyter Notebooks, add and run code cells, render plots, visualize variables through the variable explorer, visualize dataframes with the data viewer, and more

Environments: Automatically activate and switch between virtualenv, venv, pipenv, conda and pyenv environments

Refactoring: Restructure your Python code with variable extraction, method extraction and import sorting

Supported locales

The extension is available in multiple languages: de, en, es, fa, fr, it, ja, ko-kr, nl, pl, pt-br, ru, tr, zh-cn, zh-tw

Questions, issues, feature requests, and contributions

  • If you have a question about how to accomplish something with the extension, please ask on Stack Overflow
  • If you come across a problem with the extension, please file an issue
  • Contributions are always welcome! Please see our contributing guide for more details
  • Any and all feedback is appreciated and welcome!
    • If someone has already filed an issue that encompasses your feedback, please leave a 👍/👎 reaction on the issue
    • Otherwise please file a new issue
  • If you're interested in the development of the extension, you can read about our development process

Data and telemetry

The Microsoft Python Extension for Visual Studio Code collects usage data and sends it to Microsoft to help improve our products and services. Read our privacy statement to learn more. This extension respects the telemetry.enableTelemetry setting which you can learn more about at https://code.visualstudio.com/docs/supporting/faq#_how-to-disable-telemetry-reporting.

Download Details:

Author: DonJayamanne
Source Code: https://github.com/DonJayamanne/pythonVSCode 
License: MIT license

#typescript #python #testing #editor #terminal #jupyter #extract 

PythonVSCode: This Extension is Now Maintained in The Microsoft fork
Ruthie  Bugala

Ruthie Bugala

1668502434

What is Palindrome in Python? | Algorithms

In this Python article, let's learn about What is Palindrome in Python? Codes, Algorithms, and more. A palindrome is a word, phrase, number, or another sequence of units that can be read the same way in either direction, with general allowances for adjustments to punctuation and word dividers. When its digits are reversed, they turn out to be the same number as the original number. Palindromes can be numeric as well. For example, madam, 1234321. This blog will teach us how to create a Palindrome in Python.

If you want to dive further, check out this free course on Palindrome in Python and PG Programs on Software Engineering. It covers the fundamentals of python programming, such as its syntax, variables, data types, operators, tokens, and strings. This course also offers you a certificate on completion to help you stay ahead of the competition.

What is Palindrome?

A palindrome is a word, phrase, number, or another sequence of units that may be read the same way in either direction, generally if used comma-separated.

Happy belated multi-cultural palindrome day! 02/02/2020 was a unique day in February. It works whether your preferred date format is MM/DD/YYYY or DD/MM/YYYY or YYYY/MM/DD.

Palindrome example

These patterns are called palindromes. Reading them from the first character or backward doesn’t make any difference. This is an interesting introductory problem to solve with the use of programming. In this blog, we will understand the thought process, go step by step, and come up with various solutions to check whether the string is a palindrome.

A palindrome is a word, phrase, number, or another sequence of characters that reads the same backward as forward.

They are classified into 3 types, which are Palindrome numbers,
Palindrome strings, Palindrome phrase: A collection of words and special characters.

What is a Palindrome Number?

A Palindrome Number is a collection of numbers that remains the same when read backward. These numbers are also said to be symmetrical. When its digits are reversed, they turn out to be the same number as the original number. E.g., 1234321 is a Palindrome. If its digits are reversed, it again becomes 1234321, our original number. 1234232 is not a Palindrome. When reversed, the new number becomes 2324321, which is different from the original.

What is a Palindrome String?

A Palindrome String is a collection of alphabets that remains the same when read backward. They are also called Symmetrical Alphabets. When its alphabets are written in reverse order, they turn out to be the same combination of alphabets as the original string. E.g., “madam” is a Palindrome. If its alphabets are reversed, it again becomes “madam,” which was our original string. “napkin” is not a Palindrome. When reversed, the new number becomes “nikpan” which is different from the original string.

What is the Palindrome Phrase?

Palindrome Phrase is a collection of words and special characters that remains the same way when read backward. These phrases are also said to be symmetrical. When the phrase is reversed, it turns out to be the exact same phrase as the original one. For eg : a1b2c33c2b1a is a Palindrome. If the phrase is reversed, it again becomes a1b2c33c2b1a, our original phrase. a4b523kg is not a Palindrome. When reversed, the new number becomes gk325b4a which is different from the original phrase.

Palindrome Phrase is a collection of words and special characters that remains the same way when read backward. These phrases are also said to be symmetrical. When the phrase is reversed, it turns out to be the exact same phrase as the original one. For eg : a1b2c33c2b1a is a Palindrome. If the phrase is reversed, it again becomes a1b2c33c2b1a, our original phrase. a4b523kg is not a Palindrome. When reversed, the new number becomes gk325b4a which is different from the original phrase.

Palindrome Phrase is a collection of words and special characters that remains the same way when read backward. These phrases are also said to be symmetrical. When the phrase is reversed, it turns out to be the exact same phrase as the original one. For eg : a1b2c33c2b1a is a Palindrome. If the phrase is reversed, it again becomes a1b2c33c2b1a, our original phrase. a4b523kg is not a Palindrome. When reversed, the new number becomes gk325b4a which is different from the original phrase.

Palindrome Phrase is a collection of words and special characters that remains the same way when read backward. These phrases are also said to be symmetrical. When the phrase is reversed, it turns out to be the exact same phrase as the original one. For eg : a1b2c33c2b1a is a Palindrome. If the phrase is reversed, it again becomes a1b2c33c2b1a, our original phrase. a4b523kg is not a Palindrome. When reversed, the new number becomes gk325b4a which is different from the original phrase.

Palindrome Examples

Below are a few examples of Palindromes:

  • Mom
  • Madam
  • a2332a
  • Rubber
  • Dad
  • 123454321

Trivia: Is 02/02/2020 a palindrome string when considered a palindrome phrase?

Palindrome in Python Algorithm

You can enroll in these Python-related courses to get comfortable in Python Programming Language and get your free certificate on Great Learning Academy before practicing Palindromes algorithm and code in Python.

Data Science with Python
Python for Machine Learning
Data Visualization using Python
Artificial Intelligence with Python

Now how to create Palindromes in Python?

Consider the algorithm for the Problem Statement: Find if a string is a Palindrome or not.

  1. Check if the index first and index last letters are the same; if not the same, return false.
  2. Repeat step 2 by incrementing the first index and decrementing the last index
  3. Repeat step 3 while first < last If( first > last) then return True

Now let us consider an algorithm for the Problem Statement: Find if a number is a Palindrome or not.

  1. Copy the input number in another variable to compare them later.
  2. Next, we reverse the given number. To reverse the number, follow these steps:
    1. Isolate the last digit of a number. The modulo operator (%) returns the remainder of a division
    2. Append lastDigit to reverse. reverse = (reverse * 10) + lastDigit.
    3. Remove the last digit from the number. number = number / 10.
    4. Iterate this process. while (number > 0)
  3. Now we compare the reversed number with the original number.
  4. If the numbers are the same, then the number is a palindrome, else it is not

Now that we have the algorithm, let us convert it into code by following a similar logic.

Palindrome in Python Code

Using While Loop (number)

number=int(input("Enter any number :"))
#store a copy of this number
temp=number
#calculate reverse of this number
reverse_num=0
while(number>0):
    #extract last digit of this number
    digit=number%10
    #append this digit in reveresed number
    reverse_num=reverse_num*10+digit
    #floor divide the number leave out the last digit from number
    number=number//10
#compare reverse to original number
if(temp==reverse_num):
    print("The number is palindrome!")
else:
    print("Not a palindrome!")

Using While-loop strings

def check_palindrome(string):
    length = len(string)
    first = 0
    last = length -1 
    status = 1
    while(first<last):
           if(string[first]==string[last]):
               first=first+1
               last=last-1
           else:
               status = 0
               break
    return int(status)  
string = input("Enter the string: ")
print("Method 1")
status= check_palindrome(string)
if(status):
    print("It is a palindrome ")
else:
    print("Sorry! Try again")

TEST THE CODE

Input – Madam
Output – It is a palindrome

This is a good approach, but Python enables us to use the reverse function. We know that a word read forwards and backward if the same is a palindrome. Hence, let us generate the forward and backward strings for the same and check if the two strings are the same.

Using Reverse Function

def check_palindrome_1(string):
    reversed_string = string[::-1]
    status=1
    if(string!=reversed_string):
        status=0
    return status


string = input("Enter the string: ")
status= check_palindrome_1(string)
if(status):
    print("It is a palindrome ")
else:
    print("Sorry! Try again")

TEST THE CODE

Input: Enter the string: malayalam
Output: It is a palindrome

This is a good approach, but Python enables us to use the reverse function. We know that a word reads forwards and backward if the same is a palindrome. Hence, let us generate the forward and backward strings for the same and check if the two strings are the same.

Using Reverse Function

def check_palindrome_1(string):
    reversed_string = string[::-1]
    status=1
    if(string!=reversed_string):
        status=0
    return status


string = input("Enter the string: ")
status= check_palindrome_1(string)
if(status):
    print("It is a palindrome ")
else:
    print("Sorry! Try again")

TEST THE CODE

Input: Enter the string: malayalam
Output: It is a palindrome

Palindrome Program in Python

In this article, we will see different ways of implementing the palindrome program in Python

Palindrome String

Method 1:

  1. Finding the reverse of a string
  2. Checking if the reverse and original are the same or not
def isPalindrome(s):
	return s == s[::-1]

# Driver code
s = "kayak"
ans = isPalindrome(s)

if ans:
	print("Yes")

else:
	print("No")

Steps:  

  1. We create a function ispalindrome
  2. Return a variable by slicing the parameter in a reverse way
  3. In our driver code, we wrote a string 
  4. Finally, in our if-else condition, we execute if it is a palindrome print yes or print no

Method 2:

  • Using iterative loop
def isPalindrome(str):

	for i in range(O, int(len(str)/2)):
	    if str[i] != str[len(str)-i-1]:
		return False
	return True

# main function
s = "kayak"
ans = isPalindrome(s)

if (ans):
	print("Yes")

else:
	print("No")

Steps:  

  1. A loop is run from starting to half the length and checking the first character to the last character of the string.
  2. And check from the second character to the second last character of the string.
  3. If any of the characters are mismatched, it is not a palindrome.

Method 3:

  • Using the in-built function to reverse a string
def isPalindrome(s):

	rev = ‘'.join(reversed(s))

	if (s == rev):
		return True
	return False

# main function
s = "kayak"
ans = isPalindrome(s)

if(ans):
	print("Yes")
else:
	print("No")

Steps:

In this method, we are using a predefined function ‘.join’

Method 4:

  • Using recursion 
def isPalindrome(s):

	s = s.lower()

	1 = len(s)

	if 1 <2:
		return True

	elif s(0) == s{l - 1):

		return isPalindrome(s[1: l - 1])
	else:
		return False

s = "Kayak"
ans = isPalindrome(s)

	if ans:
		print("Yes")

	y else:
		print("No")

Steps:

This method compares the first and last element of the string and gives the rest of the substring a recursive call to itself.

Palindrome in a Linked List

Let’s step this up and consider another data structure. What if the data is stored in a linked list? To tackle this, we need to understand linked lists. A linked list is a data structure with a non-contiguous allocation of memory.

Linked List representation

We will begin by defining a linked list in python

class ListNode:
    def __init__(self, x):
        self.val = x
        self.next = None
        
class Solution:
    def __init__(self,seq):
        """prepends item of lists into linked list"""
        self.head = None
        for item in seq:
            node = ListNode(item)
            node.next = self.head
            self.head = node


    def palindrome(self):
        """ Check if linked list is palindrome and return True/False."""
        node = self.head
        var = node #var is initialized to head
        prev = None #initially, prev is None
    
        # prev approaches to middle of list till var reaches end or None 
        while var and var.next:
            var = var.next.next
            temp = node.next   #reverse elements of first half of list
            node.next = prev
            prev = node
            node = temp
    
        if var:  # in case of odd num elements
            tail = node.next
        else:    # in case of even num elements
            tail = node
    
        while prev:
            # compare reverse element and next half elements          
            if prev.val == tail.val:
                tail = tail.next
                prev = prev.next
            else:
                return False
        return True
# Test Cases
list_1 = Solution([7, 8, 6 ,  3 , 7 ,3 , 6, 8, 7])
print([7, 8, 6 ,  3 , 7 ,3 , 6, 8, 7],end='->')
print(list_1.palindrome())
list_2 = Solution([6 , 3 , 4, 6])
print([6 , 3 , 4, 6],end='->')
print(list_2.palindrome())
list_3 = Solution([3, 7 ,3 ])
print([ 3 , 7, 3],end='->')
print(list_3.palindrome())
list_4 = Solution([1])
print([1],end='->')
print( list_4.palindrome())

TEST THE CODE

Output –
3, 7, 3 – True
1 – True

The logic for checking if a linked list is a palindrome or not is the modified version of the one we implemented on strings and arrays. We check if the reverse of the linked list is the same as the original sequence. Instead of reversing the entire linked list and storing it in a temporary location, we reverse the first half of the linked list and check if the first half and second half match after reversal.

Check out a* Algorithm in Artificial Intelligence.

Therefore, we define a function called palindrome, which has parameters node, var( stands for variable), previous, and temp. We jump to the end of the list using the variable var in line 29, and meanwhile, we store the last node data in variable prev. Therefore, comparing the prev.val and tail.val in line 41 gives us the answer.

# Test Cases
list_1 = Solution([7, 8, 6 ,  3 , 7 ,3 , 6, 8, 7])
print(list_1.palindrome())
list_2 = Solution([6 , 3 , 4, 6])
print(list_2.palindrome())
list_3 = Solution([3, 7 ,3 ])
print(list_3.palindrome())
listl_4 = Solution([1])
Print( list_4.palindrome())

In this article, we looked at palindromes inside and out and understood them thoroughly. Try developing better implementation techniques using different data structures to improve your command over coding. We shall keep posting many more articles on implementing data structures and algorithms using Python Stay tuned and Read the Top Ten Python Books.


Original article source at: https://www.mygreatlearning.com

#python 

What is Palindrome in Python? | Algorithms
Dexter  Goodwin

Dexter Goodwin

1661933880

Webpack Extract Translation Keys Plugin

Webpack Extract Translation Keys Plugin

Webpack provides an official plugin for managing translation using i18n-webpack-plugin, but in only allows for build-time translations by replacing strings in the source code.

This plugin serves a similar purposes, but instead of replacing translation keys with actual string values it just collects translation keys allowing you to know exactly which translations are necessary for your client side.

Approach like this also allows to provide dynamically generated translation bundles to the client allowing you to get real-time updates to translation without regenerating whole client side bundle.

This plugin also compatible with Webpack 5.

Usage

Configuration

First you need to install plugin:

npm install --save-dev webpack-extract-translation-keys-plugin

And then include it in your configuration:

// webpack.config.js

var ExtractTranslationKeysPlugin = require('webpack-extract-translation-keys-plugin');
module.exports = {
    plugins: [
        new ExtractTranslationKeysPlugin({
            functionName: '_TR_',
            output: path.join(__dirname, 'dist', 'translation-keys.json')
        })
    ]

    // rest of your configuration...
}

Now inside your module you can write something like this:

console.log(_TR_('translation-key-1'));
console.log(_TR_('translation-key-2'));

If you run webpack now, you should get dist/translation-keys.json file with following content:

{
    "translation-key-1": "translation-key-1",
    "translation-key-2": "translation-key-2"
}

It may seems like a waste to output a map with the keys and values being the same thing, the purpose is to keep the output format consistent with the times when the mangle option is enabled.

Output

if output string contains [name], one output file will be created per entry key at corresponding output

// ...
    plugins: [
        new ExtractTranslationKeysPlugin({
            output: path.join(__dirname, 'dist', '[name]/translation-keys.json')
        })
    ]
// ...

Key Mangling

In some applications translation keys are quite long, so for the situtations where you want to save some additional bytes in your application, you can enable mangling during the plugin initialization:

// ...
    plugins: [
        new ExtractTranslationKeysPlugin({
            mangle: true,
            functionName: '_TR_',
            output: path.join(__dirname, 'dist', 'translation-keys.json')
        })
    ]
// ...

This setting changes the behavior of the plugin to replace the key name with a minimal ascii-readable string.

In order to be able to map back to the original translation key, the plugin outputs mapping object with keys being original keys and the values being the mangled ones:

{ "translation-key-1": " ", "translation-key-2": "!" }

It's recommended to only enable mangling for production builds, as it makes the debugging harder and also may break hot reloading, depending on your setup.

Runtime

Since this plugin doesn't replace function with something else it's up to you to provide function that will actually handle translation in the runtime. It can be a globally defined function or you can use webpack.ProvidePlugin inside your configuration:

module.exports = {
    plugins: [
        new ExtractTranslationKeysPlugin({
            functionName: '__',
            output: path.join(__dirname, 'dist', 'translation-keys.json')
        }),
        new webpack.ProvidePlugin({
            '__': 'path/to/module/with/translation/function.js'
        })
    ]

    // rest of your configuration...
}

Default options

  • functionName : __
  • done : function (result) {}
  • output : false
  • mangle : false

Error handling

Plugin throw an error if you try to call the translation function without any arguments or with a non-string argument (e.g. a variable).

Release Notes

6.0.0

Support for Webpack 5, if you are using Webpack 4, please install 5.x.x version of this plugin. This can be done by running:

npm install --save-dev webpack-extract-translation-keys-plugin@5

5.0.0

Support for multiple output for multiple entries with [name] inside the output string. If [name] is not present, only one output file will be created for all entries

4.0.0

Support for Webpack 4, if you are using Webpack 3, please install 3.x.x version of this plugin. This can be done by running:

npm install --save-dev webpack-extract-translation-keys-plugin@3

3.0.0

Support for Webpack 2, if you are using Webpack 1, please install 2.x.x version of this plugin. This can be done by running:

npm install --save-dev webpack-extract-translation-keys-plugin@2

2.0.0

Support for key mangling. The format of the output without mangling has changed from array to a map. If you want to have old behavior, you can implement it using done callback option.

Download Details:

Author: Grassator
Source Code: https://github.com/grassator/webpack-extract-translation-keys 
License: Apache License

#javascript #webpack #extract 

Webpack Extract Translation Keys Plugin
Gordon  Taylor

Gordon Taylor

1661919268

i18n-extract: Manage Localization with Static analysis

i18n-extract

Manage localization with static analysis. 

Installation

npm install --save-dev i18n-extract

The problem solved

This module analyses code statically for key usages, such as i18n.t('some.key'), in order to:

  • Report keys that are missing
  • Report keys that are unused.
  • Report keys that are highly duplicated.

E.g. This module works well in conjunction with:

Supported keys

  • static:
i18n('key.static')
  • string concatenation:
i18n('key.' + 'concat')
  • template string:
i18n(`key.template`)
  • dynamic:
i18n(`key.${dynamic}`)
  • comment:
/* i18n-extract key.comment */

API

extractFromCode(code, [options])

Parse the code to extract the argument of calls of i18n(key).

  • code should be a string.
  • Return an array containing keys used.

Example

import {extractFromCode} from 'i18n-extract';
const keys = extractFromCode("const followMe = i18n('b2b.follow');", {
  marker: 'i18n',
});
// keys = ['b2b.follow']

extractFromFiles(files, [options])

Parse the files to extract the argument of calls of i18n(key).

  • files can be either an array of strings or a string. You can also use a glob.
  • Return an array containing keys used in the source code.

Example

import {extractFromFiles} from 'i18n-extract';
const keys = extractFromFiles([
  '*.jsx',
  '*.js',
], {
  marker: 'i18n',
});

Options

  • marker: The name of the internationalized string marker function. Defaults to i18n.
  • keyLoc: An integer indicating the position of the key in the arguments. Defaults to 0. Negative numbers, e.g., -1, indicate a position relative to the end of the argument list.
  • parser: Enum indicate the parser to use, can be typescript or flow. Defaults to flow.
  • babelOptions: A Babel configuration object to allow applying custom transformations or plugins before scanning for i18n keys. Defaults to a config with all babylon plugins enabled.

findMissing(locale, keysUsed)

Report the missing keys. Those keys should probably be translated.

  • locale should be a object containing the translations.
  • keysUsed should be an array. Containes the keys used in the source code. It can be retrieve with extractFromFiles our extractFromCode.
  • Return a report.

Example

import {findMissing} from 'i18n-extract';
const missing = findMissing({
  key1: 'key 1',
}, ['key1', 'key2']);

/**
 * missing = [{
 *   type: 'MISSING',
 *   key: 'key2',
 * }];
 */

Plugins

findUnused(locale, keysUsed)

Report the unused key. Those keys should probably be removed.

  • locale should be a object containing the translations.
  • keysUsed should be an array. Containes the keys used in the source code. It can be retrieve with extractFromFiles our extractFromCode.
  • Return a report.

Example

import {findUnused} from 'i18n-extract';
const unused = findUnused({
  key1: 'key 1',
  key2: 'key 2',
}, ['key1']);

/**
 * unused = [{
 *   type: 'UNUSED',
 *   key: 'key2',
 * }];
 */

findDuplicated(locale, keysUsed, options)

Report the duplicated key. Those keys should probably be mutualized. The default threshold is 1, it will report any duplicated translations.

  • locale should be a object containing the translations.
  • keysUsed should be an array. Containes the keys used in the source code. It can be retrieve with extractFromFiles our extractFromCode.
  • options should be an object. You can provide a threshold property to change the number of duplicated value before it's added to the report.
  • Return a report.

Example

import {findDuplicated} from 'i18n-extract';
const duplicated = findDuplicated({
  key1: 'Key 1',
  key2: 'Key 2',
  key3: 'Key 2',
});

/**
 * unused = [{
 *   type: 'DUPLICATED',
 *   keys: [
 *     'key2',
 *     'key3',
 *   ],
 *   value: 'Key 2',
 * }];
 */

forbidDynamic(locale, keysUsed)

Report any dynamic key. It's arguably more dangerous to use dynamic key. They may break.

  • locale should be a object containing the translations.
  • keysUsed should be an array. Containes the keys used in the source code. It can be retrieve with extractFromFiles our extractFromCode.
  • Return a report.

Example

import {forbidDynamic} from 'i18n-extract';
const forbidDynamic = forbidDynamic({}, ['key.*']);

/**
 * forbidDynamic = [{
 *   type: 'FORBID_DYNAMIC',
 *   key: 'key.*',
 * }];
 */

flatten(object)

Flatten the object.

  • object should be a object.

Example

import {flatten} from 'i18n-extract';
const flattened = flatten({
  key2: 'Key 2',
  key4: {
    key41: 'Key 4.1',
    key42: {
      key421: 'Key 4.2.1',
    },
  },
});

/**
 * flattened = {
 *   key2: 'Key 2',
 *   'key4.key41': 'Key 4.1',
 *   'key4.key42.key421': 'Key 4.2.1',
 * };
 */

mergeMessagesWithPO(messages, poInput, poOutput)

Output a new po file with only the messages present in messages. If a message is already present in the poInput, we keep the translation. If a message is not present, we add a new empty translation.

  • messages should be an array.
  • poInput should be a string.
  • poOutput should be a string.

Example

import {mergeMessagesWithPO} from 'i18n-extract';

const messages = ['Message 1', 'Message 2'];
mergeMessagesWithPO(messages, 'messages.po', 'messages.output.po');

/**
 * Will output :
 * > messages.output.po has 812 messages.
 * > We have added 7 messages.
 * > We have removed 3 messages.
 */

Download Details:

Author: oliviertassinari
Source Code: https://github.com/oliviertassinari/i18n-extract 
License: MIT license

#javascript #i18n #extract 

i18n-extract: Manage Localization with Static analysis
Rocio  O'Keefe

Rocio O'Keefe

1661495410

A New Flutter Package Project for Extracting Ogp Data on Web Pages

ogp_data_extract

A simple dart library for extracting the Open Graph protocol on a web pages.

Getting Started

In your package's pubspec.yaml file add the dependency.

dependencies:
  ogp_data_extract: ^0.x.x

You can install packages from the command line.

With Dart:

$ dart pub get

With Flutter:

$ flutter pub get

Structure

reference : The Open Graph protocol

OgpData:
    - url
    - type
    - title
    - description
    - image
    - imageSecureUrl
    - imageType
    - imageWidth
    - imageHeight
    - imageAlt    
    - siteName
    - determiner
    - locale
    - localeAlternate    
    - latitude
    - longitude
    - streetAddress
    - locality
    - region
    - postalCode
    - countryName
    - email
    - phoneNumber
    - faxNumber
    - video
    - videoSecureUrl
    - videoHeight
    - videoWidth
    - videoType
    - audio
    - audioSecureUrl
    - audioTitle
    - audioArtist
    - audioAlbum
    - audioType
    - fbAdmins
    - fbAppId
    - twitterCard
    - twitterSite

Usage

Parse OgpData for a given URL

void main() async {
    const String url = 'https://pub.dev/';
    final OgpData? ogpData = await OgpDataExtract.execute(url);
    print(ogpData?.url); // https://pub.dev/
    print(ogpData?.type); // website
    print(ogpData?.title); // Dart packages
    print(ogpData?.description); // Pub is the package manager for the Dart programming language, containing reusable libraries & packages for Flutter, AngularDart, and general Dart programs.
    print(ogpData?.image); // https://pub.dev/static/img/pub-dev-icon-cover-image.png?hash=vg86r2r3mbs62hiv4ldop0ife5um2g5g
    print(ogpData?.siteName); // Dart packages
}

Specify the User-Agent when parsing

void main() async {
    const String url = 'https://pub.dev/';
    const String userAgent = 'Mozilla/5.0 (iPhone; CPU iPhone OS 12_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Mobile/15E148 Safari/604.1';
    final OgpData? ogpData = await OgpDataExtract.execute(url, userAgent: userAgent);
    print(ogpData);
}

Use the parser manually

void main() async {
    const String url = 'https://pub.dev/';
    final http.Response response = await http.get(Uri.parse(url));
    final Document? document = OgpDataExtract.toDocument(response);
    final OgpData ogpData = OgpDataParser(document).parse();
    print(ogpData);
}

Use this package as a library

Depend on it

Run this command:

With Flutter:

 $ flutter pub add ogp_data_extract

This will add a line like this to your package's pubspec.yaml (and run an implicit flutter pub get):

dependencies:
  ogp_data_extract: ^0.1.2

Alternatively, your editor might support flutter pub get. Check the docs for your editor to learn more.

Import it

Now in your Dart code, you can use:

import 'package:ogp_data_extract/ogp_data_extract.dart';

example/main.dart

import 'package:ogp_data_extract/ogp_data_extract.dart';

void main() async {
  const String url = 'https://pub.dev/';
  final OgpData? ogpData = await OgpDataExtract.execute(url);
  print(ogpData?.url); // https://pub.dev/
  print(ogpData?.type); // website
  print(ogpData?.title); // Dart packages
  print(ogpData
      ?.description); // Pub is the package manager for the Dart programming language, containing reusable libraries & packages for Flutter, AngularDart, and general Dart programs.
  print(ogpData
      ?.image); // https://pub.dev/static/img/pub-dev-icon-cover-image.png?hash=vg86r2r3mbs62hiv4ldop0ife5um2g5g
  print(ogpData?.siteName); // Dart packages
}

Credit

This library is inspired by metadata_fetch.

However, this one is specialized for Open Graph protocol extraction.

Download Details:

Author: KINTO-JP
Source Code: https://github.com/KINTO-JP/ogp_data_extract 
License:MIT license

#flutter #dart  #data #extract 

A New Flutter Package Project for Extracting Ogp Data on Web Pages

ExtractMacro.jl: Provides A Convenience Julia Macro to Extract Fields

ExtractMacro.jl

This [Julia] package provides a macro to extract fields from composite objects

Installation

To install the module, use Julia's package manager: start pkg mode by pressing ] and then enter:

(v1.3) pkg> add ExtractMacro

The module can then be loaded like any other Julia module:

julia> using ExtractMacro

Documentation

  • STABLEmost recently tagged version of the documentation.
  • DEVin-development version of the documentation.

Download Details:

Author: Carlobaldassi
Source Code: https://github.com/carlobaldassi/ExtractMacro.jl 
License: View license

#julia #extract 

ExtractMacro.jl: Provides A Convenience Julia Macro to Extract Fields
黎 飞

黎 飞

1660243020

Python 中的网页抓取——如何从 IMDB 抓取科幻电影

您是否曾经为您的数据科学项目寻找数据集而苦苦挣扎?如果你和我一样,答案是肯定的。

幸运的是,有许多免费的数据集可用——但有时你想要更具体或定制的东西。为此,网络抓取是您工具箱中的一项很好的技能,可以从您最喜欢的网站中提取数据。

本文涵盖了哪些内容?

本文有一个 Python 脚本,您可以使用它从IMDB网站上抓取有关科幻电影(或您选择的任何类型!)的数据。然后它可以将这些数据写入数据帧以进行进一步探索。

我将用一些探索性数据分析 (EDA) 来结束这篇文章。通过这个,您将看到您可以尝试哪些进一步的数据科学项目。

免责声明:虽然网络抓取是一种以编程方式从网站上提取数据的好方法,但请负责任地这样做。例如,我的脚本使用 sleep 功能来故意减慢拉取请求的速度,以免 IMDB 的服务器超载。一些网站不赞成使用网络抓取工具,因此请明智地使用它。

网页抓取和数据清理脚本

让我们进入抓取脚本并让它运行。该脚本为每部电影提取电影名称、年份、评级(PG-13、R 等)、类型、运行时间、评论和投票。您可以根据数据需求选择要抓取的页面数。

注意:您选择的页面越多,所需的时间就越长。使用Google Colab Notebook抓取 200 个网页需要 40 分钟。

对于那些以前没有尝试过的人,Google Colab 是一个基于云的Jupyter Notebook风格的 Python 开发工具,它存在于 Google 应用程序套件中。您可以开箱即用地使用已安装的许多数据科学中常见的软件包。

下面是 Colab 工作区及其布局的图像:

R9sAuHzGHrEvRK_hiAWsy4W41W72et6clD38gIYeAA6AtA32e97xxw0W5ub_96xmgSMTDB2VjRK-gz_YgYtZoV1YyCHjKftaB7-HD2NQ7qt_8hcdnDfqaibp0ONwPr9-4zO5gv3FuXdxiOMsN6eF8bA

介绍 Google Colab 用户界面

有了这个,让我们开始吧!首先,您应该始终将您的包作为它们自己的单元导入。如果您忘记了一个包,您可以只重新运行该单元格。这减少了开发时间。

注意:其中一些软件包需要pip install package_name先运行才能安装。如果您选择使用 Jupyter Notebook 之类的工具在本地运行代码,则需要这样做。如果您想快速启动并运行,可以使用 Google Colab 笔记本。这已经默认安装了所有这些。

from requests import getfrom bs4 import BeautifulSoupfrom warnings import warnfrom time import sleepfrom random import randintimport numpy as np, pandas as pdimport seaborn as sns

如何进行网页抓取

您可以运行以下代码来执行实际的网络抓取。它将把上面提到的所有列拉到数组中,并一次填充一部电影,一次一页。

我还在此代码中添加并记录了一些数据清理步骤。例如,我从提到电影年份的字符串数据中删除了括号。然后我将它们转换为整数。这样的事情使探索性数据分析和建模更容易。

请注意,我使用 sleep 功能是为了避免在 IMDB 过快地循环浏览他们的网页时受到限制。

# Note this takes about 40 min to run if np.arange is set to 9951 as the stopping point.pages = np.arange(1, 9951, 50) # Last time I tried, I could only go to 10000 items because after that the URI has no discernable pattern to combat webcrawlers; I just did 4 pages for demonstration purposes. You can increase this for your own projects.headers = {'Accept-Language': 'en-US,en;q=0.8'} # If this is not specified, the default language is Mandarin#initialize empty lists to store the variables scrapedtitles = []years = []ratings = []genres = []runtimes = []imdb_ratings = []imdb_ratings_standardized = []metascores = []votes = []for page in pages:     #get request for sci-fi   response = get("https://www.imdb.com/search/title?genres=sci-fi&"                  + "start="                  + str(page)                  + "&explore=title_type,genres&ref_=adv_prv", headers=headers)     sleep(randint(8,15))      #throw warning for status codes that are not 200   if response.status_code != 200:       warn('Request: {}; Status code: {}'.format(requests, response.status_code))   #parse the content of current iteration of request   page_html = BeautifulSoup(response.text, 'html.parser')         movie_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')     #extract the 50 movies for that page   for container in movie_containers:       #conditional for all with metascore       if container.find('div', class_ = 'ratings-metascore') is not None:           #title           title = container.h3.a.text           titles.append(title)           if container.h3.find('span', class_= 'lister-item-year text-muted unbold') is not None:                         #year released             year = container.h3.find('span', class_= 'lister-item-year text-muted unbold').text # remove the parentheses around the year and make it an integer             years.append(year)           else:             years.append(None) # each of the additional if clauses are to handle type None data, replacing it with an empty string so the arrays are of the same length at the end of the scraping           if container.p.find('span', class_ = 'certificate') is not None:                         #rating             rating = container.p.find('span', class_= 'certificate').text             ratings.append(rating)           else:             ratings.append("")           if container.p.find('span', class_ = 'genre') is not None:                         #genre             genre = container.p.find('span', class_ = 'genre').text.replace("\n", "").rstrip().split(',') # remove the whitespace character, strip, and split to create an array of genres             genres.append(genre)                     else:             genres.append("")           if container.p.find('span', class_ = 'runtime') is not None:             #runtime             time = int(container.p.find('span', class_ = 'runtime').text.replace(" min", "")) # remove the minute word from the runtime and make it an integer             runtimes.append(time)           else:             runtimes.append(None)           if float(container.strong.text) is not None:             #IMDB ratings             imdb = float(container.strong.text) # non-standardized variable             imdb_ratings.append(imdb)           else:             imdb_ratings.append(None)           if container.find('span', class_ = 'metascore').text is not None:             #Metascore             m_score = int(container.find('span', class_ = 'metascore').text) # make it an integer             metascores.append(m_score)           else:             metascores.append(None)           if container.find('span', attrs = {'name':'nv'})['data-value'] is not None:             #Number of votes             vote = int(container.find('span', attrs = {'name':'nv'})['data-value'])             votes.append(vote)           else:               votes.append(None)           else:               votes.append(None)

Pandas 数据帧以键:值对中的每一列作为输入数据数组。我在这里做了几个额外的数据清理步骤来完成数据清理。

运行以下单元格后,您应该有一个包含您抓取的数据的数据框。

sci_fi_df = pd.DataFrame({'movie': titles,                      'year': years,                      'rating': ratings,                      'genre': genres,                      'runtime_min': runtimes,                      'imdb': imdb_ratings,                      'metascore': metascores,                      'votes': votes}                      )sci_fi_df.loc[:, 'year'] = sci_fi_df['year'].str[-5:-1] # two more data transformations after scraping# Drop 'ovie' bug# Make year an intsci_fi_df['n_imdb'] = sci_fi_df['imdb'] * 10final_df = sci_fi_df.loc[sci_fi_df['year'] != 'ovie'] # One small issue with the scrape on these two movies so just dropping those ones.final_df.loc[:, 'year'] = pd.to_numeric(final_df['year'])

探索性数据分析

既然您有了数据,您可能想做的第一件事就是从高层次上了解更多有关它的信息。以下命令对查看任何数据很有用,接下来我们将使用它们:

final_df.head()

此命令向您显示数据框的前 5 行。它可以帮助您看到没有什么看起来很奇怪,并且一切都已准备好进行分析。你可以在这里看到输出:

TCYKlpEKIJOVJIAtIGN4wzDhCySaYIXI9cyBizZxR3XHsAQO_YH9mh626hCq8fdItaAF0N0cxSs1PP1eYujRsOt8HgeXtcC3hff-y0Jl4tvN__itH97iXqb6DrN6wJrngdsNaKQTQag5StHfOIcy5A

final_df.head()使用命令输出的前五行数据

final_df.describe()

此命令将为您提供平均值、标准差和其他摘要。Count 可以显示某些列中是否有任何空值,这是需要了解的有用信息。例如,年份列显示了从 1927 年到 2022 年被抓取的电影范围。

您可以看到下面的输出并检查其他输出:

Zeo_Y8ipyIejyYIBa2Aaocz4obHNlMVU76YTylZGl_wpRovYVFNS4e0m1DYAwkcqhpoYikJFL_dSgZSH-qoghJM3VMXESMUykrfs1e3JuXRkrp9iEZhPPnqGvsSamdYQe6Noz0Q0OA-Wen616-pmbDQ

运行final_df.describe()会生成汇总统计数据,显示数据点的数量、平均值、标准偏差等。

final_df.info()

此命令可让您了解您在每个列中使用的数据类型。

作为一名数据科学家,这些信息可能会对您有所帮助。某些函数和方法需要某些数据类型。您还可以确保您的基础数据类型采用对它们的含义有意义的格式。

例如:5 星评级应该是浮点数或整数(如果不允许使用小数)。它不应该是一个字符串,因为它是一个数字。以下是抓取后每个变量的数据格式摘要:

PT7Fa9XFYErtorVw6bNxw7Q1mI-p2_hlKgTbTs90RRpALPDlqd95F_EOwCQ7cV2cDymqZ-mXIa_0blqxxJ5wZ8Bznzd0iFyTB6kFroIUK2DJNzfRZgwgsRHr0pjDyE1ZUrQILf-22w856郁香

运行final_df.info()结果会显示每列中有多少值以及它们的数据类型。

下一个用于了解有关变量的更多信息的命令会生成热图。热图显示了所有定量变量之间的相关性。这是评估变量之间可能存在的关系的快速方法。我喜欢看系数而不是试图破译颜色代码,所以我使用了这个annot=True论点。

sns.heatmap(final_df.corr(), annot=True);

上面的命令使用 Seaborn 数据可视化包生成以下可视化:

niHLKP7bps1EpZ_39u5k3dPDF0Xuz8Zuhal8Bbc8wtImKUv50M_7fEH65rCAkrTglGtZTJpZ2sRfIE0E6Kjn9m_CYGkRct83_3wWzVp0rnHA8nh5UuveFO0OqtjVfoOzMsKGq0lZ2uxw66Lp4g6

运行后的相关性热图sns.heatmap(final_df.corr(), annot=True);

您可以看到最强的相关性是 IMDB 分数和元分数之间。这并不奇怪,因为两个电影评级系统的评级可能相似。

您可以看到的下一个最强相关性是 IMDB 评级和投票数之间的关系。这很有趣,因为随着投票数量的增加,您将拥有更具代表性的总体评分样本。不过,奇怪的是,两者之间的联系很弱。

随着运行时间的增加,投票数也大致增加。

您还可以看到 IMDB 或 metascore 与电影上映年份之间存在轻微的负相关。我们很快就会看到这一点。

您可以使用以下代码通过散点图直观地检查其中一些关系:

x = final_df['n_imdb']y = final_df['votes']plt.scatter(x, y, alpha=0.5) # s= is size var, c= is color varplt.xlabel("IMDB Rating Standardized")plt.ylabel("Number of Votes")plt.title("Number of Votes vs. IMDB Rating")plt.ticklabel_format(style='plain')plt.show()

这导致了这个可视化:

vvqxh5VwbHPoypyGlNBstgZW8puVWKa5m_hl6MYB_r78OfRC7TWBx9jxjf8PFflJO93hq83ZdIqX97uq6C_WjlZV5jorCDgtU3U3_dESuUgsStfLEgkeiikHTq2noabW_tPJQRRGpFrVmQ90gja4xAo

IMDB 评级与投票数

上面的关联显示了一些异常值。通常,我们看到 IMDB 评分为 85 或更高的电影获得更多投票。评分为 75 或以下的电影的评论较少。

在数据周围画出这些框可以告诉你我的意思。大致有两个不同幅度的分组:

QEUbjZrtSiLCbdcXIR1MN0MKvcCZgVxeW2sPzMo4KL36pjCQq87rkRdgKKwK2yWSh2Uz0HMoIckyOa0qcNX4hCQok_kuuyqq4PddFHVuC5Tzyg9-WdZobdZgWfOpW1PnKWFKfQLaDAEDXoDHfiuU5mY

数据中的两个核心组

另一件可能很有趣的事情是每个等级有多少部电影。这可以向您展示科幻小说在收视率数据中的位置。使用此代码获取评分的条形图:

ax = final_df['rating'].value_counts().plot(kind='bar',                                   figsize=(14,8),                                   title="Number of Movies by Rating")ax.set_xlabel("Rating")ax.set_ylabel("Number of Movies")ax.plot();

该代码生成了这张图表,它向我们展示了 R 和 PG-13 构成了 IMDB 上这些科幻电影的大部分。

7896rs2HtqgI4nIPyI-vUU5w43C3_Dcuyc_DdjUOudq76aIHstBINNVf5e0-1G3MUZzFgKDzK_2Jhsnno5swbXIoZwMuxqg1icY8aPbxWjOsCIm3BB9lObzY7HiDSAhmLTfcpfi2HWdW4VoUjBcn

按评级划分的电影数量

我确实看到有几部电影被评为“已批准”,我很好奇那是什么。您可以使用此代码过滤数据框以深入了解:

final_df[final_df['rating'] == 'Approved']

这表明这些电影大部分是在 80 年代之前制作的:

OIxMNDTgcXcPo_Wy8N7miq44OOAai4o-A8upYa1pNbqWjDVzPduRxNcMgUPuG9-OFyNd1AFgwWeq_o4E3Kv9pXy8xVSH7p6ZZi9uoOy78dBFK0LjvDnN9k7WYDTiZYxwpVgCiqXokWLPuMo746j

所有评级“批准”的电影

我去了 MPAA 网站,在他们的收视率信息页面上没有提到他们。它一定在某个时候被淘汰了。

您还可以查看是否有任何年份或几十年的评论优于其他人。我逐年获取平均元分数,并使用以下代码绘制它以进一步探索:

# What are the average metascores by year?final_df.groupby('year')['metascore'].mean().plot(kind='bar', figsize=(16,8), title="Avg. Metascore by Year", xlabel="Year", ylabel="Avg. Metascore")plt.xticks(rotation=90)plt.plot();

结果如下图:

BTwJBQhq0zBTr5UH5J3n7CSR6k3Ft8l9GBQ_czZRlu_LO192AXd0G_ozwXzsb6pctS-8lHCvgVLx6VZ7hWH-trp8C4oFAPCGufh2gq-F2WV96u90xt05KqUGYCqSFpmXxPEsFKSZQceglNItwChRnfE

平均 电影年份的元评分

现在我并不是说我知道为什么,但是随着您在平均 metascore 变量中的历史进展,会出现逐渐温和的下降。在过去的几十年里,收视率似乎稳定在 55-60 左右。这可能是因为我们有更多关于新电影的数据,或者新电影往往会得到更多评论。

final_df['year'].value_counts().plot(kind='bar', figsize=[20,9])

运行上面的代码,你会看到 1927 年的电影只有 1 条评论的样本。然后,该分数会出现偏差和过度夸大。您还会看到,正如我所怀疑的那样,最近的电影在评论中得到了更好的体现:

d2C-t1DLqSjRY8DpeoudyBNHG4SevXCZXFK4xoaw3QHpj_j4qn​​Ef479Tn7wNyBqOwKAzR5GVidaRZ79XB5Eo36msA8LRBxNaJu_9Xk1VKE5oeo2Pue1TLnbjMX3y48Gc5xfOBnlZ1x9rdTktI_NB2

电影数量(按年份)

数据科学项目的想法更进一步

这里有文本、分类和数字变量。您可以尝试探索更多选项。

您可以做的一件事是使用自然语言处理 (NLP) 来查看电影评级或科幻世界中是否有任何命名约定(或者如果您选择制作不同的类型,无论您选择什么类型!) .

您也可以更改网络抓取代码以引入更多类型。有了它,您可以创建一个新的跨流派数据库,以查看是否存在按流派的命名约定。

然后,您可以尝试根据电影名称来预测类型。您还可以尝试根据电影的类型或发行年份来预测 IMDB 评分。后一种想法在过去几十年里会更好,因为大多数观察都在那里。

我希望本教程能激发您对数据科学世界和可能性的好奇心!

你会发现在探索性数据分析中总是有更多的问题要问。使用该约束是关于基于业务目标的优先级。提前从这些目标开始很重要,否则您可能会永远在数据杂草中探索。

如果您对数据科学领域感兴趣并且想扩展您的技能并进入专业领域,请考虑查看 Springboard 的数据科学职业轨迹。在本课程中,Springboard 将指导您深入了解所有关键概念,一对一的专家导师与您配对,为您的旅程提供支持。

我写过其他文章,这些文章构建了与业务问题相关的数据科学项目,并介绍了解决这些问题的技术方法。如果您有兴趣,请查看这些!

快乐编码!

来源:https ://www.freecodecamp.org/news/web-scraping-sci-fi-movies-from-imdb-with-python/

 #python 

Python 中的网页抓取——如何从 IMDB 抓取科幻电影

Web Scraping Em Python – Raspar Filmes De Ficção Científica Do IMDB

Você já se esforçou para encontrar um conjunto de dados para seu projeto de ciência de dados? Se você é como eu sou, a resposta é sim.

Felizmente, existem muitos conjuntos de dados gratuitos disponíveis – mas às vezes você quer algo mais específico ou sob medida. Para isso, o web scraping é uma boa habilidade para ter em sua caixa de ferramentas para extrair dados do seu site favorito.

O que é abordado neste artigo?

Este artigo tem um script Python que você pode usar para extrair os dados de filmes de ficção científica (ou qualquer gênero que você escolher!) do site do IMDB . Ele pode então gravar esses dados em um dataframe para exploração adicional.

Vou concluir este artigo com um pouco de análise exploratória de dados (EDA). Com isso, você verá quais outros projetos de ciência de dados são possíveis para você experimentar.

Isenção de responsabilidade: embora o web scraping seja uma ótima maneira de extrair dados programaticamente de sites, faça-o com responsabilidade. Meu script usa a função sleep, por exemplo, para desacelerar intencionalmente os pull requests, para não sobrecarregar os servidores do IMDB. Alguns sites desaprovam o uso de raspadores da web, portanto, use-os com sabedoria.

Script de Web Scraping e Limpeza de Dados

Vamos ao script de raspagem e executá-lo. O roteiro traz títulos de filmes, anos, classificações (PG-13, R e assim por diante), gêneros, tempos de execução, resenhas e votos para cada filme. Você pode escolher quantas páginas deseja extrair com base em suas necessidades de dados.

Nota: levará mais tempo quanto mais páginas você selecionar. Leva 40 minutos para raspar 200 páginas da Web usando o Google Colab Notebook .

Para aqueles que ainda não experimentaram, o Google Colab é uma ferramenta de desenvolvimento Python estilo Jupyter Notebook baseada em nuvem que reside no pacote de aplicativos do Google. Você pode usá-lo imediatamente com muitos dos pacotes já instalados que são comuns em ciência de dados.

Abaixo está uma imagem do espaço de trabalho do Colab e seu layout:

R9sAuHzGHrEvRK_hiAWsy4W41W72et6clD38gIYeAA6AtA32e97xxw0W5ub_96xmgSMTDB2VjRK-gz_YgYtZoV1YyCHjKftaB7-HD2NQ7qt_8hcdnDfqaibp0ONwPr9-4zO5gv3FuXdxiOMsN6eF8bA

Apresentando a interface do usuário do Google Colab

Com isso, vamos mergulhar! Primeiramente, você deve sempre importar seus pacotes como sua própria célula. Se você esquecer um pacote, poderá executar novamente apenas essa célula. Isso reduz o tempo de desenvolvimento.

Nota: alguns desses pacotes precisam pip install package_nameser executados para instalá-los primeiro. Se você optar por executar o código localmente usando algo como um Jupyter Notebook, precisará fazer isso. Se você quiser começar a trabalhar rapidamente, pode usar o bloco de anotações do Google Colab. Isso tem tudo isso instalado por padrão.

from requests import get
from bs4 import BeautifulSoup
from warnings import warn
from time import sleep
from random import randint
import numpy as np, pandas as pd
import seaborn as sns

Como fazer o Web Scraping

Você pode executar o código a seguir que faz a raspagem da web real. Ele puxará todas as colunas mencionadas acima em matrizes e as preencherá um filme por vez, uma página por vez.

Há também algumas etapas de limpeza de dados que adicionei e documentei neste código. Eu removi os parênteses dos dados da string mencionando o ano do filme, por exemplo. Eu então converti aqueles para números inteiros. Coisas como essa facilitam a análise e a modelagem de dados exploratórios.

Observe que eu uso a função sleep para evitar ser restringido pelo IMDB quando se trata de percorrer suas páginas da web muito rapidamente.

# Note this takes about 40 min to run if np.arange is set to 9951 as the stopping point.

pages = np.arange(1, 9951, 50) # Last time I tried, I could only go to 10000 items because after that the URI has no discernable pattern to combat webcrawlers; I just did 4 pages for demonstration purposes. You can increase this for your own projects.
headers = {'Accept-Language': 'en-US,en;q=0.8'} # If this is not specified, the default language is Mandarin

#initialize empty lists to store the variables scraped
titles = []
years = []
ratings = []
genres = []
runtimes = []
imdb_ratings = []
imdb_ratings_standardized = []
metascores = []
votes = []

for page in pages:
  
   #get request for sci-fi
   response = get("https://www.imdb.com/search/title?genres=sci-fi&"
                  + "start="
                  + str(page)
                  + "&explore=title_type,genres&ref_=adv_prv", headers=headers)
  
   sleep(randint(8,15))
   
   #throw warning for status codes that are not 200
   if response.status_code != 200:
       warn('Request: {}; Status code: {}'.format(requests, response.status_code))

   #parse the content of current iteration of request
   page_html = BeautifulSoup(response.text, 'html.parser')
      
   movie_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')
  
   #extract the 50 movies for that page
   for container in movie_containers:

       #conditional for all with metascore
       if container.find('div', class_ = 'ratings-metascore') is not None:

           #title
           title = container.h3.a.text
           titles.append(title)

           if container.h3.find('span', class_= 'lister-item-year text-muted unbold') is not None:
            
             #year released
             year = container.h3.find('span', class_= 'lister-item-year text-muted unbold').text # remove the parentheses around the year and make it an integer
             years.append(year)

           else:
             years.append(None) # each of the additional if clauses are to handle type None data, replacing it with an empty string so the arrays are of the same length at the end of the scraping

           if container.p.find('span', class_ = 'certificate') is not None:
            
             #rating
             rating = container.p.find('span', class_= 'certificate').text
             ratings.append(rating)

           else:
             ratings.append("")

           if container.p.find('span', class_ = 'genre') is not None:
            
             #genre
             genre = container.p.find('span', class_ = 'genre').text.replace("\n", "").rstrip().split(',') # remove the whitespace character, strip, and split to create an array of genres
             genres.append(genre)
          
           else:
             genres.append("")

           if container.p.find('span', class_ = 'runtime') is not None:

             #runtime
             time = int(container.p.find('span', class_ = 'runtime').text.replace(" min", "")) # remove the minute word from the runtime and make it an integer
             runtimes.append(time)

           else:
             runtimes.append(None)

           if float(container.strong.text) is not None:

             #IMDB ratings
             imdb = float(container.strong.text) # non-standardized variable
             imdb_ratings.append(imdb)

           else:
             imdb_ratings.append(None)

           if container.find('span', class_ = 'metascore').text is not None:

             #Metascore
             m_score = int(container.find('span', class_ = 'metascore').text) # make it an integer
             metascores.append(m_score)

           else:
             metascores.append(None)

           if container.find('span', attrs = {'name':'nv'})['data-value'] is not None:

             #Number of votes
             vote = int(container.find('span', attrs = {'name':'nv'})['data-value'])
             votes.append(vote)

           else:
               votes.append(None)

           else:
               votes.append(None)

Os dataframes do Pandas recebem matrizes de dados de entrada para cada uma de suas colunas em pares chave:valor. Fiz algumas etapas extras de limpeza de dados aqui para finalizar a limpeza de dados.

Depois de executar a célula a seguir, você deve ter um dataframe com os dados coletados.

sci_fi_df = pd.DataFrame({'movie': titles,
                      'year': years,
                      'rating': ratings,
                      'genre': genres,
                      'runtime_min': runtimes,
                      'imdb': imdb_ratings,
                      'metascore': metascores,
                      'votes': votes}
                      )

sci_fi_df.loc[:, 'year'] = sci_fi_df['year'].str[-5:-1] # two more data transformations after scraping
# Drop 'ovie' bug
# Make year an int
sci_fi_df['n_imdb'] = sci_fi_df['imdb'] * 10
final_df = sci_fi_df.loc[sci_fi_df['year'] != 'ovie'] # One small issue with the scrape on these two movies so just dropping those ones.
final_df.loc[:, 'year'] = pd.to_numeric(final_df['year'])

Análise exploratória de dados

Agora que você tem os dados, uma das primeiras coisas que você pode querer fazer é aprender mais sobre eles em alto nível. Os comandos a seguir são uma primeira olhada útil em qualquer dado e nós os usaremos a seguir:

final_df.head()

Este comando mostra as primeiras 5 linhas do seu dataframe. Isso ajuda você a ver que nada parece estranho e tudo está pronto para análise. Você pode ver a saída aqui:

TCYKlpEKIJOVJIAtIGN4wzDhCySaYIXI9cyBizZxR3XHsAQO_YH9mh626hCq8fdItaAF0N0cxSs1PP1eYujRsOt8HgeXtcC3hff-y0Jl4tvN__itH97iXqb6DrN6wJrngdsNaKQTQag5StHfOIcyA0

As primeiras cinco linhas de dados geradas usando o final_df.head()comando

final_df.describe()

Este comando fornecerá a média, desvio padrão e outros resumos. Count pode mostrar se há algum valor nulo em algumas das colunas, o que é uma informação útil para saber. A coluna do ano, por exemplo, mostra a gama de filmes raspados – de 1927 a 2022.

Você pode ver a saída abaixo e inspecionar as outras:

Zeo_Y8ipyIejyYIBa2Aaocz4obHNlMVU76YTylZGl_wpRovYVFNS4e0m1DYAwkcqhpoYikJFL_dSgZSH-qoghJM3VMXESMUykrfs1e3JuXRkrp9iEZhPPnqGvsSamdYQe6Noz0Q0OA-Wen616-pmbDQ

A execução final_df.describe()produz estatísticas resumidas que mostram o número de pontos de dados, médias, desvios padrão e muito mais.

final_df.info()

Este comando permite que você conheça os tipos de dados com os quais está trabalhando em cada uma de suas colunas.

Como cientista de dados, essas informações podem ser úteis para você. Certas funções e métodos precisam de certos tipos de dados. Você também pode garantir que seus tipos de dados subjacentes estejam em um formato que faça sentido para o que são.

Por exemplo: uma classificação de 5 estrelas deve ser float ou int (se decimais não forem permitidos). Não deve ser uma string, pois é um número. Aqui está um resumo de qual era o formato de dados para cada variável após a raspagem:

PT7Fa9XFYErtorVw6bNxw7Q1mI-p2_hlKgTbTs90RRpALPDlqd95F_EOwCQ7cV2cDymqZ-mXIa_0blqxxJ5wZ8Bznzd0iFyTB6kFroIUK2DJNzfRZgwgsRHr0pjDyE1ZURQILf-22w856Oouf

A execução final_df.info()resulta em mostrar quantos valores você tem em cada coluna e quais são seus tipos de dados.

O próximo comando para saber mais sobre suas variáveis ​​produz um mapa de calor. O mapa de calor mostra a correlação entre todas as suas variáveis ​​quantitativas. Esta é uma maneira rápida de avaliar as relações que podem existir entre as variáveis. Eu gosto de ver os coeficientes ao invés de tentar decifrar o código de cores, então eu uso o annot=Trueargumento.

sns.heatmap(final_df.corr(), annot=True);

O comando acima produz a seguinte visualização usando o pacote de visualização de dados Seaborn:

niHLKP7bps1EpZ_39u5k3dPDF0Xuz8Zuhal8Bbc8wtImKUv50M_7fEH65rCAkrTglGtZTJpZ2sRfIE0E6Kjn9m_CYGkRct83_3wWzVp0rnHA8nh5UuveFO0OqtjVfoOzMsKGq0lZ2uxw66Lp

Um mapa de calor de correlações após a execuçãosns.heatmap(final_df.corr(), annot=True);

Você pode ver que a correlação mais forte é entre a pontuação do IMDB e o metascore. Isso não é surpreendente, pois é provável que dois sistemas de classificação de filmes tenham uma classificação semelhante.

A próxima correlação mais forte que você pode ver é entre a classificação do IMDB e o número de votos. Isso é interessante porque, à medida que o número de votos aumenta, você tem uma amostra mais representativa da classificação da população. É estranho ver que há uma associação fraca entre os dois, no entanto.

O número de votos aumenta aproximadamente à medida que o tempo de execução também aumenta.

Você também pode ver uma ligeira associação negativa entre IMDB ou metascore e o ano em que o filme foi lançado. Veremos isso em breve.

Você pode conferir algumas dessas relações visualmente por meio de um gráfico de dispersão com este código:

x = final_df['n_imdb']
y = final_df['votes']
plt.scatter(x, y, alpha=0.5) # s= is size var, c= is color var
plt.xlabel("IMDB Rating Standardized")
plt.ylabel("Number of Votes")
plt.title("Number of Votes vs. IMDB Rating")
plt.ticklabel_format(style='plain')
plt.show()

Isso resulta nesta visualização:

vvqxh5VwbHPoypyGlNBstgZW8puVWKa5m_hl6MYB_r78OfRC7TWBx9jxjf8PFflJO93hq83ZdIqX97uq6C_WjlZV5jorCDgtU3U3_dESuUgsStfLEgkeiikHTq2noabW_tPJQRRGpFrVmQ90gja4xAo

Classificações do IMDB versus o número de votos

A associação acima mostra alguns outliers. Geralmente, vemos um número maior de votos em filmes com classificação IMDB de 85 ou mais. Há menos comentários sobre filmes com uma classificação de 75 ou menos.

Desenhar essas caixas ao redor dos dados pode mostrar o que quero dizer. Há aproximadamente dois agrupamentos de magnitudes diferentes:

QEUbjZrtSiLCbdcXIR1MN0MKvcCZgVxeW2sPzMo4KL36pjCQq87rkRdgKKwK2yWSh2Uz0HMoIckyOa0qcNX4hCQok_kuuyqq4PddFHVuC5Tzyg9-WdZobdZgWfOpW1PnKWFKfQLaDAEDXoDHfiuU5mY

Dois grupos principais nos dados

Outra coisa que pode ser interessante ver é quantos filmes de cada classificação existem. Isso pode mostrar onde o Sci-Fi tende a pousar nos dados de classificação. Use este código para obter um gráfico de barras das classificações:

ax = final_df['rating'].value_counts().plot(kind='bar',
                                   figsize=(14,8),
                                   title="Number of Movies by Rating")
ax.set_xlabel("Rating")
ax.set_ylabel("Number of Movies")
ax.plot();

Esse código resulta neste gráfico que nos mostra que R e PG-13 compõem a maioria desses filmes de ficção científica no IMDB.

7896rs2HtqgI4nIPyI-vUU5w43C3_Dcuyc_DdjUOudq76aIHstBINNVf5e0-1G3MUZzFgKDzK_2Jhsnno5swbXIoZwMuxqg1icY8aPbxWjOSCIm3BB9lObzY7HiDSAhmLTfcpfi2HWdW4VoUjB9lObzY7HiDSAhmLTfcpfi2HWdW4VoUj

Número de filmes por classificação

Eu vi que havia alguns filmes classificados como “Aprovados” e fiquei curioso sobre o que era. Você pode filtrar o dataframe com este código para detalhar isso:

final_df[final_df['rating'] == 'Approved']

Isso revelou que a maioria desses filmes foi feita antes dos anos 80:

OIxMNDTgcXcPo_Wy8N7miq44OOAai4o-A8upYa1pNbqWjDVzPduRxNcMgUPuG9-OFyNd1AFgwWeq_o4E3Kv9pXy8xVSH7p6ZZi9uoOy78dBFK0LjvDnN9k7WYDTiZYxMowpVgCiqXokWLPuMo746jvMowpVgCiqXokWLPuMo746

Todos os filmes "Aprovados" com classificação

Fui ao site da MPAA e não havia menção a eles na página de informações de classificação. Deve ter sido descontinuado em algum momento.

Você também pode verificar se algum ano ou década superou outros em avaliações. Peguei o metascore médio por ano e tracei isso com o seguinte código para explorar mais:

# What are the average metascores by year?
final_df.groupby('year')['metascore'].mean().plot(kind='bar', figsize=(16,8), title="Avg. Metascore by Year", xlabel="Year", ylabel="Avg. Metascore")
plt.xticks(rotation=90)
plt.plot();

Isso resulta no gráfico a seguir:

BTwJBQhq0zBTr5UH5J3n7CSR6k3Ft8l9GBQ_czZRlu_LO192AXd0G_ozwXzsb6pctS-8lHCvgVLx6VZ7hWH-trp8C4oFAPCGufh2gq-F2WV96u90xt05KqUGYCqSFpmfXxPEsFKSZQceglNItwChRnfXxPEsFKSZQceglNItwChRnfXxPEsFKSZQceglNItw

Média Metascore por ano do filme

Agora, não estou dizendo que sei por que, mas há um declínio gradual e suave à medida que você progride na história na variável de metascore média. Parece que as classificações se estabilizaram em torno de 55-60 nas últimas duas décadas. Isso pode ser porque temos mais dados sobre filmes mais recentes ou filmes mais recentes tendem a ser mais revisados.

final_df['year'].value_counts().plot(kind='bar', figsize=[20,9])

Execute o código acima e você verá que o filme de 1927 teve apenas uma amostra de 1 revisão. Essa pontuação é então tendenciosa e super-inflada. Você verá também que os filmes mais recentes estão melhor representados nas resenhas, como eu suspeitava:

d2C-t1DLqSjRY8DpeoudyBNHG4SevXCZXFK4xoaw3QHpj_j4qnEf479Tn7wNyBqOwKAzR5GVidaRZ79XB5Eo36msA8LRBxNaJu_9Xk1VKE5oeo2Pue1TLnbjMX3y48Gc5xfOBnlZ1x9rdTktI_N2Gc5xfOBnlZ1x9rdTktI_N2

Número de filmes por ano

Ideias de projetos de ciência de dados para levar isso adiante

Você tem variáveis ​​textuais, categóricas e numéricas aqui. Existem algumas opções que você pode tentar explorar mais.

Uma coisa que você pode fazer é usar o Natural Language Process (NLP) para ver se existem convenções de nomenclatura para classificações de filmes, ou dentro do mundo da ficção científica (ou se você escolheu fazer um gênero diferente, seja qual for o gênero escolhido!) .

Você pode alterar o código de raspagem da web para atrair muitos outros gêneros também. Com isso, você pode criar um novo banco de dados entre gêneros para ver se existem convenções de nomenclatura por gênero.

Você poderia então tentar prever o gênero com base no nome do filme. Você também pode tentar prever a classificação do IMDB com base no gênero ou ano em que o filme foi lançado. A última ideia funcionaria melhor nas últimas décadas, já que a maioria das observações está lá.

Espero que este tutorial tenha despertado a curiosidade em você sobre o mundo da ciência de dados e o que é possível!

Você descobrirá na análise exploratória de dados que sempre há mais perguntas a serem feitas. Trabalhar com essa restrição é priorizar com base na(s) meta(s) de negócios. É importante começar com esses objetivos de antemão ou você pode estar nas ervas daninhas de dados explorando para sempre.

Se o campo da ciência de dados é interessante para você e você deseja expandir seu conjunto de habilidades e entrar nele profissionalmente, considere conferir a trilha de carreira em ciência de dados da Springboard . Neste curso, o Springboard orienta você em todos os principais conceitos em profundidade com um mentor especialista 1:1 emparelhado com você para apoiá-lo em sua jornada.

Escrevi outros artigos que enquadram projetos de ciência de dados em relação a problemas de negócios e abordam abordagens técnicas para resolvê-los. Confira se tiver interesse!

Boa codificação!

Fonte: https://www.freecodecamp.org/news/web-scraping-sci-fi-movies-from-imdb-with-python/

 #python 

Web Scraping Em Python – Raspar Filmes De Ficção Científica Do IMDB

Python での Web スクレイピング – IMDB から SF 映画をスクレイピングする

データ サイエンス プロジェクトのデータセットを見つけるのに苦労したことはありますか? あなたが私のような人なら、答えはイエスです。

幸いなことに、利用可能な無料のデータセットが多数ありますが、より具体的なデータセットやオーダーメイドのデータセットが必要な場合もあります。そのために、Web スクレイピングは、お気に入りの Web サイトからデータを引き出すためのツールボックスに用意しておくとよいスキルです。

この記事の内容

この記事には、 IMDB Web サイトから SF 映画 (または任意のジャンル!) のデータをスクレイピングするために使用できる Python スクリプトがあります。次に、これらのデータをデータフレームに書き込んで、さらに調査することができます。

探索的データ分析 (EDA) について少し触れて、この記事を締めくくります。これにより、さらにどのようなデータ サイエンス プロジェクトを試すことができるかがわかります。

免責事項: Web スクレイピングは、プログラムによって Web サイトからデータを引き出す優れた方法ですが、責任を持って実行してください。たとえば、私のスクリプトではスリープ機能を使用して、プル リクエストを意図的に遅くし、IMDB のサーバーが過負荷にならないようにしています。Web スクレイパーの使用を嫌う Web サイトもありますので、賢く使用してください。

Web スクレイピングおよびデータ クリーニング スクリプト

スクレイピング スクリプトを実行してみましょう。このスクリプトは、映画のタイトル、公開年、評価 (PG-13、R など)、ジャンル、上映時間、レビュー、各映画の投票を取り込みます。データのニーズに基づいて、スクレイピングするページ数を選択できます。

注: 選択するページが多いほど時間がかかります。Google Colab Notebookを使用して 200 の Web ページをスクレイピングするには 40 分かかります。

まだ試したことがない方のために説明すると、Google Colab はクラウドベースのJupyter Notebookスタイルの Python 開発ツールであり、Google アプリ スイートに含まれています。データ サイエンスで一般的な多くのパッケージが既にインストールされている状態で、すぐに使用できます。

以下は、Colab ワークスペースとそのレイアウトのイメージです。

R9sAuHzGHrEvRK_hiAWsy4W41W72et6clD38gIYeAA6AtA32e97xxw0W5ub_96xmgSMTDB2VjRK-gz_YgYtZoV1YyCHjKftaB7-HD2NQ7qt_8hcdnDfqaibp0ONwPr9-4zO5gv3FuXdxiOMsN6eF8bA

Google Colab ユーザー インターフェースの紹介

それでは、飛び込みましょう!まず最初に、常にパッケージを独自のセルとしてインポートする必要があります。パッケージを忘れた場合は、そのセルだけを再実行できます。これにより、開発時間が短縮されます。

注: これらのパッケージの一部は、pip install package_name最初にインストールするために実行する必要があります。Jupyter Notebook などを使用してコードをローカルで実行することを選択した場合は、それを実行する必要があります。すぐに使い始めたい場合は、Google Colab ノートブックを使用できます。これには、これらすべてがデフォルトでインストールされています。

from requests import get
from bs4 import BeautifulSoup
from warnings import warn
from time import sleep
from random import randint
import numpy as np, pandas as pd
import seaborn as sns

Webスクレイピングのやり方

実際の Web スクレイピングを行う次のコードを実行できます。上記のすべての列を配列に取り込んで、一度に 1 つのムービー、一度に 1 ページずつ入力します。

また、このコードに追加して文書化したデータ クリーニング手順もいくつかあります。たとえば、映画の年に言及している文字列データから括弧を削除しました。次に、それらを整数に変換しました。このようなことにより、探索的データ分析とモデリングが容易になります。

私はスリープ機能を使用して、Web ページの循環が速すぎる場合に IMDB によって制限されるのを回避していることに注意してください。

# Note this takes about 40 min to run if np.arange is set to 9951 as the stopping point.

pages = np.arange(1, 9951, 50) # Last time I tried, I could only go to 10000 items because after that the URI has no discernable pattern to combat webcrawlers; I just did 4 pages for demonstration purposes. You can increase this for your own projects.
headers = {'Accept-Language': 'en-US,en;q=0.8'} # If this is not specified, the default language is Mandarin

#initialize empty lists to store the variables scraped
titles = []
years = []
ratings = []
genres = []
runtimes = []
imdb_ratings = []
imdb_ratings_standardized = []
metascores = []
votes = []

for page in pages:
  
   #get request for sci-fi
   response = get("https://www.imdb.com/search/title?genres=sci-fi&"
                  + "start="
                  + str(page)
                  + "&explore=title_type,genres&ref_=adv_prv", headers=headers)
  
   sleep(randint(8,15))
   
   #throw warning for status codes that are not 200
   if response.status_code != 200:
       warn('Request: {}; Status code: {}'.format(requests, response.status_code))

   #parse the content of current iteration of request
   page_html = BeautifulSoup(response.text, 'html.parser')
      
   movie_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')
  
   #extract the 50 movies for that page
   for container in movie_containers:

       #conditional for all with metascore
       if container.find('div', class_ = 'ratings-metascore') is not None:

           #title
           title = container.h3.a.text
           titles.append(title)

           if container.h3.find('span', class_= 'lister-item-year text-muted unbold') is not None:
            
             #year released
             year = container.h3.find('span', class_= 'lister-item-year text-muted unbold').text # remove the parentheses around the year and make it an integer
             years.append(year)

           else:
             years.append(None) # each of the additional if clauses are to handle type None data, replacing it with an empty string so the arrays are of the same length at the end of the scraping

           if container.p.find('span', class_ = 'certificate') is not None:
            
             #rating
             rating = container.p.find('span', class_= 'certificate').text
             ratings.append(rating)

           else:
             ratings.append("")

           if container.p.find('span', class_ = 'genre') is not None:
            
             #genre
             genre = container.p.find('span', class_ = 'genre').text.replace("\n", "").rstrip().split(',') # remove the whitespace character, strip, and split to create an array of genres
             genres.append(genre)
          
           else:
             genres.append("")

           if container.p.find('span', class_ = 'runtime') is not None:

             #runtime
             time = int(container.p.find('span', class_ = 'runtime').text.replace(" min", "")) # remove the minute word from the runtime and make it an integer
             runtimes.append(time)

           else:
             runtimes.append(None)

           if float(container.strong.text) is not None:

             #IMDB ratings
             imdb = float(container.strong.text) # non-standardized variable
             imdb_ratings.append(imdb)

           else:
             imdb_ratings.append(None)

           if container.find('span', class_ = 'metascore').text is not None:

             #Metascore
             m_score = int(container.find('span', class_ = 'metascore').text) # make it an integer
             metascores.append(m_score)

           else:
             metascores.append(None)

           if container.find('span', attrs = {'name':'nv'})['data-value'] is not None:

             #Number of votes
             vote = int(container.find('span', attrs = {'name':'nv'})['data-value'])
             votes.append(vote)

           else:
               votes.append(None)

           else:
               votes.append(None)

Pandas データフレームは、キーと値のペアで各列のデータ配列を入力として受け取ります。ここで、データ クリーニングを完了するために、いくつかの追加のデータ クリーニング手順を実行しました。

次のセルを実行すると、スクレイピングしたデータを含むデータフレームが作成されます。

sci_fi_df = pd.DataFrame({'movie': titles,
                      'year': years,
                      'rating': ratings,
                      'genre': genres,
                      'runtime_min': runtimes,
                      'imdb': imdb_ratings,
                      'metascore': metascores,
                      'votes': votes}
                      )

sci_fi_df.loc[:, 'year'] = sci_fi_df['year'].str[-5:-1] # two more data transformations after scraping
# Drop 'ovie' bug
# Make year an int
sci_fi_df['n_imdb'] = sci_fi_df['imdb'] * 10
final_df = sci_fi_df.loc[sci_fi_df['year'] != 'ovie'] # One small issue with the scrape on these two movies so just dropping those ones.
final_df.loc[:, 'year'] = pd.to_numeric(final_df['year'])

探索的データ分析

データを入手したので、最初に行うことの 1 つは、データについてさらに詳しく知ることです。次のコマンドは、データを最初に確認するのに役立ちます。次に使用します。

final_df.head()

このコマンドは、データフレームの最初の 5 行を表示します。何もおかしくないこと、すべてが分析の準備ができていることを確認するのに役立ちます。ここで出力を確認できます。

TCYKlpEKIJOVJIAtIGN4wzDhCySaYIXI9cyBizZxR3XHsAQO_YH9mh626hCq8fdItaAF0N0cxSs1PP1eYujRsOt8HgeXtcC3hff-y0Jl4tvN__itH97iXqb6DrN6wJrngdsNaKQTQag5StHfOIcy5A0

final_df.head()コマンドを使用して出力されたデータの最初の 5 行

final_df.describe()

このコマンドは、平均、標準偏差、およびその他の要約を提供します。Count は、いくつかの列に null 値があるかどうかを示します。これは知っておくと便利な情報です。たとえば、年の列には、スクレイピングされた映画の範囲 (1927 年から 2022 年まで) が表示されます。

以下の出力を見て、他のものを調べることができます。

Zeo_Y8ipyIejyYIBa2Aaocz4obHNlMVU76YTylZGl_wpRovYVFNS4e0m1DYAwkcqhpoYikJFL_dSgZSH-qoghJM3VMXESMUykrfs1e3JuXRkrp9iEZhPPnqGvsSamdYQe6Noz0Q0OA-Wen616-pmbDQ

実行final_df.describe()すると、データ ポイントの数、平均、標準偏差などを示す要約統計が生成されます。

final_df.info()

このコマンドを使用すると、各列で使用しているデータ型を知ることができます。

データ サイエンティストとして、この情報は役に立ちます。特定の関数とメソッドには、特定のデータ型が必要です。また、基礎となるデータ型が、それらが何であるかを理解できる形式であることを確認することもできます。

例: 5 つ星の評価は、float または int にする必要があります (小数が許可されていない場合)。これは数値であるため、文字列であってはなりません。スクレイピング後の各変数のデータ形式の概要は次のとおりです。

PT7Fa9XFYErtorVw6bNxw7Q1mI-p2_hlKgTbTs90RRpALPDlqd95F_EOwCQ7cV2cDymqZ-mXIa_0blqxxJ5wZ8Bznzd0iFyTB6kFroIUK2DJNzfRZgwgsRHr0pjDyE1ZUrQILf-22w856OoufnnKRIm

実行final_df.info()すると、各列にある値の数とそのデータ型が表示されます。

変数の詳細を確認する次のコマンドは、ヒートマップを生成します。ヒートマップは、すべての量的変数間の相関を示します。これは、変数間に存在する可能性のある関係を評価する簡単な方法です。私はカラーコードを解読しようとするよりも係数を見るのが好きなので、annot=True引数を使用します。

sns.heatmap(final_df.corr(), annot=True);

上記のコマンドは、Seaborn データ視覚化パッケージを使用して、次の視覚化を生成します。

niHLKP7bps1EpZ_39u5k3dPDF0Xuz8Zuhal8Bbc8wtImKUv50M_7fEH65rCAkrTglGtZTJpZ2sRfIE0E6Kjn9m_CYGkRct83_3wWzVp0rnHA8nh5UuvFO0OqtjVfoOzMsKGq0lZ2uxw66Mop4g69a

実行後の相関のヒートマップsns.heatmap(final_df.corr(), annot=True);

最も強い相関関係が IMDB スコアとメタスコアの間にあることがわかります。2 つの映画の評価システムが同様に評価する可能性が高いため、これは驚くべきことではありません。

見ることができる次に強い相関関係は、IMDB の評価と投票数の間です。これは興味深いことです。投票数が増えるにつれて、人口評価のより代表的なサンプルが得られるからです。ただし、この 2 つの間に弱い関連性があることは奇妙です。

実行時間が長くなるにつれて、投票数もおおまかに増加します。

また、IMDB またはメタスコアと映画が公開された年との間にわずかに負の関連があることもわかります。これについては後ほど説明します。

次のコードを使用して、散布図を使用してこれらの関係の一部を視覚的に確認できます。

x = final_df['n_imdb']
y = final_df['votes']
plt.scatter(x, y, alpha=0.5) # s= is size var, c= is color var
plt.xlabel("IMDB Rating Standardized")
plt.ylabel("Number of Votes")
plt.title("Number of Votes vs. IMDB Rating")
plt.ticklabel_format(style='plain')
plt.show()

これにより、次の視覚化が行われます。

vvqxh5VwbHPoypyGlNBstgZW8puVWKa5m_hl6MYB_r78OfRC7TWBx9jxjf8PFflJO93hq83ZdIqX97uq6C_WjlZV5jorCDgtU3U3_dESuUgsStfLEgkeiikHTq2noabW_tPJQRRGpFrVmQ90gja4xAo

IMDB の評価と投票数

上記の関連付けは、いくつかの外れ値を示しています。一般的に、IMDB レーティングが 85 以上の映画には多くの票が集まります。評価が 75 以下の映画のレビューは少なくなります。

データの周りにこれらのボックスを描くと、私が何を意味するかがわかります。大きさの異なる大まかに 2 つのグループがあります。

QEUbjZrtSiLCbdcXIR1MN0MKvcCZgVxeW2sPzMo4KL36pjCQq87rkRdgKKwK2yWSh2Uz0HMoIckyOa0qcNX4hCQok_kuuyqq4PddFHVuC5Tzyg9-WdZobdZgWfOpW1PnKWFKfQLaDAEDXoDHfiuU5mY

データの 2 つのコア グループ

もう 1 つ興味深いのは、各レーティングの映画がいくつあるかということです。これにより、SF が評価データのどこに到達する傾向があるかがわかります。このコードを使用して、評価の棒グラフを取得します。

ax = final_df['rating'].value_counts().plot(kind='bar',
                                   figsize=(14,8),
                                   title="Number of Movies by Rating")
ax.set_xlabel("Rating")
ax.set_ylabel("Number of Movies")
ax.plot();

そのコードの結果がこのグラフになり、R と PG-13 が IMDB のこれらの SF 映画の大部分を占めていることがわかります。

7896rs2HtqgI4nIPyI-vUU5w43C3_Dcuyc_DdjUOudq76aIHstBINNVf5e0-1G3MUZzFgKDzK_2Jhsnno5swbXIoZwMuxqg1icY8aPbxWjOsCIm3BB9lObzY7HiDSAhmLTfcpbrfi2HWdW4VokUjBk

評価別の映画の数

「承認済み」と評価された映画がいくつかあるのを見て、それが何であるかに興味がありました. このコードを使用してデータフレームをフィルタリングし、ドリルダウンできます。

final_df[final_df['rating'] == 'Approved']

これにより、これらの映画のほとんどが 80 年代より前に作成されたことが明らかになりました。

OIxMNDTgcXcPo_Wy8N7miq44OOAai4o-A8upYa1pNbqWjDVzPduRxNcMgUPuG9-OFyNd1AFgwWeq_o4E3Kv9pXy8xVSH7p6ZZi9uoOy78dBFK0LjvDnN9k7WYDTiZYxwpVgCiqXokWLPuMo746j

すべてのレーティング「承認済み」映画

私は MPAA の Web サイトにアクセスしましたが、レーティング情報ページにそれらについての言及はありませんでした。ある時点で段階的に廃止されたに違いありません。

また、レビューで他の年または数十年を上回っているかどうかを確認することもできます. 年ごとの平均メタスコアを取得し、それを次のコードでプロットしてさらに詳しく調べました。

# What are the average metascores by year?
final_df.groupby('year')['metascore'].mean().plot(kind='bar', figsize=(16,8), title="Avg. Metascore by Year", xlabel="Year", ylabel="Avg. Metascore")
plt.xticks(rotation=90)
plt.plot();

これにより、次のチャートが得られます。

BTwJBQhq0zBTr5UH5J3n7CSR6k3Ft8l9GBQ_czZRlu_LO192AXd0G_ozwXzsb6pctS-8lHCvgVLx6VZ7hWH-trp8C4oFAPCGufh2gq-F2WV96u90xt05KqUGYCqSFpmXxPEsFKSZQceglNItwChRnfE

平均 映画年別メタスコア

理由がわかったとは言いませんが、平均メタスコア変数の履歴が進むにつれて、緩やかな低下が見られます。ここ数十年で、レーティングは 55 ~ 60 前後で横ばいになっているようです。これは、新しい映画に関するデータが多いか、新しい映画ほどレビューが多くなる傾向があるためかもしれません。

final_df['year'].value_counts().plot(kind='bar', figsize=[20,9])

上記のコードを実行すると、1927 年の映画にはレビューのサンプルが 1 つしかないことがわかります。そのスコアは偏っており、過大評価されています。また、私が推測したように、最近の映画の方がレビューでよりよく表されていることがわかります。

d2C-t1DLqSjRY8DpeoudyBNHG4SevXCZXFK4xoaw3QHpj_j4qn​​Ef479Tn7wNyBqOwKAzR5GVidaRZ79XB5Eo36msA8LRBxNaJu_9Xk1VKE5oeo2Pue1TLnbjMX3y48Gc5xfOBnlZ1x9rdBpgtI_N2

年ごとの映画の数

これをさらに進めるためのデータ サイエンス プロジェクトのアイデア

ここには、テキスト変数、カテゴリ変数、および数値変数があります。さらに探索するために試すことができるいくつかのオプションがあります。

できることの 1 つは、自然言語処理 (NLP) を使用して、映画の評価や SF の世界に命名規則があるかどうかを確認することです (または、別のジャンルを選択した場合は、どのジャンルを選択したとしても!) .

Web スクレイピング コードを変更して、さらに多くのジャンルを取り込むこともできます。これにより、新しいジャンル間データベースを作成して、ジャンルごとの命名規則があるかどうかを確認できます。

次に、映画の名前に基づいてジャンルを予測することができます。映画のジャンルや公開年に基づいて IMDB の評価を予測することもできます。後者のアイデアは、ほとんどの観測がそこにあるため、過去数十年でよりうまく機能するでしょう.

このチュートリアルが、データ サイエンスの世界とその可能性についての好奇心を刺激したことを願っています。

探索的データ分析では、常により多くの質問があることがわかります。その制約に対処することは、ビジネス目標に基づいて優先順位を付けることです。これらの目的を前もって開始することが重要です。

データ サイエンスの分野に興味があり、スキルセットを拡大して専門的に取り組みたい場合は、Springboard のデータ サイエンス キャリア トラックをチェックすることを検討してください。このコースでは、Springboard が 1 対 1 のエキスパート メンターと共に主要な概念のすべてを詳細に説明し、あなたの旅をサポートします。

ビジネス上の問題に関連してデータ サイエンス プロジェクトを構成し、それらを解決するための技術的アプローチについて説明する記事を他にも書いています。気になる方はこちらをチェック!

ハッピーコーディング!

ソース: https://www.freecodecamp.org/news/web-scraping-sci-fi-movies-from-imdb-with-python/

#python 

Python での Web スクレイピング – IMDB から SF 映画をスクレイピングする
Hoang  Kim

Hoang Kim

1660231560

Web Scraping Bằng Python - Cách Scrap Phim Sci-Fi Từ IMDB

Bạn đã bao giờ phải vật lộn để tìm một bộ dữ liệu cho dự án khoa học dữ liệu của mình chưa? Nếu bạn giống như tôi, câu trả lời là có.

May mắn thay, có rất nhiều bộ dữ liệu miễn phí có sẵn - nhưng đôi khi bạn muốn một cái gì đó cụ thể hơn hoặc riêng. Vì vậy, tìm kiếm trang web là một kỹ năng tốt cần có trong hộp công cụ của bạn để lấy dữ liệu ra khỏi trang web yêu thích của bạn.

Điều gì được đề cập trong Điều này?

Bài viết này có tập lệnh Python mà bạn có thể sử dụng để thu thập dữ liệu về phim khoa học viễn tưởng (hoặc bất kỳ thể loại nào bạn chọn!) Từ trang web IMDB . Sau đó, nó có thể ghi những dữ liệu này vào khung dữ liệu để khám phá thêm.

Tôi sẽ kết thúc bài viết này với một chút phân tích dữ liệu khám phá (EDA). Qua đó, bạn sẽ thấy những dự án khoa học dữ liệu nào có thể để bạn thử.

Tuyên bố từ chối trách nhiệm: mặc dù việc tìm kiếm trang web là một cách tuyệt vời để lấy dữ liệu ra khỏi trang web theo chương trình, nhưng hãy làm như vậy một cách có trách nhiệm. Ví dụ: tập lệnh của tôi sử dụng chức năng ngủ để làm chậm các yêu cầu kéo có chủ ý, để không làm quá tải các máy chủ của IMDB. Một số trang web không hài lòng khi sử dụng công cụ quét web, vì vậy hãy sử dụng nó một cách khôn ngoan.

Tập lệnh quét dữ liệu và làm sạch dữ liệu web

Hãy đến với script cạo và chạy nó. Kịch bản kéo theo tên phim, năm, xếp hạng (PG-13, R, v.v.), thể loại, thời gian chạy, đánh giá và bình chọn cho mỗi phim. Bạn có thể chọn số lượng trang bạn muốn loại bỏ dựa trên nhu cầu dữ liệu của bạn.

Lưu ý: sẽ mất nhiều thời gian hơn nếu bạn chọn nhiều trang hơn. Mất 40 phút để quét 200 trang web bằng Google Colab Notebook .

Đối với những người bạn chưa dùng thử trước đây, Google Colab là một công cụ phát triển Python theo phong cách Jupyter Notebook dựa trên đám mây, nằm trong bộ ứng dụng của Google. Bạn có thể sử dụng nó ngay lập tức với nhiều gói đã được cài đặt phổ biến trong khoa học dữ liệu.

Dưới đây là hình ảnh của không gian làm việc Colab và cách bố trí của nó:

R9sAuHzGHrEvRK_hiAWsy4W41W72et6clD38gIYeAA6AtA32e97xxw0W5ub_96xmgSMTDB2VjRK-gz_YgYtZoV1YyCHjKftaB7-HD2NQ7qt_8hcds8hcdn

Giới thiệu giao diện người dùng Google Colab

Với điều đó, chúng ta hãy đi sâu vào! Điều đầu tiên, bạn phải luôn nhập các gói của mình dưới dạng ô riêng của chúng. Nếu bạn quên một gói, bạn có thể chạy lại chỉ ô đó. Điều này cắt giảm thời gian phát triển.

Lưu ý: cần chạy một số gói này pip install package_nameđể cài đặt chúng trước. Nếu bạn chọn chạy mã cục bộ bằng cách sử dụng một cái gì đó như Máy tính xách tay Jupyter, bạn sẽ cần phải làm điều đó. Nếu bạn muốn thiết lập và chạy nhanh chóng, bạn có thể sử dụng sổ ghi chép Google Colab. Điều này đã được cài đặt tất cả những thứ này theo mặc định.

from requests import get
from bs4 import BeautifulSoup
from warnings import warn
from time import sleep
from random import randint
import numpy as np, pandas as pd
import seaborn as sns

Cách làm Web Scraping

Bạn có thể chạy đoạn mã sau để quét web thực tế. Nó sẽ kéo tất cả các cột được đề cập ở trên thành các mảng và đưa chúng vào từng bộ phim tại một thời điểm, một trang tại một thời điểm.

Ngoài ra còn có một số bước làm sạch dữ liệu mà tôi đã thêm và ghi lại trong mã này. Ví dụ, tôi đã xóa dấu ngoặc đơn khỏi dữ liệu chuỗi đề cập đến năm của bộ phim. Sau đó tôi đã chuyển đổi chúng thành số nguyên. Những thứ như thế này giúp việc phân tích và lập mô hình dữ liệu khám phá trở nên dễ dàng hơn.

Lưu ý rằng tôi sử dụng chức năng ngủ để tránh bị IMDB hạn chế khi chuyển qua các trang web của họ quá nhanh.

# Note this takes about 40 min to run if np.arange is set to 9951 as the stopping point.

pages = np.arange(1, 9951, 50) # Last time I tried, I could only go to 10000 items because after that the URI has no discernable pattern to combat webcrawlers; I just did 4 pages for demonstration purposes. You can increase this for your own projects.
headers = {'Accept-Language': 'en-US,en;q=0.8'} # If this is not specified, the default language is Mandarin

#initialize empty lists to store the variables scraped
titles = []
years = []
ratings = []
genres = []
runtimes = []
imdb_ratings = []
imdb_ratings_standardized = []
metascores = []
votes = []

for page in pages:
  
   #get request for sci-fi
   response = get("https://www.imdb.com/search/title?genres=sci-fi&"
                  + "start="
                  + str(page)
                  + "&explore=title_type,genres&ref_=adv_prv", headers=headers)
  
   sleep(randint(8,15))
   
   #throw warning for status codes that are not 200
   if response.status_code != 200:
       warn('Request: {}; Status code: {}'.format(requests, response.status_code))

   #parse the content of current iteration of request
   page_html = BeautifulSoup(response.text, 'html.parser')
      
   movie_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')
  
   #extract the 50 movies for that page
   for container in movie_containers:

       #conditional for all with metascore
       if container.find('div', class_ = 'ratings-metascore') is not None:

           #title
           title = container.h3.a.text
           titles.append(title)

           if container.h3.find('span', class_= 'lister-item-year text-muted unbold') is not None:
            
             #year released
             year = container.h3.find('span', class_= 'lister-item-year text-muted unbold').text # remove the parentheses around the year and make it an integer
             years.append(year)

           else:
             years.append(None) # each of the additional if clauses are to handle type None data, replacing it with an empty string so the arrays are of the same length at the end of the scraping

           if container.p.find('span', class_ = 'certificate') is not None:
            
             #rating
             rating = container.p.find('span', class_= 'certificate').text
             ratings.append(rating)

           else:
             ratings.append("")

           if container.p.find('span', class_ = 'genre') is not None:
            
             #genre
             genre = container.p.find('span', class_ = 'genre').text.replace("\n", "").rstrip().split(',') # remove the whitespace character, strip, and split to create an array of genres
             genres.append(genre)
          
           else:
             genres.append("")

           if container.p.find('span', class_ = 'runtime') is not None:

             #runtime
             time = int(container.p.find('span', class_ = 'runtime').text.replace(" min", "")) # remove the minute word from the runtime and make it an integer
             runtimes.append(time)

           else:
             runtimes.append(None)

           if float(container.strong.text) is not None:

             #IMDB ratings
             imdb = float(container.strong.text) # non-standardized variable
             imdb_ratings.append(imdb)

           else:
             imdb_ratings.append(None)

           if container.find('span', class_ = 'metascore').text is not None:

             #Metascore
             m_score = int(container.find('span', class_ = 'metascore').text) # make it an integer
             metascores.append(m_score)

           else:
             metascores.append(None)

           if container.find('span', attrs = {'name':'nv'})['data-value'] is not None:

             #Number of votes
             vote = int(container.find('span', attrs = {'name':'nv'})['data-value'])
             votes.append(vote)

           else:
               votes.append(None)

           else:
               votes.append(None)

Khung dữ liệu gấu trúc lấy làm mảng dữ liệu đầu vào cho mỗi cột của chúng trong các cặp khóa: giá trị. Tôi đã thực hiện thêm một số bước dọn dẹp dữ liệu ở đây để hoàn tất việc dọn dẹp dữ liệu.

Sau khi bạn chạy ô sau, bạn sẽ có một khung dữ liệu với dữ liệu bạn đã thu thập.

sci_fi_df = pd.DataFrame({'movie': titles,
                      'year': years,
                      'rating': ratings,
                      'genre': genres,
                      'runtime_min': runtimes,
                      'imdb': imdb_ratings,
                      'metascore': metascores,
                      'votes': votes}
                      )

sci_fi_df.loc[:, 'year'] = sci_fi_df['year'].str[-5:-1] # two more data transformations after scraping
# Drop 'ovie' bug
# Make year an int
sci_fi_df['n_imdb'] = sci_fi_df['imdb'] * 10
final_df = sci_fi_df.loc[sci_fi_df['year'] != 'ovie'] # One small issue with the scrape on these two movies so just dropping those ones.
final_df.loc[:, 'year'] = pd.to_numeric(final_df['year'])

Phân tích dữ liệu khám phá

Bây giờ bạn đã có dữ liệu, một trong những điều đầu tiên bạn có thể muốn làm là tìm hiểu thêm về nó ở cấp độ cao. Các lệnh sau là một cái nhìn đầu tiên hữu ích về bất kỳ dữ liệu nào và chúng tôi sẽ sử dụng chúng tiếp theo:

final_df.head()

Lệnh này hiển thị cho bạn 5 hàng đầu tiên trong khung dữ liệu của bạn. Nó giúp bạn thấy rằng không có gì trông kỳ lạ và mọi thứ đã sẵn sàng để phân tích. Bạn có thể xem kết quả ở đây:

TCYKlpEKIJOVJIAtIGN4wzDhCySaYIXI9cyBizZxR3XHsAQO_YH9mh626hCq8fdItaAF0N0cxSs1PP1eYujRsOt8HgeXtcC3hff-y0Jl4tvN6HfQTI97cyd5tvN6HfQTI97cyd5tvN6

Năm hàng dữ liệu đầu tiên được xuất ra bằng final_df.head()lệnh

final_df.describe()

Lệnh này sẽ cung cấp cho bạn giá trị trung bình, độ lệch chuẩn và các tóm tắt khác. Số đếm có thể cho bạn biết nếu có bất kỳ giá trị rỗng nào trong một số cột, đây là thông tin hữu ích cần biết. Ví dụ: cột năm hiển thị cho bạn phạm vi phim được loại bỏ - từ năm 1927 đến năm 2022.

Bạn có thể xem kết quả đầu ra bên dưới và kiểm tra các kết quả khác:

Zeo_Y8ipyIejyYIBa2Aaocz4obHNlMVU76YTylZGl_wpRovYVFNS4e0m1DYAwkcqhpoYikJFL_dSgZSH-qoghJM3VMXESMUykrfs1e3Juvs6hp6QGOAD6Q16GOAD6Q16

Chạy final_df.describe()tạo ra thống kê tóm tắt hiển thị số lượng điểm dữ liệu, trung bình, độ lệch chuẩn và hơn thế nữa.

final_df.info()

Lệnh này cho bạn biết các kiểu dữ liệu bạn đang làm việc với mỗi cột của mình.

Là một nhà khoa học dữ liệu, thông tin này có thể hữu ích cho bạn. Một số chức năng và phương thức cần một số kiểu dữ liệu nhất định. Bạn cũng có thể đảm bảo rằng các loại dữ liệu cơ bản của bạn có định dạng phù hợp với những gì chúng đang có.

Ví dụ: xếp hạng 5 sao phải là float hoặc int (nếu số thập phân không được phép). Nó không phải là một chuỗi vì nó là một số. Dưới đây là tóm tắt về định dạng dữ liệu cho từng biến sau khi cạo:

PT6

Việc chạy final_df.info()kết quả cho bạn biết bạn có bao nhiêu giá trị trong mỗi cột và kiểu dữ liệu của chúng là gì.

Lệnh tiếp theo để tìm hiểu thêm về các biến của bạn sẽ tạo ra một bản đồ nhiệt. Bản đồ nhiệt cho thấy mối tương quan giữa tất cả các biến định lượng của bạn. Đây là một cách nhanh chóng để đánh giá các mối quan hệ có thể tồn tại giữa các biến. Tôi thích xem các hệ số hơn là cố gắng giải mã mã màu, vì vậy tôi sử dụng annot=Trueđối số.

sns.heatmap(final_df.corr(), annot=True);

Lệnh trên tạo ra hình ảnh trực quan sau bằng cách sử dụng gói trực quan hóa dữ liệu Seaborn:

niHLKP7bps1EpZ_39u5k3dPDF0Xuz8Zuhal8Bbc8wtImKUv50M_7fEH65rCAkrTglGtZTJpZ2sRfIE0E6Kjn9m_CYGkRct83_3wtImKUv50M_7fEH65rCAkrTglGtZTJpZ2sRfIE0E6Kjn9m_CYGkRct83_3wwtImKUv50M_7fEH65rCAkrTglGtZTJpZ2sRfIE0E6Kjn9m_CYGkRct83_3wWzVp0rnux066

Bản đồ nhiệt của các mối tương quan sau khi chạysns.heatmap(final_df.corr(), annot=True);

Bạn có thể thấy rằng mối tương quan mạnh nhất là giữa điểm IMDB và metascore. Điều này không có gì đáng ngạc nhiên vì có khả năng hai hệ thống xếp hạng phim xếp hạng tương tự nhau.

Mối tương quan mạnh nhất tiếp theo mà bạn có thể thấy là giữa xếp hạng IMDB và số phiếu bầu. Điều này thật thú vị vì khi số phiếu bầu tăng lên, bạn sẽ có một mẫu đánh giá dân số đại diện hơn. Tuy nhiên, thật kỳ lạ khi thấy rằng có một mối liên hệ yếu giữa hai yếu tố này.

Số phiếu bầu gần như tăng lên khi thời gian chạy cũng tăng lên.

Bạn cũng có thể thấy mối liên hệ tiêu cực nhẹ giữa IMDB hoặc điểm số và năm bộ phim ra mắt. Chúng tôi sẽ xem xét điều này ngay sau đây.

Bạn có thể kiểm tra một số mối quan hệ này một cách trực quan thông qua một biểu đồ phân tán với mã này:

x = final_df['n_imdb']
y = final_df['votes']
plt.scatter(x, y, alpha=0.5) # s= is size var, c= is color var
plt.xlabel("IMDB Rating Standardized")
plt.ylabel("Number of Votes")
plt.title("Number of Votes vs. IMDB Rating")
plt.ticklabel_format(style='plain')
plt.show()

Điều này dẫn đến hình dung này:

vvqxh5VwbHPoypyGlNBstgZW8puVWKa5m_hl6MYB_r78OfRC7TWBx9jxjf8PFflJO93hq83ZdIqX97uq6C_WjlZV5jorCDgtU3U3HT_r78OfRC7TWBx9jxjf8PFflJO93hq83ZdIqX97uq6C_WjlZV5jorCDgtU3U3HT3

Xếp hạng IMDB so với số phiếu bầu

Sự liên kết trên cho thấy một số ngoại lệ. Nói chung, chúng tôi thấy số lượng phiếu bầu lớn hơn cho các phim có xếp hạng IMDB từ 85 trở lên. Có ít bài đánh giá hơn về những bộ phim có điểm đánh giá từ 75 trở xuống.

Vẽ các hộp này xung quanh dữ liệu có thể cho bạn thấy ý tôi muốn nói. Có khoảng hai nhóm có cường độ khác nhau:

QEUbjZrtSiLCbdcXIR1MN0MKvcCZgVxeW2sPzMo4KL36pjCQq87rkRdgKKwK2yWSh2Uz0HMoIckyOa0qcNX4hCQok_kuuyqq4PddFHVuf5Tzypd9-WQL5TzypW9D5D5TzypW1D5DKF5

Hai nhóm cốt lõi trong dữ liệu

Một điều khác có thể thú vị để xem là có bao nhiêu bộ phim của mỗi xếp hạng. Điều này có thể cho bạn biết Khoa học viễn tưởng có xu hướng đến đâu trong dữ liệu xếp hạng. Sử dụng mã này để nhận biểu đồ thanh về các xếp hạng:

ax = final_df['rating'].value_counts().plot(kind='bar',
                                   figsize=(14,8),
                                   title="Number of Movies by Rating")
ax.set_xlabel("Rating")
ax.set_ylabel("Number of Movies")
ax.plot();

Đoạn mã đó dẫn đến biểu đồ này cho chúng ta thấy rằng R và PG-13 chiếm phần lớn các phim Khoa học viễn tưởng trên IMDB.

7896rs2HtqgI4nIPyI-vUU5w43C3_Dcuyc_DdjUOudq76aIHstBINNVf5e0-1G3MUZzFgKDzK_2Jhsnno5swbXIoZwMuxqg1icY8aPbxWjOsCzmBB7HipbAhWjOsCzmBB7HdWnDS4fim3UjdWnDSn

Số lượng phim theo xếp hạng

Tôi đã thấy rằng có một vài bộ phim được đánh giá là “Đã được phê duyệt” và rất tò mò không biết đó là phim gì. Bạn có thể lọc khung dữ liệu bằng mã này để đi sâu vào đó:

final_df[final_df['rating'] == 'Approved']

Điều này tiết lộ rằng hầu hết những bộ phim này được thực hiện trước những năm 80:

OIxMNDTgcXcPo_Wy8N7miq44OOAai4o-A8upYa1pNbqWjDVzPduRxNcMgUPuG9-OFyNd1AFgwWeq_o4E3Kv9pXy8xVSH7p6pNbqWjDVzPduRxNcMgUPuG9-OFyNd1AFgwWeq_o4E3Kv9pXy8xVSH7p6ZZi9uoOy78dBwDKZZi9uoOy78dBwDK7

Tất cả xếp hạng phim "được phê duyệt"

Tôi đã vào trang web của MPAA và không thấy đề cập đến chúng trên trang thông tin xếp hạng của họ. Nó phải được loại bỏ dần dần vào một lúc nào đó.

Bạn cũng có thể kiểm tra xem có bất kỳ năm hoặc thập kỷ nào vượt trội hơn những người khác trên các bài đánh giá hay không. Tôi lấy điểm số trung bình theo năm và vẽ biểu đồ đó bằng đoạn mã sau để khám phá thêm:

# What are the average metascores by year?
final_df.groupby('year')['metascore'].mean().plot(kind='bar', figsize=(16,8), title="Avg. Metascore by Year", xlabel="Year", ylabel="Avg. Metascore")
plt.xticks(rotation=90)
plt.plot();

Điều này dẫn đến biểu đồ sau:

BTwJBQhq0zBTr5UH5J3n7CSR6k3Ft8l9GBQ_czZRlu_LO192AXd0G_ozwXzsb6pctS-8lHCvgVLx6VZ7hWH-trp8C4oFAPCGufh2gq-F2gnf96

Trung bình Metascore theo năm phim

Bây giờ tôi không nói rằng tôi biết tại sao, nhưng có một sự suy giảm dần dần, nhẹ khi bạn tiến bộ qua lịch sử trong biến metascore trung bình. Có vẻ như xếp hạng đã được san bằng khoảng 55-60 trong vài thập kỷ qua. Điều này có thể là do chúng tôi có nhiều dữ liệu hơn về các bộ phim mới hơn hoặc các bộ phim mới hơn có xu hướng được đánh giá nhiều hơn.

final_df['year'].value_counts().plot(kind='bar', figsize=[20,9])

Chạy đoạn mã trên và bạn sẽ thấy rằng bộ phim năm 1927 chỉ có một mẫu của 1 bài đánh giá. Điểm số đó sau đó bị thiên vị và thổi phồng quá mức. Bạn cũng sẽ thấy rằng những bộ phim gần đây hơn được thể hiện tốt hơn trong các bài đánh giá như tôi đã nghi ngờ:

d2C-t1DLqSjRY8DpeoudyBNHG4SevXCZXFK4xoaw3QHpj_j4qnEf479Tn7wNyBqOwKAzR5GVidaRZ79XB5Eo36msA8LRBxNaJu_9Xk1VKE5onrdnpg1XfOB1VKE5oeo2Glk1XfOB1VKE5onrdnpg

Số lượng phim theo năm

Ý tưởng Dự án Khoa học Dữ liệu để Thực hiện Điều này Xa hơn

Bạn có các biến dạng văn bản, phân loại và số ở đây. Có một số tùy chọn bạn có thể thử khám phá thêm.

Một điều bạn có thể làm là sử dụng Quy trình ngôn ngữ tự nhiên (NLP) để xem liệu có bất kỳ quy ước đặt tên nào đối với xếp hạng phim hoặc trong thế giới Khoa học viễn tưởng hay không (hoặc nếu bạn chọn làm một thể loại khác, bất kỳ thể loại nào bạn đã chọn!) .

Bạn cũng có thể thay đổi mã tìm kiếm web để kéo thêm nhiều thể loại khác. Với điều đó, bạn có thể tạo một cơ sở dữ liệu liên thể loại mới để xem liệu có các quy ước đặt tên theo thể loại hay không.

Sau đó, bạn có thể thử dự đoán thể loại dựa trên tên của bộ phim. Bạn cũng có thể thử dự đoán xếp hạng IMDB dựa trên thể loại hoặc năm phim ra mắt. Ý tưởng thứ hai sẽ hoạt động tốt hơn trong vài thập kỷ qua vì hầu hết các quan sát đều ở đó.

Tôi hy vọng hướng dẫn này khơi dậy sự tò mò trong bạn về thế giới khoa học dữ liệu và những gì có thể xảy ra!

Bạn sẽ thấy trong phân tích dữ liệu khám phá luôn có nhiều câu hỏi hơn để hỏi. Làm việc với hạn chế đó là sắp xếp thứ tự ưu tiên dựa trên (các) mục tiêu kinh doanh. Điều quan trọng là phải bắt đầu với những mục tiêu đó từ trước nếu không bạn có thể ở trong đám cỏ dữ liệu khám phá mãi mãi.

Nếu lĩnh vực khoa học dữ liệu thú vị với bạn và bạn muốn mở rộng bộ kỹ năng của mình và tham gia vào nó một cách chuyên nghiệp, hãy cân nhắc xem Theo dõi sự nghiệp về khoa học dữ liệu của Springboard . Trong khóa học này, Springboard hướng dẫn bạn về tất cả các khái niệm chính một cách chuyên sâu với một chuyên gia cố vấn 1: 1 được kết hợp với bạn để hỗ trợ bạn trên hành trình của mình.

Tôi đã viết các bài báo khác xoay quanh các dự án khoa học dữ liệu liên quan đến các vấn đề kinh doanh và hướng dẫn cách tiếp cận kỹ thuật để giải quyết chúng. Kiểm tra những điều đó nếu bạn quan tâm!

Chúc bạn viết mã vui vẻ!

Nguồn: https://www.freecodecamp.org/news/web-scraping-sci-fi-movies-from-imdb-with-python/

#python 

Web Scraping Bằng Python - Cách Scrap Phim Sci-Fi Từ IMDB
Thierry  Perret

Thierry Perret

1660227960

Web Scraping En Python - Gratter Des Films De Science-fiction à Partir

Avez-vous déjà eu du mal à trouver un ensemble de données pour votre projet de science des données ? Si vous êtes comme moi, la réponse est oui.

Heureusement, de nombreux ensembles de données gratuits sont disponibles, mais vous souhaitez parfois quelque chose de plus spécifique ou sur mesure. Pour cela, le grattage Web est une bonne compétence à avoir dans votre boîte à outils pour extraire des données de votre site Web préféré.

Qu'est-ce qui est couvert dans cet article ?

Cet article contient un script Python que vous pouvez utiliser pour récupérer les données sur les films de science-fiction (ou quel que soit le genre que vous choisissez !) À partir du site Web IMDB . Il peut ensuite écrire ces données dans une base de données pour une exploration plus approfondie.

Je conclurai cet article avec un peu d'analyse exploratoire des données (EDA). Grâce à cela, vous verrez quels autres projets de science des données il vous est possible d'essayer.

Avis de non-responsabilité : bien que le scraping Web soit un excellent moyen d'extraire par programme des données de sites Web, veuillez le faire de manière responsable. Mon script utilise par exemple la fonction sleep pour ralentir intentionnellement les pull requests, afin de ne pas surcharger les serveurs d'IMDB. Certains sites Web désapprouvent l'utilisation de grattoirs Web, alors utilisez-les judicieusement.

Script de grattage Web et de nettoyage des données

Passons au script de grattage et lancez-le. Le script extrait les titres de films, les années, les classements (PG-13, R, etc.), les genres, les durées d'exécution, les critiques et les votes pour chaque film. Vous pouvez choisir le nombre de pages que vous souhaitez récupérer en fonction de vos besoins en données.

Remarque : plus vous sélectionnez de pages, plus cela prendra de temps. Il faut 40 minutes pour gratter 200 pages Web à l'aide de Google Colab Notebook .

Pour ceux d'entre vous qui ne l'ont pas encore essayé, Google Colab est un outil de développement Python de style Jupyter Notebook basé sur le cloud qui réside dans la suite d'applications Google. Vous pouvez l'utiliser prêt à l'emploi avec de nombreux packages déjà installés qui sont courants en science des données.

Vous trouverez ci-dessous une image de l'espace de travail Colab et de sa disposition :

R9sAuHzGHrEvRK_hiAWsy4W41W72et6clD38gIYeAA6AtA32e97xxw0W5ub_96xmgSMTDB2VjRK-gz_YgYtZoV1YyCHjKftaB7-HD2NQ7qt_8hcdnDfqaibp0ONwPr9-4zO5gv3FuXdxiOMsN6eF8eF8bA

Présentation de l'interface utilisateur de Google Colab

Sur ce, plongeons dedans ! Tout d'abord, vous devez toujours importer vos packages dans leur propre cellule. Si vous oubliez un package, vous pouvez réexécuter uniquement cette cellule. Cela réduit le temps de développement.

Remarque : certains de ces packages doivent pip install package_nameêtre exécutés pour les installer en premier. Si vous choisissez d'exécuter le code localement en utilisant quelque chose comme un cahier Jupyter, vous devrez le faire. Si vous souhaitez être opérationnel rapidement, vous pouvez utiliser le bloc-notes Google Colab. Cela a tout cela installé par défaut.

from requests import get
from bs4 import BeautifulSoup
from warnings import warn
from time import sleep
from random import randint
import numpy as np, pandas as pd
import seaborn as sns

Comment faire du Web Scraping

Vous pouvez exécuter le code suivant qui effectue le grattage Web réel. Il tirera toutes les colonnes mentionnées ci-dessus dans des tableaux et les remplira un film à la fois, une page à la fois.

Il y a aussi quelques étapes de nettoyage des données que j'ai ajoutées et documentées dans ce code également. J'ai supprimé les parenthèses des données de chaîne mentionnant l'année du film par exemple. Je les ai ensuite convertis en nombres entiers. De telles choses facilitent l'analyse et la modélisation exploratoire des données.

Notez que j'utilise la fonction de veille pour éviter d'être limité par IMDB lorsqu'il s'agit de parcourir trop rapidement leurs pages Web.

# Note this takes about 40 min to run if np.arange is set to 9951 as the stopping point.

pages = np.arange(1, 9951, 50) # Last time I tried, I could only go to 10000 items because after that the URI has no discernable pattern to combat webcrawlers; I just did 4 pages for demonstration purposes. You can increase this for your own projects.
headers = {'Accept-Language': 'en-US,en;q=0.8'} # If this is not specified, the default language is Mandarin

#initialize empty lists to store the variables scraped
titles = []
years = []
ratings = []
genres = []
runtimes = []
imdb_ratings = []
imdb_ratings_standardized = []
metascores = []
votes = []

for page in pages:
  
   #get request for sci-fi
   response = get("https://www.imdb.com/search/title?genres=sci-fi&"
                  + "start="
                  + str(page)
                  + "&explore=title_type,genres&ref_=adv_prv", headers=headers)
  
   sleep(randint(8,15))
   
   #throw warning for status codes that are not 200
   if response.status_code != 200:
       warn('Request: {}; Status code: {}'.format(requests, response.status_code))

   #parse the content of current iteration of request
   page_html = BeautifulSoup(response.text, 'html.parser')
      
   movie_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')
  
   #extract the 50 movies for that page
   for container in movie_containers:

       #conditional for all with metascore
       if container.find('div', class_ = 'ratings-metascore') is not None:

           #title
           title = container.h3.a.text
           titles.append(title)

           if container.h3.find('span', class_= 'lister-item-year text-muted unbold') is not None:
            
             #year released
             year = container.h3.find('span', class_= 'lister-item-year text-muted unbold').text # remove the parentheses around the year and make it an integer
             years.append(year)

           else:
             years.append(None) # each of the additional if clauses are to handle type None data, replacing it with an empty string so the arrays are of the same length at the end of the scraping

           if container.p.find('span', class_ = 'certificate') is not None:
            
             #rating
             rating = container.p.find('span', class_= 'certificate').text
             ratings.append(rating)

           else:
             ratings.append("")

           if container.p.find('span', class_ = 'genre') is not None:
            
             #genre
             genre = container.p.find('span', class_ = 'genre').text.replace("\n", "").rstrip().split(',') # remove the whitespace character, strip, and split to create an array of genres
             genres.append(genre)
          
           else:
             genres.append("")

           if container.p.find('span', class_ = 'runtime') is not None:

             #runtime
             time = int(container.p.find('span', class_ = 'runtime').text.replace(" min", "")) # remove the minute word from the runtime and make it an integer
             runtimes.append(time)

           else:
             runtimes.append(None)

           if float(container.strong.text) is not None:

             #IMDB ratings
             imdb = float(container.strong.text) # non-standardized variable
             imdb_ratings.append(imdb)

           else:
             imdb_ratings.append(None)

           if container.find('span', class_ = 'metascore').text is not None:

             #Metascore
             m_score = int(container.find('span', class_ = 'metascore').text) # make it an integer
             metascores.append(m_score)

           else:
             metascores.append(None)

           if container.find('span', attrs = {'name':'nv'})['data-value'] is not None:

             #Number of votes
             vote = int(container.find('span', attrs = {'name':'nv'})['data-value'])
             votes.append(vote)

           else:
               votes.append(None)

           else:
               votes.append(None)

Les dataframes Pandas prennent en entrée des tableaux de données pour chacune de leurs colonnes dans des paires clé:valeur. J'ai effectué quelques étapes supplémentaires de nettoyage des données ici pour finaliser le nettoyage des données.

Après avoir exécuté la cellule suivante, vous devriez avoir une trame de données avec les données que vous avez récupérées.

sci_fi_df = pd.DataFrame({'movie': titles,
                      'year': years,
                      'rating': ratings,
                      'genre': genres,
                      'runtime_min': runtimes,
                      'imdb': imdb_ratings,
                      'metascore': metascores,
                      'votes': votes}
                      )

sci_fi_df.loc[:, 'year'] = sci_fi_df['year'].str[-5:-1] # two more data transformations after scraping
# Drop 'ovie' bug
# Make year an int
sci_fi_df['n_imdb'] = sci_fi_df['imdb'] * 10
final_df = sci_fi_df.loc[sci_fi_df['year'] != 'ovie'] # One small issue with the scrape on these two movies so just dropping those ones.
final_df.loc[:, 'year'] = pd.to_numeric(final_df['year'])

L'analyse exploratoire des données

Maintenant que vous disposez des données, l'une des premières choses que vous voudrez peut-être faire est d'en savoir plus à un niveau élevé. Les commandes suivantes sont un premier aperçu utile de toutes les données et nous les utiliserons ensuite :

final_df.head()

Cette commande vous montre les 5 premières lignes de votre dataframe. Cela vous aide à voir que rien ne semble bizarre et que tout est prêt pour l'analyse. Vous pouvez voir la sortie ici :

TCYKlpEKIJOVJIAtIGN4wzDhCySaYIXI9cyBizZxR3XHsAQO_YH9mh626hCq8fdItaAF0N0cxSs1PP1eYujRsOt8HgeXtcC3hff-y0Jl4tvN__itH97iXqb6DrN6wJrngdsNaKQTQag5StHfOIcy5A0

Les cinq premières lignes de données générées à l'aide de la final_df.head()commande

final_df.describe()

Cette commande vous fournira la moyenne, l'écart type et d'autres résumés. Count peut vous montrer s'il y a des valeurs nulles dans certaines des colonnes, ce qui est une information utile à connaître. La colonne de l'année, par exemple, vous montre la gamme de films récupérés - de 1927 à 2022.

Vous pouvez voir la sortie ci-dessous et inspecter les autres :

Zeo_Y8ipyIejyYIBa2Aaocz4obHNlMVU76YTylZGl_wpRovYVFNS4e0m1DYAwkcqhpoYikJFL_dSgZSH-qoghJM3VMXESMUykrfs1e3JuXRkrp9iEZhPPnqGvsSamdYQe6Noz0Q0OA-Wen616-pmbDQ

L'exécution final_df.describe()produit des statistiques récapitulatives indiquant le nombre de points de données, les moyennes, les écarts-types, etc.

final_df.info()

Cette commande vous permet de connaître les types de données avec lesquels vous travaillez dans chacune de vos colonnes.

En tant que data scientist, ces informations peuvent vous être utiles. Certaines fonctions et méthodes nécessitent certains types de données. Vous pouvez également vous assurer que vos types de données sous-jacents sont dans un format qui a du sens pour ce qu'ils sont.

Par exemple : une note de 5 étoiles doit être un flottant ou un entier (si les décimales ne sont pas autorisées). Il ne doit pas s'agir d'une chaîne car il s'agit d'un nombre. Voici un résumé du format de données pour chaque variable après le grattage :

PT7Fa9XFYErtorVw6bNxw7Q1mI-p2_hlKgTbTs90RRpALPDlqd95F_EOwCQ7cV2cDymqZ-mXIa_0blqxxJ5wZ8Bznzd0iFyTB6kFroIUK2DJNzfRZgwgsRHr0pjDyE1ZUrQILf-22w856OoufnnKmRI

L'exécution final_df.info()des résultats vous montre combien de valeurs vous avez dans chaque colonne et quels sont leurs types de données.

La commande suivante pour en savoir plus sur vos variables produit une carte thermique. La carte thermique montre la corrélation entre toutes vos variables quantitatives. C'est un moyen rapide d'évaluer les relations qui peuvent exister entre les variables. J'aime voir les coefficients plutôt que d'essayer de déchiffrer le code couleur, alors j'utilise l' annot=Trueargument.

sns.heatmap(final_df.corr(), annot=True);

La commande ci-dessus produit la visualisation suivante à l'aide du package de visualisation de données Seaborn :

niHLKP7bps1EpZ_39u5k3dPDF0Xuz8Zuhal8Bbc8wtImKUv50M_7fEH65rCAkrTglGtZTJpZ2sRfIE0E6Kjn9m_CYGkRct83_3wWzVp0rnHA8nh5UuveFO0OqtjVfoOzMsKGq0lZ2uxw66Lp4g69aMo

Une carte thermique des corrélations après l'exécutionsns.heatmap(final_df.corr(), annot=True);

Vous pouvez voir que la corrélation la plus forte est entre le score IMDB et le métascore. Ce n'est pas surprenant puisqu'il est probable que deux systèmes de classification de films obtiennent une note similaire.

La prochaine corrélation la plus forte que vous pouvez voir est entre la note IMDB et le nombre de votes. C'est intéressant car à mesure que le nombre de votes augmente, vous avez un échantillon plus représentatif de la cote de la population. Il est étrange de voir qu'il existe une faible association entre les deux, cependant.

Le nombre de votes augmente à peu près à mesure que la durée d'exécution augmente également.

Vous pouvez également voir une légère association négative entre IMDB ou metascore et l'année de sortie du film. Nous verrons cela sous peu.

Vous pouvez vérifier visuellement certaines de ces relations via un nuage de points avec ce code :

x = final_df['n_imdb']
y = final_df['votes']
plt.scatter(x, y, alpha=0.5) # s= is size var, c= is color var
plt.xlabel("IMDB Rating Standardized")
plt.ylabel("Number of Votes")
plt.title("Number of Votes vs. IMDB Rating")
plt.ticklabel_format(style='plain')
plt.show()

Cela se traduit par cette visualisation :

vvqxh5VwbHPoypyGlNBstgZW8puVWKa5m_hl6MYB_r78OfRC7TWBx9jxjf8PFflJO93hq83ZdIqX97uq6C_WjlZV5jorCDgtU3U3_dESuUgsStfLEgkeiikHTq2noabW_tPJQRRGpFrVmQ90gja4xAo

Notes IMDB par rapport au nombre de votes

L'association ci-dessus montre quelques valeurs aberrantes. Généralement, nous voyons un plus grand nombre de votes sur les films qui ont une cote IMDB de 85 ou plus. Il y a moins de critiques sur les films avec une note de 75 ou moins.

Dessiner ces cases autour des données peut vous montrer ce que je veux dire. Il existe à peu près deux groupes de grandeurs différentes :

QEUbjZrtSiLCbdcXIR1MN0MKvcCZgVxeW2sPzMo4KL36pjCQq87rkRdgKKwK2yWSh2Uz0HMoIckyOa0qcNX4hCQok_kuuyqq4PddFHVuC5Tzyg9-WdZobdZgWfOpW1PnKWFKfQLaDAEDXoDHfiuU5mY

Deux groupes principaux dans les données

Une autre chose qui pourrait être intéressante à voir est le nombre de films de chaque classement. Cela peut vous montrer où la science-fiction a tendance à atterrir dans les données d'évaluation. Utilisez ce code pour obtenir un graphique à barres des notes :

ax = final_df['rating'].value_counts().plot(kind='bar',
                                   figsize=(14,8),
                                   title="Number of Movies by Rating")
ax.set_xlabel("Rating")
ax.set_ylabel("Number of Movies")
ax.plot();

Ce code donne ce tableau qui nous montre que R et PG-13 constituent la majorité de ces films de science-fiction sur IMDB.

7896rs2HtqgI4nIPyI-vUU5w43C3_Dcuyc_DdjUOudq76aIHstBINNVf5e0-1G3MUZzFgKDzK_2Jhsnno5swbXIoZwMuxqg1icY8aPbxWjOsCIm3BB9lObzY7HiDSAhmLTfcpfi2HWdW4VoUjBck

Nombre de films par note

J'ai vu qu'il y avait quelques films classés comme "approuvés" et j'étais curieux de savoir ce que c'était. Vous pouvez filtrer la trame de données avec ce code pour y accéder :

final_df[final_df['rating'] == 'Approved']

Cela a révélé que la plupart de ces films ont été réalisés avant les années 80 :

OIxMNDTgcXcPo_Wy8N7miq44OOAai4o-A8upYa1pNbqWjDVzPduRxNcMgUPuG9-OFyNd1AFgwWeq_o4E3Kv9pXy8xVSH7p6ZZi9uoOy78dBFK0LjvDnN9k7WYDTiZYxwpVgCiqXokWLPuMo746jvWMo

Tous les films classés "Approuvés"

Je suis allé sur le site Web de la MPAA et il n'y avait aucune mention d'eux sur leur page d'information sur les évaluations. Il a dû être supprimé à un moment donné.

Vous pouvez également vérifier si des années ou des décennies ont surpassé les autres lors des évaluations. J'ai pris le métascore moyen par année et l'ai tracé avec le code suivant pour explorer davantage :

# What are the average metascores by year?
final_df.groupby('year')['metascore'].mean().plot(kind='bar', figsize=(16,8), title="Avg. Metascore by Year", xlabel="Year", ylabel="Avg. Metascore")
plt.xticks(rotation=90)
plt.plot();

Cela se traduit par le graphique suivant :

BTwJBQhq0zBTr5UH5J3n7CSR6k3Ft8l9GBQ_czZRlu_LO192AXd0G_ozwXzsb6pctS-8lHCvgVLx6VZ7hWH-trp8C4oFAPCGufh2gq-F2WV96u90xt05KqUGYCqSFpmXxPEsFKSZQceglNItwChRnfE

Moy. Metascore par année de film

Maintenant, je ne dis pas que je sais pourquoi, mais il y a un léger déclin progressif à mesure que vous progressez dans l'histoire de la variable de métascore moyen. Il semble que les notes se soient stabilisées autour de 55-60 au cours des deux dernières décennies. Cela peut être dû au fait que nous avons plus de données sur les films les plus récents ou que les films les plus récents ont tendance à être davantage examinés.

final_df['year'].value_counts().plot(kind='bar', figsize=[20,9])

Exécutez le code ci-dessus et vous verrez que le film de 1927 n'avait qu'un échantillon de 1 critique. Ce score est alors biaisé et sur-gonflé. Vous verrez aussi que les films les plus récents sont mieux représentés dans les critiques comme je m'en doutais :

d2C-t1DLqSjRY8DpeoudyBNHG4SevXCZXFK4xoaw3QHpj_j4qnEf479Tn7wNyBqOwKAzR5GVidaRZ79XB5Eo36msA8LRBxNaJu_9Xk1VKE5oeo2Pue1TLnbjMX3y48Gc5xfOBnlZ1x9rdTktI_N2Bpg

Nombre de films par année

Idées de projets de science des données pour aller plus loin

Vous avez ici des variables textuelles, catégorielles et numériques. Il y a quelques options que vous pourriez essayer d'explorer davantage.

Une chose que vous pouvez faire est d'utiliser Natural Language Process (NLP) pour voir s'il existe des conventions de dénomination pour les classements de films, ou dans le monde de la science-fiction (ou si vous avez choisi de faire un genre différent, quel que soit le genre que vous avez choisi !) .

Vous pouvez modifier le code de grattage Web pour extraire également de nombreux autres genres. Avec cela, vous pouvez créer une nouvelle base de données inter-genres pour voir s'il existe des conventions de dénomination par genre.

Vous pouvez alors essayer de prédire le genre en fonction du nom du film. Vous pouvez également essayer de prédire la note IMDB en fonction du genre ou de l'année de sortie du film. Cette dernière idée fonctionnerait mieux au cours des dernières décennies puisque la plupart des observations sont là.

J'espère que ce tutoriel a éveillé votre curiosité sur le monde de la science des données et sur ce qui est possible !

Vous trouverez dans l'analyse exploratoire des données qu'il y a toujours plus de questions à poser. Travailler avec cette contrainte consiste à établir des priorités en fonction des objectifs commerciaux. Il est important de commencer par ces objectifs dès le départ, sinon vous pourriez être dans les mauvaises herbes à explorer les données pour toujours.

Si le domaine de la science des données vous intéresse et que vous souhaitez élargir vos compétences et y entrer de manière professionnelle, pensez à consulter la piste de carrière en science des données de Springboard . Dans ce cours, Springboard vous guide à travers tous les concepts clés en profondeur avec un mentor expert 1: 1 jumelé avec vous pour vous soutenir dans votre voyage.

J'ai écrit d'autres articles qui encadrent des projets de science des données en relation avec des problèmes commerciaux et décrivent des approches techniques pour les résoudre. Vérifiez-les si vous êtes intéressé!

Bon codage !

Source : https://www.freecodecamp.org/news/web-scraping-sci-fi-movies-from-imdb-with-python/

#python 

Web Scraping En Python - Gratter Des Films De Science-fiction à Partir

Web Scraping En Python: Extraer Películas De Ciencia Ficción De IMDB

¿Alguna vez ha tenido problemas para encontrar un conjunto de datos para su proyecto de ciencia de datos? Si eres como yo, la respuesta es sí.

Afortunadamente, hay muchos conjuntos de datos gratuitos disponibles, pero a veces desea algo más específico o personalizado. Para eso, el raspado web es una buena habilidad para tener en su caja de herramientas para extraer datos de su sitio web favorito.

¿Qué está cubierto en este artículo?

Este artículo tiene un script de Python que puede usar para extraer los datos de las películas de ciencia ficción (¡o del género que elija!) del sitio web de IMDB . Luego puede escribir estos datos en un marco de datos para una mayor exploración.

Concluiré este artículo con un poco de análisis exploratorio de datos (EDA). A través de esto, verá qué otros proyectos de ciencia de datos puede probar.

Descargo de responsabilidad: si bien el raspado web es una excelente manera de extraer datos de los sitios web mediante programación, hágalo de manera responsable. Mi secuencia de comandos utiliza la función de suspensión, por ejemplo, para ralentizar las solicitudes de extracción de forma intencionada, a fin de no sobrecargar los servidores de IMDB. Algunos sitios web desaprueban el uso de raspadores web, así que úselo sabiamente.

Web Scraping y script de limpieza de datos

Vayamos al script de raspado y hagamos que funcione. El guión extrae títulos de películas, años, calificaciones (PG-13, R, etc.), géneros, tiempos de ejecución, reseñas y votos para cada película. Puede elegir cuántas páginas desea raspar en función de sus necesidades de datos.

Nota: llevará más tiempo cuantas más páginas seleccione. Se tarda 40 minutos en raspar 200 páginas web con Google Colab Notebook .

Para aquellos de ustedes que no lo han probado antes, Google Colab es una herramienta de desarrollo de Python de estilo Jupyter Notebook basada en la nube que vive en la suite de aplicaciones de Google. Puede usarlo de inmediato con muchos de los paquetes ya instalados que son comunes en la ciencia de datos.

A continuación se muestra una imagen del espacio de trabajo de Colab y su diseño:

R9sAuHzGHrEvRK_hiAWsy4W41W72et6clD38gIYeAA6AtA32e97xxw0W5ub_96xmgSMTDB2VjRK-gz_YgYtZoV1YyCHjKftaB7-HD2NQ7qt_8hcdnDfqaibp0ONwPr9-4zO5gv3FuXdxiOMsN6eF8bA

Presentamos la interfaz de usuario de Google Colab

Con eso, ¡vamos a sumergirnos! Lo primero es lo primero, siempre debe importar sus paquetes como su propia celda. Si olvida un paquete, puede volver a ejecutar solo esa celda. Esto reduce el tiempo de desarrollo.

Nota: algunos de estos paquetes deben pip install package_nameejecutarse para instalarlos primero. Si elige ejecutar el código localmente usando algo como un Jupyter Notebook, deberá hacerlo. Si desea ponerse en marcha rápidamente, puede utilizar el bloc de notas de Google Colab. Esto tiene todo esto instalado por defecto.

from requests import get
from bs4 import BeautifulSoup
from warnings import warn
from time import sleep
from random import randint
import numpy as np, pandas as pd
import seaborn as sns

Cómo hacer el web scraping

Puede ejecutar el siguiente código que realiza el web scraping real. Extraerá todas las columnas mencionadas anteriormente en matrices y las completará una película a la vez, una página a la vez.

También hay algunos pasos de limpieza de datos que he agregado y documentado en este código. Eliminé los paréntesis de los datos de cadena que mencionan el año de la película, por ejemplo. Luego los convertí a números enteros. Cosas como esta facilitan el análisis y el modelado de datos exploratorios.

Tenga en cuenta que uso la función de suspensión para evitar que IMDB me restrinja cuando se trata de recorrer sus páginas web demasiado rápido.

# Note this takes about 40 min to run if np.arange is set to 9951 as the stopping point.

pages = np.arange(1, 9951, 50) # Last time I tried, I could only go to 10000 items because after that the URI has no discernable pattern to combat webcrawlers; I just did 4 pages for demonstration purposes. You can increase this for your own projects.
headers = {'Accept-Language': 'en-US,en;q=0.8'} # If this is not specified, the default language is Mandarin

#initialize empty lists to store the variables scraped
titles = []
years = []
ratings = []
genres = []
runtimes = []
imdb_ratings = []
imdb_ratings_standardized = []
metascores = []
votes = []

for page in pages:
  
   #get request for sci-fi
   response = get("https://www.imdb.com/search/title?genres=sci-fi&"
                  + "start="
                  + str(page)
                  + "&explore=title_type,genres&ref_=adv_prv", headers=headers)
  
   sleep(randint(8,15))
   
   #throw warning for status codes that are not 200
   if response.status_code != 200:
       warn('Request: {}; Status code: {}'.format(requests, response.status_code))

   #parse the content of current iteration of request
   page_html = BeautifulSoup(response.text, 'html.parser')
      
   movie_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')
  
   #extract the 50 movies for that page
   for container in movie_containers:

       #conditional for all with metascore
       if container.find('div', class_ = 'ratings-metascore') is not None:

           #title
           title = container.h3.a.text
           titles.append(title)

           if container.h3.find('span', class_= 'lister-item-year text-muted unbold') is not None:
            
             #year released
             year = container.h3.find('span', class_= 'lister-item-year text-muted unbold').text # remove the parentheses around the year and make it an integer
             years.append(year)

           else:
             years.append(None) # each of the additional if clauses are to handle type None data, replacing it with an empty string so the arrays are of the same length at the end of the scraping

           if container.p.find('span', class_ = 'certificate') is not None:
            
             #rating
             rating = container.p.find('span', class_= 'certificate').text
             ratings.append(rating)

           else:
             ratings.append("")

           if container.p.find('span', class_ = 'genre') is not None:
            
             #genre
             genre = container.p.find('span', class_ = 'genre').text.replace("\n", "").rstrip().split(',') # remove the whitespace character, strip, and split to create an array of genres
             genres.append(genre)
          
           else:
             genres.append("")

           if container.p.find('span', class_ = 'runtime') is not None:

             #runtime
             time = int(container.p.find('span', class_ = 'runtime').text.replace(" min", "")) # remove the minute word from the runtime and make it an integer
             runtimes.append(time)

           else:
             runtimes.append(None)

           if float(container.strong.text) is not None:

             #IMDB ratings
             imdb = float(container.strong.text) # non-standardized variable
             imdb_ratings.append(imdb)

           else:
             imdb_ratings.append(None)

           if container.find('span', class_ = 'metascore').text is not None:

             #Metascore
             m_score = int(container.find('span', class_ = 'metascore').text) # make it an integer
             metascores.append(m_score)

           else:
             metascores.append(None)

           if container.find('span', attrs = {'name':'nv'})['data-value'] is not None:

             #Number of votes
             vote = int(container.find('span', attrs = {'name':'nv'})['data-value'])
             votes.append(vote)

           else:
               votes.append(None)

           else:
               votes.append(None)

Los marcos de datos de Pandas toman como entrada matrices de datos para cada una de sus columnas en pares clave:valor. Hice un par de pasos adicionales de limpieza de datos aquí para finalizar la limpieza de datos.

Después de ejecutar la siguiente celda, debería tener un marco de datos con los datos que extrajo.

sci_fi_df = pd.DataFrame({'movie': titles,
                      'year': years,
                      'rating': ratings,
                      'genre': genres,
                      'runtime_min': runtimes,
                      'imdb': imdb_ratings,
                      'metascore': metascores,
                      'votes': votes}
                      )

sci_fi_df.loc[:, 'year'] = sci_fi_df['year'].str[-5:-1] # two more data transformations after scraping
# Drop 'ovie' bug
# Make year an int
sci_fi_df['n_imdb'] = sci_fi_df['imdb'] * 10
final_df = sci_fi_df.loc[sci_fi_df['year'] != 'ovie'] # One small issue with the scrape on these two movies so just dropping those ones.
final_df.loc[:, 'year'] = pd.to_numeric(final_df['year'])

Análisis exploratorio de datos

Ahora que tiene los datos, una de las primeras cosas que puede querer hacer es aprender más sobre ellos a un alto nivel. Los siguientes comandos son un primer vistazo útil a cualquier dato y los usaremos a continuación:

final_df.head()

Este comando le muestra las primeras 5 filas de su marco de datos. Te ayuda a ver que nada se ve raro y que todo está listo para el análisis. Puedes ver la salida aquí:

TCYKlpEKIJOVJIAtIGN4wzDhCySaYIXI9cyBizZxR3XHsAQO_YH9mh626hCq8fdItaAF0N0cxSs1PP1eYujRsOt8HgeXtcC3hff-y0Jl4tvN__itH97iXqb6DrN6wJrngdsNaKQTQag5StH0fOIcy5StHfO

Las primeras cinco filas de datos generados con el final_df.head()comando

final_df.describe()

Este comando le proporcionará la media, la desviación estándar y otros resúmenes. Count puede mostrarle si hay valores nulos en algunas de las columnas, lo cual es información útil para saber. La columna del año, por ejemplo, muestra el rango de películas extraídas, desde 1927 hasta 2022.

Puede ver el resultado a continuación e inspeccionar los demás:

Zeo_Y8ipyIejyYIBa2Aaocz4obHNlMVU76YTylZGl_wpRovYVFNS4e0m1DYAwkcqhpoYikJFL_dSgZSH-qoghJM3VMXESMUykrfs1e3JuXRkrp9iEZhPPnqGvsSamdYQe6Noz0Q0OA-Wen616-pmbDQ

La ejecución final_df.describe()produce estadísticas resumidas que muestran la cantidad de puntos de datos, promedios, desviaciones estándar y más.

final_df.info()

Este comando le permite conocer los tipos de datos con los que está trabajando en cada una de sus columnas.

Como científico de datos, esta información puede resultarle útil. Ciertas funciones y métodos necesitan ciertos tipos de datos. También puede asegurarse de que sus tipos de datos subyacentes estén en un formato que tenga sentido para lo que son.

Por ejemplo: una calificación de 5 estrellas debe ser un número flotante o entero (si no se permiten decimales). No debería ser una cadena ya que es un número. Aquí hay un resumen de cuál era el formato de datos para cada variable después del raspado:

PT7Fa9XFYErtorVw6bNxw7Q1mI-p2_hlKgTbTs90RRpALPDlqd95F_EOwCQ7cV2cDymqZ-mXIa_0blqxxJ5wZ8Bznzd0iFyTB6kFroIUK2DJNzfRZgwgsRHr0pjDyE1ZUrQILf-22w856mORIo

Los resultados de la ejecución final_df.info()le muestran cuántos valores tiene en cada columna y cuáles son sus tipos de datos.

El siguiente comando para obtener más información sobre sus variables produce un mapa de calor. El mapa de calor muestra la correlación entre todas sus variables cuantitativas. Esta es una forma rápida de evaluar las relaciones que pueden existir entre las variables. Me gusta ver los coeficientes en lugar de tratar de descifrar el código de colores, así que uso el annot=Trueargumento.

sns.heatmap(final_df.corr(), annot=True);

El comando anterior produce la siguiente visualización utilizando el paquete de visualización de datos Seaborn:

niHLKP7bps1EpZ_39u5k3dPDF0Xuz8Zuhal8Bbc8wtImKUv50M_7fEH65rCAkrTglGtZTJpZ2sRfIE0E6Kjn9m_CYGkRct83_3wWzVp0rnHA8nh5UuveFO0OqtjVfoOzMsKGq0lZ2uxw66aMo4g69

Un mapa de calor de las correlaciones después de ejecutarsns.heatmap(final_df.corr(), annot=True);

Puede ver que la correlación más fuerte es entre el puntaje de IMDB y el metascore. Esto no es sorprendente ya que es probable que dos sistemas de calificación de películas califiquen de manera similar.

La siguiente correlación más fuerte que puede ver es entre la calificación de IMDB y la cantidad de votos. Esto es interesante porque a medida que aumenta el número de votos, se tiene una muestra más representativa de la calificación de la población. Sin embargo, es extraño ver que existe una asociación débil entre los dos.

El número de votos aumenta aproximadamente a medida que aumenta el tiempo de ejecución.

También puede ver una ligera asociación negativa entre IMDB o metascore y el año en que salió la película. Veremos esto en breve.

Puede verificar algunas de estas relaciones visualmente a través de un gráfico de dispersión con este código:

x = final_df['n_imdb']
y = final_df['votes']
plt.scatter(x, y, alpha=0.5) # s= is size var, c= is color var
plt.xlabel("IMDB Rating Standardized")
plt.ylabel("Number of Votes")
plt.title("Number of Votes vs. IMDB Rating")
plt.ticklabel_format(style='plain')
plt.show()

Esto da como resultado esta visualización:

vvqxh5VwbHPoypyGlNBstgZW8puVWKa5m_hl6MYB_r78OfRC7TWBx9jxjf8PFflJO93hq83ZdIqX97uq6C_WjlZV5jorCDgtU3U3_dESuUgsStfLEgkeiikHTq2noabW_tPJQRRGpFrVmQ90gja4xAo

Calificaciones de IMDB frente al número de votos

La asociación anterior muestra algunos valores atípicos. En general, vemos una mayor cantidad de votos en películas que tienen una calificación IMDB de 85 o más. Hay menos reseñas de películas con una calificación de 75 o menos.

Dibujar estos cuadros alrededor de los datos puede mostrarle lo que quiero decir. Hay aproximadamente dos agrupaciones de diferentes magnitudes:

QEUbjZrtSiLCbdcXIR1MN0MKvcCZgVxeW2sPzMo4KL36pjCQq87rkRdgKKwK2yWSh2Uz0HMoIckyOa0qcNX4hCQok_kuuyqq4PddFHVuC5Tzyg9-WdZobdZgWfOpW1PnKWFKfQLaDAEDXoDHfiuU5mY

Dos grupos centrales en los datos

Otra cosa que podría ser interesante ver es cuántas películas hay de cada calificación. Esto puede mostrarle dónde tiende a aterrizar la ciencia ficción en los datos de calificaciones. Use este código para obtener un gráfico de barras de las calificaciones:

ax = final_df['rating'].value_counts().plot(kind='bar',
                                   figsize=(14,8),
                                   title="Number of Movies by Rating")
ax.set_xlabel("Rating")
ax.set_ylabel("Number of Movies")
ax.plot();

Ese código da como resultado este gráfico que nos muestra que R y PG-13 constituyen la mayoría de estas películas de ciencia ficción en IMDB.

7896rs2HtqgI4nIPyI-vUU5w43C3_Dcuyc_DdjUOudq76aIHstBINNVf5e0-1G3MUZzFgKDzK_2Jhsnno5swbXIoZwMuxqg1icY8aPbxWjOsCIm3BB9lObzY7HiDSAhmLTfcpfi2HWdBc4VoUj

Número de películas por calificación

Vi que había algunas películas calificadas como "Aprobadas" y tenía curiosidad de qué se trataba. Puede filtrar el marco de datos con este código para profundizar en eso:

final_df[final_df['rating'] == 'Approved']

Esto reveló que la mayoría de estas películas se hicieron antes de los años 80:

OIxMNDTgcXcPo_Wy8N7miq44OOAai4o-A8upYa1pNbqWjDVzPduRxNcMgUPuG9-OFyNd1AFgwWeq_o4E3Kv9pXy8xVSH7p6ZZi9uoOy78dBFK0LjvDnN9k7WYDTiZYxwpVgCiqXokWLPuMo746jvWMo746

Todas las películas "aprobadas" calificadas

Fui al sitio web de la MPAA y no había ninguna mención de ellos en su página de información de calificaciones. Debe haber sido eliminado en algún momento.

También puede verificar si algún año o década superó a otros en las revisiones. Tomé el metapuntaje promedio por año y lo tracé con el siguiente código para explorar más a fondo:

# What are the average metascores by year?
final_df.groupby('year')['metascore'].mean().plot(kind='bar', figsize=(16,8), title="Avg. Metascore by Year", xlabel="Year", ylabel="Avg. Metascore")
plt.xticks(rotation=90)
plt.plot();

Esto da como resultado el siguiente gráfico:

BTwJBQhq0zBTr5UH5J3n7CSR6k3Ft8l9GBQ_czZRlu_LO192AXd0G_ozwXzsb6pctS-8lHCvgVLx6VZ7hWH-trp8C4oFAPCGufh2gq-F2WV96u90xt05KqUGYCqSFpmXxPEsFKSZQceglNItwChRnfE

Promedio Metascore por año de película

Ahora no digo que sepa por qué, pero hay una disminución gradual y leve a medida que avanza en la historia en la variable de metapuntaje promedio. Parece que las calificaciones se han nivelado alrededor de 55-60 en las últimas dos décadas. Esto podría deberse a que tenemos más datos sobre películas más nuevas o a que las películas más nuevas tienden a tener más reseñas.

final_df['year'].value_counts().plot(kind='bar', figsize=[20,9])

Ejecute el código anterior y verá que la película de 1927 solo tenía una muestra de 1 reseña. Esa puntuación entonces está sesgada y sobreinflada. Verás también que las películas más recientes están mejor representadas en las reseñas como sospechaba:

d2C-t1DLqSjRY8DpeoudyBNHG4SevXCZXFK4xoaw3QHpj_j4qnEf479Tn7wNyBqOwKAzR5GVidaRZ79XB5Eo36msA8LRBxNaJu_9Xk1VKE5oeo2Pue1TLnbjMX3y48Gc5xfOBnlZ1x9rdTktI_N2Bpg

Número de películas por año

Ideas de proyectos de ciencia de datos para llevar esto más lejos

Tiene variables textuales, categóricas y numéricas aquí. Hay algunas opciones que podría intentar explorar más.

Una cosa que podría hacer es usar el Proceso de lenguaje natural (NLP) para ver si hay alguna convención de nomenclatura para las clasificaciones de películas, o dentro del mundo de la ciencia ficción (o si elige hacer un género diferente, ¡cualquier género que elija!) .

También puede cambiar el código de web scraping para incluir muchos más géneros. Con eso, podría crear una nueva base de datos entre géneros para ver si hay convenciones de nombres por género.

Luego, podría intentar predecir el género según el nombre de la película. También puede intentar predecir la calificación de IMDB según el género o el año en que salió la película. La última idea funcionaría mejor en las últimas décadas ya que la mayoría de las observaciones están ahí.

¡Espero que este tutorial haya despertado tu curiosidad sobre el mundo de la ciencia de datos y lo que es posible!

Encontrará en el análisis exploratorio de datos que siempre hay más preguntas que hacer. Trabajar con esa restricción se trata de priorizar en función de los objetivos comerciales. Es importante comenzar con esos objetivos desde el principio o podría estar en la exploración de datos para siempre.

Si el campo de la ciencia de datos es interesante para usted y desea ampliar su conjunto de habilidades e ingresar a él de manera profesional, considere consultar la Trayectoria profesional en ciencia de datos de Springboard . En este curso, Springboard lo guía a través de todos los conceptos clave en profundidad con un mentor experto 1: 1 emparejado con usted para ayudarlo en su viaje.

He escrito otros artículos que enmarcan los proyectos de ciencia de datos en relación con los problemas comerciales y repaso los enfoques técnicos para resolverlos. ¡Échales un vistazo si estás interesado!

¡Feliz codificación!

Fuente: https://www.freecodecamp.org/news/web-scraping-sci-fi-movies-from-imdb-with-python/

#python 

Web Scraping En Python: Extraer Películas De Ciencia Ficción De IMDB
Elian  Harber

Elian Harber

1652428740

Xurls: Extract Urls From Text

xurls

Extract urls from text using regular expressions. Requires Go 1.17 or later.

import "mvdan.cc/xurls/v2"

func main() {
    rxRelaxed := xurls.Relaxed()
    rxRelaxed.FindString("Do gophers live in golang.org?")  // "golang.org"
    rxRelaxed.FindString("This string does not have a URL") // ""

    rxStrict := xurls.Strict()
    rxStrict.FindAllString("must have scheme: http://foo.com/.", -1) // []string{"http://foo.com/"}
    rxStrict.FindAllString("no scheme, no match: foo.com", -1)       // []string{}
}

Since API is centered around regexp.Regexp, many other methods are available, such as finding the byte indexes for all matches.

The regular expressions are compiled when the API is first called. Any subsequent calls will use the same regular expression pointers.

cmd/xurls

To install the tool globally:

go install mvdan.cc/xurls/v2/cmd/xurls@latest
$ echo "Do gophers live in http://golang.org?" | xurls
http://golang.org

Author: Mvdan
Source Code: https://github.com/mvdan/xurls 
License: BSD-3-Clause license

#go #golang #extract #text 

Xurls: Extract Urls From Text

Face Recognition with OpenCV and Python

Introduction

What is face recognition? Or what is recognition? When you look at an apple fruit, your mind immediately tells you that this is an apple fruit. This process, your mind telling you that this is an apple fruit is recognition in simple words. So what is face recognition then? I am sure you have guessed it right. When you look at your friend walking down the street or a picture of him, you recognize that he is your friend Paulo. Interestingly when you look at your friend or a picture of him you look at his face first before looking at anything else. Ever wondered why you do that? This is so that you can recognize him by looking at his face. Well, this is you doing face recognition.

But the real question is how does face recognition works? It is quite simple and intuitive. Take a real life example, when you meet someone first time in your life you don't recognize him, right? While he talks or shakes hands with you, you look at his face, eyes, nose, mouth, color and overall look. This is your mind learning or training for the face recognition of that person by gathering face data. Then he tells you that his name is Paulo. At this point your mind knows that the face data it just learned belongs to Paulo. Now your mind is trained and ready to do face recognition on Paulo's face. Next time when you will see Paulo or his face in a picture you will immediately recognize him. This is how face recognition work. The more you will meet Paulo, the more data your mind will collect about Paulo and especially his face and the better you will become at recognizing him.

Now the next question is how to code face recognition with OpenCV, after all this is the only reason why you are reading this article, right? OK then. You might say that our mind can do these things easily but to actually code them into a computer is difficult? Don't worry, it is not. Thanks to OpenCV, coding face recognition is as easier as it feels. The coding steps for face recognition are same as we discussed it in real life example above.

  • Training Data Gathering: Gather face data (face images in this case) of the persons you want to recognize
  • Training of Recognizer: Feed that face data (and respective names of each face) to the face recognizer so that it can learn.
  • Recognition: Feed new faces of the persons and see if the face recognizer you just trained recognizes them.

OpenCV comes equipped with built in face recognizer, all you have to do is feed it the face data. It's that simple and this how it will look once we are done coding it.

visualization

OpenCV Face Recognizers

OpenCV has three built in face recognizers and thanks to OpenCV's clean coding, you can use any of them by just changing a single line of code. Below are the names of those face recognizers and their OpenCV calls.

  1. EigenFaces Face Recognizer Recognizer - cv2.face.createEigenFaceRecognizer()
  2. FisherFaces Face Recognizer Recognizer - cv2.face.createFisherFaceRecognizer()
  3. Local Binary Patterns Histograms (LBPH) Face Recognizer - cv2.face.createLBPHFaceRecognizer()

We have got three face recognizers but do you know which one to use and when? Or which one is better? I guess not. So why not go through a brief summary of each, what you say? I am assuming you said yes :) So let's dive into the theory of each.

EigenFaces Face Recognizer

This algorithm considers the fact that not all parts of a face are equally important and equally useful. When you look at some one you recognize him/her by his distinct features like eyes, nose, cheeks, forehead and how they vary with respect to each other. So you are actually focusing on the areas of maximum change (mathematically speaking, this change is variance) of the face. For example, from eyes to nose there is a significant change and same is the case from nose to mouth. When you look at multiple faces you compare them by looking at these parts of the faces because these parts are the most useful and important components of a face. Important because they catch the maximum change among faces, change the helps you differentiate one face from the other. This is exactly how EigenFaces face recognizer works.

EigenFaces face recognizer looks at all the training images of all the persons as a whole and try to extract the components which are important and useful (the components that catch the maximum variance/change) and discards the rest of the components. This way it not only extracts the important components from the training data but also saves memory by discarding the less important components. These important components it extracts are called principal components. Below is an image showing the principal components extracted from a list of faces.

Principal Components eigenfaces_opencv source

You can see that principal components actually represent faces and these faces are called eigen faces and hence the name of the algorithm.

So this is how EigenFaces face recognizer trains itself (by extracting principal components). Remember, it also keeps a record of which principal component belongs to which person. One thing to note in above image is that Eigenfaces algorithm also considers illumination as an important component.

Later during recognition, when you feed a new image to the algorithm, it repeats the same process on that image as well. It extracts the principal component from that new image and compares that component with the list of components it stored during training and finds the component with the best match and returns the person label associated with that best match component.

Easy peasy, right? Next one is easier than this one.

FisherFaces Face Recognizer

This algorithm is an improved version of EigenFaces face recognizer. Eigenfaces face recognizer looks at all the training faces of all the persons at once and finds principal components from all of them combined. By capturing principal components from all the of them combined you are not focusing on the features that discriminate one person from the other but the features that represent all the persons in the training data as a whole.

This approach has drawbacks, for example, images with sharp changes (like light changes which is not a useful feature at all) may dominate the rest of the images and you may end up with features that are from external source like light and are not useful for discrimination at all. In the end, your principal components will represent light changes and not the actual face features.

Fisherfaces algorithm, instead of extracting useful features that represent all the faces of all the persons, it extracts useful features that discriminate one person from the others. This way features of one person do not dominate over the others and you have the features that discriminate one person from the others.

Below is an image of features extracted using Fisherfaces algorithm.

Fisher Faces eigenfaces_opencv source

You can see that features extracted actually represent faces and these faces are called fisher faces and hence the name of the algorithm.

One thing to note here is that even in Fisherfaces algorithm if multiple persons have images with sharp changes due to external sources like light they will dominate over other features and affect recognition accuracy.

Getting bored with this theory? Don't worry, only one face recognizer is left and then we will dive deep into the coding part.

Local Binary Patterns Histograms (LBPH) Face Recognizer

I wrote a detailed explaination on Local Binary Patterns Histograms in my previous article on face detection using local binary patterns histograms. So here I will just give a brief overview of how it works.

We know that Eigenfaces and Fisherfaces are both affected by light and in real life we can't guarantee perfect light conditions. LBPH face recognizer is an improvement to overcome this drawback.

Idea is to not look at the image as a whole instead find the local features of an image. LBPH alogrithm try to find the local structure of an image and it does that by comparing each pixel with its neighboring pixels.

Take a 3x3 window and move it one image, at each move (each local part of an image), compare the pixel at the center with its neighbor pixels. The neighbors with intensity value less than or equal to center pixel are denoted by 1 and others by 0. Then you read these 0/1 values under 3x3 window in a clockwise order and you will have a binary pattern like 11100011 and this pattern is local to some area of the image. You do this on whole image and you will have a list of local binary patterns.

LBP Labeling LBP labeling

Now you get why this algorithm has Local Binary Patterns in its name? Because you get a list of local binary patterns. Now you may be wondering, what about the histogram part of the LBPH? Well after you get a list of local binary patterns, you convert each binary pattern into a decimal number (as shown in above image) and then you make a histogram of all of those values. A sample histogram looks like this.

Sample Histogram LBP labeling

I guess this answers the question about histogram part. So in the end you will have one histogram for each face image in the training data set. That means if there were 100 images in training data set then LBPH will extract 100 histograms after training and store them for later recognition. Remember, algorithm also keeps track of which histogram belongs to which person.

Later during recognition, when you will feed a new image to the recognizer for recognition it will generate a histogram for that new image, compare that histogram with the histograms it already has, find the best match histogram and return the person label associated with that best match histogram. 

Below is a list of faces and their respective local binary patterns images. You can see that the LBP images are not affected by changes in light conditions.

LBP Faces LBP faces source

The theory part is over and now comes the coding part! Ready to dive into coding? Let's get into it then.

Coding Face Recognition with OpenCV

The Face Recognition process in this tutorial is divided into three steps.

  1. Prepare training data: In this step we will read training images for each person/subject along with their labels, detect faces from each image and assign each detected face an integer label of the person it belongs to.
  2. Train Face Recognizer: In this step we will train OpenCV's LBPH face recognizer by feeding it the data we prepared in step 1.
  3. Testing: In this step we will pass some test images to face recognizer and see if it predicts them correctly.

[There should be a visualization diagram for above steps here]

To detect faces, I will use the code from my previous article on face detection. So if you have not read it, I encourage you to do so to understand how face detection works and its Python coding.

Import Required Modules

Before starting the actual coding we need to import the required modules for coding. So let's import them first.

  • cv2: is OpenCV module for Python which we will use for face detection and face recognition.
  • os: We will use this Python module to read our training directories and file names.
  • numpy: We will use this module to convert Python lists to numpy arrays as OpenCV face recognizers accept numpy arrays.
#import OpenCV module
import cv2
#import os module for reading training data directories and paths
import os
#import numpy to convert python lists to numpy arrays as 
#it is needed by OpenCV face recognizers
import numpy as np

#matplotlib for display our images
import matplotlib.pyplot as plt
%matplotlib inline 

Training Data

The more images used in training the better. Normally a lot of images are used for training a face recognizer so that it can learn different looks of the same person, for example with glasses, without glasses, laughing, sad, happy, crying, with beard, without beard etc. To keep our tutorial simple we are going to use only 12 images for each person.

So our training data consists of total 2 persons with 12 images of each person. All training data is inside training-data folder. training-data folder contains one folder for each person and each folder is named with format sLabel (e.g. s1, s2) where label is actually the integer label assigned to that person. For example folder named s1 means that this folder contains images for person 1. The directory structure tree for training data is as follows:

training-data
|-------------- s1
|               |-- 1.jpg
|               |-- ...
|               |-- 12.jpg
|-------------- s2
|               |-- 1.jpg
|               |-- ...
|               |-- 12.jpg

The test-data folder contains images that we will use to test our face recognizer after it has been successfully trained.

As OpenCV face recognizer accepts labels as integers so we need to define a mapping between integer labels and persons actual names so below I am defining a mapping of persons integer labels and their respective names.

Note: As we have not assigned label 0 to any person so the mapping for label 0 is empty.

#there is no label 0 in our training data so subject name for index/label 0 is empty
subjects = ["", "Tom Cruise", "Shahrukh Khan"]

Prepare training data

You may be wondering why data preparation, right? Well, OpenCV face recognizer accepts data in a specific format. It accepts two vectors, one vector is of faces of all the persons and the second vector is of integer labels for each face so that when processing a face the face recognizer knows which person that particular face belongs too.

For example, if we had 2 persons and 2 images for each person.

PERSON-1    PERSON-2   

img1        img1         
img2        img2

Then the prepare data step will produce following face and label vectors.

FACES                        LABELS

person1_img1_face              1
person1_img2_face              1
person2_img1_face              2
person2_img2_face              2

Preparing data step can be further divided into following sub-steps.

  1. Read all the folder names of subjects/persons provided in training data folder. So for example, in this tutorial we have folder names: s1, s2.
  2. For each subject, extract label number. Do you remember that our folders have a special naming convention? Folder names follow the format sLabel where Label is an integer representing the label we have assigned to that subject. So for example, folder name s1 means that the subject has label 1, s2 means subject label is 2 and so on. The label extracted in this step is assigned to each face detected in the next step.
  3. Read all the images of the subject, detect face from each image.
  4. Add each face to faces vector with corresponding subject label (extracted in above step) added to labels vector.

[There should be a visualization for above steps here]

Did you read my last article on face detection? No? Then you better do so right now because to detect faces, I am going to use the code from my previous article on face detection. So if you have not read it, I encourage you to do so to understand how face detection works and its coding. Below is the same code.

#function to detect face using OpenCV
def detect_face(img):
    #convert the test image to gray image as opencv face detector expects gray images
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
    #load OpenCV face detector, I am using LBP which is fast
    #there is also a more accurate but slow Haar classifier
    face_cascade = cv2.CascadeClassifier('opencv-files/lbpcascade_frontalface.xml')

    #let's detect multiscale (some images may be closer to camera than others) images
    #result is a list of faces
    faces = face_cascade.detectMultiScale(gray, scaleFactor=1.2, minNeighbors=5);
    
    #if no faces are detected then return original img
    if (len(faces) == 0):
        return None, None
    
    #under the assumption that there will be only one face,
    #extract the face area
    (x, y, w, h) = faces[0]
    
    #return only the face part of the image
    return gray[y:y+w, x:x+h], faces[0]

I am using OpenCV's LBP face detector. On line 4, I convert the image to grayscale because most operations in OpenCV are performed in gray scale, then on line 8 I load LBP face detector using cv2.CascadeClassifier class. After that on line 12 I use cv2.CascadeClassifier class' detectMultiScale method to detect all the faces in the image. on line 20, from detected faces I only pick the first face because in one image there will be only one face (under the assumption that there will be only one prominent face). As faces returned by detectMultiScale method are actually rectangles (x, y, width, height) and not actual faces images so we have to extract face image area from the main image. So on line 23 I extract face area from gray image and return both the face image area and face rectangle.

Now you have got a face detector and you know the 4 steps to prepare the data, so are you ready to code the prepare data step? Yes? So let's do it.

#this function will read all persons' training images, detect face from each image
#and will return two lists of exactly same size, one list 
# of faces and another list of labels for each face
def prepare_training_data(data_folder_path):
    
    #------STEP-1--------
    #get the directories (one directory for each subject) in data folder
    dirs = os.listdir(data_folder_path)
    
    #list to hold all subject faces
    faces = []
    #list to hold labels for all subjects
    labels = []
    
    #let's go through each directory and read images within it
    for dir_name in dirs:
        
        #our subject directories start with letter 's' so
        #ignore any non-relevant directories if any
        if not dir_name.startswith("s"):
            continue;
            
        #------STEP-2--------
        #extract label number of subject from dir_name
        #format of dir name = slabel
        #, so removing letter 's' from dir_name will give us label
        label = int(dir_name.replace("s", ""))
        
        #build path of directory containin images for current subject subject
        #sample subject_dir_path = "training-data/s1"
        subject_dir_path = data_folder_path + "/" + dir_name
        
        #get the images names that are inside the given subject directory
        subject_images_names = os.listdir(subject_dir_path)
        
        #------STEP-3--------
        #go through each image name, read image, 
        #detect face and add face to list of faces
        for image_name in subject_images_names:
            
            #ignore system files like .DS_Store
            if image_name.startswith("."):
                continue;
            
            #build image path
            #sample image path = training-data/s1/1.pgm
            image_path = subject_dir_path + "/" + image_name

            #read image
            image = cv2.imread(image_path)
            
            #display an image window to show the image 
            cv2.imshow("Training on image...", image)
            cv2.waitKey(100)
            
            #detect face
            face, rect = detect_face(image)
            
            #------STEP-4--------
            #for the purpose of this tutorial
            #we will ignore faces that are not detected
            if face is not None:
                #add face to list of faces
                faces.append(face)
                #add label for this face
                labels.append(label)
            
    cv2.destroyAllWindows()
    cv2.waitKey(1)
    cv2.destroyAllWindows()
    
    return faces, labels

I have defined a function that takes the path, where training subjects' folders are stored, as parameter. This function follows the same 4 prepare data substeps mentioned above.

(step-1) On line 8 I am using os.listdir method to read names of all folders stored on path passed to function as parameter. On line 10-13 I am defining labels and faces vectors.

(step-2) After that I traverse through all subjects' folder names and from each subject's folder name on line 27 I am extracting the label information. As folder names follow the sLabel naming convention so removing the letter s from folder name will give us the label assigned to that subject.

(step-3) On line 34, I read all the images names of of the current subject being traversed and on line 39-66 I traverse those images one by one. On line 53-54 I am using OpenCV's imshow(window_title, image) along with OpenCV's waitKey(interval) method to display the current image being traveresed. The waitKey(interval) method pauses the code flow for the given interval (milliseconds), I am using it with 100ms interval so that we can view the image window for 100ms. On line 57, I detect face from the current image being traversed.

(step-4) On line 62-66, I add the detected face and label to their respective vectors.

But a function can't do anything unless we call it on some data that it has to prepare, right? Don't worry, I have got data of two beautiful and famous celebrities. I am sure you will recognize them!

training-data

Let's call this function on images of these beautiful celebrities to prepare data for training of our Face Recognizer. Below is a simple code to do that.

#let's first prepare our training data
#data will be in two lists of same size
#one list will contain all the faces
#and other list will contain respective labels for each face
print("Preparing data...")
faces, labels = prepare_training_data("training-data")
print("Data prepared")

#print total faces and labels
print("Total faces: ", len(faces))
print("Total labels: ", len(labels))
Preparing data...
Data prepared
Total faces:  23
Total labels:  23

This was probably the boring part, right? Don't worry, the fun stuff is coming up next. It's time to train our own face recognizer so that once trained it can recognize new faces of the persons it was trained on. Read? Ok then let's train our face recognizer.

Train Face Recognizer

As we know, OpenCV comes equipped with three face recognizers.

  1. EigenFace Recognizer: This can be created with cv2.face.createEigenFaceRecognizer()
  2. FisherFace Recognizer: This can be created with cv2.face.createFisherFaceRecognizer()
  3. Local Binary Patterns Histogram (LBPH): This can be created with cv2.face.LBPHFisherFaceRecognizer()

I am going to use LBPH face recognizer but you can use any face recognizer of your choice. No matter which of the OpenCV's face recognizer you use the code will remain the same. You just have to change one line, the face recognizer initialization line given below.

#create our LBPH face recognizer 
face_recognizer = cv2.face.createLBPHFaceRecognizer()

#or use EigenFaceRecognizer by replacing above line with 
#face_recognizer = cv2.face.createEigenFaceRecognizer()

#or use FisherFaceRecognizer by replacing above line with 
#face_recognizer = cv2.face.createFisherFaceRecognizer()

Now that we have initialized our face recognizer and we also have prepared our training data, it's time to train the face recognizer. We will do that by calling the train(faces-vector, labels-vector) method of face recognizer.

#train our face recognizer of our training faces
face_recognizer.train(faces, np.array(labels))

Did you notice that instead of passing labels vector directly to face recognizer I am first converting it to numpy array? This is because OpenCV expects labels vector to be a numpy array.

Still not satisfied? Want to see some action? Next step is the real action, I promise!

Prediction

Now comes my favorite part, the prediction part. This is where we actually get to see if our algorithm is actually recognizing our trained subjects's faces or not. We will take two test images of our celeberities, detect faces from each of them and then pass those faces to our trained face recognizer to see if it recognizes them.

Below are some utility functions that we will use for drawing bounding box (rectangle) around face and putting celeberity name near the face bounding box.

#function to draw rectangle on image 
#according to given (x, y) coordinates and 
#given width and heigh
def draw_rectangle(img, rect):
    (x, y, w, h) = rect
    cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 2)
    
#function to draw text on give image starting from
#passed (x, y) coordinates. 
def draw_text(img, text, x, y):
    cv2.putText(img, text, (x, y), cv2.FONT_HERSHEY_PLAIN, 1.5, (0, 255, 0), 2)

First function draw_rectangle draws a rectangle on image based on passed rectangle coordinates. It uses OpenCV's built in function cv2.rectangle(img, topLeftPoint, bottomRightPoint, rgbColor, lineWidth) to draw rectangle. We will use it to draw a rectangle around the face detected in test image.

Second function draw_text uses OpenCV's built in function cv2.putText(img, text, startPoint, font, fontSize, rgbColor, lineWidth) to draw text on image.

Now that we have the drawing functions, we just need to call the face recognizer's predict(face) method to test our face recognizer on test images. Following function does the prediction for us.

#this function recognizes the person in image passed
#and draws a rectangle around detected face with name of the 
#subject
def predict(test_img):
    #make a copy of the image as we don't want to chang original image
    img = test_img.copy()
    #detect face from the image
    face, rect = detect_face(img)

    #predict the image using our face recognizer 
    label= face_recognizer.predict(face)
    #get name of respective label returned by face recognizer
    label_text = subjects[label]
    
    #draw a rectangle around face detected
    draw_rectangle(img, rect)
    #draw name of predicted person
    draw_text(img, label_text, rect[0], rect[1]-5)
    
    return img
  • line-6 read the test image
  • line-7 detect face from test image
  • line-11 recognize the face by calling face recognizer's predict(face) method. This method will return a lable
  • line-12 get the name associated with the label
  • line-16 draw rectangle around the detected face
  • line-18 draw name of predicted subject above face rectangle

Now that we have the prediction function well defined, next step is to actually call this function on our test images and display those test images to see if our face recognizer correctly recognized them. So let's do it. This is what we have been waiting for.

print("Predicting images...")

#load test images
test_img1 = cv2.imread("test-data/test1.jpg")
test_img2 = cv2.imread("test-data/test2.jpg")

#perform a prediction
predicted_img1 = predict(test_img1)
predicted_img2 = predict(test_img2)
print("Prediction complete")

#create a figure of 2 plots (one for each test image)
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))

#display test image1 result
ax1.imshow(cv2.cvtColor(predicted_img1, cv2.COLOR_BGR2RGB))

#display test image2 result
ax2.imshow(cv2.cvtColor(predicted_img2, cv2.COLOR_BGR2RGB))

#display both images
cv2.imshow("Tom cruise test", predicted_img1)
cv2.imshow("Shahrukh Khan test", predicted_img2)
cv2.waitKey(0)
cv2.destroyAllWindows()
cv2.waitKey(1)
cv2.destroyAllWindows()
Predicting images...
Prediction complete

wohooo! Is'nt it beautiful? Indeed, it is!

End Notes

Face Recognition is a fascinating idea to work on and OpenCV has made it extremely simple and easy for us to code it. It just takes a few lines of code to have a fully working face recognition application and we can switch between all three face recognizers with a single line of code change. It's that simple.

Although EigenFaces, FisherFaces and LBPH face recognizers are good but there are even better ways to perform face recognition like using Histogram of Oriented Gradients (HOGs) and Neural Networks. So the more advanced face recognition algorithms are now a days implemented using a combination of OpenCV and Machine learning. I have plans to write some articles on those more advanced methods as well, so stay tuned!

Download Details:
Author: informramiz
Source Code: https://github.com/informramiz/opencv-face-recognition-python
License: MIT License

#opencv  #python #facerecognition 

Face Recognition with OpenCV and Python