Dataset creation for beginners using Software

INTRODUCTION

WHY THIS ARTICLE?

This article is the second part of the web-scraping series.

AS I mentioned before in my first article, that I choose to write an article about scraping because during building my project Fake-News Detection System, It took me days to research for it accordingly, As I wasn’t able to find dataset according to my need.

So, If you didn’t go through my first article, I would strongly recommend going through that once and If you have a programming background, then you must read the first article of this series.

WHOM THIS ARTICLE IS USEFUL FOR?

Since, for users having programming background, I have already written a blog and who is having knowledge about python in specific, I would suggest doing scraping using that instead of any software because I find it easy to do it using python as compare to spend days on the understanding interface of any particular software.

But the people out there, who don’t have any programming background in particular, you can follow along with me and get familiar with the interface & working of this software.


OVERVIEW

This article covers the second part of the series, Scraping web-pages using software: Octoparse.

However, there are many software that you can found easily on the internet for automating the purpose like

ParseHubScarpeSimpleDiffbotMozenda.

Brief Introduction to different automated software:

1.ParseHub:

Purpose: Parsehub is a phenomenal tool for building web scrapers without coding to extract tremendous data. It is used by data scientists, data journalists, data analysts, E-commerce websites, job boards, marketing & sales, finance & many more.

Features: Its interface is dead simple to use, you can build web scrapers simply by clicking on data that you want. It then exports the data in JSON or Excel format. It has many handy features such as automatic IP rotation, allowing scraping behind login walls, going through dropdowns and tabs, getting data from tables and maps, and much much more. In addition, it has a generous free tier, allowing users to scrape up to 200 pages of data in just 40 minutes! Parsehub is also nice in that it provides desktop clients for Windows, Mac OS, and Linux, so you can use them from your computer no matter what system you’re running.

2.ScrapeSimple:

Purpose: ScrapeSimple is the perfect service for people who want a custom scraper built for them. Web scraping is made as simple as filling out a form with instructions for what kind of data you want.

Features: ScrapeSimple lives up to its name with a fully managed service that builds and maintains custom web scrapers for customers. Just tell them what information you need from which sites, and they will design a custom web scraper to deliver the information to you periodically (could be daily, weekly, monthly, or whatever) in CSV format directly to your inbox. This service is perfect for businesses that just want an HTML scraper without needing to write any code themselves. Response times are quick and the service is incredibly friendly and helpful, making this service perfect for people who just want the full data extraction process taken care of for them.

3.Diffbot:
Purpose: Enterprises who have specific data crawling and screen scraping needs, particularly those who scrape websites that often change their HTML structure.

Features: Diffbot is different from most page scraping tools out there in that it uses computer vision (instead of HTML parsing) to identify relevant information on a page. This means that even if the HTML structure of a page changes, your web scrapers will not break as long as the page looks the same visually. This is an incredible feature for long-running mission-critical web scraping jobs. While they may be a bit pricey (the cheapest plan is $299/month), they do a great job offering a premium service that may make it worth it for large customers.

4.Mozenda:

Purpose: Enterprises looking for a cloud-based self serve webpage scraping platform need look no further. With over 7 billion pages scraped, Mozenda has experience in serving enterprise customers from all around the world.

Features: Mozenda allows enterprise customers to run web scrapers on their robust cloud platform. They set themselves apart with the customer service (providing both phone and email support to all paying customers). Its platform is highly scalable and will allow for on-premise hosting as well. Like Diffbot, they are a bit pricy, and their lowest plans start at $250/month.

  • Although I am going to talk about Octoparse in detail in this article since I have used that only.

OCTOPARSE

Purpose: Octoparse is a fantastic tool for people who want to extract data from websites without having to code, while still having control over the full process with their easy to use user interface.

Features: Octoparse is the perfect tool for people who want to scrape websites without learning to code. It features a point and clicks screen scraper, allowing users to scrape behind login forms, fills in forms, input search terms, scrolls through the infinite scroll, renders javascript, and more. It also includes a site parser and a hosted solution for users who want to run their scrapers in the cloud. Best of all, it comes with a generous free tier allowing users to build up to 10 crawlers for free. For enterprise-level customers, they also offer fully customized crawlers and managed solutions where they take care of running everything for you and just deliver the data to you directly.

Step by Step explanation to extract data from 1000’s of news articles

Step-1: Download Octoparse

  • and follow along with the guidelines of the community.

Step-2: Sign-up

  • After completing with downloading & installing, sign up for an account if you haven’t created before.

Step-3: Explore it

  • Before starting on your own, I would strongly recommend you please explore different sections of it that will be ultimately going to help you in interacting with this interface while working on it later on.
  • Go through the popular template section, there are some popular templates of popular websites and you might find your required data there.
  • Go through the tutorials on both template mode & advanced mode

Step-4: Enter URL

  • If you want to scrape data from just one website, you can simply paste your copied URL at the home page and click start.

#data-science #web-scraping-tools #web-scraping #beginner #data analysis

What is GEEK

Buddha Community

Dataset creation for beginners using Software
Chloe  Butler

Chloe Butler

1667425440

Pdf2gerb: Perl Script Converts PDF Files to Gerber format

pdf2gerb

Perl script converts PDF files to Gerber format

Pdf2Gerb generates Gerber 274X photoplotting and Excellon drill files from PDFs of a PCB. Up to three PDFs are used: the top copper layer, the bottom copper layer (for 2-sided PCBs), and an optional silk screen layer. The PDFs can be created directly from any PDF drawing software, or a PDF print driver can be used to capture the Print output if the drawing software does not directly support output to PDF.

The general workflow is as follows:

  1. Design the PCB using your favorite CAD or drawing software.
  2. Print the top and bottom copper and top silk screen layers to a PDF file.
  3. Run Pdf2Gerb on the PDFs to create Gerber and Excellon files.
  4. Use a Gerber viewer to double-check the output against the original PCB design.
  5. Make adjustments as needed.
  6. Submit the files to a PCB manufacturer.

Please note that Pdf2Gerb does NOT perform DRC (Design Rule Checks), as these will vary according to individual PCB manufacturer conventions and capabilities. Also note that Pdf2Gerb is not perfect, so the output files must always be checked before submitting them. As of version 1.6, Pdf2Gerb supports most PCB elements, such as round and square pads, round holes, traces, SMD pads, ground planes, no-fill areas, and panelization. However, because it interprets the graphical output of a Print function, there are limitations in what it can recognize (or there may be bugs).

See docs/Pdf2Gerb.pdf for install/setup, config, usage, and other info.


pdf2gerb_cfg.pm

#Pdf2Gerb config settings:
#Put this file in same folder/directory as pdf2gerb.pl itself (global settings),
#or copy to another folder/directory with PDFs if you want PCB-specific settings.
#There is only one user of this file, so we don't need a custom package or namespace.
#NOTE: all constants defined in here will be added to main namespace.
#package pdf2gerb_cfg;

use strict; #trap undef vars (easier debug)
use warnings; #other useful info (easier debug)


##############################################################################################
#configurable settings:
#change values here instead of in main pfg2gerb.pl file

use constant WANT_COLORS => ($^O !~ m/Win/); #ANSI colors no worky on Windows? this must be set < first DebugPrint() call

#just a little warning; set realistic expectations:
#DebugPrint("${\(CYAN)}Pdf2Gerb.pl ${\(VERSION)}, $^O O/S\n${\(YELLOW)}${\(BOLD)}${\(ITALIC)}This is EXPERIMENTAL software.  \nGerber files MAY CONTAIN ERRORS.  Please CHECK them before fabrication!${\(RESET)}", 0); #if WANT_DEBUG

use constant METRIC => FALSE; #set to TRUE for metric units (only affect final numbers in output files, not internal arithmetic)
use constant APERTURE_LIMIT => 0; #34; #max #apertures to use; generate warnings if too many apertures are used (0 to not check)
use constant DRILL_FMT => '2.4'; #'2.3'; #'2.4' is the default for PCB fab; change to '2.3' for CNC

use constant WANT_DEBUG => 0; #10; #level of debug wanted; higher == more, lower == less, 0 == none
use constant GERBER_DEBUG => 0; #level of debug to include in Gerber file; DON'T USE FOR FABRICATION
use constant WANT_STREAMS => FALSE; #TRUE; #save decompressed streams to files (for debug)
use constant WANT_ALLINPUT => FALSE; #TRUE; #save entire input stream (for debug ONLY)

#DebugPrint(sprintf("${\(CYAN)}DEBUG: stdout %d, gerber %d, want streams? %d, all input? %d, O/S: $^O, Perl: $]${\(RESET)}\n", WANT_DEBUG, GERBER_DEBUG, WANT_STREAMS, WANT_ALLINPUT), 1);
#DebugPrint(sprintf("max int = %d, min int = %d\n", MAXINT, MININT), 1); 

#define standard trace and pad sizes to reduce scaling or PDF rendering errors:
#This avoids weird aperture settings and replaces them with more standardized values.
#(I'm not sure how photoplotters handle strange sizes).
#Fewer choices here gives more accurate mapping in the final Gerber files.
#units are in inches
use constant TOOL_SIZES => #add more as desired
(
#round or square pads (> 0) and drills (< 0):
    .010, -.001,  #tiny pads for SMD; dummy drill size (too small for practical use, but needed so StandardTool will use this entry)
    .031, -.014,  #used for vias
    .041, -.020,  #smallest non-filled plated hole
    .051, -.025,
    .056, -.029,  #useful for IC pins
    .070, -.033,
    .075, -.040,  #heavier leads
#    .090, -.043,  #NOTE: 600 dpi is not high enough resolution to reliably distinguish between .043" and .046", so choose 1 of the 2 here
    .100, -.046,
    .115, -.052,
    .130, -.061,
    .140, -.067,
    .150, -.079,
    .175, -.088,
    .190, -.093,
    .200, -.100,
    .220, -.110,
    .160, -.125,  #useful for mounting holes
#some additional pad sizes without holes (repeat a previous hole size if you just want the pad size):
    .090, -.040,  #want a .090 pad option, but use dummy hole size
    .065, -.040, #.065 x .065 rect pad
    .035, -.040, #.035 x .065 rect pad
#traces:
    .001,  #too thin for real traces; use only for board outlines
    .006,  #minimum real trace width; mainly used for text
    .008,  #mainly used for mid-sized text, not traces
    .010,  #minimum recommended trace width for low-current signals
    .012,
    .015,  #moderate low-voltage current
    .020,  #heavier trace for power, ground (even if a lighter one is adequate)
    .025,
    .030,  #heavy-current traces; be careful with these ones!
    .040,
    .050,
    .060,
    .080,
    .100,
    .120,
);
#Areas larger than the values below will be filled with parallel lines:
#This cuts down on the number of aperture sizes used.
#Set to 0 to always use an aperture or drill, regardless of size.
use constant { MAX_APERTURE => max((TOOL_SIZES)) + .004, MAX_DRILL => -min((TOOL_SIZES)) + .004 }; #max aperture and drill sizes (plus a little tolerance)
#DebugPrint(sprintf("using %d standard tool sizes: %s, max aper %.3f, max drill %.3f\n", scalar((TOOL_SIZES)), join(", ", (TOOL_SIZES)), MAX_APERTURE, MAX_DRILL), 1);

#NOTE: Compare the PDF to the original CAD file to check the accuracy of the PDF rendering and parsing!
#for example, the CAD software I used generated the following circles for holes:
#CAD hole size:   parsed PDF diameter:      error:
#  .014                .016                +.002
#  .020                .02267              +.00267
#  .025                .026                +.001
#  .029                .03167              +.00267
#  .033                .036                +.003
#  .040                .04267              +.00267
#This was usually ~ .002" - .003" too big compared to the hole as displayed in the CAD software.
#To compensate for PDF rendering errors (either during CAD Print function or PDF parsing logic), adjust the values below as needed.
#units are pixels; for example, a value of 2.4 at 600 dpi = .0004 inch, 2 at 600 dpi = .0033"
use constant
{
    HOLE_ADJUST => -0.004 * 600, #-2.6, #holes seemed to be slightly oversized (by .002" - .004"), so shrink them a little
    RNDPAD_ADJUST => -0.003 * 600, #-2, #-2.4, #round pads seemed to be slightly oversized, so shrink them a little
    SQRPAD_ADJUST => +0.001 * 600, #+.5, #square pads are sometimes too small by .00067, so bump them up a little
    RECTPAD_ADJUST => 0, #(pixels) rectangular pads seem to be okay? (not tested much)
    TRACE_ADJUST => 0, #(pixels) traces seemed to be okay?
    REDUCE_TOLERANCE => .001, #(inches) allow this much variation when reducing circles and rects
};

#Also, my CAD's Print function or the PDF print driver I used was a little off for circles, so define some additional adjustment values here:
#Values are added to X/Y coordinates; units are pixels; for example, a value of 1 at 600 dpi would be ~= .002 inch
use constant
{
    CIRCLE_ADJUST_MINX => 0,
    CIRCLE_ADJUST_MINY => -0.001 * 600, #-1, #circles were a little too high, so nudge them a little lower
    CIRCLE_ADJUST_MAXX => +0.001 * 600, #+1, #circles were a little too far to the left, so nudge them a little to the right
    CIRCLE_ADJUST_MAXY => 0,
    SUBST_CIRCLE_CLIPRECT => FALSE, #generate circle and substitute for clip rects (to compensate for the way some CAD software draws circles)
    WANT_CLIPRECT => TRUE, #FALSE, #AI doesn't need clip rect at all? should be on normally?
    RECT_COMPLETION => FALSE, #TRUE, #fill in 4th side of rect when 3 sides found
};

#allow .012 clearance around pads for solder mask:
#This value effectively adjusts pad sizes in the TOOL_SIZES list above (only for solder mask layers).
use constant SOLDER_MARGIN => +.012; #units are inches

#line join/cap styles:
use constant
{
    CAP_NONE => 0, #butt (none); line is exact length
    CAP_ROUND => 1, #round cap/join; line overhangs by a semi-circle at either end
    CAP_SQUARE => 2, #square cap/join; line overhangs by a half square on either end
    CAP_OVERRIDE => FALSE, #cap style overrides drawing logic
};
    
#number of elements in each shape type:
use constant
{
    RECT_SHAPELEN => 6, #x0, y0, x1, y1, count, "rect" (start, end corners)
    LINE_SHAPELEN => 6, #x0, y0, x1, y1, count, "line" (line seg)
    CURVE_SHAPELEN => 10, #xstart, ystart, x0, y0, x1, y1, xend, yend, count, "curve" (bezier 2 points)
    CIRCLE_SHAPELEN => 5, #x, y, 5, count, "circle" (center + radius)
};
#const my %SHAPELEN =
#Readonly my %SHAPELEN =>
our %SHAPELEN =
(
    rect => RECT_SHAPELEN,
    line => LINE_SHAPELEN,
    curve => CURVE_SHAPELEN,
    circle => CIRCLE_SHAPELEN,
);

#panelization:
#This will repeat the entire body the number of times indicated along the X or Y axes (files grow accordingly).
#Display elements that overhang PCB boundary can be squashed or left as-is (typically text or other silk screen markings).
#Set "overhangs" TRUE to allow overhangs, FALSE to truncate them.
#xpad and ypad allow margins to be added around outer edge of panelized PCB.
use constant PANELIZE => {'x' => 1, 'y' => 1, 'xpad' => 0, 'ypad' => 0, 'overhangs' => TRUE}; #number of times to repeat in X and Y directions

# Set this to 1 if you need TurboCAD support.
#$turboCAD = FALSE; #is this still needed as an option?

#CIRCAD pad generation uses an appropriate aperture, then moves it (stroke) "a little" - we use this to find pads and distinguish them from PCB holes. 
use constant PAD_STROKE => 0.3; #0.0005 * 600; #units are pixels
#convert very short traces to pads or holes:
use constant TRACE_MINLEN => .001; #units are inches
#use constant ALWAYS_XY => TRUE; #FALSE; #force XY even if X or Y doesn't change; NOTE: needs to be TRUE for all pads to show in FlatCAM and ViewPlot
use constant REMOVE_POLARITY => FALSE; #TRUE; #set to remove subtractive (negative) polarity; NOTE: must be FALSE for ground planes

#PDF uses "points", each point = 1/72 inch
#combined with a PDF scale factor of .12, this gives 600 dpi resolution (1/72 * .12 = 600 dpi)
use constant INCHES_PER_POINT => 1/72; #0.0138888889; #multiply point-size by this to get inches

# The precision used when computing a bezier curve. Higher numbers are more precise but slower (and generate larger files).
#$bezierPrecision = 100;
use constant BEZIER_PRECISION => 36; #100; #use const; reduced for faster rendering (mainly used for silk screen and thermal pads)

# Ground planes and silk screen or larger copper rectangles or circles are filled line-by-line using this resolution.
use constant FILL_WIDTH => .01; #fill at most 0.01 inch at a time

# The max number of characters to read into memory
use constant MAX_BYTES => 10 * M; #bumped up to 10 MB, use const

use constant DUP_DRILL1 => TRUE; #FALSE; #kludge: ViewPlot doesn't load drill files that are too small so duplicate first tool

my $runtime = time(); #Time::HiRes::gettimeofday(); #measure my execution time

print STDERR "Loaded config settings from '${\(__FILE__)}'.\n";
1; #last value must be truthful to indicate successful load


#############################################################################################
#junk/experiment:

#use Package::Constants;
#use Exporter qw(import); #https://perldoc.perl.org/Exporter.html

#my $caller = "pdf2gerb::";

#sub cfg
#{
#    my $proto = shift;
#    my $class = ref($proto) || $proto;
#    my $settings =
#    {
#        $WANT_DEBUG => 990, #10; #level of debug wanted; higher == more, lower == less, 0 == none
#    };
#    bless($settings, $class);
#    return $settings;
#}

#use constant HELLO => "hi there2"; #"main::HELLO" => "hi there";
#use constant GOODBYE => 14; #"main::GOODBYE" => 12;

#print STDERR "read cfg file\n";

#our @EXPORT_OK = Package::Constants->list(__PACKAGE__); #https://www.perlmonks.org/?node_id=1072691; NOTE: "_OK" skips short/common names

#print STDERR scalar(@EXPORT_OK) . " consts exported:\n";
#foreach(@EXPORT_OK) { print STDERR "$_\n"; }
#my $val = main::thing("xyz");
#print STDERR "caller gave me $val\n";
#foreach my $arg (@ARGV) { print STDERR "arg $arg\n"; }

Download Details:

Author: swannman
Source Code: https://github.com/swannman/pdf2gerb

License: GPL-3.0 license

#perl 

Hoang  Ha

Hoang Ha

1657764000

Cách Thu Thập Dữ Liệu Từ Twitter Bằng Tweepy Và Snscrape


You can use the data you can get from social media in a number of ways, like sentiment analysis (analyzing people's thoughts) on a specific issue or field of interest.

There are several ways you can scrape (or gather) data from Twitter. And in this article, we will look at two of those ways: using Tweepy and Snscrape.

We will learn a method to scrape public conversations from people on a specific trending topic, as well as tweets from a particular user.

Now without further ado, let’s get started.

Tweepy vs Snscrape – Introduction to Our Scraping Tools

Now, before we get into the implementation of each platform, let's try to grasp the differences and limits of each platform.

Tweepy

Tweepy is a Python library for integrating with the Twitter API. Because Tweepy is connected with the Twitter API, you can perform complex queries in addition to scraping tweets. It enables you to take advantage of all of the Twitter API's capabilities.

But there are some drawbacks – like the fact that its standard API only allows you to collect tweets for up to a week (that is, Tweepy does not allow recovery of tweets beyond a week window, so historical data retrieval is not permitted).

Also, there are limits to how many tweets you can retrieve from a user's account. You can read more about Tweepy's functionalities here.

Snscrape

Snscrape is another approach for scraping information from Twitter that does not require the use of an API. Snscrape allows you to scrape basic information such as a user's profile, tweet content, source, and so on.

Snscrape is not limited to Twitter, but can also scrape content from other prominent social media networks like Facebook, Instagram, and others.

Its advantages are that there are no limits to the number of tweets you can retrieve or the window of tweets (that is, the date range of tweets). So Snscrape allows you to retrieve old data.

But the one disadvantage is that it lacks all the other functionalities of Tweepy – still, if you only want to scrape tweets, Snscrape would be enough.

Now that we've clarified the distinction between the two methods, let's go over their implementation one by one.

How to Use Tweepy to Scrape Tweets

Before we begin using Tweepy, we must first make sure that our Twitter credentials are ready. With that, we can connect Tweepy to our API key and begin scraping.

If you do not have Twitter credentials, you can register for a Twitter developer account by going here. You will be asked some basic questions about how you intend to use the Twitter API. After that, you can begin the implementation.

The first step is to install the Tweepy library on your local machine, which you can do by typing:

pip install git+https://github.com/tweepy/tweepy.git

How to Scrape Tweets from a User on Twitter

Now that we’ve installed the Tweepy library, let’s scrape 100 tweets from a user called john on Twitter. We'll look at the full code implementation that will let us do this and discuss it in detail so we can grasp what’s going on:

import tweepy

consumer_key = "XXXX" #Your API/Consumer key 
consumer_secret = "XXXX" #Your API/Consumer Secret Key
access_token = "XXXX"    #Your Access token key
access_token_secret = "XXXX" #Your Access token Secret key

#Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)


username = "john"
no_of_tweets =100


try:
    #The number of tweets we want to retrieved from the user
    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)
    
    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
    
    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))
    time.sleep(3)

Now let's go over each part of the code in the above block.

import tweepy

consumer_key = "XXXX" #Your API/Consumer key 
consumer_secret = "XXXX" #Your API/Consumer Secret Key
access_token = "XXXX"    #Your Access token key
access_token_secret = "XXXX" #Your Access token Secret key

#Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)

In the above code, we've imported the Tweepy library into our code, then we've created some variables where we store our Twitter credentials (The Tweepy authentication handler requires four of our Twitter credentials). So we then pass in those variable into the Tweepy authentication handler and save them into another variable.

Then the last statement of call is where we instantiated the Tweepy API and passed in the require parameters.

username = "john"
no_of_tweets =100


try:
    #The number of tweets we want to retrieved from the user
    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)
    
    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
    
    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))

In the above code, we created the name of the user (the @name in Twitter) we want to retrieved the tweets from and also the number of tweets. We then created an exception handler to help us catch errors in a more effective way.

After that, the api.user_timeline() returns a collection of the most recent tweets posted by the user we picked in the screen_name parameter and the number of tweets you want to retrieve.

In the next line of code, we passed in some attributes we want to retrieve from each tweet and saved them into a list. To see more attributes you can retrieve from a tweet, read this.

In the last chunk of code we created a dataframe and passed in the list we created along with the names of the column we created.

Note that the column names must be in the sequence of how you passed them into the attributes container (that is, how you passed those attributes in a list when you were retrieving the attributes from the tweet).

If you correctly followed the steps I described, you should have something like this:

image-17

Image by Author

Now that we are done, let's go over one more example before we move into the Snscrape implementation.

How to Scrape Tweets from a Text Search

In this method, we will be retrieving a tweet based on a search. You can do that like this:

import tweepy

consumer_key = "XXXX" #Your API/Consumer key 
consumer_secret = "XXXX" #Your API/Consumer Secret Key
access_token = "XXXX"    #Your Access token key
access_token_secret = "XXXX" #Your Access token Secret key

#Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)


search_query = "sex for grades"
no_of_tweets =150


try:
    #The number of tweets we want to retrieved from the search
    tweets = api.search_tweets(q=search_query, count=no_of_tweets)
    
    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.user.name, tweet.created_at, tweet.favorite_count, tweet.source,  tweet.text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["User", "Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
    
    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))

The above code is similar to the previous code, except that we changed the API method from api.user_timeline() to api.search_tweets(). We've also added tweet.user.name to the attributes container list.

In the code above, you can see that we passed in two attributes. This is because if we only pass in tweet.user, it would only return a dictionary user object. So we must also pass in another attribute we want to retrieve from the user object, which is name.

You can go here to see a list of additional attributes that you can retrieve from a user object. Now you should see something like this once you run it:

image-18

Image by Author.

Alright, that just about wraps up the Tweepy implementation. Just remember that there is a limit to the number of tweets you can retrieve, and you can not retrieve tweets more than 7 days old using Tweepy.

How to Use Snscrape to Scrape Tweets

As I mentioned previously, Snscrape does not require Twitter credentials (API key) to access it. There is also no limit to the number of tweets you can fetch.

For this example, though, we'll just retrieve the same tweets as in the previous example, but using Snscrape instead.

To use Snscrape, we must first install its library on our PC. You can do that by typing:

pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git

How to Scrape Tweets from a User with Snscrape

Snscrape includes two methods for getting tweets from Twitter: the command line interface (CLI) and a Python Wrapper. Just keep in mind that the Python Wrapper is currently undocumented – but we can still get by with trial and error.

In this example, we will use the Python Wrapper because it is more intuitive than the CLI method. But if you get stuck with some code, you can always turn to the GitHub community for assistance. The contributors will be happy to help you.

To retrieve tweets from a particular user, we can do the following:

import snscrape.modules.twitter as sntwitter
import pandas as pd

# Created a list to append all tweet attributes(data)
attributes_container = []

# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:john').get_items()):
    if i>100:
        break
    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])
    
# Creating a dataframe from the tweets list above 
tweets_df = pd.DataFrame(attributes_container, columns=["Date Created", "Number of Likes", "Source of Tweet", "Tweets"])

Let's go over some of the code that you might not understand at first glance:

for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:john').get_items()):
    if i>100:
        break
    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])
    
  
# Creating a dataframe from the tweets list above 
tweets_df = pd.DataFrame(attributes_container, columns=["Date Created", "Number of Likes", "Source of Tweet", "Tweets"])

In the above code, what the sntwitter.TwitterSearchScaper does is return an object of tweets from the name of the user we passed into it (which is john).

As I mentioned earlier, Snscrape does not have limits on numbers of tweets so it will return however many tweets from that user. To help with this, we need to add the enumerate function which will iterate through the object and add a counter so we can access the most recent 100 tweets from the user.

You can see that the attributes syntax we get from each tweet looks like the one from Tweepy. These are the list of attributes that we can get from the Snscrape tweet which was curated by Martin Beck.

Sns.Scrape

Credit: Martin Beck

More attributes might be added, as the Snscrape library is still in development. Like for instance in the above image, source has been replaced with sourceLabel. If you pass in only source it will return an object.

If you run the above code, you should see something like this as well:

image-19

Image by Author

Now let's do the same for scraping by search.

How to Scrape Tweets from a Text Search with Snscrape

import snscrape.modules.twitter as sntwitter
import pandas as pd

# Creating list to append tweet data to
attributes_container = []

# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('sex for grades since:2021-07-05 until:2022-07-06').get_items()):
    if i>150:
        break
    attributes_container.append([tweet.user.username, tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])
    
# Creating a dataframe to load the list
tweets_df = pd.DataFrame(attributes_container, columns=["User", "Date Created", "Number of Likes", "Source of Tweet", "Tweet"])

Again, you can access a lot of historical data using Snscrape (unlike Tweepy, as its standard API cannot exceed 7 days. The premium API is 30 days.). So we can pass in the date from which we want to start the search and the date we want it to end in the sntwitter.TwitterSearchScraper() method.

What we've done in the preceding code is basically what we discussed before. The only thing to bear in mind is that until works similarly to the range function in Python (that is, it excludes the last integer). So if you want to get tweets from today, you need to include the day after today in the "until" parameter.

image-21

Image of Author.

Now you know how to scrape tweets with Snscrape, too!

When to use each approach

Now that we've seen how each method works, you might be wondering when to use which.

Well, there is no universal rule for when to utilize each method. Everything comes down to a matter preference and your use case.

If you want to acquire an endless number of tweets, you should use Snscrape. But if you want to use extra features that Snscrape cannot provide (like geolocation, for example), then you should definitely use Tweepy. It is directly integrated with the Twitter API and provides complete functionality.

Even so, Snscrape is the most commonly used method for basic scraping.

Conclusion

In this article, we learned how to scrape data from Python using Tweepy and Snscrape. But this was only a brief overview of how each approach works. You can learn more by exploring the web for additional information.

I've included some useful resources that you can use if you need additional information. Thank you for reading.

 Source: https://www.freecodecamp.org/news/python-web-scraping-tutorial/

#python #web 

许 志强

许 志强

1657769340

如何使用 Tweepy 和 Snscape 从 Twitter 上抓取数据

如果您是数据爱好者,您可能会同意社交媒体是现实世界数据中最丰富的来源之一。像 Twitter 这样的网站充满了数据。

您可以通过多种方式使用从社交媒体获得的数据,例如针对特定问题或感兴趣领域的情绪分析(分析人们的想法)。

您可以通过多种方式从 Twitter 上抓取(或收集)数据。在本文中,我们将研究其中两种方式:使用 Tweepy 和 Snscrap。

我们将学习一种方法来抓取人们关于特定趋势主题的公开对话,以及来自特定用户的推文。

现在事不宜迟,让我们开始吧。

Tweepy vs Snscrape——我们的抓取工具简介

现在,在我们进入每个平台的实现之前,让我们尝试掌握每个平台的差异和限制。

呸呸呸

Tweepy 是一个用于与 Twitter API 集成的 Python 库。因为 Tweepy 与 Twitter API 连接,除了抓取推文之外,您还可以执行复杂的查询。它使您能够利用 Twitter API 的所有功能。

但也有一些缺点——比如它的标准 API 只允许您收集长达一周的推文(也就是说,Tweepy 不允许恢复超过一周窗口的推文,因此不允许检索历史数据)。

此外,您可以从用户帐户中检索多少条推文也是有限制的。您可以在此处阅读有关 Tweepy 功能的更多信息

刮擦

Snscape 是另一种从 Twitter 上抓取信息的方法,不需要使用 API。Snscrape 允许您抓取基本信息,例如用户的个人资料、推文内容、来源等。

Snscape 不仅限于 Twitter,还可以从其他著名的社交媒体网络(如 Facebook、Instagram 等)中抓取内容。

它的优点是可以检索的推文数量或推文窗口(即推文的日期范围)没有限制。因此,Snscape 允许您检索旧数据。

但一个缺点是它缺乏 Tweepy 的所有其他功能——不过,如果你只想抓取推文,Snscrap 就足够了。

现在我们已经阐明了这两种方法之间的区别,让我们一一来看看它们的实现。

如何使用 Tweepy 抓取推文

在我们开始使用 Tweepy 之前,我们必须首先确保我们的 Twitter 凭据已准备好。有了它,我们可以将 Tweepy 连接到我们的 API 密钥并开始抓取。

如果您没有 Twitter 凭据,您可以前往此处注册 Twitter 开发者帐户。您将被问及一些关于您打算如何使用 Twitter API 的基本问题。之后,您可以开始实施。

第一步是在你的本地机器上安装 Tweepy 库,你可以通过键入:

pip install git+https://github.com/tweepy/tweepy.git

如何在 Twitter 上抓取用户的推文

现在我们已经安装了 Tweepy 库,让我们从johnTwitter 上调用的用户那里抓取 100 条推文。我们将查看完整的代码实现,让我们这样做并详细讨论它,以便我们了解发生了什么:

import tweepy

consumer_key = "XXXX" #Your API/Consumer key 
consumer_secret = "XXXX" #Your API/Consumer Secret Key
access_token = "XXXX"    #Your Access token key
access_token_secret = "XXXX" #Your Access token Secret key

#Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)


username = "john"
no_of_tweets =100


try:
    #The number of tweets we want to retrieved from the user
    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)
    
    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
    
    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))
    time.sleep(3)

现在让我们回顾一下上面代码块中的每一部分代码。

import tweepy

consumer_key = "XXXX" #Your API/Consumer key 
consumer_secret = "XXXX" #Your API/Consumer Secret Key
access_token = "XXXX"    #Your Access token key
access_token_secret = "XXXX" #Your Access token Secret key

#Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)

在上面的代码中,我们将 Tweepy 库导入到我们的代码中,然后我们创建了一些变量来存储我们的 Twitter 凭据(Tweepy 身份验证处理程序需要我们的四个 Twitter 凭据)。所以我们然后将这些变量传递给 Tweepy 身份验证处理程序并将它们保存到另一个变量中。

然后最后一个调用语句是我们实例化 Tweepy API 并传入 require 参数的地方。

username = "john"
no_of_tweets =100


try:
    #The number of tweets we want to retrieved from the user
    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)
    
    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
    
    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))

在上面的代码中,我们创建了要从中检索推文的用户名(Twitter 中的@name)以及推文的数量。然后我们创建了一个异常处理程序来帮助我们以更有效的方式捕获错误。

之后,api.user_timeline()返回我们在参数中选择的用户发布的最新推文的集合以及screen_name您要检索的推文数量。

在下一行代码中,我们传入了一些我们想从每条推文中检索的属性,并将它们保存到一个列表中。要查看可以从推文中检索到的更多属性,请阅读

在最后一段代码中,我们创建了一个数据框,并传入了我们创建的列表以及我们创建的列的名称。

请注意,列名必须按照您将它们传递到属性容器的顺序(即,当您从推文中检索属性时,您如何在列表中传递这些属性)。

如果你正确地按照我描述的步骤,你应该有这样的:

图像 17

作者图片

现在我们已经完成了,在我们进入 Snscrap 实现之前,让我们再看一个例子。

如何从文本搜索中抓取推文

在这种方法中,我们将根据搜索检索推文。你可以这样做:

import tweepy

consumer_key = "XXXX" #Your API/Consumer key 
consumer_secret = "XXXX" #Your API/Consumer Secret Key
access_token = "XXXX"    #Your Access token key
access_token_secret = "XXXX" #Your Access token Secret key

#Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)


search_query = "sex for grades"
no_of_tweets =150


try:
    #The number of tweets we want to retrieved from the search
    tweets = api.search_tweets(q=search_query, count=no_of_tweets)
    
    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.user.name, tweet.created_at, tweet.favorite_count, tweet.source,  tweet.text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["User", "Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
    
    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))

上面的代码与前面的代码类似,只是我们将 API 方法从 更改api.user_timeline()api.search_tweets()。我们还添加tweet.user.name了属性容器列表。

在上面的代码中,你可以看到我们传入了两个属性。这是因为如果我们只传入tweet.user,它只会返回一个字典用户对象。所以我们还必须传入另一个我们想从用户对象中检索的属性,即name.

您可以在此处查看可以从用户对象中检索的附加属性列表。现在,一旦您运行它,您应该会看到类似这样的内容:

图像 18

图片由作者提供。

好的,这就是 Tweepy 的实现。请记住,您可以检索的推文数量是有限制的,并且您不能使用 Tweepy 检索超过 7 天的推文。

如何使用 Snscrape 来抓取推文

正如我之前提到的,Snscrape 不需要 Twitter 凭据(API 密钥)来访问它。您可以获取的推文数量也没有限制。

但是,对于这个示例,我们将只检索与上一个示例相同的推文,但使用 Snscape。

要使用 Snscrap,我们必须首先在我们的 PC 上安装它的库。您可以通过键入:

pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git

如何使用 Snscrape 抓取用户的推文

Snscrape 包括两种从 Twitter 获取推文的方法:命令行界面 (CLI) 和 Python Wrapper。请记住,Python Wrapper 目前没有文档记录——但我们仍然可以通过反复试验来度过难关。

在本例中,我们将使用 Python Wrapper,因为它比 CLI 方法更直观。但是,如果您遇到一些代码问题,您可以随时向 GitHub 社区寻求帮助。贡献者将很乐意为您提供帮助。

要检索特定用户的推文,我们可以执行以下操作:

import snscrape.modules.twitter as sntwitter
import pandas as pd

# Created a list to append all tweet attributes(data)
attributes_container = []

# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:john').get_items()):
    if i>100:
        break
    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])
    
# Creating a dataframe from the tweets list above 
tweets_df = pd.DataFrame(attributes_container, columns=["Date Created", "Number of Likes", "Source of Tweet", "Tweets"])

让我们复习一下你可能第一眼看不懂的一些代码:

for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:john').get_items()):
    if i>100:
        break
    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])
    
  
# Creating a dataframe from the tweets list above 
tweets_df = pd.DataFrame(attributes_container, columns=["Date Created", "Number of Likes", "Source of Tweet", "Tweets"])

在上面的代码中,所做的sntwitter.TwitterSearchScaper是从我们传递给它的用户名(即 john)中返回一个推文对象。

正如我之前提到的,Snscrape 对推文的数量没有限制,因此它会返回来自该用户的许多推文。为了解决这个问题,我们需要添加枚举函数,该函数将遍历对象并添加一个计数器,以便我们可以访问用户最近的 100 条推文。

您可以看到,我们从每条推文中获得的属性语法与 Tweepy 中的类似。这些是我们可以从 Martin Beck 策划的 Snscape 推文中获得的属性列表。

Sns.Scrape

学分:马丁贝克

可能会添加更多属性,因为 Snscape 库仍在开发中。例如上图中的,source已替换为sourceLabel. 如果你只传入source它会返回一个对象。

如果你运行上面的代码,你应该也会看到类似这样的东西:

图像 19

作者图片

现在让我们对通过搜索进行抓取做同样的事情。

如何使用 Snscrape 从文本搜索中抓取推文

import snscrape.modules.twitter as sntwitter
import pandas as pd

# Creating list to append tweet data to
attributes_container = []

# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('sex for grades since:2021-07-05 until:2022-07-06').get_items()):
    if i>150:
        break
    attributes_container.append([tweet.user.username, tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])
    
# Creating a dataframe to load the list
tweets_df = pd.DataFrame(attributes_container, columns=["User", "Date Created", "Number of Likes", "Source of Tweet", "Tweet"])

同样,您可以使用 Snscrape 访问大量历史数据(与 Tweepy 不同,因为它的标准 API 不能超过 7 天。高级 API 是 30 天。)。所以我们可以在方法中传入我们想要开始搜索的日期和想要结束的日期sntwitter.TwitterSearchScraper()

我们在前面的代码中所做的基本上就是我们之前讨论过的。唯一要记住的是,直到与 Python 中的范围函数类似(也就是说,它不包括最后一个整数)。因此,如果您想从今天开始获取推文,则需要在“直到”参数中包含今天之后的一天。

图像 21

作者的形象。

现在您也知道如何使用 Snscape 抓取推文了!

何时使用每种方法

现在我们已经了解了每种方法的工作原理,您可能想知道何时使用哪种方法。

好吧,对于何时使用每种方法没有通用规则。一切都取决于问题偏好和您的用例。

如果你想获得无穷无尽的推文,你应该使用 Snscrap。但是,如果您想使用 Snscrape 无法提供的额外功能(例如地理定位),那么您绝对应该使用 Tweepy。它直接与 Twitter API 集成并提供完整的功能。

即便如此,Snscrape 是最常用的基本刮削方法。

结论

在本文中,我们学习了如何使用 Tweepy 和 Snscrap 从 Python 中抓取数据。但这只是对每种方法如何工作的简要概述。您可以通过浏览网络了解更多信息以获取更多信息。

我提供了一些有用的资源,如果您需要更多信息,可以使用它们。感谢您的阅读。

 来源:https ://www.freecodecamp.org/news/python-web-scraping-tutorial/

#python #web 

Como Extrair Dados Do Twitter Usando Tweepy E Snscrape

Se você é um entusiasta de dados, provavelmente concordará que uma das fontes mais ricas de dados do mundo real são as mídias sociais. Sites como o Twitter estão cheios de dados.

Você pode usar os dados obtidos nas mídias sociais de várias maneiras, como análise de sentimentos (analisando os pensamentos das pessoas) sobre um assunto ou campo de interesse específico.

Existem várias maneiras de extrair (ou coletar) dados do Twitter. E neste artigo, veremos duas dessas maneiras: usando o Tweepy e o Snscrape.

Aprenderemos um método para extrair conversas públicas de pessoas sobre um tópico de tendência específico, bem como tweets de um usuário específico.

Agora sem mais delongas, vamos começar.

Tweepy vs Snscrape – Introdução às nossas ferramentas de raspagem

Agora, antes de entrarmos na implementação de cada plataforma, vamos tentar entender as diferenças e os limites de cada plataforma.

Tweepy

Tweepy é uma biblioteca Python para integração com a API do Twitter. Como o Tweepy está conectado à API do Twitter, você pode realizar consultas complexas além de extrair tweets. Ele permite que você aproveite todos os recursos da API do Twitter.

Mas existem algumas desvantagens – como o fato de que sua API padrão só permite coletar tweets por até uma semana (ou seja, o Tweepy não permite a recuperação de tweets além de uma janela de semana, portanto, a recuperação de dados históricos não é permitida).

Além disso, há limites para quantos tweets você pode recuperar da conta de um usuário. Você pode ler mais sobre as funcionalidades do Tweepy aqui .

Snscrape

Snscrape é outra abordagem para extrair informações do Twitter que não requer o uso de uma API. O Snscrape permite extrair informações básicas, como o perfil de um usuário, conteúdo do tweet, fonte e assim por diante.

O Snscrape não se limita ao Twitter, mas também pode extrair conteúdo de outras redes sociais proeminentes, como Facebook, Instagram e outros.

Suas vantagens são que não há limites para o número de tweets que você pode recuperar ou a janela de tweets (ou seja, o intervalo de datas dos tweets). Então Snscrape permite que você recupere dados antigos.

Mas a única desvantagem é que ele não possui todas as outras funcionalidades do Tweepy – ainda assim, se você quiser apenas raspar tweets, o Snscrape seria suficiente.

Agora que esclarecemos a distinção entre os dois métodos, vamos analisar sua implementação um por um.

Como usar o Tweepy para raspar tweets

Antes de começarmos a usar o Tweepy, devemos primeiro ter certeza de que nossas credenciais do Twitter estão prontas. Com isso, podemos conectar o Tweepy à nossa chave de API e começar a raspar.

Se você não tiver credenciais do Twitter, poderá se registrar para uma conta de desenvolvedor do Twitter acessando aqui . Serão feitas algumas perguntas básicas sobre como você pretende usar a API do Twitter. Depois disso, você pode começar a implementação.

O primeiro passo é instalar a biblioteca Tweepy em sua máquina local, o que você pode fazer digitando:

pip install git+https://github.com/tweepy/tweepy.git

Como raspar tweets de um usuário no Twitter

Agora que instalamos a biblioteca Tweepy, vamos extrair 100 tweets de um usuário chamado johnno Twitter. Veremos a implementação completa do código que nos permitirá fazer isso e discutiremos em detalhes para que possamos entender o que está acontecendo:

import tweepy

consumer_key = "XXXX" #Your API/Consumer key 
consumer_secret = "XXXX" #Your API/Consumer Secret Key
access_token = "XXXX"    #Your Access token key
access_token_secret = "XXXX" #Your Access token Secret key

#Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)


username = "john"
no_of_tweets =100


try:
    #The number of tweets we want to retrieved from the user
    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)
    
    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
    
    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))
    time.sleep(3)

Agora vamos examinar cada parte do código no bloco acima.

import tweepy

consumer_key = "XXXX" #Your API/Consumer key 
consumer_secret = "XXXX" #Your API/Consumer Secret Key
access_token = "XXXX"    #Your Access token key
access_token_secret = "XXXX" #Your Access token Secret key

#Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)

No código acima, importamos a biblioteca Tweepy para nosso código e criamos algumas variáveis ​​nas quais armazenamos nossas credenciais do Twitter (o manipulador de autenticação do Tweepy requer quatro de nossas credenciais do Twitter). Então, passamos essas variáveis ​​para o manipulador de autenticação Tweepy e as salvamos em outra variável.

Em seguida, a última instrução de chamada é onde instanciamos a API do Tweepy e passamos os parâmetros require.

username = "john"
no_of_tweets =100


try:
    #The number of tweets we want to retrieved from the user
    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)
    
    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
    
    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))

No código acima, criamos o nome do usuário (o @name no Twitter) do qual queremos recuperar os tweets e também o número de tweets. Em seguida, criamos um manipulador de exceção para nos ajudar a detectar erros de maneira mais eficaz.

Depois disso, o api.user_timeline()retorna uma coleção dos tweets mais recentes postados pelo usuário que escolhemos no screen_nameparâmetro e o número de tweets que você deseja recuperar.

Na próxima linha de código, passamos alguns atributos que queremos recuperar de cada tweet e os salvamos em uma lista. Para ver mais atributos que você pode recuperar de um tweet, leia isto .

No último pedaço de código criamos um dataframe e passamos a lista que criamos junto com os nomes da coluna que criamos.

Observe que os nomes das colunas devem estar na sequência de como você os passou para o contêiner de atributos (ou seja, como você passou esses atributos em uma lista quando estava recuperando os atributos do tweet).

Se você seguiu corretamente os passos que descrevi, você deve ter algo assim:

imagem-17

Imagem do autor

Agora que terminamos, vamos ver mais um exemplo antes de passarmos para a implementação do Snscrape.

Como raspar tweets de uma pesquisa de texto

Neste método, estaremos recuperando um tweet com base em uma pesquisa. Você pode fazer assim:

import tweepy

consumer_key = "XXXX" #Your API/Consumer key 
consumer_secret = "XXXX" #Your API/Consumer Secret Key
access_token = "XXXX"    #Your Access token key
access_token_secret = "XXXX" #Your Access token Secret key

#Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)


search_query = "sex for grades"
no_of_tweets =150


try:
    #The number of tweets we want to retrieved from the search
    tweets = api.search_tweets(q=search_query, count=no_of_tweets)
    
    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.user.name, tweet.created_at, tweet.favorite_count, tweet.source,  tweet.text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["User", "Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
    
    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))

O código acima é semelhante ao código anterior, exceto que alteramos o método da API de api.user_timeline()para api.search_tweets(). Também adicionamos tweet.user.nameà lista de contêineres de atributos.

No código acima, você pode ver que passamos dois atributos. Isso ocorre porque se passarmos apenas tweet.user, ele retornaria apenas um objeto de usuário de dicionário. Portanto, também devemos passar outro atributo que queremos recuperar do objeto de usuário, que é name.

Você pode acessar aqui para ver uma lista de atributos adicionais que podem ser recuperados de um objeto de usuário. Agora você deve ver algo assim depois de executá-lo:

imagem-18

Imagem do Autor.

Tudo bem, isso praticamente encerra a implementação do Tweepy. Apenas lembre-se de que há um limite para o número de tweets que você pode recuperar, e você não pode recuperar tweets com mais de 7 dias usando o Tweepy.

Como usar o Snscrape para raspar tweets

Como mencionei anteriormente, o Snscrape não requer credenciais do Twitter (chave de API) para acessá-lo. Também não há limite para o número de tweets que você pode buscar.

Para este exemplo, porém, apenas recuperaremos os mesmos tweets do exemplo anterior, mas usando o Snscrape.

Para usar o Snscrape, devemos primeiro instalar sua biblioteca em nosso PC. Você pode fazer isso digitando:

pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git

Como raspar tweets de um usuário com Snscrape

O Snscrape inclui dois métodos para obter tweets do Twitter: a interface de linha de comando (CLI) e um Python Wrapper. Apenas tenha em mente que o Python Wrapper não está documentado no momento – mas ainda podemos nos virar com tentativa e erro.

Neste exemplo, usaremos o Python Wrapper porque é mais intuitivo que o método CLI. Mas se você ficar preso a algum código, sempre poderá recorrer à comunidade do GitHub para obter assistência. Os colaboradores terão prazer em ajudá-lo.

Para recuperar tweets de um usuário específico, podemos fazer o seguinte:

import snscrape.modules.twitter as sntwitter
import pandas as pd

# Created a list to append all tweet attributes(data)
attributes_container = []

# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:john').get_items()):
    if i>100:
        break
    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])
    
# Creating a dataframe from the tweets list above 
tweets_df = pd.DataFrame(attributes_container, columns=["Date Created", "Number of Likes", "Source of Tweet", "Tweets"])

Vamos revisar alguns dos códigos que você pode não entender à primeira vista:

for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:john').get_items()):
    if i>100:
        break
    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])
    
  
# Creating a dataframe from the tweets list above 
tweets_df = pd.DataFrame(attributes_container, columns=["Date Created", "Number of Likes", "Source of Tweet", "Tweets"])

No código acima, o que o sntwitter.TwitterSearchScaperfaz é retornar um objeto de tweets do nome do usuário que passamos para ele (que é john).

Como mencionei anteriormente, o Snscrape não tem limites no número de tweets, então ele retornará quantos tweets desse usuário. Para ajudar com isso, precisamos adicionar a função enumerate que irá percorrer o objeto e adicionar um contador para que possamos acessar os 100 tweets mais recentes do usuário.

Você pode ver que a sintaxe de atributos que obtemos de cada tweet se parece com a do Tweepy. Esta é a lista de atributos que podemos obter do tweet do Snscrape, com curadoria de Martin Beck.

Sns.Scrape

Crédito: Martin Beck

Mais atributos podem ser adicionados, pois a biblioteca Snscrape ainda está em desenvolvimento. Como por exemplo na imagem acima, sourcefoi substituído por sourceLabel. Se você passar apenas sourceele retornará um objeto.

Se você executar o código acima, deverá ver algo assim também:

imagem-19

Imagem do autor

Agora vamos fazer o mesmo para raspagem por pesquisa.

Como raspar tweets de uma pesquisa de texto com Snscrape

import snscrape.modules.twitter as sntwitter
import pandas as pd

# Creating list to append tweet data to
attributes_container = []

# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('sex for grades since:2021-07-05 until:2022-07-06').get_items()):
    if i>150:
        break
    attributes_container.append([tweet.user.username, tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])
    
# Creating a dataframe to load the list
tweets_df = pd.DataFrame(attributes_container, columns=["User", "Date Created", "Number of Likes", "Source of Tweet", "Tweet"])

Novamente, você pode acessar muitos dados históricos usando o Snscrape (ao contrário do Tweepy, pois sua API padrão não pode exceder 7 dias. A API premium é de 30 dias). Assim, podemos passar a data a partir da qual queremos iniciar a pesquisa e a data em que queremos que ela termine no sntwitter.TwitterSearchScraper()método.

O que fizemos no código anterior é basicamente o que discutimos antes. A única coisa a ter em mente é que até funciona de forma semelhante à função range em Python (ou seja, exclui o último inteiro). Portanto, se você deseja obter tweets de hoje, precisa incluir o dia depois de hoje no parâmetro "até".

imagem-21

Imagem do Autor.

Agora você também sabe como raspar tweets com o Snscrape!

Quando usar cada abordagem

Agora que vimos como cada método funciona, você deve estar se perguntando quando usar qual.

Bem, não existe uma regra universal para quando utilizar cada método. Tudo se resume a uma preferência de assunto e seu caso de uso.

Se você deseja adquirir um número infinito de tweets, deve usar o Snscrape. Mas se você quiser usar recursos extras que o Snscrape não pode fornecer (como geolocalização, por exemplo), você deve definitivamente usar o Tweepy. Ele é integrado diretamente com a API do Twitter e oferece funcionalidade completa.

Mesmo assim, o Snscrape é o método mais comumente usado para raspagem básica.

Conclusão

Neste artigo, aprendemos como extrair dados do Python usando Tweepy e Snscrape. Mas esta foi apenas uma breve visão geral de como cada abordagem funciona. Você pode aprender mais explorando a web para obter informações adicionais.

Incluí alguns recursos úteis que você pode usar se precisar de informações adicionais. Obrigado por ler.

 Fonte: https://www.freecodecamp.org/news/python-web-scraping-tutorial/ 

#python #web 

坂本  篤司

坂本 篤司

1657773780

TweepyとSnscrapeを使用してTwitterからデータをスクレイピングする方法

あなたがデータ愛好家であれば、実世界のデータの最も豊富な情報源の1つがソーシャルメディアであることに同意するでしょう。Twitterのようなサイトはデータでいっぱいです。

ソーシャルメディアから取得できるデータは、特定の問題や関心のある分野に関する感情分析(人々の考えを分析する)など、さまざまな方法で使用できます。

Twitterからデータを取得(または収集)する方法はいくつかあります。この記事では、TweepyとSnscrapeの2つの方法について説明します。

特定のトレンドトピックに関する人々からの公開会話や、特定のユーザーからのツイートをスクレイプする方法を学びます。

さて、これ以上面倒なことはせずに、始めましょう。

Tweepy vs Snscrape –スクレイピングツールの紹介

さて、各プラットフォームの実装に入る前に、各プラットフォームの違いと限界を把握してみましょう。

Tweepy

Tweepyは、TwitterAPIと統合するためのPythonライブラリです。TweepyはTwitterAPIに接続されているため、ツイートをスクレイピングするだけでなく、複雑なクエリを実行できます。これにより、TwitterAPIのすべての機能を利用できるようになります。

ただし、いくつかの欠点があります。たとえば、標準APIでは最大1週間のツイートしか収集できない(つまり、Tweepyでは1週間を超えるツイートの回復が許可されないため、履歴データの取得は許可されません)。

また、ユーザーのアカウントから取得できるツイートの数には制限があります。Tweepyの機能について詳しくは、こちらをご覧ください。

Snscrape

Snscrapeは、APIを使用せずにTwitterから情報を取得するためのもう1つのアプローチです。Snscrapeを使用すると、ユーザーのプロファイル、ツイートコンテンツ、ソースなどの基本情報をスクレイピングできます。

SnscrapeはTwitterだけでなく、Facebook、Instagramなどの他の著名なソーシャルメディアネットワークからコンテンツをスクレイピングすることもできます。

その利点は、取得できるツイートの数やツイートのウィンドウ(つまり、ツイートの日付範囲)に制限がないことです。したがって、Snscrapeを使用すると、古いデータを取得できます。

ただし、1つの欠点は、Tweepyの他のすべての機能が不足していることです。それでも、ツイートをスクレイピングするだけの場合は、Snscrapeで十分です。

2つのメソッドの違いが明確になったので、それらの実装を1つずつ見ていきましょう。

Tweepyを使用してツイートをスクレイピングする方法

Tweepyの使用を開始する前に、まずTwitterのクレデンシャルの準備ができていることを確認する必要があります。これで、TweepyをAPIキーに接続して、スクレイピングを開始できます。

Twitterの資格情報をお持ちでない場合は、こちらからTwitter開発者アカウントに登録できます。TwitterAPIをどのように使用するかについての基本的な質問がいくつかあります。その後、実装を開始できます。

最初のステップは、ローカルマシンにTweepyライブラリをインストールすることです。これは、次のように入力することで実行できます。

pip install git+https://github.com/tweepy/tweepy.git

Twitterでユーザーからのツイートをスクレイピングする方法

Tweepyライブラリをインストールしたので、johnTwitterで呼び出されたユーザーから100件のツイートを取得してみましょう。これを可能にする完全なコード実装を見て、何が起こっているのかを把握できるように詳細に説明します。

import tweepyconsumer_key = "XXXX" #Your API/Consumer key consumer_secret = "XXXX" #Your API/Consumer Secret Keyaccess_token = "XXXX"    #Your Access token keyaccess_token_secret = "XXXX" #Your Access token Secret key#Pass in our twitter API authentication keyauth = tweepy.OAuth1UserHandler(    consumer_key, consumer_secret,    access_token, access_token_secret)#Instantiate the tweepy APIapi = tweepy.API(auth, wait_on_rate_limit=True)username = "john"no_of_tweets =100try:    #The number of tweets we want to retrieved from the user    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)        #Pulling Some attributes from the tweet    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] for tweet in tweets]    #Creation of column list to rename the columns in the dataframe    columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]        #Creation of Dataframe    tweets_df = pd.DataFrame(attributes_container, columns=columns)except BaseException as e:    print('Status Failed On,',str(e))    time.sleep(3)

次に、上記のブロックのコードの各部分を見ていきましょう。

import tweepyconsumer_key = "XXXX" #Your API/Consumer key consumer_secret = "XXXX" #Your API/Consumer Secret Keyaccess_token = "XXXX"    #Your Access token keyaccess_token_secret = "XXXX" #Your Access token Secret key#Pass in our twitter API authentication keyauth = tweepy.OAuth1UserHandler(    consumer_key, consumer_secret,    access_token, access_token_secret)#Instantiate the tweepy APIapi = tweepy.API(auth, wait_on_rate_limit=True)

上記のコードでは、Tweepyライブラリをコードにインポートしてから、Twitterクレデンシャルを格納する変数をいくつか作成しました(Tweepy認証ハンドラーには4つのTwitterクレデンシャルが必要です)。次に、それらの変数をTweepy認証ハンドラーに渡し、別の変数に保存します。

次に、呼び出しの最後のステートメントは、Tweepy APIをインスタンス化し、requireパラメーターを渡した場所です。

username = "john"no_of_tweets =100try:    #The number of tweets we want to retrieved from the user    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)        #Pulling Some attributes from the tweet    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] for tweet in tweets]    #Creation of column list to rename the columns in the dataframe    columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]        #Creation of Dataframe    tweets_df = pd.DataFrame(attributes_container, columns=columns)except BaseException as e:    print('Status Failed On,',str(e))

上記のコードでは、ツイートを取得するユーザーの名前(Twitterでは@name)と、ツイートの数を作成しました。次に、より効果的な方法でエラーをキャッチするのに役立つ例外ハンドラーを作成しました。

その後api.user_timeline()、パラメータで選択したユーザーによって投稿された最新のツイートのコレクションと、screen_name取得するツイートの数が返されます。

次のコード行では、各ツイートから取得する属性をいくつか渡して、リストに保存しました。ツイートから取得できるその他の属性を確認するには、こちらをお読みください。

コードの最後のチャンクで、データフレームを作成し、作成した列の名前とともに作成したリストを渡しました。

列名は、属性コンテナにどのように渡したか(つまり、ツイートから属性を取得するときにリストでこれらの属性をどのように渡したか)の順序である必要があることに注意してください。

私が説明した手順を正しく実行した場合は、次のようになります。

画像-17

著者による画像

これで完了です。Snscrapeの実装に移る前に、もう1つの例を見ていきましょう。

テキスト検索からツイートをスクレイピングする方法

この方法では、検索に基づいてツイートを取得します。あなたはこのようにそれを行うことができます:

import tweepyconsumer_key = "XXXX" #Your API/Consumer key consumer_secret = "XXXX" #Your API/Consumer Secret Keyaccess_token = "XXXX"    #Your Access token keyaccess_token_secret = "XXXX" #Your Access token Secret key#Pass in our twitter API authentication keyauth = tweepy.OAuth1UserHandler(    consumer_key, consumer_secret,    access_token, access_token_secret)#Instantiate the tweepy APIapi = tweepy.API(auth, wait_on_rate_limit=True)search_query = "sex for grades"no_of_tweets =150try:    #The number of tweets we want to retrieved from the search    tweets = api.search_tweets(q=search_query, count=no_of_tweets)        #Pulling Some attributes from the tweet    attributes_container = [[tweet.user.name, tweet.created_at, tweet.favorite_count, tweet.source,  tweet.text] for tweet in tweets]    #Creation of column list to rename the columns in the dataframe    columns = ["User", "Date Created", "Number of Likes", "Source of Tweet", "Tweet"]        #Creation of Dataframe    tweets_df = pd.DataFrame(attributes_container, columns=columns)except BaseException as e:    print('Status Failed On,',str(e))

api.user_timeline()上記のコードは、APIメソッドをからに変更したことを除いて、前のコードと似ていますapi.search_tweets()tweet.user.name属性コンテナリストにも追加しました。

上記のコードでは、2つの属性を渡したことがわかります。これは、を渡すだけではtweet.user、辞書のユーザーオブジェクトのみが返されるためです。したがって、ユーザーオブジェクトから取得する別の属性である。も渡す必要がありますname

ここに移動して、ユーザーオブジェクトから取得できる追加の属性のリストを確認できます。これを実行すると、次のようなものが表示されます。

画像-18

著者による画像。

了解しました。これでTweepyの実装はほぼ完了です。取得できるツイートの数には制限があり、Tweepyを使用して7日以上経過したツイートを取得することはできません。

Snscrapeを使用してツイートをスクレイピングする方法

前述したように、SnscrapeはそれにアクセスするためにTwitterのクレデンシャル(APIキー)を必要としません。取得できるツイートの数にも制限はありません。

ただし、この例では、前の例と同じツイートを取得しますが、代わりにSnscrapeを使用します。

Snscrapeを使用するには、最初にそのライブラリをPCにインストールする必要があります。次のように入力すると、次のように入力できます。

pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git

Snscrapeを使用してユーザーからのツイートをスクレイピングする方法

Snscrapeには、Twitterからツイートを取得するための2つの方法が含まれています。コマンドラインインターフェイス(CLI)とPythonラッパーです。Python Wrapperは現在文書化されていないことを覚えておいてください。ただし、試行錯誤を繰り返すことで解決できます。

この例では、CLIメソッドよりも直感的であるため、Pythonラッパーを使用します。ただし、コードに行き詰まった場合は、いつでもGitHubコミュニティに支援を求めることができます。寄稿者は喜んでお手伝いします。

特定のユーザーからツイートを取得するには、次のようにします。

import snscrape.modules.twitter as sntwitterimport pandas as pd# Created a list to append all tweet attributes(data)attributes_container = []# Using TwitterSearchScraper to scrape data and append tweets to listfor i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:john').get_items()):    if i>100:        break    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])    # Creating a dataframe from the tweets list above tweets_df = pd.DataFrame(attributes_container, columns=["Date Created", "Number of Likes", "Source of Tweet", "Tweets"])

一見理解できないかもしれないコードのいくつかを見てみましょう:

for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:john').get_items()):    if i>100:        break    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])      # Creating a dataframe from the tweets list above tweets_df = pd.DataFrame(attributes_container, columns=["Date Created", "Number of Likes", "Source of Tweet", "Tweets"])

上記のコードでは、sntwitter.TwitterSearchScaper渡したユーザー(john)の名前からツイートのオブジェクトを返します。

前述したように、Snscrapeにはツイートの数に制限がないため、そのユーザーからのツイートの数は返されます。これを支援するために、オブジェクトを反復処理する列挙関数を追加し、カウンターを追加して、ユーザーからの最新の100件のツイートにアクセスできるようにする必要があります。

各ツイートから取得した属性構文は、Tweepyのものと同じように見えることがわかります。これらは、MartinBeckによってキュレーションされたSnscrapeツイートから取得できる属性のリストです。

Sns.Scrape

クレジット:Martin Beck

Snscrapeライブラリはまだ開発中であるため、さらに属性が追加される可能性があります。たとえば上の画像のように、sourceはに置き換えられましたsourceLabel。パスのみを渡すとsource、オブジェクトが返されます。

上記のコードを実行すると、次のようなものも表示されます。

画像-19

著者による画像

次に、検索によるスクレイピングについても同じようにします。

Snscrapeを使用してテキスト検索からツイートをスクレイピングする方法

import snscrape.modules.twitter as sntwitterimport pandas as pd# Creating list to append tweet data toattributes_container = []# Using TwitterSearchScraper to scrape data and append tweets to listfor i,tweet in enumerate(sntwitter.TwitterSearchScraper('sex for grades since:2021-07-05 until:2022-07-06').get_items()):    if i>150:        break    attributes_container.append([tweet.user.username, tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])    # Creating a dataframe to load the listtweets_df = pd.DataFrame(attributes_container, columns=["User", "Date Created", "Number of Likes", "Source of Tweet", "Tweet"])

繰り返しになりますが、Snscrapeを使用して多くの履歴データにアクセスできます(Tweepyとは異なり、標準APIは7日を超えることはできません。プレミアムAPIは30日です)。したがって、検索を開始する日付と終了する日付をsntwitter.TwitterSearchScraper()メソッドに渡すことができます。

前のコードで行ったことは、基本的に前に説明したことです。覚えておくべき唯一のことは、Pythonの範囲関数と同様に機能するまで(つまり、最後の整数を除外する)ということです。したがって、今日からツイートを取得する場合は、今日の翌日を「until」パラメーターに含める必要があります。

画像-21

著者の画像。

これで、Snscrapeを使用してツイートをスクレイピングする方法もわかりました。

各アプローチをいつ使用するか

それぞれの方法がどのように機能するかを見てきたので、いつどの方法を使用するのか疑問に思われるかもしれません。

ええと、それぞれの方法をいつ利用するかについての普遍的な規則はありません。すべては問題の好みとあなたのユースケースに帰着します。

無限のツイートを取得したい場合は、Snscrapeを使用する必要があります。ただし、Snscrapeが提供できない追加機能(ジオロケーションなど)を使用する場合は、必ずTweepyを使用する必要があります。Twitter APIと直接統合されており、完全な機能を提供します。

それでも、Snscrapeは基本的なスクレイピングに最も一般的に使用される方法です。

結論

この記事では、TweepyとSnscrapeを使用してPythonからデータをスクレイピングする方法を学びました。しかし、これは各アプローチがどのように機能するかについての簡単な概要にすぎませんでした。詳細については、Webを探索して詳細を確認してください。

追加情報が必要な場合に使用できる便利なリソースをいくつか紹介しました。読んでくれてありがとう。

 ソース:https ://www.freecodecamp.org/news/python-web-scraping-tutorial/

#python #web