Web pages and crawler packages on FB or IG

FB_IG_crawler

Project origin

Due to the professional needs of friends, it is necessary to collect the number of competitor FB or IG Funsi and the number of daily posts every day (weekly), as a data analysis,

but the number of FB and IG Funsi groups that need to be tracked together exceeds 200 (really crazy, this is a real case)

If you click on one by one with your finger, copy and paste it to excel.

This action repeats two hundred times. I think about it, it collapses (and it collapses every time), so he just Ask me how these steps are automated

Project goal

  1. Can successfully grab FB and IG data
  2. Open the virtual web page for the crawler to execute
  3. Able to analyze the information needed for web crawling
  4. Cram the captured data into the cloud google excel
  5. Daily execution (optional)
  6. There will be a line notification after execution (optional)

Project technology

  1. nodejs
  2. npm related virtual web pages and crawler packages
  3. google cloud related api
  4. cron scheduling basics
  5. line robot basics

Project writing logic

FB web pages do not support jQuery search. IG needs to log in first.

Based on the above two reasons, FB and IGE will be blocked by normal crawler programs. It must be executed by a physical web page.

So this time, selenium-webdriver is used for processing.

FB & IG crawler ideas

  1. Log in to personal account
  2. Jump to the fan page to capture the number of followers
  3. Search articles on the current page, count if there are posts on the day
  4. If the current page is posted on the same day, it will trigger the down instruction to continue to find

google sheet operation ideas

  1. Confirm whether the sheet exists (add it if it does not exist)
  2. Confirm whether there are all fan posts in the sheet (write to the sheet if it does not exist)
  3. The internal sorting of the program conforms to the order listed in the google sheet
  4. Add a new column (named after time stamp) and paste the crawler information
  5. Adjust google sheet column width

Scheduling logic

  1. Daily execution (currently default 23:20)
  2. Can be executed in the background without daily management
  3. There will be a line notification after execution

Front installation

NVM: version control

windows download page for node.js , windows official introduction

Mac recommends downloading Homebrew first

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
下載後安裝nvm

brew install nvm
nvm install 12.6

git: program version control windows download page , mac download page official introduction

yarn: installation package is more convenient windows download page , mac download page , official introduction

VScode: it is recommended to use IDE to write program download page

If you are using windows system

Need to download the chrome driver. This driver needs to be the same as your chrome version. Please put this driver in the root directory of this project.

Download this project

git clone https://github.com/dean9703111/FB_IG_crawler.git

cd FB_IG_crawler

Installation kit

yarn

Kit description

  1. dotenv: Grab the environment configuration file of .env
  2. googleapis: access google sheets related api
  3. selenium-webdriver: This is the focus of this project, creating virtual webpages and executing crawlers
  4. cron: package for executing schedule
  5. forever-monitor: monitor project execution
  6. dateformat: Normalized time
  7. xmlhttprequest: issue Http Get/Post
  8. to-boolean: Because the true/false from .env cannot be recognized, this package conversion is required

Modify to own environment variable

Please make a copy of .env.example and rename it to .env, then fill in the SPREADSHEET_ID to be written by your target, and put your own FB and IG account secrets

Modify json to the URL and title of your target crawler

Please copy ex_fb.json and ex_ig.json in the json folder and rename them to fb.json and ig.json, and change the title and url to the special page of your target crawler

Start node && open google sheets authorization

Please pursuant to google sheets api teaching to complete the application

name to decide, type select Desktop App, then click on the “DOWNLOAD CLIENT CONFIGURATION”

after downloading the new certificate in this project “google_key” folder and put in

then under the command

node index.js

Then a URL will pop up for you to get the authentication code. After

copying and pasting, you can write the data into google sheets.

Windows schedule setting (later I wrote a cron.js schedule, this article is just as knowledge increase)

Basically, you can ignore the operation of viewing pictures in English.

Exhaust setting

If you want it to be executed automatically every day, you need to install the forever package on the computer (env, please remember USE_CRON=true)

npm install forever -g

Next, execute this command in the project directory, it will always be executed in the background

forever start [目標程式](index.js)

List the schedules currently being executed

forever list 

Pause all currently executing schedules

forever stopall 

LINE token acquisition

Apply for a set of LINE Notify tokens at this website .

This blog has other LINE Notify applications!

Error handling

This is a problem when the google sheets permissions are insufficient. GaxiosError: Insufficient Permission There will be a problem if the chrome version cannot correspond to the chromedriver. Chrome not reachable Selenium WebDriver error reference resource 1 and reference resource 2

Download Details:

Author: dean9703111

GitHub: https://github.com/dean9703111/FB_IG_crawler

#nodejs #node #javascript

Web pages and crawler packages on FB or IG
3.00 GEEK