_“The User-Agent request header is a characteristic string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent.” ― _MDN web docs
To reach this goal, we are going to randomly select a valid User-Agent from a file containing a list of valid User-Agent strings.
Firstly, we need to get such a file. Secondly, we have to read it and extract a random line. This can be achieved with the following function:
<?php
function getRandomUserAgent() {
// default User-Agent
$userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0";
// reading a randomly chosen User-Agent string from the User-Agent list file
if ($file = fopen("user_agents.txt", "r")) {
$userAgents = array();
while (!feof($file)) {
$userAgents[] = fgets($file);
}
$userAgent = $userAgents[array_rand($userAgents)];
}
return trim($userAgent);
}
?>
To implement the IP rotation, we are going to use a proxy server.
“A proxy server is basically another computer which serves as a hub through which internet requests are processed. By connecting through one of these servers, your computer sends your requests to the server which then processes your request and returns what you were wanting. Moreover, in this way it serves as an intermediary between your home machine and the rest of the computers on the internet.” ―What Is My IP?
When using a proxy, the website we are making the request to sees the IP address of the proxy server — not ours. This enables us to scrape the target website anonymously without the risk of being banned or blocked.
Using a single proxy means that the IP server can be banned, interrupting our script. To avoid this, we would need to build a pool of proxies to route our requests through. Instead, we are going to use the Tor proxy. If you are not familiar with Tor, reading the following article is greatly recommended: How Does Tor Really Work?
#programming #cybersecurity #startup #php #web-scraping