Are all spiders supposed to use +http in their user agent string?

Are all spiders supposed to use +http in their user agent string? - http

Here are some spider user agent strings I've seen recently. They all seem to include a URL prefixed with +:
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; meanpathbot/1.0; +http://www.meanpath.com/meanpathbot.html)
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Is that just a convention that most spiders follow? Or is it specified somewhere. I couldn't find it.

It's just a convention that some spiders follow. There is no constraint on what people can put in a user agent header.
Take a look at this list of user agents that contain "GoogleBot". You'll notice that many of these don't contain "+http".

Related

$Remote port in Nginx log file changes each time even it's from the same visitor

From the Nginx log file, it shows that the $remote_port (aka client port) changes every time even it's the same visitor visiting the same site :
180.163.220.3 - - [19/Jan/2020:14:18:07 +0800] "GET /home/images/logo.svg HTTP/2.0" 200 4997 "https://www.powerprocess.cn/home/index.php" "Mozilla/5.0 (Linux; U; Android 8.1.0; zh-CN; EML-AL00 Build/HUAWEIEML-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.108 baidu.sogo.uc.UCBrowser/11.9.4.974 UWS/2.13.1.48 Mobile Safari/537.36 AliApp(DingTalk/4.5.11) com.alibaba.android.rimet/10487439 Channel/227200 language/zh-CN"
180.163.220.3 - - [19/Jan/2020:14:18:07 +0800] "GET /home/images/banner.svg HTTP/2.0" 200 25161 "https://www.powerprocess.cn/home/index.php" "Mozilla/5.0 (Linux; U; Android 8.1.0; zh-CN; EML-AL00 Build/HUAWEIEML-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.108 baidu.sogo.uc.UCBrowser/11.9.4.974 UWS/2.13.1.48 Mobile Safari/537.36 AliApp(DingTalk/4.5.11) com.alibaba.android.rimet/10487439 Channel/227200 language/zh-CN"
180.163.220.3 - - [19/Jan/2020:14:18:07 +0800] "GET /home/images/about-us-team.svg HTTP/2.0" 200 58413 "https://www.powerprocess.cn/home/index.php" "Mozilla/5.0 (Linux; U; Android 8.1.0; zh-CN; EML-AL00 Build/HUAWEIEML-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.108 baidu.sogo.uc.UCBrowser/11.9.4.974 UWS/2.13.1.48 Mobile Safari/537.36 AliApp(DingTalk/4.5.11) com.alibaba.android.rimet/10487439 Channel/227200 language/zh-CN"
180.163.220.3 - - [19/Jan/2020:14:18:07 +0800] "GET /home/images/planning.svg HTTP/2.0" 200 10871 "https://www.powerprocess.cn/home/index.php" "Mozilla/5.0 (Linux; U; Android 8.1.0; zh-CN; EML-AL00 Build/HUAWEIEML-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.108 baidu.sogo.uc.UCBrowser/11.9.4.974 UWS/2.13.1.48 Mobile Safari/537.36 AliApp(DingTalk/4.5.11) com.alibaba.android.rimet/10487439 Channel/227200 language/zh-CN
As you can see, the port is changing, which I think shouldn't have been. Anyone can suggest the reason? Or maybe it's because although the visitor is visiting the same uri, the browser is actually making multiple requests to fetch the resources, and during each request the client port will be changed?

Change user-agent for Symfony Panther Chromeclient

How do I change the user-agent in a headless Chrome created by Symfony's Panther createChromeClient()?
When I create a Chrome client with
$client = \Symfony\Component\Panther\Client::createChromeClient();
I see in the access_log a user-agent of
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/77.0.3865.90 Safari/537.36"
I searched for solutions, and think I have to change the user-agent string via the arguments of the chrome, but can't find the right way, because the answers on the web aren't for PHP or Panther.
Cheers!

I found it:
$client = \Symfony\Component\Panther\Client::createChromeClient(null, [
'--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
'--window-size=1200,1100',
'--headless',
'--disable-gpu',
]);
This question gave me the idea.

Change IP address dynamically?

Consider the case,
I want to crawl websites frequently, but my IP address got blocked after some day/limit.
So, how can change my IP address dynamically or any other ideas?

An approach using Scrapy will make use of two components, RandomProxy and RotateUserAgentMiddleware.
Modify DOWNLOADER_MIDDLEWARES as follows. You will have to insert the new components in the settings.py:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 90,
'tutorial.randomproxy.RandomProxy': 100,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'tutorial.spiders.rotate_useragent.RotateUserAgentMiddleware' :400,
}
Random Proxy
You can use scrapy-proxies. This component will process Scrapy requests using a random proxy from a list to avoid IP ban and improve crawling speed.
You can build up your proxy list from a quick internet search. Copy links in the list.txt file according to requested url format.
Rotation of user agent
For each scrapy request a random user agent will be used from a list you define in advance:
class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
# Add desired logging message here.
spider.log(
u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request),
level=log.DEBUG
)
# the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
# for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
More details here.

You can try using proxy servers to prevent being blocked. There are services providing working proxies. The best I tried is https://gimmeproxy.com - they frequently check proxies for various parameters.
In order to get proxy from them, you need just to make the following request:
https://gimmeproxy.com/api/getProxy
They will provide JSON response with all proxy data which you can use later as needed:
{
"supportsHttps": true,
"protocol": "socks5",
"ip": "179.162.22.82",
"port": "36915",
"get": true,
"post": true,
"cookies": true,
"referer": true,
"user-agent": true,
"anonymityLevel": 1,
"websites": {
"example": true,
"google": false,
"amazon": true
},
"country": "BR",
"tsChecked": 1517952910,
"curl": "socks5://179.162.22.82:36915",
"ipPort": "179.162.22.82:36915",
"type": "socks5",
"speed": 37.78,
"otherProtocols": {}
}
You can use it like this with Curl:
curl -x socks5://179.162.22.82:36915 http://example.com

If you are using R, you could do the web crawling through TOR. I think TOR resets its IP-adress every 10 minutes(?) automatically. I think there is a way forcing TOR to change the IP in shorter intervals, but that didn't work for me. Instead you could set up multiple instances of TOR and then switch between the independent instances (here you can find a good explaination of how to set up multiple instances of TOR: https://tor.stackexchange.com/questions/2006/how-to-run-multiple-tor-browsers-with-different-ips)
After that you could do something like the following in R (use the ports of your independent TOR browsers and a list of useragents. Every time you call the 'getURL'-function cycle through your list of ports/useragents)
library(RCurl)
port <- c(a list of your ports)
proxy <- paste("socks5h://127.0.0.1:",port,sep="")
ua <- c(a list of your useragents)
opt <- list(proxy=sample(proxy,1),
useragent=sample(ua,1),
followlocation=TRUE,
referer="",
timeout=timeout,
verbose=verbose,
ssl.verifypeer=ssl)
webpage <- getURL(url=url,.opts=opt)

Some VPN applications allow you to automatically change your IP address to a new random IP address at a set interval such as: every 2 minutes. Both HMA! Pro VPN and VPN4ALL software support this feature.

Word of warning about VPNs, check their Terms and Conditions carefully because scraping using them goes against their user policy ( One such example would be Astrill). I tried a scraping tool and got my account locked

If you have public IPs. Add them on your interface and if you are using Linux use Iptables for switching those public IPs.
Iptables sample rules for two IPs
iptables -t nat -A POSTROUTING -m statistic --mode random --probability 0.5 -j SNAT --to-source 192.168.0.2
iptables -t nat -A POSTROUTING -m statistic --mode random --probability 0.5 -j SNAT --to-source 192.168.0.3
If you have 4 IPs then probablity will become 0.25.
You can also create your own proxy with simple steps.
These rules will allow the proxy server to switch its outgoing IPS.

Detect Computer or Mobile Device in ASP.Net

When a user is visiting your site, is there a way of know what type of device they are using: computer, tablet, or mobile device? For example, when I get emails sometimes I see that it says sent via iphone; how do they know that? Any help or point me in the right direction would be very helpful.

Short answer: You can't.
Long answer:
All you have is the information in the HTTP User-Agent header, which usually contains the OS name and version.
Usually, browsers running on Mac OS and Linux send enough information to identify the exact OS. For example, here's my User-Agent header:
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.7) Gecko/2009030423 Ubuntu/8.10 (intrepid) Firefox/3.0.7
You can see that I'm running Ubuntu 8.10 Intrepid Ibex.
And here's what Firefox and Safari 4 Beta report on my MacBook Pro:
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.7) Gecko/2009021906 Firefox/3.0.7
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us) AppleWebKit/528.16 (KHTML, like Gecko) Version/4.0 Safari/528.16
Windows browsers, on the other hand, usually only report the OS version and not the specific package (Pro, Business, etc.):
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x
I did not write this awnser i copied it from: Detect exact OS version from browser Thanks to: Can Berk Güder

Where to get this browser info?

I'm working on some logging functionality for a website, and I need to update a table with information about the users' browser info. The table was created a while ago by someone else, and I have no idea where they were getting this info. Does anyone recognize this? (each line is a single row)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB0.0; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; FunWebProducts; FBSMTWB; SIMBAR={8FCFDD51-4B26-489E-A39E-AB2744B
Java/1.6.0_06
Opera/9.80 (Windows NT 5.1; U; en) Presto/2.5.24 Version/10.53
BlackBerry9630/4.7.1.61 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/105

It's in the HTTP User-Agent header.

Here is the code you would use in ASP.
Dim sAgent: sAgent = Request.ServerVariables("HTTP_USER_AGENT")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Are all spiders supposed to use +http in their user agent string? - http

It's just a convention that some spiders follow. There is no constraint on what people can put in a user agent header. Take a look at this list of user agents that contain "GoogleBot". You'll notice that many of these don't contain "+http".

Related

$Remote port in Nginx log file changes each time even it's from the same visitor

Change user-agent for Symfony Panther Chromeclient

Change IP address dynamically?

Detect Computer or Mobile Device in ASP.Net

Where to get this browser info?

Categories

Resources