From the Nginx log file, it shows that the $remote_port (aka client port) changes every time even it's the same visitor visiting the same site :
180.163.220.3 - - [19/Jan/2020:14:18:07 +0800] "GET /home/images/logo.svg HTTP/2.0" 200 4997 "https://www.powerprocess.cn/home/index.php" "Mozilla/5.0 (Linux; U; Android 8.1.0; zh-CN; EML-AL00 Build/HUAWEIEML-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.108 baidu.sogo.uc.UCBrowser/11.9.4.974 UWS/2.13.1.48 Mobile Safari/537.36 AliApp(DingTalk/4.5.11) com.alibaba.android.rimet/10487439 Channel/227200 language/zh-CN"
180.163.220.3 - - [19/Jan/2020:14:18:07 +0800] "GET /home/images/banner.svg HTTP/2.0" 200 25161 "https://www.powerprocess.cn/home/index.php" "Mozilla/5.0 (Linux; U; Android 8.1.0; zh-CN; EML-AL00 Build/HUAWEIEML-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.108 baidu.sogo.uc.UCBrowser/11.9.4.974 UWS/2.13.1.48 Mobile Safari/537.36 AliApp(DingTalk/4.5.11) com.alibaba.android.rimet/10487439 Channel/227200 language/zh-CN"
180.163.220.3 - - [19/Jan/2020:14:18:07 +0800] "GET /home/images/about-us-team.svg HTTP/2.0" 200 58413 "https://www.powerprocess.cn/home/index.php" "Mozilla/5.0 (Linux; U; Android 8.1.0; zh-CN; EML-AL00 Build/HUAWEIEML-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.108 baidu.sogo.uc.UCBrowser/11.9.4.974 UWS/2.13.1.48 Mobile Safari/537.36 AliApp(DingTalk/4.5.11) com.alibaba.android.rimet/10487439 Channel/227200 language/zh-CN"
180.163.220.3 - - [19/Jan/2020:14:18:07 +0800] "GET /home/images/planning.svg HTTP/2.0" 200 10871 "https://www.powerprocess.cn/home/index.php" "Mozilla/5.0 (Linux; U; Android 8.1.0; zh-CN; EML-AL00 Build/HUAWEIEML-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.108 baidu.sogo.uc.UCBrowser/11.9.4.974 UWS/2.13.1.48 Mobile Safari/537.36 AliApp(DingTalk/4.5.11) com.alibaba.android.rimet/10487439 Channel/227200 language/zh-CN
As you can see, the port is changing, which I think shouldn't have been. Anyone can suggest the reason? Or maybe it's because although the visitor is visiting the same uri, the browser is actually making multiple requests to fetch the resources, and during each request the client port will be changed?
Consider the case,
I want to crawl websites frequently, but my IP address got blocked after some day/limit.
So, how can change my IP address dynamically or any other ideas?
An approach using Scrapy will make use of two components, RandomProxy and RotateUserAgentMiddleware.
Modify DOWNLOADER_MIDDLEWARES as follows. You will have to insert the new components in the settings.py:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 90,
'tutorial.randomproxy.RandomProxy': 100,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'tutorial.spiders.rotate_useragent.RotateUserAgentMiddleware' :400,
}
Random Proxy
You can use scrapy-proxies. This component will process Scrapy requests using a random proxy from a list to avoid IP ban and improve crawling speed.
You can build up your proxy list from a quick internet search. Copy links in the list.txt file according to requested url format.
Rotation of user agent
For each scrapy request a random user agent will be used from a list you define in advance:
class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
# Add desired logging message here.
spider.log(
u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request),
level=log.DEBUG
)
# the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
# for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
More details here.
You can try using proxy servers to prevent being blocked. There are services providing working proxies. The best I tried is https://gimmeproxy.com - they frequently check proxies for various parameters.
In order to get proxy from them, you need just to make the following request:
https://gimmeproxy.com/api/getProxy
They will provide JSON response with all proxy data which you can use later as needed:
{
"supportsHttps": true,
"protocol": "socks5",
"ip": "179.162.22.82",
"port": "36915",
"get": true,
"post": true,
"cookies": true,
"referer": true,
"user-agent": true,
"anonymityLevel": 1,
"websites": {
"example": true,
"google": false,
"amazon": true
},
"country": "BR",
"tsChecked": 1517952910,
"curl": "socks5://179.162.22.82:36915",
"ipPort": "179.162.22.82:36915",
"type": "socks5",
"speed": 37.78,
"otherProtocols": {}
}
You can use it like this with Curl:
curl -x socks5://179.162.22.82:36915 http://example.com
If you are using R, you could do the web crawling through TOR. I think TOR resets its IP-adress every 10 minutes(?) automatically. I think there is a way forcing TOR to change the IP in shorter intervals, but that didn't work for me. Instead you could set up multiple instances of TOR and then switch between the independent instances (here you can find a good explaination of how to set up multiple instances of TOR: https://tor.stackexchange.com/questions/2006/how-to-run-multiple-tor-browsers-with-different-ips)
After that you could do something like the following in R (use the ports of your independent TOR browsers and a list of useragents. Every time you call the 'getURL'-function cycle through your list of ports/useragents)
library(RCurl)
port <- c(a list of your ports)
proxy <- paste("socks5h://127.0.0.1:",port,sep="")
ua <- c(a list of your useragents)
opt <- list(proxy=sample(proxy,1),
useragent=sample(ua,1),
followlocation=TRUE,
referer="",
timeout=timeout,
verbose=verbose,
ssl.verifypeer=ssl)
webpage <- getURL(url=url,.opts=opt)
Some VPN applications allow you to automatically change your IP address to a new random IP address at a set interval such as: every 2 minutes. Both HMA! Pro VPN and VPN4ALL software support this feature.
Word of warning about VPNs, check their Terms and Conditions carefully because scraping using them goes against their user policy ( One such example would be Astrill). I tried a scraping tool and got my account locked
If you have public IPs. Add them on your interface and if you are using Linux use Iptables for switching those public IPs.
Iptables sample rules for two IPs
iptables -t nat -A POSTROUTING -m statistic --mode random --probability 0.5 -j SNAT --to-source 192.168.0.2
iptables -t nat -A POSTROUTING -m statistic --mode random --probability 0.5 -j SNAT --to-source 192.168.0.3
If you have 4 IPs then probablity will become 0.25.
You can also create your own proxy with simple steps.
These rules will allow the proxy server to switch its outgoing IPS.
ASP.NET MVC4 web application (shopping cart) is running with mono, apache and mod_mono in Debian.
Sometimes it stops responging or is slow.
Apache acces log file contains parts which may be can used to reproduce the issue:
1.4.24.123 - - [25/Nov/2014:19:50:06 +0200] "GET /store/StoreImage/Thumb?product=350-00315&size=198 HTTP/1.1" 200 5231 "http://www.example.com/store/Store/Details?product=350-00315" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0"
1.4.24.123 - - [25/Nov/2014:19:50:06 +0200] "GET /store/Image/Icon HTTP/1.1" 200 2721 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0"
1.4.24.123 - - [25/Nov/2014:19:50:06 +0200] "GET /store/Image/Icon HTTP/1.1" 200 2721 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0"
etc.
How to send those requests to web site for testing the site from windows ? Is there some free tool or converter which reads apache log file and sends those requests ?
Or how to create MVC4 controller which accepts uploaded this file, parsest it and isses web requests from it to site ? Or is there some web page which can do it ?
Or some testing spidder which invokes all urls from all pages in site rapidly ?
I have a site that's embedded inside an iframe.
On occasion I'm seeing a phantom get request generated for each get or post.
The nginx logs show this occurring, notice there is a get request sent immediately after the post:
XX.XXX.XXX.XX - - [06/Oct/2012:20:55:47 +0000] "POST /website_widget/users HTTP/1.1" 200 1996 "http://subdomain.mysite.com/website_wi
dget/users/sign_up" "Mozilla/5.0 (Windows NT 5.1; rv:15.0) Gecko/20100101 Firefox/15.0.1"
XX.XXX.XXX.XX - - [06/Oct/2012:20:55:47 +0000] "GET /website_widget/users HTTP/1.1" 404 781 "http://subdomain.mysite.com/website_widg
et/users" "Mozilla/5.0 (Windows NT 5.1; rv:15.0) Gecko/20100101 Firefox/15.0.1"
XX.XXX.XXX.XX - - [06/Oct/2012:20:55:53 +0000] "POST /website_widget/users HTTP/1.1" 200 1993 "http://subdomain.mysite.com/website_wi
dget/users" "Mozilla/5.0 (Windows NT 5.1; rv:15.0) Gecko/20100101 Firefox/15.0.1"
XX.XXX.XXX.XX - - [06/Oct/2012:20:55:53 +0000] "GET /website_widget/users HTTP/1.1" 404 781 "http://subdomain.mysite.com/website_widg
et/users" "Mozilla/5.0 (Windows NT 5.1; rv:15.0) Gecko/20100101 Firefox/15.0.1"
This also happens with standard get requests too. From my rails logs I can see:
Started GET "/website_widget/users/sign_in" for XX.XX.XXX.XX at 2012-10-06 20:45:35 +0000
[b7e895726057452d0af6a2ac5cd1668d] Processing by WebsiteWidget::MyController#new as HTML
Started GET "/website_widget/users/sign_in" for XX.XX.XXX.XX at 2012-10-06 20:45:37 +0000
[b20e57fcc205ee6cf958589ab1660c9f] Processing by WebsiteWidget::MyController#new as */*
Notice in the */* for the second log entry, which suggests the mime type is not set to html or not set at all.
Had anyone come across this kind of thing before? Or got any idea how I can debug it further. I'm proving quite difficult to recreate.
So it looks like this was caused by a firefox plugin. Probably a site ranking plugin of some description.
I'm using the CSS3 ability to apply multiple background images to an element. Currently, I have this code in my stylesheet:
body{background:url("images/emblem.png") top center no-repeat, url("images/background.png");background-color:#EAE6D9}
The code works in all browsers that support it. And those that it doesn't defaults down to the background-color.
However, watching the access log files for the site, I'm noticing 404 errors pop up for, what looks to be, a malformed request based on this CSS initiative. The funny thing is, they are coming from someone using Firefox 5. I'm using Firefox 5 and I cannot get an error to show up in the log for my IP.
Here's the error line from the log:
10.21.7.246 - - [28/Jun/2011:12:02:01 -0500] "GET /templates/images/emblem.png%22),%20url(%22http://ulabs.illinoisstate.edu/templates/images/background.png HTTP/1.1" 404 1005 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0"
I have a feeling the problem is coming from the fact that the " and the space is being URL encoded, but I'm definitely not doing that. And it doesn't happen all the time. Looking at requests from my IP address, the request is properly split up.
10.1.8.129 - - [28/Jun/2011:12:29:33 -0500] "GET /templates/images/background.png HTTP/1.1" 304 - "http://ulabs.illinoisstate.edu/templates/style.1308848695.php" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0"
10.1.8.129 - - [28/Jun/2011:12:29:33 -0500] "GET /templates/images/emblem.png HTTP/1.1" 304 - "http://ulabs.illinoisstate.edu/templates/style.1308848695.php" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0"
Has anyone experienced this behavior before? Or have any ideas on what I might try to resolve the issue?
We've discovered it's YSlow causing the error to be generated. When running YSlow, the error would appear in the log immediately for that IP address. Since this really isn't really a problem, luckily there's nothing we need to fix on our end.