For example, I have a website under the domain example.com. In that site, I have a page like this example.com/hello. Now I need to point my second domain hello.com to that page example.com/hello. It should not be a re-direct. The visitor should stay in hello.com but see the content from the page example.com/hello. Is this possible? Can we do it in dns or in nginx?
The access log after using proxy pass :
123.231.120.120 - - [10/Mar/2016:19:53:18 +0530] "GET / HTTP/1.1" 200 1598 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
123.231.120.120 - - [10/Mar/2016:19:53:18 +0530] "GET /a4e1020a9f19bd46f895c136e8e9ecb839666e7b.js?meteor_js_resource=true HTTP/1.1" 404 44 "http://swimamerica.lk/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.$
123.231.120.120 - - [10/Mar/2016:19:53:18 +0530] "GET /9b342ac50483cb063b76a0b64df1e2d913a82675.css?meteor_css_resource=true HTTP/1.1" 200 73 "http://swimamerica.lk/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.262$
123.231.120.120 - - [10/Mar/2016:19:53:18 +0530] "GET /images/favicons/favicon-16x16.png HTTP/1.1" 200 1556 "http://swimamerica.lk/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
123.231.120.120 - - [10/Mar/2016:19:53:19 +0530] "GET /images/favicons/favicon-96x96.png HTTP/1.1" 200 1556 "http://swimamerica.lk/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
123.231.120.120 - - [10/Mar/2016:19:53:19 +0530] "GET /images/favicons/favicon-32x32.png HTTP/1.1" 200 1556 "http://swimamerica.lk/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
123.231.120.120 - - [10/Mar/2016:19:53:19 +0530] "GET /images/favicons/android-icon-192x192.png HTTP/1.1" 200 1556 "http://swimamerica.lk/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
You can use proxy_pass directive. Just create a new server associated with the domain hello.com and then for location = / set proxy_pass equals to http://example.com/hello:
server {
server_name hello.com;
# ...
location = / {
proxy_pass http://example.com/hello/;
}
# serve static content (ugly way)
location ~* \.(jpg|jpeg|gif|png|css|js|ico|xml|rss|txt)$ {
proxy_pass http://example.com/hello/$uri$is_args$args;
}
# serve static content (better way,
# but requires collection all assets under the common root)
location ~ /static/ {
proxy_pass http://example.com/static/;
}
}
UPD: Here is an exact solution for your situation:
server {
server_name swimamerica.lk;
location = / {
proxy_pass http://killerwhales.lk/swimamerica;
}
# serve static content (ugly way) - added woff and woff2 extentions
location ~* \.(jpg|jpeg|gif|png|css|js|ico|xml|rss|txt|woff|woff2)$ {
proxy_pass http://killerwhales.lk$uri$is_args$args;
}
# added location for web sockets
location ~* sockjs {
proxy_pass http://killerwhales.lk$uri$is_args$args;
}
}
Use the proxy_pass directive. Just create a new server associated with the domain hello.com and then for location = / set proxy_pass equal to http://domain.com/hello:
server {
server_name hello.com;
# ...
location = / {
proxy_pass http://domain.com/hello/;
}
# serve static content (ugly way)
location ~* \.(jpg|jpeg|gif|png|css|js|ico|xml|rss|txt)$ {
proxy_pass http://domain.com/hello/$uri$is_args$args;
}
# serve static content (better way,
# but requires collection all assets under the common root)
location ~ /static/ {
proxy_pass http://domain.com/static/;
}
}
Related
I'm trying to do a simple get request but no matter how I'm configuring the headers I keep getting a 403 response. The page loads fine in a browser. No login is required and there are no tracked cookies either. The link I'm trying to get a response from is below, followed by my simple code.
https://i7.sportsdatabase.com/nba/query.json?sdql=50+%3C+Kobe+Bryant%3Apoints+and+site%3Daway&sport=nba
url = 'https://i7.sportsdatabase.com/nba/query.json?sdql=50+%3C+Kobe+Bryant%3Apoints+and+site%3Daway&sport=nba'
headers = {
'Host': 'i7.sportsdatabase.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
}
r = requests.get(url, headers)
I'm not seeing any other headers that need adding to the request. The full, in browser, request headers are below:
Host: i7.sportsdatabase.com
Connection: keep-alive
Cache-Control: max-age=0
sec-ch-ua: "Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"
sec-ch-ua-mobile: ?0
DNT: 1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en,en-US;q=0.9,it;q=0.8,es;q=0.7
If-None-Match: "be833f0fb26eb81487fc09e05c85ac8c8646fc7b"
Try:
Make your URL a string
Add the accepts
This works:
url = 'https://i7.sportsdatabase.com/nba/query.json?sdql=50+%3C+Kobe+Bryant%3Apoints+and+site%3Daway&sport=nba'
headers = {
'Host': 'i7.sportsdatabase.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
}
r = requests.get(url, headers=headers)
Try using .Session()
import requests
s = requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36',
}
s.get('https://i7.sportsdatabase.com/nba/trends', headers=headers)
url = 'https://i7.sportsdatabase.com/nba/query.json?sdql=50+%3C+Kobe+Bryant%3Apoints+and+site%3Daway&sport=nba'
r = s.get(url, headers=headers)
print(r)
Output:
print(r)
<Response [200]>
I have defined a custom nginx log format using below template :
log_format main escape=none '"$time_local" client=$remote_addr '
'request="$request" '
'status=$status'
'Req_Header=$req_headers '
'request_body=$req_body '
'response_body=$resp_body '
'referer=$http_referer '
'user_agent="$http_user_agent" '
'upstream_addr=$upstream_addr '
'upstream_status=$upstream_status '
'request_time=$request_time '
'upstream_response_time=$upstream_response_time '
'upstream_connect_time=$upstream_connect_time ';
In return i get the request logged like below
"09/Sep/2019:13:28:39 +0530" client=59.152.52.190 request="POST /api/onboard/checkExistence HTTP/1.1"status=200Req_Header=Headers: accept: application/json
host: uat-pwa.abc.com
from: https://uat-pwa.abc.com/onboard/mf/onboard-info_v1.2.15.3
sec-fetch-site: same-origin
accept-language: en-GB,en-US;q=0.9,en;q=0.8
content-type: application/json
connection: keep-alive
content-length: 46
cookie: _ga=GA1.2.51303468.1558948708; _gid=GA1.2.1607663960.1568015582; _gat_UA-144276655-2=1
referer: https://uat-pwa.abc.com/onboard/mf/onboard-info
accept-encoding: gzip, deflate, br
ticket: aW52ZXN0aWNh
businessunit: MF
sec-fetch-mode: cors
userid: Onboarding
origin: https://uat-pwa.abc.com
investorid:
user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36
request_body={"PAN":"ABCDEGH","mobile":null,"both":"no"} response_body={"timestamp":"2019-09-09T13:28:39.132+0530","message":"Client Already Exist. ","details":"Details are in Logger database","payLoad":null,"errorCode":"0050","userId":"Onboarding","investorId":"","sessionUUID":"a2161b89-d2d7-11e9-aa73-3dba15bc0e1c"} referer=https://uat-pwa.abc.com/onboard/mf/onboard-info user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36" upstream_addr=[::1]:8080 upstream_status=200 request_time=0.069 upstream_response_time=0.068 upstream_connect_time=0.000
I am facing issues in writing a parser rule in telegraf logparser section. Once this data is parsed properly, Telegraf can write it into influx DB.
I have tried various solutions online to find a parsing rule but not able to do so as I am new to it. Any assistance will be appreciated.
Thanks and do let me know if any further information is required.
I'm trying to ban annoying bot by user-agent. I put this into server section of nginx config:
server {
listen 80 default_server;
....
if ($http_user_agent ~* (AhrefsBot)) {
return 444;
}
checking by curl:
[root#vm85559 site_avaliable]# curl -I -H 'User-agent: Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)' localhost/
curl: (52) Empty reply from server
so i check /var/log/nginx/access.log and i see some connections get 444, but another connections get 200!
51.255.65.78 - - [25/Jun/2017:15:47:36 +0300 - -] "GET /product/kovriki-avtomobilnie/volkswagen/?PAGEN_1=10 HTTP/1.1" 444 0 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" 1498394856.155
217.182.132.60 - - [25/Jun/2017:15:47:50 +0300 - 2.301] "GET /product/bryzgoviki/toyota/ HTTP/1.1" 200 14500 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" 1498394870.955
How is it possible?
Ok, got it!
I've add $server_name and $server_addr to nginx log format, and saw that cunning bot connects by ip without server_name:
51.255.65.40 - _ *myip* - [25/Jun/2017:16:22:27 +0300 - 2.449] "GET /product/soyuz_96_2/mitsubishi/l200/ HTTP/1.1" 200 9974 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" 1498396947.308
so i added this and bot can't connect anymore
server {
listen *myip*:80;
server_name _;
return 403;
}
I am quite new to scrapy (and my background is not informatics). I have a website that I cant visit with my local ip, since I am banned, I can visit it using a VPN service on browser. To my spider be able to crawl it I set up a pool of proxies that I have found here http://proxylist.hidemyass.com/ . And with that my spider is able to crawl and scrape items but my doubt is if I have to change the proxy pool list everyday?? Sorry if my question is a dumb one...
here my settings.py:
BOT_NAME = 'reviews'
SPIDER_MODULES = ['reviews.spiders']
NEWSPIDER_MODULE = 'reviews.spiders'
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware':None, # to avoid the raise IOError, 'Not a gzipped file' exceptions.IOError: Not a gzipped file
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'reviews.rotate_useragent.RotateUserAgentMiddleware' :400,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'reviews.middlewares.ProxyMiddleware': 100,
}
PROXIES = [{'ip_port': '168.63.249.35:80', 'user_pass': ''},
{'ip_port': '162.17.98.242:8888', 'user_pass': ''},
{'ip_port': '70.168.108.216:80', 'user_pass': ''},
{'ip_port': '45.64.136.154:8080', 'user_pass': ''},
{'ip_port': '149.5.36.153:8080', 'user_pass': ''},
{'ip_port': '185.12.7.74:8080', 'user_pass': ''},
{'ip_port': '150.129.130.180:8080', 'user_pass': ''},
{'ip_port': '185.22.9.145:8080', 'user_pass': ''},
{'ip_port': '200.20.168.135:80', 'user_pass': ''},
{'ip_port': '177.55.64.38:8080', 'user_pass': ''},]
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'reviews (+http://www.yourdomain.com)'
here my middlewares.py:
import base64
import random
from settings import PROXIES
class ProxyMiddleware(object):
def process_request(self, request, spider):
proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
encoded_user_pass = base64.encodestring(proxy['user_pass'])
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
else:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
Another question: if I have a website that is https should I have a proxy pool list for https only? and then another function class HTTPSProxyMiddleware(object) that recives a list HTTPS_PROXIES ?
my rotate_useragent.py:
import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
#the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
#for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
user_agent_list = [\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
Another question and last(sorry if is again a stupid one) in settings.py there is a commented default part # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'reviews (+http://www.yourdomain.com)' should I uncomment it and put my personal informations? or just leave it like that? I wanna crawl effeciently but regarding the good policies and good habits to avoid possible ban issues...
I am asking this all because with this things my spiders started to throw errors like
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting http://www.example.com/browse/?start=884 took longer than 180.0 seconds.
and
Error downloading <GET http://www.example.com/article/2883892/x-review.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
and
Error downloading <GET http://www.example.com/browse/?start=6747>: TCP connection timed out: 110: Connection timed out.
Thanks so much for your help and time.
There is already a library to do this. https://github.com/aivarsk/scrapy-proxies
Please download it from there. It has not been in pypi.org yet, so you can't install it easily using pip or easy_install.
There's not a correct answer for this. Some proxies are not always available so you have to check them now and then. Also, if you use the same proxy every time the server you are scraping may block its IP as well, but that depends on the security mechanisms this server has.
Yes, because you don't know if all the proxies you have in your pool support HTTPS. Or you could have just one pool and add a field to each proxy that indicates its HTTPS support.
In your settings your are disabling the user agent middleware: 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None.
The USER_AGENT setting won't have any effect.
Some logs
46.196.164.146 - - [21/Feb/2015:20:05:45 +0300] "GET / HTTP/1.1" 200 10930 "http://22on8mj7w7wpcc.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko/20100101 Firefox/16.0" "-"
78.171.167.204 - - [21/Feb/2015:20:05:45 +0300] "GET / HTTP/1.1" 200 10931 "http://y707yvc8a.net/" "Opera/9.80 (Windows NT 6.1; WOW64; U; Edition Romania Local; ru) Presto/2.10.289 Version/8.09" "-"
78.171.167.204 - - [21/Feb/2015:20:05:45 +0300] "GET / HTTP/1.1" 200 10930 "http://87rk11k0.ua/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko/20100101 Firefox/16.0" "-"
78.174.146.52 - - [21/Feb/2015:20:05:45 +0300] "GET / HTTP/1.1" 200 10931 "http://8811mm213kc34.org/" "Mozilla/5.0 (Windows NT 5.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0" "-"
176.43.210.33 - - [21/Feb/2015:20:05:45 +0300] "GET / HTTP/1.1" 200 10930 "http://qh0lx1wqp17.ru/" "Mozilla/5.0 (Windows NT 5.1; WOW64; rv:11.0) Gecko/20100101 Firefox/11.0" "-"
User agent change randomly. Is there any way to do in nginx conf
if (preg_match('/\d{3}/', $invalid_referer)) {
return 403;
}
You should just make a rewrite rule at this case:
if ($http_referer ~* ".*[0-9]{2}.*") {
return 403;
}