I'm trying to create a Service that Scraping websites by using Google Cached Pages.
Example
https://webcache.googleusercontent.com/search?q=cache:nike.com
The Response that I get is the HTML from Google cache, which is an older version of the Nike site.
And it works fine as long as I run it locally on my computer,
but when I deploy to google cloud platform, there I use porxy server
I get a 403 error that I can not access the information through a porxy server
Example of response from proxy server
433. That’s an error.Your client does not have permission to get URL /s
earch?q=cache:http://nike.com from this server. (Client IP address: XX.XXX.XX.XXX)<br
Please see Google's Terms of Service posted at
https://policies.google.com/terms If you believe that you
have received this response in error, please report your
problem. However, please make sure to take a look at our Terms of
Service (http://www.google.com/terms_of_service.html). In your email,
please send us the entire code displayed below. Please also
send us any information you may know about how you are performing your
Google searches-- for example, "I' m using the Opera browser on Linux
to do searches from home. My Internet access is through a dial-up
account I have with the FooCorp ISP." or "I'm using the Konqueror
browser on Linux t o search from my job at myFoo.com. My machine's IP
address is 10.20.30.40, but all of myFoo' s web traffic goes through
some kind of proxy server whose IP address is 10.11.12.13." (If y ou
don't know any information like this, that's OK. But this kind of
information can help us track down problems, so please tell us what
you can.)We will use all this information to diagnose the
problem, and we'll hopefully have you back up and searching with
Google agai n quickly! Please note that although we read all
the email we receive, we are not always able to send a personal
response to each and every email. So don't despair if you don't hear
back from u s! Also note that if you do not send us the
entire code below, we will not be able to help
you.Best wishes,The Google
Article that talks about the problem https://proxyserver.com/web-scraping-crawling/scraping-websites-via-google-cached-pages/
How can I solve this problem, and run requests from the cloud as well without being blocked? Add parameters?
Thanks :)
I guess that you should add a property in the header of your http request
for example :
URL u = new URL("https://www.google.com//search?q=c");
URLConnection c = u.openConnection();
c.setRequestProperty("User-Agent", "MSIE 7.0");
or
HttpRequest request =HttpRequest.newBuilder(new URI("https://www.google.com//search?q=c")).header("User-Agent", "MSIE 7.0").GET().build();
// note to change the URI
this two examples are in Java but the same concept is applied in all environments I guess
hope that was helpfull
I've encountered a weird situation, after registration we're sending an email with a verification link, pretty standard stuff, but somehow clicking on the link seems to make the request twice, looking at the logs, the first time it comes from my IP and the second request comes from some Google IP: 66.102.8.60 (doing a reverse lookup shows google-proxy-66-102-8-60.google.com).
Any idea what's going on and how to prevent this?
The server is running Nginx and the site is Ruby on Rails if that helps.
I do not know the root cause but my best guess is same as Tripleee wrote above - most probably google is scanning urls. This happens in all browsers (well at least in Chrome and Firefox), but only under following circumstances:
the url is clicked from gmail (if you copy paste it to browser tab, the second request is not issued)
the url is clicked for the first time... Subsequent clicks from the same email do not trigger second request
I know it is probably not the answer you expected, but after giving it some thought I figured that operation like this should be handled on server side. In my case I am tracking information about confirmation urls anyways, so the first time the request comes to my backend I am deleting it and proceeding with confirmation normally. Since the confirmation entry is missing in the database for the second request it returns immediately with status 404, 422 or something whatever suits you.
Hope that helps anyone who gets here looking for an answer to this problem ;)
I am having Postfix server configured for domain. From last few days my mails are marking as spam in gmail. I have already configured DKIM,SPF and DMARC for this domain. I have checked mail source and getting
"Authentication-Results: mydomain; dmarc=fail header.from=mydomain"
I have checked all the support docs but didn`t find anything.
Could you provide full sample headers or run a classification test with a third party?
Often, the TXT records aren't correctly created with the proper name style or have a syntax error causing the mail server parser to fail. Since DMARC fails the issue can be in either SPF or DKIM. In relaxed mode, the [SPF]-authenticated domain and RFC5322.From domain must have the same Organizational Domain. In strict mode, only an exact DNS domain match is considered to produce Identifier Alignment.
Generally speaking, its really hard to help without proper details. Kindly always provide DNS and configuration samples.
I have recently setup forgot password functionality on my site using the stock symfony2 implementation.
Problem is my reset password email gets sent to my junk folder.
What causes this? Is it the content of the email itself?
Here it is:
Hello myemail#hotmail.com!
To reset your password - please visit http://application.mysite.com/resetting/reset/yLbv6BLD6ItSlmXSl4tFI7la78Es5UnS1GqvJnN_5uR
Regards,
the Team.
Could it be something in my settings?
There is a lot of possibilities that can cause this problem.
It's most often coming from the server (e-mail) configuration.
Look at the "original message" (with headers) to see if there is no explicit problem, but it's very difficult to debug.
Look at your email configuration (postfix local ? gmail ?), search for working examples and hopes you find the problem, especially if it's your production server.
Good luck
There can be lot of reasons:
you send emails from shared IP segment
to low ammount of text in your message
spammy look sender email address (for example "noreply#...")
subject of message
url thas point somwhere to testing environment (for example 127.0.0.1)
Try to change these, and experiment...
I have a website "www.website.com".
Recently I found out that somebody has set up a reverse proxy with an almost identical url "www.website1.com" in front of my website.
I'm concerned of those users who came to my website through that reverse proxy. Their username and passwords might be logged when they login.
Is there a way for me to have my web server refuse reverse proxy?
For example, I've set up a reverse proxy using squid with the url "www.fakestackoverflow.com" in front of "www.stackoverflow.com". So whenever I type "www.fakestackoverflow.com" in my web browser address bar, I'll be redirected to "www.stackoverflow.com" by the reverse proxy. Then I notice the url in my address bar changed to "www.stackoverflow.com" indicating that I'm no longer going through the reverse proxy.
"www.stackoverflow.com" must've detected that I came to the website from another url and then redirected me to the website through the actual url.
How do I do something like that in ASP.NET web application?
Also asked on server fault.
First, use JavaScript to sniff document.location.href and match it against your domain:
var MyHostName = "www.mydomain.com";
if (0 == document.location.href.indexOf("https://"))
{
MyHostName = "https://" + MyHostName + "/";
if (0 != document.location.href.indexOf(MyHostName)) {
var new_location = document.location.href.replace(/https:\/\/[^\/]+\//, MyHostName);
if(new_location != document.location.href)
document.location.replace(new_location);
}
}
else
{
MyHostName = "http://" + MyHostName + "/";
if (0 != document.location.href.indexOf(MyHostName)) {
var new_location = document.location.href.replace(/http:\/\/[^\/]+\//, MyHostName);
if(new_location != document.location.href)
document.location.replace(new_location);
}
}
Second: write a init script to all your ASP pages to check if the remote user IP address matches the address of the reverse proxy. If it matches, redirect to a tinyurl link which redirects back to your real domain. Use tinyurl or other redirection service to counter reverse proxy's url rewriting.
Third: write a scheduled task to do a DNS lookup on the fake domain, and update a configuration file which your init script in step 2 uses. Note: Do not do a DNS lookup in your ASP because DNS lookups can stall for 5 seconds. This opens a door for DOS against your site. Also, don't block solely based on IP address because it's easy to relocate.
Edit: If you're considered of the proxy operator stealing user passwords and usernames, you should log all users who are served to the proxy's IP address, and disable their accounts. Then send email to them explaining that they have been victims of a phishing attack via a misspelled domain name, and request them to change their passwords.
After days of searching and experimenting, I think I've found an explanation to my question. In my question, I used stackoverflow.com as an example but now I'm going to use whatismyipaddress.com as my example since both exhibit the same behaviour in the sense of url rewriting plus whatismyipaddress.com is able to tell my ip address.
First, in order to reproduce the behaviour, I visited whatismyipaddress.com and got my ip address, say 111.111.111.111. Then I visited www.whatismyipaddress.com (note the additional www. as its prefix) and the url in my browser's address bar changed back to whatismyipaddress.com discarding the prefix. After reading comments from Josh Stodola, it strucked me to prove this point.
Next, I set up a reverse proxy with the url www.myreverseproxy.com and ip address 222.222.222.222 and I have it performed the two scenarios below:
I have the reverse proxy points to whatismyipaddress.com (without the prefix **www.). Then typed www.myreverseproxy.com in my browser's address bar. The reverse proxy then relayed me to whatismyipaddress.com and the url in my address bar didn't change (still showing www.myreverseproxy.com). I further confirmed this by checking the ip address on the webpage which showed 222.222.222.222 (which is the ip address of the reverse proxy). This means that I'm still viewing the webpage through the reverse proxy and not directly connected to whatismyipaddress.com.
Then I have the reverse proxy points to www.whatismyipaddress.com (with the prefix wwww. this time). I visited www.myreverseproxy.com and this time the url in my address bar changed from www.myreverseproxy.com to whatismyipaddress.com. The webpage showed my ip address as 111.111.111.111 (which is the real ip address of my pc). This means that I'm no longer viewing the webpage through the reverse proxy and redirected straight to whatismyipaddress.com.
I think this is some kind of url rewriting trick which Josh Stodola has pointed out. I think I'm gonna read more on this. As to how to protect a server from reverse proxy, the best bet is to use SSL. Encrypted information passing through a proxy will be of no use since it can't be read in plain sight thus preventing eavesdropping and man-in-the-middle attack which what reverse proxy exactly is.
Safeguarding with javascript though can be seen trivial since javascript can be stripped off easily by a reverse proxy and also prevent other online services like google translate from accessing your website.
If you were to do Authentication over SSL using https://, you can bypass the proxy in most cases.
You can also look for the X-Forwarded-For header in the incoming request and match it against the suspicious proxy.
As I see it, your fundamental issue here is that whatever application layer defence measures you put in place to mitigate this attack can be worked around by the attacker, assuming this really is a malicious attack made by a competent adversary.
In my view, you should definitely be using HTTPS, which in principle would allow the user to confirm for sure whether they're talking to the right server, but this relies on the user knowing to check for this. Some browsers these days display extra information in the URL bar about which legal entity owns the SSL certificate, which would help, as it's unlikely an attacker would be able to persuade a legitimate certificate authority to issue a certificate in your name.
Some of the other comments here said that HTTPS can be intercepted by intermediate proxy servers, which is not actually true. With HTTPS, the client issues a CONNECT request to the proxy server, which tunnels all future traffic direct to the origin server, without being able to read any of it. If we assume that this proxy server is entirely bespoke and malicious, then it can terminate the SSL session and intercept the traffic, but it can only do that with its own SSL certificate, not with yours. This certificate will either be self signed (in which case clients will get lots of warning messages) or a genuine certificate issued by a certificate authority, in which case it'll have the wrong legal entity name, and you should be able to go back to the certificate authority, have the cert revoked and potentially ask the police to take action against the owner of the certificate, if you have reasonable suspicion that they are phishing.
The other thing I can think of which would mitigate this threat to some extent would be to implement one-time password functionality, either using a hardware/software token or using (my personal favorite) an SMS sent to the user's phone when they log in. This wouldn't prevent the attacker getting access to the session once, but should prevent them being able to log in in future. You could further protect the users by requiring another one time password before allowing them to see/edit particularly sensitive details.
There's very little you can do to prevent this without causing legitimate proxies (translation, google cache, etc..) from failing. If you don't care if people use such services, then simply set your web app to always redirect if the base url is not correct.
There are some steps you can take if you are aware of the proxies, and can find out their IP addresses, but that can change and you would have to stay on top of it. #jmz's answer is quite good in that regard.
I have come with an idea, and I think a solution.
First of all you do not need all page to be overwrite because this way you block other proxies, and other services (like google automatic translate).
So let say that you won to be absolute sure about the login page.
So what you do, when a user gets on login.aspx page you make a redirect with the full path of your site again to login.aspx.
if(Not all ready redirect on header / or on parametres from url)
Responce.Redirect("https://www.mysite.com/login.aspx");
This way I do not think that transparent proxy can change the get header and change it.
Also you can log any proxy, and or big requests from some ips and check it. When you found a Fishing site like the one you say you can also report it.
http://www.antiphishing.org/report_phishing.html
https://submit.symantec.com/antifraud/phish.cgi
http://www.google.com/safebrowsing/report_phish/
Maybe create a black-list of URLs and compare requests with Response.Referer if the website is on that list then kill the request or do a redirection of your own.
The black-list is obviously something you would have to manually update.
Ok i have went throu a similar situation but i managed to overcome it by using another forwarded domain that points to my original perminantly , then checking with code if the client is the reverse server or not if it it i would redirect them to my second domain which will go to the original
Check out more info from here: http://alphablog.xyz/your-website-is-being-mirrored-by-someone-else-without-your-knowledge/
The simplest way would probably be to put some Javascript code on your page that examines window.location to see if the top level domain (TLD) matches what you expect, and if not, replaces it with your correct domain (causing the browser to reload to the proper site instead).