Scraping Websites via Google Cached Pages pages has been blocked - proxy-server

I'm trying to create a Service that Scraping websites by using Google Cached Pages.
Example
https://webcache.googleusercontent.com/search?q=cache:nike.com
The Response that I get is the HTML from Google cache, which is an older version of the Nike site.
And it works fine as long as I run it locally on my computer,
but when I deploy to google cloud platform, there I use porxy server
I get a 403 error that I can not access the information through a porxy server
Example of response from proxy server
433. That’s an error.Your client does not have permission to get URL /s
earch?q=cache:http://nike.com from this server. (Client IP address: XX.XXX.XX.XXX)<br
Please see Google's Terms of Service posted at
https://policies.google.com/terms If you believe that you
have received this response in error, please report your
problem. However, please make sure to take a look at our Terms of
Service (http://www.google.com/terms_of_service.html). In your email,
please send us the entire code displayed below. Please also
send us any information you may know about how you are performing your
Google searches-- for example, "I' m using the Opera browser on Linux
to do searches from home. My Internet access is through a dial-up
account I have with the FooCorp ISP." or "I'm using the Konqueror
browser on Linux t o search from my job at myFoo.com. My machine's IP
address is 10.20.30.40, but all of myFoo' s web traffic goes through
some kind of proxy server whose IP address is 10.11.12.13." (If y ou
don't know any information like this, that's OK. But this kind of
information can help us track down problems, so please tell us what
you can.)We will use all this information to diagnose the
problem, and we'll hopefully have you back up and searching with
Google agai n quickly! Please note that although we read all
the email we receive, we are not always able to send a personal
response to each and every email. So don't despair if you don't hear
back from u s! Also note that if you do not send us the
entire code below, we will not be able to help
you.Best wishes,The Google
Article that talks about the problem https://proxyserver.com/web-scraping-crawling/scraping-websites-via-google-cached-pages/
How can I solve this problem, and run requests from the cloud as well without being blocked? Add parameters?
Thanks :)

I guess that you should add a property in the header of your http request
for example :
URL u = new URL("https://www.google.com//search?q=c");
URLConnection c = u.openConnection();
c.setRequestProperty("User-Agent", "MSIE 7.0");
or
HttpRequest request =HttpRequest.newBuilder(new URI("https://www.google.com//search?q=c")).header("User-Agent", "MSIE 7.0").GET().build();
// note to change the URI
this two examples are in Java but the same concept is applied in all environments I guess
hope that was helpfull

Related

Invalid HTTP Request - linkedin.com/oauth/v2/authorization

Suddenly linkedin oauth2 stopped working! As per instructions found here:
https://developer.linkedin.com/docs/oauth2
When invoking this:
https://www.linkedin.com/oauth/v2/authorization?response_type=code&client_id=75jdo0an3ktnbx&redirect_uri=https://app.myapp.com/account/linkedin_login&state=fregfdgfasd&scope=r_basicprofile%20r_emailaddress
Instead of a valid response I get a 400 error:
LinkedIn
Invalid HTTP Request
Could not process this client request HTTP method request for URL. Please double-check the URL (address) you used, or contact us if you feel you have reached this page in error.
I am experiencing the same problem using Chrome, but not with Edge or Firefox. Contacted LI, reply was we are working on it, no estimate of when we will solve it. The new profile update seems to be botched in Chrome, OK with Edge and still not updated to the new look if using Firefox.
Linkedin has problems far deeper than poor coding, they forgot the meaning of being social in networking, the site is becoming a pile of stale resumes, non-existent debates and bad quality networking.
I am not OAuth fluent enough to tell you why, but they have 2 different systems: oAuth and oAuth legacy.
I personaly couldn't find a way to retrieve a valid token from OAuth but yes from OAuth legacy. The main difference is the URL and the authorization window.
You are actually using : https://www.linkedin.com/oauth/v2 for you api calls.
OAuth legacy is using https://www.linkedin.com/uas/oauth2.
The whole process is the same so you won't have to change your code, just the URL.
see OAuth legacy doc: linkedin.com/docs/oauth2-legacy
The bad side is the authorization window, the user has to literaly login (email + password) before clicking on the 'Authorized' button and being redirected to your callback URL.
I am agree, this website has something buggy. When visited from France (browser language set to FR-fr and an IP geolocalised in France), their whole interface is written in Dutch ...
Anyway, i hope it helps

sendmail genericstable not used when mailing

I want to forward all mail for root (so basically the output of all cron jobs but other mails for root as well) to an external email address (hotmail).
Easiest method would be to use the aliases file. I updated the root alias:
root: mymail#hotmail.com
And ran newaliases.
When an email is sent I see that the hotmail MX server "accepts" my mail. Standard MS Security through obscurity makes me think it's silently discarding my email ( not in junk mail, ... ).
This server is used to send/receive mail for a domain (and more domains in the future).
I've checked the logs and it seems the mail is sent with from field of : root#mail.domain.com
I'm pretty sure this is at the root of my mail never received in my hotmail.
The existing email addresses are using user#domain.com as from.
Now I would like to rewrite this (mail) from address/ctladdr.
I thought this would be an easy fix with genericstable.
Genericstable (had multiple tries):
root info#domain.com
root#localhost info#domain.com
root#mail.domain.com info#domain.com
Regenerated the db with makemap.
I tried with different settings.
I also removed the EXPOSED_USER root (from the generic m4 file). I can see it's not in the generated cf file.
I also added root to the trusted users.
In my m4 file:
FEATURE(genericstable)dnl
GENERICS_DOMAIN(domain.com)dnl
dnl GENERICS_DOMAIN(mail.domain.com)dnl
dnl GENERICS_DOMAIN_FILE(`/etc/mail/generics-domains')dnl
FEATURE(masquerade_envelope)dnl
dnl define(`LOCAL_RELAY', `localhost')dnl
I have a submit mc file as well. Not sure if this matters but I don't think so.
(I don't have sendmail in MSP mode running as far as I know).
I've tried with GENERICS_DOMAIN as the domain that I want it to be or the domain that I want to be rewritten.
make all install
and restarted sendmail.
Still it just seems to go out as root#mail.domain.com
I tried with sendmail in address test mode (bt; tryflags hs and try esmtp root). This correctly modifies to the wanted source address: info#domain.com.
Anyone has some other ideas why this is not working? Or more debugging ways?
Do I need local_relay to make this work? What's expected to be in the hosts file? Fqdn(mail.domain.com) and hostname(so mail) for 127.0.0.1 ?
EDIT: I probably should mention that I have an incoming queue for MailScanner.
Thanks a lot in advance!
I believe the source of my issue is that I was expecting all mailserver mentioned in the headers to have the mail.example.com removed.
However the first header is to submit it to the local queue.
And only when Sendmail is sending the mail out (connecting to the outside MX of example.com) the translation gets done.
So the servers mentioned in the headers stay with mail.example.com.
I thought the mail.example.com was the culprit in hotmail not delivering my email. Which seemed to be wrong.
After investigating for a long time I noticed that if I sent an email from info#example.com to hotmail it was nowhere shown(no, not even in spam, ...) while it was accepted.
If I sent an email first to info#example.com and then sent one back from info#example.com the mail gets successfully delivered in the hotmail mailbox.
This also seems to be the case with other users of the same example.com domain (so not solely with info#).
After some more investigating I noticed: html email seems to be more easily delivered(sent through squirrelmail). Plain text only mails seem to be ignored.
NOTE: in all cases my mail was accepted by the hotmail mailserver. So no error code 550 or something. I was always sending mail from the mail.example.com server (either command line or through Squirrelmail).
EDIT: I had yet another annoying encounter with Hotmail. Again my message is accepted and just disappears. I've been sending to this destination address before without any issues. But for some reason all of a sudden Hotmail mailservers get "improved".
I'd like to throw in this reference of a topic that got opened years ago which is still ongoing with no feedback from MS: https://answers.microsoft.com/en-us/outlook_com/forum/oemail-osend/messages-reported-as-250-queued-for-delivery-but/f451cda5-ba7d-45ff-b643-501efe2413dc?page=2 . So you're definitely not alone. But also understand that there can be multiple issues leading to the same symptoms.
So I'd like to add some steps which might help preventing a massive headache for others:
Use a footer that clearly states your company and domain.
Use HTML mail
For some reasons sometimes I see mails getting delivered directly in the Deleted folder. Not in Spam
For some reason sending more mails from your domain is better as you gain more "reputation"
You can open a case with Microsoft here:
https://support.microsoft.com/en-us/getsupport?oaspworkflow=start_1.0.0.0&wfname=capsub&productkey=edfsmsbl3&locale=en-us&ccsid=635754176123391261
Don't set your expectations high. They'll mainly send you an email back that you're not eligible for remediation and later on answer on your case with a standard answer. HOWEVER what creating this case does do is probably getting confirmation that your email got indeed "filtered" by the mighty SmartScreen (they will not tell you why). But this way at least you know it's the spam filter and the below points might help you out.
Make sure to pass the message ID, timestamp, ... (log entry from maillog is what I did)
The answer on your case will certainly mention to use SNDS(Smart Network Data Service) and JMRP (Junk Mail Reporting Program)
SNDS: I've subscribed and never seen anything listed here. So if you have low email volume don't expect anything to show up here
JMRP: this is a service that will send you an email when a message gets marked as spam by users. I've never got anything useful out of this either.
make sure that your DNS settings are correct (MX record, A record, PTR record). This was all correct for me and nobody could point out a flaw in my configuration.
if you open a case they'll also send you a link to "Improving E-mail Deliverability into Windows Live Hotmail". You can find this on google as well and it might give some pointers.
if you're clearly sending an email campaign add in an Opt-out link (which again was not the case for me)
even if the destination address has your email address whitelisted your mail might be silently discarded. This goes beyond all logic.
having them send an email and reply might get your email delivered as well although it looks clumsy to go ask to send you an email so you can actually use email.
Basically the filter tries to "intelligently" determine what's normal mail behavior and based on that will take actions. So there's a big chance you can get your mail delivered by improving the content of your mails.
All in all I can only recommend to not use hotmail. Not for yourself or for your customers if you're a business. Unless you always want to be doubting if the other side actually received the mail. Sometimes you might be able to call, but if this is a lead through your site and they never get your response that's lost business. Of course it's the user's choice but if you can, try to convince them to use another mail account they have as none of the other providers just silently deletes mails (or at least I've never seen it).
I hope this helps someone else.

JMeter NTLM/Windows Authentication Load Testing

What is to be done?
We have an application deployed on the Sharepoint (corporate) Server which uses the windows credentials to log into the application.
App URL format: http://testmachine:1000/sites/test/
Windows Credentials Format: user_id#domain.co.in
The objective is to perform the load/performance testing on the application (especially the log in functionality) for such n number of users.
Normally when I hit the app URL in the Firefox/IE, it pops up a window asking for credentials. I enter the credentials, browse the app and then log out. I intend to capture this in JMeter and simulate this for large number of users.
Where I’m stuck?
Now I start the JMeter proxy server, and then try the same steps as above. But when the pop up window appears, JMeter simply doesn’t record the it nor it does record anything else after the login.
What I’ve tried?
If I try the same steps after enabling “Automatically detect intranet network” in IE, then it simply auto detects my windows credentials (No credentials pop-up), logs me into the app (this is not recorded in JMeter either) and takes me to the home page. And any page thereafter I hit gets recorded in JMeter.
I’ve also tried to use the HTTP Authorization Manager using following parameters:
BaseURL : http://testmachine:1000/sites/test/
Username: DOMAIN\USER_ID
Password: i_wont_tell_you
Domain: \
Realm:
It didn't help. I am quite confused about how-to-use the above element. And not even sure whether its a right approach to get the solution to my problem.
Any help/suggestions?
P.S. I know about a tool called Badboy, but have to go for it as a last resource. Also not even sure if it records the pop windows.
And sorry if the post is verbose.
UPDATE:
I have also tried -
Username: USER_ID and Domain: my_company_domain
But this is not the actual problem. Problem is, when I try to hit the pages (automation) which I've recorded previously return success response even if I haven't used the HTTP Authorization Manager. I'm not sure what I'm missing.
OK. Finally I got what was missing.
First, I had to change the implementation of every request to HttpClient3.1
Second, it was really frustrating to see that JMeter documentation was misleading.
It says that the config file httpclient.parameters, should be edited as following:
http.authentication.preemptive$Boolean=false
But it didn't work. Changing it to true worked like a charm.
Hope this helps other people.
JMeter works at the HTTP layer so the proxy will only capture requests made over this protocol layer. It sounds to me like you have already found the right approach to use for recording by using '“Automatically detect intranet network” in IE', you can use this method to capture most requests and you will have to figure out authentication manually. How you do this depends on how your application communicates with your server to authenticate a user.

Is there a way to redirect Post requests preserving post data or an alternative?

I am setting up a CDN relying only on Header redirects or temporary URLs served by an API controlled by a Database cluster.
The Goal is to reduce hardware costs and have flexible nodes with only FTP/HTTP/PHP as requirement and create a cheap solution for websites that can work with this.
Howevery my Problem is that i want to have a static Address where file uploads (containing ClientID and Token) can be sent to. I am using simple post.
But the file should be sent directly to the most idle server.
So what I want is to have A Post request to http://whatever.com/upload.php which is redirected to http://server-in-cdn.whatever.com/upload.php whithout loosing the data.
The problem is that the post request gets converted into a GET request and Post data is lost.
The W3C documentation states that the 307 Header code could be used, but its not reliable and user confirmation is required.
Or is there an alternative? I am not really into network stuff... but I think the classic solution would be some sort of Load balancer or router running BGB/Quagga or something like that, and the traffic would still go over that node.. is that correct?
Or is there a way to totally redirect the traffic on Network/DNS basis?
Thanks in advance.

Redirect loop in ASP.NET app when used in America

I have a bunch of programs written in ASP.NET 3.5 and 4. I can load them fine (I'm in England) and so can my England based colleagues. My American colleagues however are suffering redirect loops when trying to load any of the apps. I have tried myself using Hide My Ass and can consistently recreate this issue.
I'm stumped. What could be causing a redirect loop for users in a specific country?!
The apps are hosted on IIS 6 on a dedicated Windows Server 2003. I have restarted IIS with no luck.
Edit
I should have made it clear that unfortunately I do not have access to the machines in the US to run Firefox Firebug/Fiddler. The message I get in Chrome is This webpage has a redirect loop..
When you say "a redirect loop", do you mean a redirect as in an http redirect? Or do you mean you have a TCP/IP routing loop?
A TCP/IP loop can be positively identified by performing a ping from one of the affected client boxes. If you get a "TTL expired" or similar message then this is routing and unlikely to be application related.
If you really meant an http redirect, try running Fiddler, or even better, HttpWatch Pro and looking at both the request headers, and the corresponding responses. Even better - try comparing the request/response headers from non-US working client/servers to the failing US counterparts
you could take a look with Live HTTP Headers in firefox and see what it's trying to redirect to. it could possibly be trying to redirect to a url based on the visitor's lang/country, or perhaps the dns is not fully propagated...
if you want to post the url, i could give you the redirect trace
What could be causing a redirect loop
for users in a specific country?!
Globalization / localization related code
Geo-IP based actions
Using different base URLs in each country, and then redirecting from one to itself. For example, if you used uk.example.com in the UK, and us.example.com in the US, and had us.example.com redirect accidentally to itself for some reason.
Incorrect redirects on 404 Not Found errors.
Spurious meta redirect tags
Incorrect redirects based on authentication errors
Many other reasons
I have tried myself using Hide My Ass
and can consistently recreate this
issue.
I have restarted IIS with no luck.
I do not have access to the machines
in the US to run Firefox
Firebug/Fiddler.
The third statement above don't make sense in light of the other two. If you can restart IIS or access the sites with a proxy, then you can run Fiddler, since it's a client-side application. Looking at the generated HTML and corresponding HTTP headers will be the best way to diagnose your problem.

Resources