r White Hat webscrape whitelisting - r

I've been periodically doing web scrapes for an eCommerce client of mine and getting through with read_html with no issues up until recently. It seems they've now upgraded their website security and my current attempts are now being blocked.
As this is an expected function, I should be able to get them to add me to their whitelist (and maybe use a more efficient scraping technique)
As I've never asked IT to whitelist a crawler before, would I just need them to whitelist my IP address? Is there some sort of bot profile that I need to create? Any help will be appreciated. For now, I just need to be able to scrape the raw html code

I got things sorted. They needed a combination of my user agent string, and my IP address. so I sent them xxx.xxx.xxx.xxx, "ExampleBot; +https://example.net"
Something like this worked for the read_html command:
html <- try(read_html(GET(webpage, user_agent("ExampleBotBot; +https://example.net"))))
#Not my real bot's user agent
that code reads the html text of the page into the html variable so I can parse it with rvest

Related

How do I fix my localhost for Shiny app OAuth2.0 Authentication?

I am attempting to create a program which uses a user's Spotify data. I've conducted the following steps as per the documentation:
Set up application
Registered redirect urls on application dashboard
Obtained Client ID and secret.
The code I'm trying to use to get authentication is below:
client_id <- "<CLIENT_ID>"
redirect_url <- "http://localhost:8888/callback/"
link <- glue::glue('https://accounts.spotify.com/authorize?client_id={client_id}&response_type=code&redirect_uri={redirect_url}&scope=user-top-read playlist-modify-public playlist-modify-private user-read-private user-library-read user-library-modify')
browseURL(link,
browser = getOption("browser"),
encodeIfNeeded = FALSE)
I was able to get it to show an authorization page once, I tried to approve the application and received a localhost connection error (Connection Refused). This error now happens upon running the code (no authorization page generated).
I've gone through all the steps to fix this issue (Flushing DNS, Disabling Firewall, different redirect urls, resetting my router), but nothing seems to work.
Does anyone have any suggestions on what I might be doing wrong?
I think the proper way of doing OAuth 2.0 authentication is via the httr::oauth2.0_* family. They do not show an example for Spotify, but it should be rather straight forward to set the "dance" up with this framework.
Type demo("oauth2-github") (or refer to the Code repo on GitHub) for an example using Oauth 2.0 for GitHub and adapt the code for Spotify. Be aware that httr provides a convenience function (oauth_endpoints) for some providers (but not Spotify). Hence, you have to provide the necessary config (mainly the proper URLs) using oauth_endpoint (Note the missing s).
If you have particular questions, come back with some code and I am sure we cna help.

Scraping Websites via Google Cached Pages pages has been blocked

I'm trying to create a Service that Scraping websites by using Google Cached Pages.
Example
https://webcache.googleusercontent.com/search?q=cache:nike.com
The Response that I get is the HTML from Google cache, which is an older version of the Nike site.
And it works fine as long as I run it locally on my computer,
but when I deploy to google cloud platform, there I use porxy server
I get a 403 error that I can not access the information through a porxy server
Example of response from proxy server
433. That’s an error.Your client does not have permission to get URL /s
earch?q=cache:http://nike.com from this server. (Client IP address: XX.XXX.XX.XXX)<br
Please see Google's Terms of Service posted at
https://policies.google.com/terms If you believe that you
have received this response in error, please report your
problem. However, please make sure to take a look at our Terms of
Service (http://www.google.com/terms_of_service.html). In your email,
please send us the entire code displayed below. Please also
send us any information you may know about how you are performing your
Google searches-- for example, "I' m using the Opera browser on Linux
to do searches from home. My Internet access is through a dial-up
account I have with the FooCorp ISP." or "I'm using the Konqueror
browser on Linux t o search from my job at myFoo.com. My machine's IP
address is 10.20.30.40, but all of myFoo' s web traffic goes through
some kind of proxy server whose IP address is 10.11.12.13." (If y ou
don't know any information like this, that's OK. But this kind of
information can help us track down problems, so please tell us what
you can.)We will use all this information to diagnose the
problem, and we'll hopefully have you back up and searching with
Google agai n quickly! Please note that although we read all
the email we receive, we are not always able to send a personal
response to each and every email. So don't despair if you don't hear
back from u s! Also note that if you do not send us the
entire code below, we will not be able to help
you.Best wishes,The Google
Article that talks about the problem https://proxyserver.com/web-scraping-crawling/scraping-websites-via-google-cached-pages/
How can I solve this problem, and run requests from the cloud as well without being blocked? Add parameters?
Thanks :)
I guess that you should add a property in the header of your http request
for example :
URL u = new URL("https://www.google.com//search?q=c");
URLConnection c = u.openConnection();
c.setRequestProperty("User-Agent", "MSIE 7.0");
or
HttpRequest request =HttpRequest.newBuilder(new URI("https://www.google.com//search?q=c")).header("User-Agent", "MSIE 7.0").GET().build();
// note to change the URI
this two examples are in Java but the same concept is applied in all environments I guess
hope that was helpfull

Scraping data from stats.nba.com, Getting Error in curl::curl_fetch_memory(url, handle = handle)

I'd like to scrape team advanced stats from stats.nba.com.
My current code to get the XHR file where the data is stored is :
library(httr)
library(jsonlite)
nba <- GET('https://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=11%2F12%2F2019&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&TwoWay=0&VsConference=&VsDivision=')
I get the URL via these steps in Chrome:
Inspect -> Network -> XHR
The code throws this error:
Error in curl::curl_fetch_memory(url, handle = handle) :
LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 60
I also tried it with custom advanced filters on the website which either result in the same error or the code running forever. I'm not that great in web scraping so I would appreciate if anyone can point out what the issue is here.
I have had a good look at this. It looks like this site goes to some lengths to prevent scraping, and won't give you the json from that url unless you provide it with cookies that are generated by a back-and-forth between your browser's javascript and their own servers. They also monitor request timings with New Relic technology and are therefore likely to block your IP if you scrape multiple pages. It wouldn't be impossible, but very, very hard.
If you are desperate for the data you could look into using the NBA API which requires a sign-up but us free to use for 1000 requests per day.
The other option is to automate a browser using RSelenium to get the html of the fully rendered pages.
Of course, if you only want this one page, you can just copy the html from your Chrome's inspector, then use rvest::read_html(readClipboard())

HTML scraping using YQL

I am trying to use YQL to scrape some websites. When I test various queries in the YQL console I get an results node. So for example when I run:
select * from html where url="http://www.reverbnation.com/" and xpath='/html/body'
I get an empty <results /> node (permalink).
Thanks in advance!
http://www.reverbnation.com may be blocking the request coming from Yahoo! based on certain criteria, like headers. I had a look at reverbnation's robots.txt, and they aren't blocking Yahoo! based on the "Yahoo Pipes 2.0" user agent, so it must be something else.
To re-create the issue, make a YQL query against your own site, then look at the full access logs to see the full request and all headers that came from Yahoo! Then make a similar request using a tool like cURL.
You can also try and run netcat on a port and query with http://yoursite.com:PORT to see the full request.
Related issue discussed here.

Is there a way to redirect Post requests preserving post data or an alternative?

I am setting up a CDN relying only on Header redirects or temporary URLs served by an API controlled by a Database cluster.
The Goal is to reduce hardware costs and have flexible nodes with only FTP/HTTP/PHP as requirement and create a cheap solution for websites that can work with this.
Howevery my Problem is that i want to have a static Address where file uploads (containing ClientID and Token) can be sent to. I am using simple post.
But the file should be sent directly to the most idle server.
So what I want is to have A Post request to http://whatever.com/upload.php which is redirected to http://server-in-cdn.whatever.com/upload.php whithout loosing the data.
The problem is that the post request gets converted into a GET request and Post data is lost.
The W3C documentation states that the 307 Header code could be used, but its not reliable and user confirmation is required.
Or is there an alternative? I am not really into network stuff... but I think the classic solution would be some sort of Load balancer or router running BGB/Quagga or something like that, and the traffic would still go over that node.. is that correct?
Or is there a way to totally redirect the traffic on Network/DNS basis?
Thanks in advance.

Resources