HTML scraping using YQL - web-scraping

I am trying to use YQL to scrape some websites. When I test various queries in the YQL console I get an results node. So for example when I run:
select * from html where url="http://www.reverbnation.com/" and xpath='/html/body'
I get an empty <results /> node (permalink).
Thanks in advance!

http://www.reverbnation.com may be blocking the request coming from Yahoo! based on certain criteria, like headers. I had a look at reverbnation's robots.txt, and they aren't blocking Yahoo! based on the "Yahoo Pipes 2.0" user agent, so it must be something else.
To re-create the issue, make a YQL query against your own site, then look at the full access logs to see the full request and all headers that came from Yahoo! Then make a similar request using a tool like cURL.
You can also try and run netcat on a port and query with http://yoursite.com:PORT to see the full request.
Related issue discussed here.

Related

r White Hat webscrape whitelisting

I've been periodically doing web scrapes for an eCommerce client of mine and getting through with read_html with no issues up until recently. It seems they've now upgraded their website security and my current attempts are now being blocked.
As this is an expected function, I should be able to get them to add me to their whitelist (and maybe use a more efficient scraping technique)
As I've never asked IT to whitelist a crawler before, would I just need them to whitelist my IP address? Is there some sort of bot profile that I need to create? Any help will be appreciated. For now, I just need to be able to scrape the raw html code
I got things sorted. They needed a combination of my user agent string, and my IP address. so I sent them xxx.xxx.xxx.xxx, "ExampleBot; +https://example.net"
Something like this worked for the read_html command:
html <- try(read_html(GET(webpage, user_agent("ExampleBotBot; +https://example.net"))))
#Not my real bot's user agent
that code reads the html text of the page into the html variable so I can parse it with rvest

Scraping data from stats.nba.com, Getting Error in curl::curl_fetch_memory(url, handle = handle)

I'd like to scrape team advanced stats from stats.nba.com.
My current code to get the XHR file where the data is stored is :
library(httr)
library(jsonlite)
nba <- GET('https://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=11%2F12%2F2019&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&TwoWay=0&VsConference=&VsDivision=')
I get the URL via these steps in Chrome:
Inspect -> Network -> XHR
The code throws this error:
Error in curl::curl_fetch_memory(url, handle = handle) :
LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 60
I also tried it with custom advanced filters on the website which either result in the same error or the code running forever. I'm not that great in web scraping so I would appreciate if anyone can point out what the issue is here.
I have had a good look at this. It looks like this site goes to some lengths to prevent scraping, and won't give you the json from that url unless you provide it with cookies that are generated by a back-and-forth between your browser's javascript and their own servers. They also monitor request timings with New Relic technology and are therefore likely to block your IP if you scrape multiple pages. It wouldn't be impossible, but very, very hard.
If you are desperate for the data you could look into using the NBA API which requires a sign-up but us free to use for 1000 requests per day.
The other option is to automate a browser using RSelenium to get the html of the fully rendered pages.
Of course, if you only want this one page, you can just copy the html from your Chrome's inspector, then use rvest::read_html(readClipboard())

What will the RightSignature API send to my callback URL when a signer signs a document

When I send a one-off document to RightSignature via their API, I'm specifying a callback location in the XML document as specified in RightSignature's schema definition. I then get a signer-link value back from their API for the document. I display the HTML response from the signer-link URL in an iFrame on our website. When our user signs the document in this iFrame, which is rendering the responses from their website, I want their website to post to our callback location.
Can I do this with the RightSignature API and does it make sense?
So far, I'm only getting content in the iFrame that indicates that the signing was successful. The callback location does not seem to be getting called.
I got it solved just now. Basically, i was doing two things wrong first you have to go in RightSignature Account and set it there the CallBack url
Account > Settings > Advanced Settings
But the thing which RS is unable to mention to us that this url can not be of localhost, but it should be of https i mean like Live URL of your site like
https://stagingmysite.azurewebsites.net/User/CallBackFunction
And then in your CallBack just write these two lines and you will receive complete XML which would have the GUID and document status as well.
byte[] data = Request.BinaryRead(Request.TotalBytes);
string callBackXML = System.Text.Encoding.UTF8.GetString(data);
I found the answer with some help from the API team at RightSignature. I was using callback_location but what I really wanted is redirect_location. Their online documentation was difficult to follow and did not clearly point out the difference.
I got this working after a lot of trial and error.

How to make POST request using OAuth via Youtube API?

I am trying to get this thing to work for a couple days since it's my first time working with the OAuth system without any luck.
I have been experimenting here: https://developers.google.com/youtube/v3/docs/subscriptions/insert#try-it
With the following settings:
http://i.gyazo.com/5cd28f1194d5dfebee25d07bc0db965e.png
When I execute the code it successfully subscribes to the specified channelIdaccount with the authorized account.
I have tried to copy paste the shown POST URL into my browser without any luck. The plan was just to test it as I would like to implement this in PHP.
Now to my questions:
The {YOUR_API_KEY}, is this where I am supposed to type in the access token? If so, do I need the &mine=true tag at all?
I just realized that there are no ID's in the URL but there is an JSON-object in the request box example. Am I supposed to convert a string to JSON-object and pass it to the $fields= tag?

How to find HTTP POST Data sent to a CGI Page?

I searched google for a good number of hours. Maybe I searched for the wrong keywords.
Here is what I want to do.
I'm posting data to a website which then makes a HTTP POST request and returns a .CGI webpage. I want to know the parameters the web page uses to send that HTTP POST request so that I can directly link a page from my Webpage to the final .CGI webpage by making the user enter the data on my own webpage.
How do I achieve it?
Usually the POST body is piped into STDIN, just read it as a normal file

Resources