Scraping https website using getURL

Scraping https website using getURL - r

I had a nice little package to scrape Google Ngram data but I have discovered they have switched to SSL and my package has broken. If I switch from readLines to getURL gets some of the way there, but some of the included script in the page is missing. Do I need to get fancy with user agents or something?
Here is what I have tried so far (pretty basic):
library(RCurl)
myurl <- "https://books.google.com/ngrams/graph?content=hacker&year_start=1950&year_end=2000"
getURL(myurl)
Comparing the results to viewing the source after entering the url in a browser shows that the crucial content is missing from the results returned to R. In the browser, the source includes content looking like this:
<script type="text/javascript">
var data = [{"ngram": "hacker", "type": "NGRAM", "timeseries": [9.4930387994907051e-09,
1.1685493106483591e-08, 1.0784501440023556e-08, 1.0108472218003532e-08,
etc.
Any suggestions would be greatly appreciated!

Sorry, not a direct solution, but it doesn't seem to be an user-agent problem. When you open your URL in a browser, you can see that there is a redirection that adds a parameter at the end of the address : direct_url=t1%3B%2Chacker%3B%2Cc0.
If you use getURL() to download this new URL, complete with the new parameter, then the javascript you are mentioning is present in the result.
Another solution could be to try to access data via Google BigQuery, as mentioned in this SO question :
Google N-Gram Web API

Related

How can i download rds file from dropbox in r? [duplicate]

I tried
download.file('https://www.dropbox.com/s/r3asyvybozbizrm/Himalayas.jpg',
destfile="1.jpg",
method="auto")
but it returns the HTML source of that page.
Also tried a little bit of rdrop
library(rdrop2)
# please put in your key/secret
drop_auth(new_usesr = FALSE, key=key, secret=secret, cache=T)
And the pop up website reports:
Invalid redirect_uri: "http://localhost:1410": It must exactly match one of the redirect URIs you've pre-configured for your app (including the path).
I don't understand the URI thing very well. Can somebody recommend some document to read please....
I read some posts but most of them discuss how to read data from excel files.
repmis worked only for reading excel files...
library(repmis)
repmis::source_DropboxData("test.csv",
"tcppj30pkluf5ko",
sep = ",",
header = F)
Also tried
library(RCurl)
url='https://www.dropbox.com/s/tcppj30pkluf5ko/test.csv'
x = getURL(url)
read.csv(textConnection(x))
And it didn't work...
Any help and discussion's appreciated. Thanks!

The first issue is because the https://www.dropbox.com/s/r3asyvybozbizrm/Himalayas.jpg link points to a preview page, not the file content itself, which is why you get the HTML. You can modify links like this though to point to the file content, as shown here:
https://www.dropbox.com/help/201
E.g., add a raw=1 URL parameter:
https://www.dropbox.com/s/r3asyvybozbizrm/Himalayas.jpg?raw=1
Your downloader will need to follow redirects for that to work though.
The second issue is because you're trying to use a OAuth 2 app authorization flow, which requires that all redirect URIs be pre-registered. You can register redirect URIs, in your case it's http://localhost:1410, for Dropbox API apps on the app's page on the App Console:
https://www.dropbox.com/developers/apps
For more information on using OAuth, you can refer to the Dropbox API OAuth guide here:
https://www.dropbox.com/developers/reference/oauthguide

I use read.table(url("yourdropboxpubliclink")) for instance I use this link
instead of using https://www.dropbox.com/s/xyo8sy9velpkg5y/foo.txt?dl=0, which is chared link on dropbox I use
https://dl.dropboxusercontent.com/u/15634209/histogram/foo.txt
and non-public link raw=1 will work
It works fine for me.

Python 3.5 requests for clawing

I have a coding problem regarding Python 3.5 web clawing.
I try to use 'requests.get' to extract the real link from 'http://www.baidu.com/link?url=ePp1pCIHlDpkuhgOrvIrT3XeWQ5IRp3k0P8knV3tH0QNyeA042ZtaW6DHomhrl_aUXOaQvMBu8UmDjySGFD2qCsHHtf1pBbAq-e2jpWuUd3'. An example of the code is like below:
import requests
response = requests.get('http://www.baidu.com/link?url=ePp1pCIHlDpkuhgOrvIrT3XeWQ5IRp3k0P8knV3tH0QNyeA042ZtaW6DHomhrl_aUXOaQvMBu8UmDjySGFD2qCsHHtf1pBbAq-e2jpWuUd3')
c = response.url
I expected that 'c' should be 'caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm'. (I remove http:// from the link as I can't post two links in one question.)
However, it doesn't work, and keeps return me the same link as I putted in.
Can anyone help on this. Many thanks in advance.
#
Thanks a lot to Charlie.
I have found out the solution. I first use .content.decode to read the response history, but that will be mixed up with many irrelevant info. I then use .findall to extract the redirect url from the history, which should be the first url displayed in the response history. Then, I use requests.get to retrieve the info. Below is the code:
rep1 = requests.get(url)
cont = rep1.content.decode('utf-8')
extract_cont = re.findall('"([^"]*)"', cont)
redir_url = extract_cont[0]
rep = requests.get(redir_url)

You may consider looking into the response headers for a 'location' header.
response.headers['location']
You may also consider looking at the response history, which contains a response for each response instance in a chain of redirects
response.history

Your sample URL doesn't redirect; The response is a 200 and then it uses a JavaScript window.location change. The requests library won't support this type of redirect.
<script>window.location.replace("http://caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm")</script>
<noscript><META http-equiv="refresh" content="0;URL='http://caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm'"></noscript>
If you know you will always be using this one service, you could parse the response, maybe using regex.
If you don't know what service will always be used and also want to handle every possible situation, you might need to instantiate a WebKit instance or something and somehow try to determine when it finally finishes. I'm sure there's a page load complete event which you could use, but you still might have pages that do a window.location change after the page is loaded using a timer. This will be very heavyweight and still not cover every conceivable type of redirect.
I recommend starting with writing a special handler for each type of edge case and fallback on a default handler that just looks at the response.url. As new edge cases come up, write new handlers. It's kind of the 'trial and error' approach.

Upload csv file to REST API with R

I want to upload a csv file to a REST API.
The API is accessible via an URL like
http://sampledomain.com/api/data/?key=xxx
A provided sample curl call looks as following:
curl --form "file=#my_data.zip" \
"http://sampledomain.com/api/data/?key=xxx"
How can I translate this call into R?
I heard of the RCurl package, but can´t figure out how to use it in this case.
Regards

I am not sure RCurl will handle it as you can see from the limit on the first page.
Limitations One doesn't yet have full control over the contents of a
POST form such as specifying files, content type. Error handling uses
a single global variable at present.
However, another package from Hadley that might solve your problem httr
POST("http://sampledomain.com/api/data/?key=xxx", body = list(y = upload_file(system.file("my_data.zip"))))

ASPX URL is broken & Streaming WebService

I'm attempting to create a streaming webservice, unfortunally i even lack its concept overall. My idea is to have a method which will return to me a string with the value of the URL to the streaming page.
I've tried many different ways to do this, but no one of them worked; I tried using DownloadString, even writting the raw URL, but i always had errors so i found one way to just make it happen:
[WebMethod]
public string WatchMedia(string title)
{
Global.Media = title;
Streaming str = new Streaming(); //Streaming.aspx
return str.GetURL();
}
Okay so, in my aspx.cs i included this:
internal string GetURL()
{
return HttpContext.Current.Request.Url.AbsoluteUri.ToString();
}
Don't really ask me about the 'internal', i'm so tired of trying different ways to get this to work that i just go along with that VS builds for me.
That does give me the URL i thought i wanted, BUT, it doesn't work, why? Because it says, give or take (directly translated):
The request format is not recognized for the unexpectedly terminated URL in /WatchMedia
WatchMedia is the name of my method as seen above.
Now, beside's hoping someone can give me a straight answer as to what ridiculous sin am i hurting my self with here, i'd like to know if this is the way for a streaming webservice to work? I can't seem to find any real information about video streaming webservices over the www, not even Google will tell me!

If you ever have the same problem, just forget creating an object of the aspx page, and get the URL raw, by running the page and copying it, then all you have to do is change the localhost Port, which you can get from HttpContext.

Extracting html tables from website

I am trying to use XML, RCurl package to read some html tables of the following URL
http://www.nse-india.com/marketinfo/equities/cmquote.jsp?key=SBINEQN&symbol=SBIN&flag=0&series=EQ#
Here is the code I am using
library(RCurl)
library(XML)
options(RCurlOptions = list(useragent = "R"))
url <- "http://www.nse-india.com/marketinfo/equities/cmquote.jsp?key=SBINEQN&symbol=SBIN&flag=0&series=EQ#"
wp <- getURLContent(url)
doc <- htmlParse(wp, asText = TRUE)
docName(doc) <- url
tmp <- readHTMLTable(doc)
## Required tables
tmp[[13]]
tmp[[14]]
If you look at the tables it has not been able to parse the values from the webpage.
I guess this due to some javascipt evaluation happening on the fly.
Now if I use "save page as" option in google chrome(it does not work in mozilla)
and save the page and then use the above code i am able to read in the values.
But is there a work around so that I can read the table of the fly ?
It will be great if you can help.
Regards,

Looks like they're building the page using javascript by accessing http://www.nse-india.com/marketinfo/equities/ajaxGetQuote.jsp?symbol=SBIN&series=EQ and parsing out some string. Maybe you could grab that data and parse it out instead of scraping the page itself.
Looks like you'll have to build a request with the proper referrer headers using cURL, though. As you can see, you can't just hit that ajaxGetQuote page with a bare request.
You can probably read the appropriate headers to put in by using the Web Inspector in Chrome or Safari, or by using Firebug in Firefox.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scraping https website using getURL - r

Related

How can i download rds file from dropbox in r? [duplicate]

Python 3.5 requests for clawing

Upload csv file to REST API with R

ASPX URL is broken & Streaming WebService

Extracting html tables from website

Categories

Resources