Setting "an informative User-Agent string" in getURL - r

I am trying to access a Wikipedia page so to get a list of pages, and get the following error:
library(RCurl)
u <- "http://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=tal&namespace=4"
getURL(u)
[1] "Scripts should use an informative User-Agent string with contact information, or they may be IP-blocked without notice.\n"
I hope to get to that page through the Wikipedia api, but I am not sure it would work.
And the thing is that other pages are read without problem, for example:
u <- "http://en.wikipedia.org/wiki/Wikipedia:Talk"
getURL(u)
Any suggestions?
Side note: In general I would rather to not scrape wiki pages and go through the api, but I fear that this specific pages are not yet available through the api...

According to the documentation of RCurl, you can specify additional header by adding a httpheader parameter:
getURL(u, httpheader = c('User-Agent' = "Informative string with your contact info"))

Related

How can i download rds file from dropbox in r? [duplicate]

I tried
download.file('https://www.dropbox.com/s/r3asyvybozbizrm/Himalayas.jpg',
destfile="1.jpg",
method="auto")
but it returns the HTML source of that page.
Also tried a little bit of rdrop
library(rdrop2)
# please put in your key/secret
drop_auth(new_usesr = FALSE, key=key, secret=secret, cache=T)
And the pop up website reports:
Invalid redirect_uri: "http://localhost:1410": It must exactly match one of the redirect URIs you've pre-configured for your app (including the path).
I don't understand the URI thing very well. Can somebody recommend some document to read please....
I read some posts but most of them discuss how to read data from excel files.
repmis worked only for reading excel files...
library(repmis)
repmis::source_DropboxData("test.csv",
"tcppj30pkluf5ko",
sep = ",",
header = F)
Also tried
library(RCurl)
url='https://www.dropbox.com/s/tcppj30pkluf5ko/test.csv'
x = getURL(url)
read.csv(textConnection(x))
And it didn't work...
Any help and discussion's appreciated. Thanks!
The first issue is because the https://www.dropbox.com/s/r3asyvybozbizrm/Himalayas.jpg link points to a preview page, not the file content itself, which is why you get the HTML. You can modify links like this though to point to the file content, as shown here:
https://www.dropbox.com/help/201
E.g., add a raw=1 URL parameter:
https://www.dropbox.com/s/r3asyvybozbizrm/Himalayas.jpg?raw=1
Your downloader will need to follow redirects for that to work though.
The second issue is because you're trying to use a OAuth 2 app authorization flow, which requires that all redirect URIs be pre-registered. You can register redirect URIs, in your case it's http://localhost:1410, for Dropbox API apps on the app's page on the App Console:
https://www.dropbox.com/developers/apps
For more information on using OAuth, you can refer to the Dropbox API OAuth guide here:
https://www.dropbox.com/developers/reference/oauthguide
I use read.table(url("yourdropboxpubliclink")) for instance I use this link
instead of using https://www.dropbox.com/s/xyo8sy9velpkg5y/foo.txt?dl=0, which is chared link on dropbox I use
https://dl.dropboxusercontent.com/u/15634209/histogram/foo.txt
and non-public link raw=1 will work
It works fine for me.

api authorization header r httr

I am trying to access the Open Apparel Registry api using httr.
NB: It's free to sign up (need to login + get authentication code on profile page).
But you can see the swagger api docs here: https://openapparel.org/api/docs/#!/facilities/facilities_list
Here is how you authorize on web version:
oar_root_api <- "https://openapparel.org/api/facilities/"
oar_token <- XXX
oar_api_facilities_GET <- httr::GET(url = oar_root_api,
add_headers(
`Authorization` = oar_token),
verbose()
)
The code I receive back is 401 so something is wrong with my authorization, but I've tried so many ways. I can't figure out how to specify this correctly.
Sorry to hear you've been having difficulties. This additional documentation may help: https://docs.google.com/document/d/1ZKCN84Eu9WDAXUokojOw7Dcg5TAJw0vKnVk7RPrTPZ0/edit?usp=sharing
We tend to find that users need to add the "Token" prefix (see page 3), which I appreciate isn't standard practice - this is something we intend to change!
Let us know how you get on.
OAR
The Open Apparel Registry (OAR) uses Django REST Framework to provide API enpoints. The TokenAuthentication class requires that the Authorization header value have a "Token " prefix. From the documentation
For clients to authenticate, the token key should be included in the Authorization HTTP header. The key should be prefixed by the string literal "Token", with whitespace separating the two strings. For example:
Authorization: Token 9944b09.....
I am not familiar with R, but I searched for string concatenation and it looks like the paste function will build the header value that you need.
oar_root_api <- "https://openapparel.org/api/facilities/"
oar_token <- XXX
oar_api_facilities_GET <- httr::GET(url = oar_root_api,
add_headers(
`Authorization` = paste("Token ", oar_token)),
verbose()
)

http request post username and password wih header to get auth/key in R

Can I get recommendation to learn http request post username and password with header to get auth/key in R?. Thanks
The Examples section of httr's GET function documentation is useful.
There is one example which demonstrates how to add headers...
GET(url, add_headers(a = 1, b = 2))
...and another which shows how to authenticate...
GET(url, authenticate("username", "password"))
A good place for a quick start is vignette of httr package: https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html
The POST request is covered in POST() function; the exact implementation & next steps would depend on your use case.

How to properly set cookies to get URL content using httr

I need to download information from web site that is protected using cookies. I pass this protection manually and then insert cookies to httr.
Here is similar topic, but it does not solve my problem: (Copying cookie for httr)
library(httr)
url<-"http://smida.gov.ua/db/emitent/year/xml/showform/32153/125/templ"
cook<-"_SMIDA=9117a9eb136353bd6956651bd59acd37; __utmt=1; __utma=29983421.1729484844.1413489369.1413625619.1413627797.3; __utmb=29983421.7.10.1413627797; __utmc=29983421; __utmz=29983421.1413489369.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"
response <- GET(url, config(cookie= cook))
content(x = response,as = 'text', encoding = "UTF-8")
So when I use content it return me information, that I am not logged in( as I do without cookie)
How can I solve this problem?
Test credentials are login: mytest2, pass: qwerty12)
This would be the way to set_cookies with GET & httr:
GET("http://smida.gov.ua/db/emitent/year/xml/showform/32153/125/templ",
set_cookies(`_SMIDA` = "7cf9ea4bfadb60bbd0950e2f8f4c279d",
`__utma` = "29983421.138599299.1413649536.1413649536.1413649536.1",
`__utmb` = "29983421.5.10.1413649536",
`__utmc` = "29983421",
`__utmt` = "1",
`__utmz` = "29983421.1413649536.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"))
That worked for me, well at least I think it did as I cannot read the language. A table comes back with the same structure and no prompt to login.
Unfortunately the captcha on login prevents the use of Rselenium (or other, similar, crawling packages), so you'll have to continue to manually grab those cookies (or use a utility to extract them from the session).
Finally, you probably want to seriously consider changing those credentials, now :-)
EDIT: #VadymB and I both found that the code didn't work until we rebooted RStudio. Your mileage may vary.

Extracting html tables from website

I am trying to use XML, RCurl package to read some html tables of the following URL
http://www.nse-india.com/marketinfo/equities/cmquote.jsp?key=SBINEQN&symbol=SBIN&flag=0&series=EQ#
Here is the code I am using
library(RCurl)
library(XML)
options(RCurlOptions = list(useragent = "R"))
url <- "http://www.nse-india.com/marketinfo/equities/cmquote.jsp?key=SBINEQN&symbol=SBIN&flag=0&series=EQ#"
wp <- getURLContent(url)
doc <- htmlParse(wp, asText = TRUE)
docName(doc) <- url
tmp <- readHTMLTable(doc)
## Required tables
tmp[[13]]
tmp[[14]]
If you look at the tables it has not been able to parse the values from the webpage.
I guess this due to some javascipt evaluation happening on the fly.
Now if I use "save page as" option in google chrome(it does not work in mozilla)
and save the page and then use the above code i am able to read in the values.
But is there a work around so that I can read the table of the fly ?
It will be great if you can help.
Regards,
Looks like they're building the page using javascript by accessing http://www.nse-india.com/marketinfo/equities/ajaxGetQuote.jsp?symbol=SBIN&series=EQ and parsing out some string. Maybe you could grab that data and parse it out instead of scraping the page itself.
Looks like you'll have to build a request with the proper referrer headers using cURL, though. As you can see, you can't just hit that ajaxGetQuote page with a bare request.
You can probably read the appropriate headers to put in by using the Web Inspector in Chrome or Safari, or by using Firebug in Firefox.

Resources