What is the maximum length of URL in R? - r

I am constructing an URI in R which is generated on the fly with ~40000 characters.
I tried using
RCurl
jsonlite
curl
All three give a bad URL Error when connecting through a HTTP GET request. I am refraining from using httr as it will install 5 additional dependencies, while I want minimum dependency in my R program. I am unsure if even httr would be able to handle so many characters in URL.
Is there a way that I can encode/pack it to a allowed limit or a better approach/package that can handle URL of any length similar to python's urllib?
Thanks in advance.

This is not a limitation of RCurl.
Let's make a long URL and try it:
> s = paste0(rep(letters,2000),collapse="")
> nchar(s)
[1] 52000
That's 52000 characters of A-Z. Stick it on a URL:
> url = paste0("http://www.omegahat.net/RCurl/",s,sep="")
> nchar(url)
[1] 52030
> substr(url, 1, 40)
[1] "http://www.omegahat.net/RCurl/abcdefghij"
Now try and get it:
> txt = getURL(url)
> txt
[1] "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">\n<html><head>\n<title>414 Request-URI Too Large</title>\n</head><body>\n<h1>Request-URI Too Large</h1>\n<p>The requested URL's length exceeds the capacity\nlimit for this server.<br />\n</p>\n</body></html>\n"
>
That's the correct response from the server. The server decided it was a long URL, returned a 414 error, and proves RCurl can request URLs of over 40,000 characters.
Until we know more, I can only presume the "bad URL" message is coming from the server, about which we know nothing.

Related

R problems with encoding after reading from MS SQL via ODBC

We have a database and R scripts running on different virtual machines.
First we connect to db
con <- dbConnect(
odbc(),
Driver = "SQL Server",
Server = "server", Database = "db", UID = "uid", PWD = "pwd",
encoding = "UTF-8"
)
and gathering the data
data <- dbGetQuery(con, "SELECT * FROM TableName")
The problem is following: when different machines running the same script, for some of these we face character variables encoding issues.
For example, this is what we have on machine A
> data$char_var[1]
[1] "фамилия"
> Encoding(data$char_var[1])
[1] "UTF-8"
> Sys.getlocale()
[1] "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
> Encoding(data$char_var[1]) <- "1251"
> data$char_var[1]
[1] "гревцев"
and this is what we have on machine B
> data$char_var[1]
[1] "<e3><f0><e5><e2><f6><e5><e2>"
> Encoding(data$char_var[1])
[1] "UTF-8"
> Sys.getlocale()
[1] "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
> Encoding(data$char_var[1]) <- "1251"
> data$char_var[1]
[1] "фамилия"
First script returns gibberish, but it prints the initial value correctly. The same code running on machine B initially prints utf-8 and then returns encoded values. What could be the reason for such a difference?
As a result we want a script that would have the same "фамилия" output value to show it on a dashboard.
According to the result of your call to Encoding(data$char_var[1]), both machines are declaring the returned results to be encoded using UTF-8.
On the first machine, that appears to be true, because you're seeing valid output. Then you mess it up by declaring the encoding incorrectly as "1251", and you see gibberish.
On the second machine, the result comes to you declared as UTF-8, but it isn't (which is why it looks like gibberish to start). When you change the declared encoding to be "1251" it looks okay, so that's what it must have been all along.
So you have two choices:
Make sure both machines are consistent about what they return from the dbGetQuery. You can handle either encoding, but you need to know what it is and make sure it's declared correctly.
Alternatively, try to detect what is being returned, and declare it appropriately. One way to do this might be to put a known string in the database and compare the result to that. If you know you should get "фамилия" and you get something else, switch the declared encoding. You could also try the readr::guess_encoding() function.
One other issue is that some downstream function might only be able handle one or the other of UTF-8 and 1251 encodings. (Windows R is really bad at non-native encodings, and UTF-8 is never native on Windows.) In that case you may want to actually convert to a common encoding. You use the iconv() function to do that, e.g.
iconv(char_var, from = "cp1251", to = "UTF-8")
will try to convert to UTF-8.

get the number of redirects from a url in R

I have to extract a feature- the number of redirects, from the url in my dataframe. Is there a way to find the number in R like there is in python:
r = requests.get(url)
i=0
for h in r.history:
i=i+1
print(i)
The return value from httr::GET is completely undocumented, but the headers etc from redirects seem to appear in the $all_headers object:
> url = "http://github.com"
> g = httr::GET(url)
> length(g$all_headers)
[1] 2
because http redirects to https. If you go straight to https you dont see a redirect:
> url = "https://github.com"
> g = httr::GET(url)
> length(g$all_headers)
[1] 1
The return value of httr::GET is an httr::response object which has the core documentation at ?httr::response. You can examine the whole object with str() to see the parts that aren't salient to most R users. It's been documented, like, forever. I don't know where folks might be confused that it has no docs. Perhaps heads are above the clouds…perhaps in orbit or space or something.
Since what you want is count of redirects, you might actually care about redirects vs a naive count of all the response headers. e.g.
res <- httr::GET("http://1.usa.gov/1J6GNoW")
sum(((sapply(res$all_headers, `[[`, "status") %% 300) == 1))
That's 3 (and may not be exactly what you want either).
length(res$all_headers)
is 4 and I doubt you should be including 4xx responses in the redirects, but you could be clearer in your question if it is just the number of 3xx's vs total in the HTTP chain.
You might also want to consider:
cat(rawToChar(curl::curl_fetch_memory("http://1.usa.gov/1J6GNoW")$headers))
count the actual redirects from that (depending on what the actual "mission" is).

Why url.exists returns FALSE when the URL does exists using RCurl?

For example:
if(url.exists("http://www.google.com")) {
# Two ways to submit a query to google. Searching for RCurl
getURL("http://www.google.com/search?hl=en&lr=&ie=ISO-8859-1&q=RCurl&btnG=Search")
# Here we let getForm do the hard work of combining the names and values.
getForm("http://www.google.com/search", hl="en", lr="",ie="ISO-8859-1", q="RCurl", btnG="Search")
# And here if we already have the parameters as a list/vector.
getForm("http://www.google.com/search", .params = c(hl="en", lr="", ie="ISO-8859-1", q="RCurl", btnG="Search"))
}
This is an example from RCurl package manual. However, it does not work:
> url.exists("http://www.google.com")
[1] FALSE
I found there is an answer to this here Rcurl: url.exists returns false when url does exists. It said this is because of the default user agent is not useful. But I do not understand what user agent is and how to use it.
Also, this error happened when I worked in my company. I tried the same code at home, and it worked find. So I am guessing this is because of proxy. Or there is some other reasons that I did not realize.
I need to use RCurl to search my queries from Google, and then extract the information such as title and descriptions from the website. In this case, how to use user agent? Or, does the package httr can do this?
guys. Thanks a lot for help. I think I just figured out how to do it. The important thing is proxy. If I use:
> opts <- list(
proxy = "http://*******",
proxyusername = "*****",
proxypassword = "*****",
proxyport = 8080
)
> url.exists("http://www.google.com",.opts = opts)
[1] TRUE
Then all done! You can find your proxy under System-->proxy if you use win 10. At the same time:
> site <- getForm("http://www.google.com.au", hl="en",
lr="", q="r-project", btnG="Search",.opts = opts)
> htmlTreeParse(site)
$file
[1] "<buffer>"
.........
In getForm, opts needs to be put in as well. There are two posters here (RCurl default proxy settings and Proxy setting for R) answering the same question. I have not tried how to extract information from here.

RCurl - Boolean Options

These Curl docs: http://curl.haxx.se/docs/manpage.html#-d list many boolean options.
How do I specify these options in a postForm call in RCurl? For example, how do I specify the --sslv3 flag?
I tried
postForm(url, .opts = list(sslv3=TRUE))
but received the error:
Warning message:
In mapCurlOptNames(names(.els), asNames = TRUE) :
Unrecognized CURL options: sslv3
Thanks in advance.
SOLUTION
Through some trial and error, I found that this works:
options(RCurlOptions = list(sslversion=3))
postForm(url)
If anyone could clarify how to translate the Curl options to the RCurl options, it would appreciated!
Curl stands for a few things http://daniel.haxx.se/docs/curl-vs-libcurl.html. The problem here is you are looking at what the curl command line tool does and instead want to ask how the libcurl library implements something.
RCurl use the libcurl library. This can be accessed via an api. The "symbols" used in the api are listed here http://curl.haxx.se/libcurl/c/symbols-in-versions.html. We can compare them to the options listed by RCurl:
library(RCurl)
cInfo <- getURL("http://curl.haxx.se/libcurl/c/symbols-in-versions.html")
cInfo <- unlist(strsplit(cInfo, "\n"))
cInfo <- cInfo[grep("CURLOPT_", cInfo)]
cInfo <- gsub("([^[\\s]]*)\\s.*", "\\1", cInfo)
cInfo <- gsub("CURLOPT_", "", cInfo)
cInfo <- tolower(gsub("_", ".", cInfo))
listCurlOptions()[!listCurlOptions()%in%cInfo]
From the above we can see that all RCurl options are derived from libcurl api symbols. The
CURLOPT_ is removed _ is replaced by . and the letters are demoted to lower case.
The question then arises as to what types the symbols represent. I usually just look at the
php library documentation to discover this. http://php.net/manual/en/function.curl-setopt.php lists
CURLOPT_SSLVERSION The SSL version (2 or 3) to use. By default PHP will try to determine this itself, although in some cases this must be set manually.
as an integer type. expecting the value 2 or 3.
Alternatively you can look at the curl_easy_setopt manual page http://curl.haxx.se/libcurl/c/curl_easy_setopt.html.
CURLOPT_SSLVERSION
Pass a long as parameter to control what version of SSL/TLS to attempt to use. The available options are:
CURL_SSLVERSION_DEFAULT
The default action. This will attempt to figure out the remote SSL protocol version, i.e. either SSLv3 or TLSv1 (but not SSLv2, which became disabled by default with 7.18.1).
CURL_SSLVERSION_TLSv1
Force TLSv1
CURL_SSLVERSION_SSLv2
Force SSLv2
CURL_SSLVERSION_SSLv3
Force SSLv3
It says we would need to pass a long with value CURL_SSLVERSION_SSLv3 to stipulate sslv3.
What is the value of CURL_SSLVERSION_SSLv3? We can examine RCurl:::SSLVERSION_SSLv3
> c(RCurl:::SSLVERSION_DEFAULT, RCurl:::SSLVERSION_TLSv1, RCurl:::SSLVERSION_SSLv2, RCurl:::SSLVERSION_SSLv3)
[1] 0 1 2 3
>
So in fact the permissible values for sslversion are 0,1,2 or 3.
So the confusion in this case arose from the curl program which presumably uses the libcurl api implementing this in a binary fashion.
So the correct way in this case to use this option would be:
postForm(url, .opts = list(sslversion = 3))
or
postForm(url, .opts = list(sslv = 3))
you can use the shorter sslv as .opts is passed to mapCurlOptNames which will use pmatch
to find sslversion.
To be fair to the author of RCurl this is all explained in http://www.omegahat.org/RCurl/philosophy.html also located in /RCurl/inst/doc/philosophy.html .An excerpt reads:
Each of these and what it controls is described in the libcurl
man(ual) page for curl_easy_setopt and that is the authoritative
documentation. Anything we provide here is merely repetition or
additional explanation.
The names of the options require a slight explanation. These
correspond to symbolic names in the C code of libcurl. For example,
the option url in R corresponds to CURLOPT_URL in C. Firstly,
uppercase letters are annoying to type and read, so we have mapped
them to lower case letters in R. We have also removed the prefix
"CURLOPT_" since we know the context in which they option names are
being used. And lastly, any option names that have a _ (after we have
removed the CURLOPT_ prefix) are changed to replace the '_' with a '.'
so we can type them in R without having to quote them. For example,
combining these three rules, "CURLOPT_URL" becomes url and
CURLOPT_NETRC_FILE becomes netrc.file. That is the mapping scheme.
Try this (after reviewing examples on ?curlOptions after being referred by ?postForm:)
myOpts = curlOptions(sslv3 = TRUE)
postForm(url, .opts = myOpts)
Although I admit I thought your code should work. You may need to also post you version numbers. There is also a curlSetOpt that might be more "assertive".

read.csv fails to read a CSV file from google docs

I wish to use read.csv to read a google doc spreadsheet.
I try using the following code:
data_url <- "http://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv"
read.csv(data_url)
Which results in the following error:
Error in file(file, "rt") : cannot open the connection
I'm on windows 7. And the code was tried on R 2.12 and 2.13
I remember trying this a few months ago and it worked fine.
Any suggestion what might be causing this or how to solve it?
Thanks.
It might have something to do with the fact that Google is reporting a 302 temporarily moved response.
> download.file(data_url, "~/foo.csv", method = "wget")
--2011-04-29 18:01:01-- http://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv
Resolving spreadsheets0.google.com... 74.125.230.132, 74.125.230.128, 74.125.230.130, ...
Connecting to spreadsheets0.google.com|74.125.230.132|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv [following]
--2011-04-29 18:01:01-- https://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv
Connecting to spreadsheets0.google.com|74.125.230.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: `/home/gavin/foo.csv'
[ <=> ] 41 --.-K/s in 0s
2011-04-29 18:01:02 (1.29 MB/s) - `/home/gavin/foo.csv' saved [41]
> read.csv("~/foo.csv")
column1 column2
1 a 1
2 b 2
3 ds 3
4 d 4
5 f 5
6 ga 5
I'm not sure R's internal download code is capable of responding to such redirects:
> download.file(data_url, "~/foo.csv")
trying URL 'http://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv'
Error in download.file(data_url, "~/foo.csv") :
cannot open URL 'http://spreadsheets0.google.com/spreadsheet/pub?hl=en&hl=en&key=0AgMhDTVek_sDdGI2YzY2R1ZESDlmZS1VYUxvblQ0REE&single=true&gid=0&output=csv'
I ran into the same problem and eventually found a solution in a forum thread. Using my own public CSV file:
library(RCurl)
tt = getForm("https://spreadsheets.google.com/spreadsheet/pub",
hl ="en_US", key = "0Aonsf4v9iDjGdHRaWWRFbXdQN1ZvbGx0LWVCeVd0T1E",
output = "csv",
.opts = list(followlocation = TRUE, verbose = TRUE, ssl.verifypeer = FALSE))
holidays <- read.csv(textConnection(tt))
Check the solution on http://blog.forret.com/2011/07/google-docs-infamous-moved-temporarily-error-fixed/
So what is the solution: just add “&ndplr=1” to your URL and you will
skip the authentication redirect. I’m not sure what the NDPLR
parameter name stands for, let’s just call it: “Never Do Published
Link Redirection“.

Resources