I am trying to download files from the NSE India website (nseindia.com). The problem is that webmaster does not like scraping programs downloading files or reading pages from the website. They have a user agent based restriction it seems.
The file I am trying to download is http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip
I am able to download this from the linux shell using
curl -v -A "Mozilla" http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip
The output is this
About to connect() to www.nseindia.com port 80 (#0)
* Trying 115.112.4.12... % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:--
--:--:-- 0connected
GET /archives/equities/bhavcopy/pr/PR280815.zip HTTP/1.1
User-Agent: Mozilla
Host: www.nseindia.com
Accept: /
< HTTP/1.1 200 OK < Server: Oracle-iPlanet-Web-Server/7.0 < Content-Length: 374691 < X-frame-options: SAMEORIGIN < Last-Modified:
Fri, 28 Aug 2015 12:20:02 GMT < ETag: "5b7a3-55e051f2" <
Accept-Ranges: bytes < Content-Type: application/zip < Date: Sat, 29
Aug 2015 17:56:05 GMT < Connection: keep-alive < { [data not shown] PK
5 365k 5 19977 0 0 34013 0 0:00:11 --:--:-- 0:00:11
56592
This allows me to the download the file.
The code I am using in R Curl is this
library("RCurl")
jurl <- "http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip"
juseragent <- "Mozilla"
myOpts = curlOptions(verbose = TRUE, header = TRUE, useragent = juseragent)
jfile <- getURL(jurl,.opts=myOpts)
This, too, does not work.
I have also unsuccessfully tried using download.file from the base library with the user agent changed.
Any help will be appreciated.
library(curl) # this is not RCurl, you need to download curl
to download file in the working directory
curl_download("http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip","tt.zip",handle = new_handle("useragent" = "my_user_agent"))
First, your problem is not setting the user agent, but downloading binary data. This works:
jfile <- getURLContent(jurl, .opts=myOpts, binary=TRUE)
Here is a (more) complete example using httr instead of RCurl.
library(httr)
url <- "http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip"
response <- GET(url, user_agent("Mozilla"))
response$status # 200 OK
# [1] 200
tf <- tempfile()
writeBin(content(response, "raw"), tf) # write response content (the zip file) to a temporary file
files <- unzip(tf, exdir=tempdir()) # unzips to system temp directory and returns a vector of file names
df.lst <- lapply(files[grepl("\\.csv$",files)],read.csv) # convert .csv files to list of data.frames
head(df.lst[[2]])
# SYMBOL SERIES SECURITY HIGH.LOW INDEX.FLAG
# 1 AGRODUTCH EQ AGRO DUTCH INDUSTRIES LTD H NA
# 2 ALLSEC EQ ALLSEC TECHNOLOGIES LTD H NA
# 3 ALPA BE ALPA LABORATORIES LTD H NA
# 4 AMTL EQ ADV METERING TECH LTD H NA
# 5 ANIKINDS BE ANIK INDUSTRIES LTD H NA
# 6 ARSHIYA EQ ARSHIYA LIMITED H NA
Related
When trying to put to WebHDFS in order to create a file and write to it (using the following link: https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#CREATE) I run into issues using httr.
Using RCurl or RWebHDFS is not possible because the target Hadoop cluster is secure.
Here is the code I have attempted to use:
library(httr)
r <- PUT("https://hadoopmgr1p.global.ad:14000/webhdfs/v1/user/testuser/temp/loadfile_testuser_2019-11-28_15_28_41411?op=CREATE&permission=755&user.name=testuser",
authenticate(":", "", type = "gssnegotiate"),
verbose())
testuser is a super user with permissions to R/W. I get the following error:
<- HTTP/1.1 400 Data upload requests must have content-type set to 'application/octet-stream'
<- Date: Fri, 29 Nov 2019 15:42:30 GMT
<- Date: Fri, 29 Nov 2019 15:42:30 GMT
<- Pragma: no-cache
<- X-Content-Type-Options: nosniff
<- X-XSS-Protection: 1; mode=block
<- Content-Length: 0
The error is pretty explanatory, so I then attempt to PUT with a content-type:
r <- PUT("https://hadoopmgr1p.global.ad:14000/webhdfs/v1/user/testuser/temp/loadfile_testuser_2019-11-28_15_28_41411?op=CREATE&permission=755&user.name=testuser",
authenticate(":", "", type = "gssnegotiate"),
content_type("application/octet-stream"),
verbose())
I get a success - however it is not truly successful:
<- Date: Fri, 29 Nov 2019 16:04:52 GMT
<- Cache-Control: no-cache
<- Expires: Fri, 29 Nov 2019 16:04:52 GMT
<- Date: Fri, 29 Nov 2019 16:04:52 GMT
<- Pragma: no-cache
<- Content-Type: application/json;charset=utf-8
<- X-Content-Type-Options: nosniff
<- X-XSS-Protection: 1; mode=block
<- Content-Length: 0
There is no file that was uploaded. Uploading a file with that first request, gives me another error:
<- HTTP/1.1 307 Temporary Redirect
<- Date: Fri, 29 Nov 2019 16:07:24 GMT
<- Cache-Control: no-cache
<- Expires: Fri, 29 Nov 2019 16:07:24 GMT
<- Date: Fri, 29 Nov 2019 16:07:24 GMT
<- Pragma: no-cache
<- Content-Type: application/json;charset=utf-8
<- X-Content-Type-Options: nosniff
<- X-XSS-Protection: 1; mode=block
Error in curl::curl_fetch_memory(url, handle = handle) :
necessary data rewind wasn't possible
The code in question:
library(httr)
temp_file <- httr::upload_file(lfs_temp_file, type = "text/plain")
r <- PUT("https://hadoopmgr1p.global.ad:14000/webhdfs/v1/user/testuser/temp/loadfile_testuser_2019-11-28_15_28_41411?op=CREATE&permission=755&user.name=testuser",
authenticate(":", "", type = "gssnegotiate"),
body=temp_file,
content_type("application/octet-stream"),
verbose())
Attempting the same command using curl works without issue:
curl -i -k -X PUT --negotiate -u : "https://hadoopmgr1p.global.ad:14000/webhdfs/v1/user/testuser/temp/loadfile_testuser_2019-11-28_15_28_4141?op=CREATE&permission=755&user.name=testuser"
This results in the following:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0HTTP/1.1 307 Temporary Redirect
Date: Thu, 28 Nov 2019 23:27:16 GMT
Cache-Control: no-cache
Expires: Thu, 28 Nov 2019 23:27:16 GMT
Date: Thu, 28 Nov 2019 23:27:16 GMT
Pragma: no-cache
Content-Type: application/json;charset=utf-8
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
WWW-Authenticate: Negotiate <stuff>/
Set-Cookie: hadoop.auth="<stuff>"; Path=/; Secure; HttpOnly
Location: https://hadoopmgr1p.global.ad:14000/webhdfs/v1/user/testuser/temp/loadfile_testuser_2019-11-28_15_28_4141?op=CREATE&data=true&user.name=testuser&permission=755
Content-Length: 0
Following the Location header lets us create the file successfully.
What am I doing wrong?
Thanks
Good work, including the curl output. I believe that answers it.
Your curl command uses PUT, and your httr command uses POST. Try https://www.rdocumentation.org/packages/httr/versions/1.4.1/topics/PUT .
Hint for future reference: POST commands are not typically used if you're specifying an exact location. That's what PUT is for.
httr is attempting to follow the redirect, and failing. To fix the issue, tell httr to stop following the location config(followlocation = 0L).
The PUT command will be as follows:
r <- PUT("https://hadoopmgr1p.global.ad:14000/webhdfs/v1/user/testuser/temp/
loadfile_testuser_2019-11-28_15_28_41411?op=CREATE&permission=755&user.name=testuser",
authenticate(":", "", type = "gssnegotiate"),
body=NULL,
config(followlocation = 0L),
verbose())
This should return a valid reponse with a Location header.
I called an API from R using getURL that returns a JSON response.
When I check with typeof in R, it gives me [1] "character".
I am trying to have my data in JSON format as it should be, to be able to convert it to a DataTable. What could be the reason that it is a character list and how do I fix it?
This is what I am getting in the data returned from the API:
[1] "HTTP/1.1 200 OK\r\nDate: Thu, 04 Jan 2018 20:38:50 GMT\r\nContent-Type: application/json; charset=utf-8\r\nTransfer-Encoding: chunked\r\nConnection: keep-alive\r\nSet-Cookie: __cfduid=d6bbf45645c3bd5332f83d25d06d8b8ca1515098329; expires=Fri, 04-Jan-19 20:38:49 GMT; path=/; domain=.onesignal.com; HttpOnly\r\nStatus: 200 OK\r\nCache-Control: public, max-age=7200\r\nAccess-Control-Allow-Origin: *\r\nX-XSS-Protection: 1; mode=block\r\nX-Request-Id: bd2552de-bf7d-4a0c-94d6-ff1b6856002a\r\nAccess-Control-Allow-Headers: SDK-Version\r\nETag: W/\"47580e0a23e806945b01f1237219175c\"\r\nX-Frame-Options: SAMEORIGIN\r\nX-Runtime: 0.112902\r\nX-Content-Type-Options: nosniff\r\nX-Powered-By: Phusion Passenger 5.1.4\r\nCF-Cache-Status: REVALIDATED\r\nExpires: Thu, 04 Jan 2018 22:38:50 GMT\r\nServer: cloudflare-nginx\r\nCF-RAY: 3d8100f109c6a23f-ICN\r\n\r\n{\"total_count\":2057,\"offset\":0,\"limit\":50,\"notifications\":[{\"adm_big_picture\":\"\",\"adm_group\":\"\",\"adm_group_message\":{\"en\":\"\... <truncated>
If I try to use fromJSON function with this data,
I get:
Error in file(con, "r") : cannot open the connection
jsonlite::fromJSON works great for parsing JSON. Your problem is that you have a bunch of stuff in front of your JSON. (Maybe after too, can't tell...)
I think the JSON starts at the first {, so we'll remove everything before that. Calling your data x:
x = sub('^[^\\{]*\\{', '{', x)
jsonlite::fromJSON(x)
Type the unescaped version of the patter into the Regex101 tool for an explanation. (Unescaped version uses single not double backslashes: ^[^\{]*\{ . In R strings we need to double the backslashes.)
Here's a working example based on your data:
x = 'HTTP/1.1 200 OK
Date: Thu, 04 Jan 2018 20:38:50 GMT
Content-Type: application/json; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: __cfduid=d6bbf45645c3bd5332f83d25d06d8b8ca1515098329; expires=Fri, 04-Jan-19 20:38:49 GMT; path=/; domain=.onesignal.com; HttpOnly
Status: 200 OK
Cache-Control: public, max-age=7200
Access-Control-Allow-Origin: *
X-XSS-Protection: 1; mode=block
X-Request-Id: bd2552de-bf7d-4a0c-94d6-ff1b6856002a
Access-Control-Allow-Headers: SDK-Version
ETag: W/\"47580e0a23e806945b01f1237219175c\"
X-Frame-Options: SAMEORIGIN
X-Runtime: 0.112902
X-Content-Type-Options: nosniff
X-Powered-By: Phusion Passenger 5.1.4
CF-Cache-Status: REVALIDATED\r\nExpires: Thu, 04 Jan 2018 22:38:50 GMT
Server: cloudflare-nginx
CF-RAY: 3d8100f109c6a23f-ICN
{\"total_count\":2057,\"offset\":0,\"limit\":50,\"notifications\":[{\"adm_big_picture\":\"\",\"adm_group\":\"\"}]}'
y = gsub('^[^\\{]*\\{', '{', x)
jsonlite::fromJSON(sub('^(^\\{)*\\{', '{', y))
# $total_count
# [1] 2057
#
# $offset
# [1] 0
#
# $limit
# [1] 50
#
# $notifications
# adm_big_picture adm_group
# 1
You can use the rjson package to transform your input into a json. Using simplifyDataFrame parameter fromJSON should output a dataframe object.
Importing data from a JSON file into R
[edit]
Your data is returning with some header, you can overcome it removing it from the string and passing to fromJSON
library(stringr)
library(rjson)
json <- str_sub(str_extract(data, "ICN\\r\\n\\r\\n.*"), 8)
df <- as.data.frame(fromJSON(json))
> head(df)
total_count
1 2057
I am having trouble trying to retrieve the gzip'd content of the following URL:
https://www.lendingclub.com/browse/browseNotesAj.action?method=getResultsInitial&startindex=0&pagesize=1
I can see that the content is encoded using gzip by looking at the response headers:
HTTP/1.1 200 OK
Content-Encoding: gzip
I have tried RCurl using getURL as well as this post with no luck. Can someone help me try to get the content into a variable (hopefully without requiring writing and reading from file)?
Or in httr
library(httr)
library(jsonlite)
out <- GET("https://www.lendingclub.com/browse/browseNotesAj.action?method=getResultsInitial&startindex=0&pagesize=1")
jsonlite::fromJSON(content(out, "text"))
$result
[1] "success"
$searchresult
$searchresult$loans
loanGrade purpose loanAmtRemaining loanUnfundedAmount noFee primeTotalInvestment title
1 C5 debt_consolidation 25 25 0 0 Debt consolidation
isInCurrentOrder alreadySelected primeFractions fico wholeLoanTimeRemaining loanType primeUnfundedAmount
1 FALSE FALSE 0 720-724 -69999 Personal 0
hasCosigner amountToInvest loan_status alreadyInvestedIn loanLength searchrank loanRateDiff loanGUID
1 FALSE 0 INFUNDING FALSE 36 1 .00 35783459
isWholeLoan loanAmt loanAmountRequested primeMarkedInvestment loanRate loanTimeRemaining
1 0 7650 7650 0 14.99 1199721001
$searchresult$totalRecords
[1] 1472
Turns out RCurl handles gzip encoding:
getURL('https://www.lendingclub.com/browse/browseNotesAj.action?method=getResultsInitial&startindex=0&pagesize=1',
encoding="gzip")
I've been using httr to export data from REDCap databases into R for a few months now. We recently upgraded our R Studio Server to the most recent version (v0.98.1049) and upgraded to R 3.1.1 at the same time. After that upgrade, my httr::POST calls stopped working, sometimes. The error I keep getting is
Error in function (type, msg, asError = TRUE) :
GnuTLS recv error (-9): A TLS packet with unexpected length was received.
At first I thought it might be an SSL issue, but the error only occurs in certain databases, and in those databases, I can still download the data using RCurl. That is, this code will work
RCurl::postForm(uri=[URL],
.params=list(token=[TOKEN],
content='record',
format='csv'))
But this code will not
httr::POST(url=[URL],
body=list(token=[TOKEN],
content='record',
format='csv'))
Adding further to my confusion, even though I can't export data in projects where this error occurs, I can import data.
I'm out of ideas on where to start. I'd much appreciate any ideas on what might be going wrong here.
(I'd like to provide a reproducible example, but I'm afraid I'm working with healthcare data. sorry)
As requested, here's the verbose() output. It's a slightly different call, but produces the same error. (I used a call that wouldn't risk exposing confidential information)
> httr::POST(url=whi$url,
+ body=list(token=whi$token,
+ content='metadata',
+ format='csv'),
+ httr::verbose(data_in=TRUE, info=TRUE))
* Hostname was found in DNS cache
* Hostname in DNS cache was stale, zapped
* Trying 172.26.30.4...
* Connected to [URL] (172.26.30.4) port 443 (#7)
* found 153 certificates in /home/nutterb/R/x86_64-unknown-linux-gnu-library/3.1/httr/cacert.pem
* SSL re-using session ID
* server certificate verification OK
* common name: [URL] (matched)
* server certificate expiration date OK
* server certificate activation date OK
* certificate public key: RSA
* certificate version: #3
* subject: OU=Domain Control Validated,CN=[URL]
* start date: Thu, 10 Apr 2014 17:06:17 GMT
*
* expire date: Sat, 21 Mar 2015 16:35:07 GMT
*
* issuer: C=US,ST=Arizona,L=Scottsdale,O=Starfield Technologies\, Inc.,OU=http://certs.starfieldtech.com/repository/,CN=Starfield Secure Certificate Authority - G2
* compression: NULL
* cipher: AES-128-CBC
* MAC: SHA1
-> POST /redcap/api/ HTTP/1.1
-> User-Agent: curl/7.35.0 Rcurl/1.95.4.1 httr/0.5.0.9000
-> Host: [URL]
-> Accept-Encoding: gzip
-> accept: application/json, text/xml, */*
-> Content-Length: 374
-> Expect: 100-continue
-> Content-Type: multipart/form-data; boundary=------------------------05c968969cc362a9
->
<- HTTP/1.1 100 Continue
>> --------------------------05c968969cc362a9
>> Content-Disposition: form-data; name="token"
>>
>> [TOKEN]
>> --------------------------05c968969cc362a9
>> Content-Disposition: form-data; name="content"
>>
>> metadata
>> --------------------------05c968969cc362a9
>> Content-Disposition: form-data; name="format"
>>
>> csv
>> --------------------------05c968969cc362a9--
<- HTTP/1.1 200 OK
<- Date: Sat, 06 Sep 2014 09:55:43 GMT
<- Expires: 0
<- cache-control: no-store, no-cache, must-revalidate
<- Pragma: no-cache
<- Access-Control-Allow-Origin: *
<- Vary: Accept-Encoding
<- Content-Type: text/html; charset=utf-8
<- Connection: close
<- Content-Encoding: gzip
<-
<< NA
<< NA
<< NA
<< NA
<< NA
<< NA
<< NA
<< NA
<< NA
<< NA
<< NA
* GnuTLS recv error (-9): A TLS packet with unexpected length was received.
* Closing connection 7
Error in function (type, msg, asError = TRUE) :
GnuTLS recv error (-9): A TLS packet with unexpected length was received.
In addition: There were 11 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
2: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
3: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
4: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
5: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
6: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
7: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
8: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
9: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
10: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
11: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
>
I'm trying to POST data to JIRA Project using R and I keep getting: Error Bad Request. At first I thought it must be the JSON format that I created. So I wrote the JSON to file and did a curl command from console (see below) and the POST worked just fine.
curl -D- -u fred:fred -X POST -d #sample.json -H "Content-Type: application/json" http://localhost:8090/rest/api/2/issue/
Which brings the issue to my R code. Can someone tell me what am I doing wrong with the RCurl postForm?
Source:
library(RJSONIO)
library(RCurl)
x <- list (
fields = list(
project = c(
c(key="TEST")
),
summary="The quick brown fox jumped over the lazy dog",
description = "silly old billy",
issuetype = c(name="Task")
)
)
curl.opts <- list(
userpwd = "fred:fred",
verbose = TRUE,
httpheader = c('Content-Type' = 'application/json',Accept = 'application/json'),
useragent = "RCurl"
)
postForm("http://jirahost:8080/jira/rest/api/2/issue/",
.params= c(data=toJSON(x)),
.opts = curl.opts,
style="POST"
)
rm(list=ls())
gc()
Here's the output of the response:
* About to connect() to jirahost port 80 (#0)
* Trying 10.102.42.58... * connected
* Connected to jirahost (10.102.42.58) port 80 (#0)
> POST /jira/rest/api/2/issue/ HTTP/1.1
User-Agent: RCurl
Host: jirahost
Content-Type: application/json
Accept: application/json
Content-Length: 337
< HTTP/1.1 400 Bad Request
< Date: Mon, 07 Apr 2014 19:44:08 GMT
< Server: Apache-Coyote/1.1
< X-AREQUESTID: 764x1525x1
< X-AUSERNAME: anonymous
< Cache-Control: no-cache, no-store, no-transform
< Content-Type: application/json;charset=UTF-8
< Set-Cookie: atlassian.xsrf.token=B2LW-L6Q7-15BO- MTQ3|bcf6e0a9786f879a7b8df47c8b41a916ab51da0a|lout; Path=/jira
< Connection: close
< Transfer-Encoding: chunked
<
* Closing connection #0
Error: Bad Request
You might find it easier to use httr which has been constructed with
the needs of modern APIs in mind, and tends to set better default
options. The equivalent httr code would be:
library(httr)
x <- list(
fields = list(
project = c(key = "TEST"),
summary = "The quick brown fox jumped over the lazy dog",
description = "silly old billy",
issuetype = c(name = "Task")
)
)
POST("http://jirahost:8080/jira/rest/api/2/issue/",
body = RJSONIO::toJSON(x),
authenticate("fred", "fred", "basic"),
add_headers("Content-Type" = "application/json"),
verbose()
)
If that doesn't work, you'll need to supply the output from a successful
verbose curl on the console, and a failed httr call in R.