How to extract data from file on an FTP server without downloading it all in R? - encoding error? - r

I am trying to get a large dataset (3+ GB) from the following server:
ftp://podaac-ftp.jpl.nasa.gov/allData/ghrsst/data/L4/GLOB/JPL/MUR
I know RCurl is a good package for getting data from FTP. The file is a compressed netcdf file. I need to uncompress it to read it into R using ncdf4. It's compressed as bz2.
Importantly, the file is larger than I want on my hard drive, so saving a copy locally is not an ideal option. How can I access data on the file without saving a copy to my disk first?
Here's my attempt so far:
library(RCurl); library(ncdf4)
d = getURL('ftp://podaac-ftp.jpl.nasa.gov/allData/ghrsst/data/L4/GLOB/JPL/MUR/2015/144/20150524-JPL-L4UHfnd-GLOB-v01-fv04-MUR.nc.bz2')
d = bzfile(d, open = 'r')
d = nc_open(d)
But I'm stuck at this cryptic error after the first line:
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
embedded nul in string: 'BZh91AY&SY¦ÁÀÉ\0033[ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿáåÏ\035\017)³îÎ\u009dÍØcn]sw7½ÎkÜÞõï=uÎׯv]ìçn\u009dÎn½îê·±Þìê÷wS­M\u008có·+ÎçW¹Ý=Ù×¹\u009cγ­ÜëÞs½ÛN¹²w;\u009buÍÝ]{·k^çuªnìº-³6«[+Üå;\033m»Û½ow:w¹ïo{uyîî\u00937¬\\Ƶl¶½\u009dÖVìç¯{ÎõïoSm]Ý×\u009eî\u008dæî®î®î\vÛÕïgW\036î®wqîÝ\\ïw«6½Þï\036Ýrë§=¬Fg·\\íåÔÙº÷gu·3\u009bKmÛ\027­Þ»\u0092îî\016îêwwm»\u009b­·s;MÞÁ½½­ÎóÍso^»q¯o;k\033iµ\u009bÛuyÝÞní5w:ï]ÓuÎo[«\033:åÞvEÜíÎç½ÝË­\u009eìQNöÔ\u008e\u0094vmÝȯg»e lÍ^\u008a©'
It seems to be an encoding issue based on other similar problems but I tried both .encoding = 'UTF-8' and .encoding = 'ISO-8859-1' as shown in the getURL() documentation but neither work.
I've seen other answers to problems like this but they all seem to involve editing the source file. I don't have write access to this file, however. Any help?

I'd use httr for this
library("httr")
library("ncdf4")
url <- 'ftp://podaac-ftp.jpl.nasa.gov/allData/ghrsst/data/L4/GLOB/JPL/MUR/2015/144/20150524-JPL-L4UHfnd-GLOB-v01-fv04-MUR.nc.bz2'
res <- GET(url, write_disk(basename(url)))
# uncompress - I used OSX's default compression tool
nc_open(sub("\\.bz2", "", res$request$output$path))
the only step i didn't sort out programatically is un-compressing the bz2 file, just did that with OSX's default tool

I don't know much at all about R, but you should be able to do this with curl in FTP mode by changing the output to stdout rather than a local filename and then using bz2 to uncompress the file you want from its standard input.
So, for example, I can do this:
curl --output - --user user:password 'ftp://127.0.0.1/somefile.bz2' | bz2 ...
Maybe you can start that from within R? Or make a fifo with:
mkfifo fifo
curl ....
and then read from the file called fifo in R.
Or maybe R has a system() command, and you could do:
system('mkfifo fifo; curl ..... | bz2 .... > fifo &')
and then read from the file called fifo in R.

Related

Read open excel file in R

is there a way to read an open excel file into R?
When an excel file is open in Excel, Excel puts a lock on the file, such as the reading method in R cannot access the file.
Can you circumvent this lock?
Thanks
Edit: this occurs under windows with original excel.
I too do not have problem opening xlsx files that are already open in excel, but if you do i have a workaround that might work:
path_to_xlsx <- "C:/Some/Path/to/test.xlsx"
temp <- tempdir()
file.copy(path_to_xlsx, to = paste0(temp, "/test.xlsx"))
df <- openxlsx::read.xlsx(paste0(temp, "/test.xlsx"))
This copies the file (Which should not be blocked) to a temporary directory, and then loads the file from there. Again, i'm not sure if this is needed, as i do not have the problem you have.
You could try something like this using the ps package. I've used it on Windows and Mac to read from files that I had downloaded from some web resource and opened in Excel with openxlsx2, but it should work with other packages or programs too.
# get the path to the open file via the ps package
library(ps)
p <- ps()
# get the pid for the current program, in my case Excel on Mac
ppid <- p$pid[grepl("Excel", p$name)]
# get the list of open files for that program
pfiles <- ps_open_files(ps_handle(ppid))
pfile <- pfiles[grepl(".xlsx", pfiles$path),]
# return the path to the file
sel <- grepl("^(.|[^~].*)\\.xlsx", basename(pfile$path))
path <- pfile$path[sel]
What do you mean by "the reading method in R", and by "cannot access the file" (i.e. what code are you using and what error message do you get exactly)? I'm successfully importing Excel files that are currently open, with something like:
dat <- readxl::read_excel("PATH/TO/FILE.xlsx")
If the file is being edited in Excel, R imports the last saved version.
EDIT: I've now tried it on both Linux and Windows and it still works, at least with version 1.3.1 of 'readxl'.

R - Cannot Download gz file from FTP Server

I have been trying for three days now to download a file from an FTP server with R without a result. I have really tried everything and read all questions but still cannot manage.
The url is:
u <- "ftp://user:password#109.2.160.55/AGLO/2020/10/AGLO_00001_03-0_GDBX_1000077_202010032206_860101.CSV.gz"
When I copy paste this link in Firefox I can download the file, but with R I cannot. I tried download.file, GET, writeBin, getURL. All failed: getURL gives the following error:
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
embedded nul in string: '\037‹\b\bøÙx_\0\003AGLO_00001_03-0_GDBX_1000077_202010032206_860101.CSV\0¬\\M³£¸’Ý¿_Áî­º\002I Xɶ®­*\fn>nÔ­MÇ›\231·èͼ\210î×\021óóç¤\004\006#¹.Œ§+¢ë’צŽ’TæÉ\017¡ÎUS*üï·\030ÿ±ßbñKüÛùtøþ\033#A–ýÆc\036ãgÁy,\177Ë%~f_ŽÝ{é~,é\v%}¡\\~ÐIN¿ÿùï?~ÿ\217¿þýÏ¿þ(\017±\034oY¾ýë¯?þû÷?ÿ$qù·Å/p\v÷\177{£òð]U½.umÊ¿\235»¦/{s~ƒ”\220–7ÓŸ¢Ã¿þø¯\177þã¯ÿ)ó¸ÈŠ\022_eEœã·îúz-K\031—í £¯ZÕÑW5´º+…H9/{Uéú¨K^BfNôsT©è]·­n\215ŽÊ2\021œ©²¦¯”e/æ»7e\fIy‹\231Ä_"ayF?ÈñÏýsåQUGüâ3ô¼H’d\t\177\024Hàgœ•ê=ºªópR„\235笼\002á¹VÇ\aðºRFy°y\b6Ç_xz¹d¯Àf<—I2Â.\030›\004\f°3\002­ZÓõ#\027\035Z£ê\023ÀDz(+\\7CwT}ÉJ\215¿»nè£þ¢\201\230Ô^>"§\033×ø3#çLäž¾éc›õ\035§‰`wàr\022\020pg-êøë »èÖjØCï\003çeý÷\037j\210êoM}n"ÝG׫ÆCÂ:£úï]S«è¢a_*°\034¹Z\016\036C\034XŽÜ¼œBð8YX\217»&ã‘\t=†›êz=´X…`yyÓ]g-G\177é¾Ü¾(ü颶9`\235Ñ­\031ÎXË\016\033*RÚwÏmèûgàešf|\001Þ]\023xžYËï/F·\235}\004p\tM{Òj\200·);ݾ\033\230ýE\003úY_u\215‡`j,´Û¾\0^°8}iëf<\023Kå\217\002,àP2a­Éß\006‚%åM\r¦ªì“¸Ý`jã%°“{ú±AùiÊ_s;ô_ºÄî\004Öí0\vý4ÜTÿá+ÿÚœL5þv‡¹0ž’ÃxÁ夒\025l\
There is no proxy problem whatsoever since to get the u url I am searching in the FTP directory.
How can I download this fing file ?
Also another way I could eventially work around this is using:
browseURL(u,
browser = "C:/Program Files (x86)/Mozilla Firefox/firefox.exe")
The issue with this is that it will open a Firefox browser that will:
Ask me if I am sure I want to go to this site and then
Ask me what I want to do with this file (not so much of a problem since I can choose a default to always download, but still I do not want to)
The issue with this is that simply I do not want to open a browser and I do not want to be asked if I want to go to the site and if I want to download. There are many files on this server so I want to do all of this automatically and I will need to be working in parllel, so having a browser pop up is not great, but if all else fails I can accept.
I am so desperate that I can give you the user name and password in private.
Apparently downloading the file to disk using httr solved the problem. It is possible to combine write_disk and httr::GET to download files to disk in the following way:
library(httr)
to_download <- "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
# Download pdf to disk
GET(to_download, write_disk("dummy.pdf"))

Attempting to download files from SFTP using R

I'm trying to implement R in the workplace and save a bit of time from all the data churning we do.
A lot of files we receive are sent to us via SFTP as they contain sensitive information.
I've looked around on StackOverflow & Google but nothing seems to work for me. I tried using the RCurl Library from an example I found online but it doesn't allow me to include the port(22) as part of the login details.
library(RCurl)
protocol <- "sftp"
server <- "hostname"
userpwd <- "user:password"
tsfrFilename <- "Reports/Excelfile.xlsx"
ouptFilename <- "~/Test.xlsx"
url <- paste0(protocol, "://", server, tsfrFilename)
data <- getURL(url = url, userpwd=userpwd)
I end up getting the error code
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
embedded nul in string:
Any help would be greatly appreciated as this will save us loads of time!
Thanks,
Shan
Looks like a similar situation here: Using R to download SAS file from ftp-server
I'm no expert in r but there it looks like getBinaryUrl() worked instead of getURL() in the example given.
Hope that helps
M
Note that there are two packages, RCurl and rcurl. For RCurl, I used successfully keyfiles to connect via sftp:
opts <- list(
ssh.public.keyfile = pubkey, # file name
ssh.private.keyfile = privatekey, # filename
keypasswd <- keypasswd # optional password
)
RCurl::getURL(url=uri, .opts = opts, curl = RCurl::getCurlHandle())
For this to work, you need two create the keyfiles e.g. via putty or similar.
I too was having problems specifying the port options when using the getURI() and getURL() functions.
In order to specify the port, you simply add the port as port = #### instead of port(####). For example:
data <- getURI(url = url,
userpwd = userpwd,
port = 22)
Now, like #MarkThomas pointed out, whenever you get an encodoing error, try getBinaryURL() instead of getURI(). In most cases, this will allow you to download SAS files as well as .csv files econded in UTF-8 or LATIN1!!

download.file() fails when appending a random suffix to the filename

I'm trying to download a file in R on a remote server which sits behind a number of proxies. Something - I can't figure out what - is causing the file to be returned cached whenever I try and access it on that server, whether I do so through R or just through a Web Browser.
I've tried using cacheOK=FALSE in my download.file call and this has had no effect.
Per Is there a way to force browsers to refresh/download images? I have tried adding a random suffix to the end of the URL:
download.file(url = paste("http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_daily.zip?",
format(Sys.time(), "%d%m%Y"),sep=""),
destfile = "F-F_Research_Data_Factors_daily.zip", cacheOK=FALSE)
This produces, e.g., the following URL:
http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_daily.zip?17092012
Which when accessed from a Web Browser on the remote server, indeed returns the latest version of the file. However, when accessed using download.file in R, this returns a corrupted zip archive. Both WinRAR and R's unzip function complain that the zip file is corrupt.
unzip("F-F_Research_Data_Factors_daily.zip")
1: In unzip("F-F_Research_Data_Factors_daily.zip") :
internal error in unz code
I can't see why downloading this file via R would cause a corrupted file to be returned, whereas downloading it via a Web Browser gives no problem.
Can anyone suggest either a way to beat the cache from R (about which I'm not hopeful), or a reason why download.file doesn't like my URL with ?someRandomString tacked onto the end of it?
It will work if you use mode="wb"
download.file(url = paste("http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_daily.zip?",format(Sys.time(),"%d%m%Y"),sep=""),
destfile = "F-F_Research_Data_Factors_daily.zip", mode='wb', cacheOK=FALSE)

Connect to the Twitter Streaming API using R

I just started playing around with the Twitter Streaming API and using the command line, redirect the raw JSON reponses to a file using the command below:
curl https://stream.twitter.com/1/statuses/sample.json -u USER:PASSWORD -o "somefile.txt"
Is it possible to stay completely within R and leverage RCurl to do the same thing? Instead of just saving the output to a file, I would like to parse each response that is returned. I have parsed twitter search results in the past, but I would like to do this as each response is received. Essentially, apply a function to each JSON response.
Thanks in advance.
EDIT: Here is the code that I have tried in R (I am on Windows, unfortunately). I need to include the reference to the .pem file to avoid the error. However, the code just "runs" and I can not seem to see what is returned. I have tried print, cat, etc.
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
getURL("https://stream.twitter.com/1/statuses/sample.json",
userpwd="USER:PWD",
cainfo = "cacert.pem")
I was able to figure out the basics, hopefully this helps.
#==============================================================================
# Streaming twitter using RCURL
#==============================================================================
library(RCurl)
library(rjson)
# set the directory
setwd("C:\\")
#### redirects output to a file
WRITE_TO_FILE <- function(x) {
if (nchar(x) >0 ) {
write.table(x, file="Twitter Stream Capture.txt", append=T, row.names=F, col.names=F)
}
}
### windows users will need to get this certificate to authenticate
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
### write the raw JSON data from the Twitter Firehouse to a text file
getURL("https://stream.twitter.com/1/statuses/sample.json",
userpwd=USER:PASSWORD,
cainfo = "cacert.pem",
write=WRITE_TO_FILE)
Try the twitter api package for R.
install.packages('twitteR')
library(twitteR)
I think this is what you need.

Resources