R: reading a binary file that is zipped - r

I am trying to parse some online weather data with R. The data is a binary file that has been gzipped. An example file is:
ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/V1.0/2005/PRCP_CU_GAUGE_V1.0GLB_0.50deg.lnx.20050101.gz
If I download the file to my computer and manually unzip it, I can easily do the following:
myFile <- ( "/tmp/PRCP_CU_GAUGE_V1.0GLB_0.50deg.lnx.20050101" )
to.read = file( myFile, "rb")
myPoints <- readBin(to.read, real(), n=1e6, size = 4, endian = "little")
What I would prefer to do is automate both the download/unzip along with the read. So I thought that would be as simple as the following:
p <- gzcon( url( "ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/V1.0/2005/PRCP_CU_GAUGE_V1.0GLB_0.50deg.lnx.20050101.gz" ) )
myPoints <- readBin(p, real(), n=1e6, size = 4, endian = "little")
This seems to work just dandy, but in the manual step the vector myPoints has length 518400, which is accurate. However if R handles the download and read as in the second example, I get a different length vector every time I run the code. Seriously. I'm not smoking anything. I swear. I run it multiple time and each time the vector is a different length, always less than the expected 518400.
I also tried getting R to download the gzip file using the following:
temp <- tempfile()
myFile <- download.file("ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/V1.0/2005/PRCP_CU_GAUGE_V1.0GLB_0.50deg.lnx.20050101.gz",temp)
I found that often that would return a Warning about the file not being the expected size. Like the following:
Warning message:
In download.file("ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/V1.0/2005/PRCP_CU_GAUGE_V1.0GLB_0.50deg.lnx.20050101.gz", :
downloaded length 162176 != reported length 179058
Any tips you can throw my way that might help me solve this?
-J

Try this:
R> remfname <- "ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/V1.0/2005/PRCP_CU_GAUGE_V1.0GLB_0.50deg.lnx.20050101.gz"
R> locfname <- "/tmp/data.gz"
R> download.file(remfname, locfname)
trying URL 'ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/V1.0/2005/PRCP_CU_GAUGE_V1.0GLB_0.50deg.lnx.20050101.gz'
ftp data connection made, file length 179058 bytes
opened URL
==================================================
downloaded 174 Kb
R> con <- gzcon(file(locfname, "rb"))
R> myPoints <- readBin(con, real(), n=1e6, size = 4, endian = "little")
R> close(con)
R> str(myPoints)
num [1:518400] 0 0 0 0 0 0 0 0 0 0 ...
R>

Related

R: Cannot run certain function after cleaning temporary directory

I get the error:
Error in file(fn, "rb") : cannot open the connection
In addition: Warning message:
In file(fn, "rb") :
cannot open file 'C:\Users\***\AppData\Local\Temp\Rtmpwh6Zih\raster\r_tmp_2020-05-
13_170601_12152_33882.gri': No such file or directory
When I run the following code in RStudio (1.2.5042):
raster.binair <- vector(mode = "list", length = length(aggregated.rasters))
for (i in 1:NROW(aggregated.rasters)) {
+ clamped <- clamp(aggregated.rasters[[i]], upper=12, useValues=FALSE)
+ raster.binair[[i]] <- clamped
+ }
"aggregated.rasters" is a list of 96 rasters and when I separately run it, I get the correct list. I recently cleaned my temporary directory (accessed by tempdir()) and deleted the files in there. I suppose the part:
cannot open file 'C:\Users\***\AppData\Local\Temp\Rtmpwh6Zih\raster\r_tmp_2020-05-
13_170601_12152_33882.gri': No such file or directory
is referring to this. I don't know what I did wrong here. Can I get these files back or work around this error?
Files in the temp folder are deleted when an R session ends. So you should never count on them. You can run the code again, but if you want to permanently keep the results you need to write them elsewhere. Here are two options
Write many files
raster.binair <- vector(mode = "list", length = length(aggregated.rasters))
for (i in 1:NROW(aggregated.rasters)) {
f <- paste0("raster_", i)
clamped <- clamp(aggregated.rasters[[i]], upper=12, useValues=FALSE, filename=f)
raster.binair[[i]] <- clamped
}
Write a single file
raster.binair <- vector(mode = "list", length = length(aggregated.rasters))
for (i in 1:NROW(aggregated.rasters)) {
raster.binair[[i]] <- clamp(aggregated.rasters[[i]], upper=12, useValues=FALSE)
}
s <- stack(raster.binair)
s <- writeRaster(s, filename="mydata.tif")

Read in large text file in chunks

I'm working with limited RAM (AWS free tier EC2 server - 1GB).
I have a relatively large txt file "vectors.txt" (800mb) I'm trying to read into R. Having tried various methods I have failed to read in this vector to memory.
So, I was researching ways of reading it in in chunks. I know that the dim of the resulting data frame should be 300K * 300. If I was able to read in the file e.g. 10K lines at a time and then save each chunk as an RDS file I would be able to loop over the results and get what I need, albeit just a little slower with less convenience than having the whole thing in memory.
To reproduce:
# Get data
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
# word2vec r library
library(rword2vec)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
So far so good. Here's where I struggle:
word_vectors = as.data.frame(read.table("vector.txt",skip = 1, nrows = 10))
Returns "cannot allocate a vector of size [size]" error message.
Tried alternatives:
word_vectors <- ff::read.table.ffdf(file = "vector.txt", header = TRUE)
Same, not enough memory
word_vectors <- readr::read_tsv_chunked("vector.txt",
callback = function(x, i) saveRDS(x, i),
chunk_size = 10000)
Resulted in:
Parsed with column specification:
cols(
`299567 300` = col_character()
)
|=========================================================================================| 100% 817 MB
Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs, :
Evaluation error: bad 'file' argument.
Is there any other way to turn vectors.txt into a data frame? Maybe by breaking it into pieces and reading in each piece, saving as a data frame and then to rds? Or any other alternatives?
EDIT:
From Jonathan's answer below, tried:
library(rword2vec)
library(RSQLite)
# Download pre trained Google News word2vec model (Slimmed down version)
# https://github.com/eyaler/word2vec-slim
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
# from https://privefl.github.io/bigreadr/articles/csv2sqlite.html
csv2sqlite <- function(tsv,
every_nlines,
table_name,
dbname = sub("\\.txt$", ".sqlite", tsv),
...) {
# Prepare reading
con <- RSQLite::dbConnect(RSQLite::SQLite(), dbname)
init <- TRUE
fill_sqlite <- function(df) {
if (init) {
RSQLite::dbCreateTable(con, table_name, df)
init <<- FALSE
}
RSQLite::dbAppendTable(con, table_name, df)
NULL
}
# Read and fill by parts
bigreadr::big_fread1(tsv, every_nlines,
.transform = fill_sqlite,
.combine = unlist,
... = ...)
# Returns
con
}
vectors_data <- csv2sqlite("vector.txt", every_nlines = 1e6, table_name = "vectors")
Resulted in:
Splitting: 12.4 seconds.
Error: nThread >= 1L is not TRUE
Another option would be to do the processing on-disk, e.g. using an SQLite file and dplyr's database functionality. Here's one option: https://stackoverflow.com/a/38651229/4168169
To get the CSV into SQLite you can also use the bigreadr package which has an article on doing just this: https://privefl.github.io/bigreadr/articles/csv2sqlite.html

Using R to access FTP Server and Download Files Results in Status "530 Not logged in"

What I'm Attempting to Do
I'm attempting to download several weather data files from the US National Climatic Data Centre's FTP server but am running into problems with an error message after successfully completing several file downloads.
After successfully downloading two station/year combinations I start getting an error "530 Not logged in" message. I've tried starting at the offending year and running from there and get roughly the same results. It downloads a year or two of data and then stops with the same error message about not being logged in.
Working Example
Following is a working example (or not) with the output truncated and pasted below.
options(timeout = 300)
ftp <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/"
td <- tempdir()
station <– c("983240-99999", "983250-99999", "983270-99999", "983280-99999", "984260-41231", "984290-99999", "984300-99999", "984320-99999", "984330-99999")
years <- 1960:2016
for (i in years) {
remote_file_list <- RCurl::getURL(
paste0(ftp, "/", i, "/"), ftp.use.epsv = FALSE, ftplistonly = TRUE,
crlf = TRUE, ssl.verifypeer = FALSE)
remote_file_list <- strsplit(remote_file_list, "\r*\n")[[1]]
file_list <- paste0(station, "-", i, ".op.gz")
file_list <- file_list[file_list %in% remote_file_list]
file_list <- paste0(ftp, i, "/", file_list)
Map(function(ftp, dest) utils::download.file(url = ftp,
destfile = dest, mode = "wb"),
file_list, file.path(td, basename(file_list)))
}
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1960/983250-99999-1960.op.gz'
Content type 'unknown' length 7135 bytes
==================================================
downloaded 7135 bytes
...
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1961/984290-99999-1961.op.gz'
Content type 'unknown' length 7649 bytes
==================================================
downloaded 7649 bytes
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1962/983250-99999-1962.op.gz'
downloaded 0 bytes
Error in utils::download.file(url = ftp, destfile = dest, mode = "wb") :
cannot download all files In addition: Warning message:
In utils::download.file(url = ftp, destfile = dest, mode = "wb") :
URL ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1962/983250-99999-1962.op.gz':
status was '530 Not logged in'
Different Methods and Ideas I've Tried but Haven't Yet Been Successful
So far I've tried to slow the requests down using Sys.sleep in a for loop and any other manner of retrieving the files more slowly by opening then closing connections, etc. It's puzzling because: i) it works for a bit then stops and it's not related to the particular year/station combination per se; ii) I can use nearly the exact same code and download much larger annual files of global weather data without any errors over a long period of years like this; and iii) it's not always stopping after 1961 going to 1962, sometimes it stops at 1960 when it starts on 1961, etc., but it does seem to be consistently between years, not within from what I've found.
The login is anonymous, but you can use userpwd "ftp:your#email.address". So far I've been unsuccessful in using that method to ensure that I was logged in to download the station files.
I think you're going to need a more defensive strategy when working with this FTP server:
library(curl) # ++gd > RCurl
library(purrr) # consistent "data first" functional & piping idioms FTW
library(dplyr) # progress bar
# We'll use this to fill in the years
ftp_base <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/%s/"
dir_list_handle <- new_handle(ftp_use_epsv=FALSE, dirlistonly=TRUE, crlf=TRUE,
ssl_verifypeer=FALSE, ftp_response_timeout=30)
# Since you, yourself, noted the server was perhaps behaving strangely or under load
# it's prbly a much better idea (and a practice of good netizenship) to cache the
# results somewhere predictable rather than a temporary, ephemeral directory
cache_dir <- "./gsod_cache"
dir.create(cache_dir, showWarnings=FALSE)
# Given the sporadic efficacy of server connection, we'll wrap our calls
# in safe & retry functions. Change this variable if you want to have it retry
# more times.
MAX_RETRIES <- 6
# Wrapping the memory fetcher (for dir listings)
s_curl_fetch_memory <- safely(curl_fetch_memory)
retry_cfm <- function(url, handle) {
i <- 0
repeat {
i <- i + 1
res <- s_curl_fetch_memory(url, handle=handle)
if (!is.null(res$result)) return(res$result)
if (i==MAX_RETRIES) { stop("Too many retries...server may be under load") }
}
}
# Wrapping the disk writer (for the actual files)
# Note the use of the cache dir. It won't waste your bandwidth or the
# server's bandwidth or CPU if the file has already been retrieved.
s_curl_fetch_disk <- safely(curl_fetch_disk)
retry_cfd <- function(url, path) {
# you should prbly be a bit more thorough than `basename` since
# i think there are issues with the 1971 and 1972 filenames.
# Gotta leave some work up to the OP
cache_file <- sprintf("%s/%s", cache_dir, basename(url))
if (file.exists(cache_file)) return()
i <- 0
repeat {
i <- i + 1
if (i==6) { stop("Too many retries...server may be under load") }
res <- s_curl_fetch_disk(url, cache_file)
if (!is.null(res$result)) return()
}
}
# the stations and years
station <- c("983240-99999", "983250-99999", "983270-99999", "983280-99999",
"984260-41231", "984290-99999", "984300-99999", "984320-99999",
"984330-99999")
years <- 1960:2016
# progress indicators are like bowties: cool
pb <- progress_estimated(length(years))
walk(years, function(yr) {
# the year we're working on
year_url <- sprintf(ftp_base, yr)
# fetch the directory listing
tmp <- retry_cfm(year_url, handle=dir_list_handle)
con <- rawConnection(tmp$content)
fils <- readLines(con)
close(con)
# sift out only the target stations
map(station, ~grep(., fils, value=TRUE)) %>%
keep(~length(.)>0) %>%
flatten_chr() -> fils
# grab the stations files
walk(paste(year_url, fils, sep=""), retry_cfd)
# tick off progress
pb$tick()$print()
})
You may also want to set curl_interrupt to TRUE in the curl handle if you want to be able to stop/esc/interrupt the downloads.

"download.file" Incomplete and inconsistent downloads

Am trying to understand why I am having inconsistent results downloading CSV files from a website archive. Don't know if the problem is at my end, the other side or just failed communications in between. Any suggestions are welcomed.
Using a R script to automate the downloading of CSV files by month and year from the HYCOM archives for analysis. The script generated the following URL trying URL 'http://ncss.hycom.org/thredds/ncss/GLBu0.08/reanalysis/3hrly?var=salinity&var=water_temp&var=water_u&var=water_v&latitude=13.875&longitude=-72.25&time_start=2012-05-01T00:00:00Z&time_end=2012-05-31T21:00:00Z&vertCoord=&accept=csv'
Running download.file successfully obtains the file about half the time, otherwise fails. Any suggestions are welcomed. The images below shows the failed run. Successful run is below.
Successful Log
#download one month of data
MM = '05'
LastDay = ndays(paste(year,MM,'01',sep="-"))
H1 = paste( as shown in image)
H2 = '-01T00:00:00Z&time_end='
#H3 = 'T21:00:00Z&timeStride=1&vertCoord=&accept=csv'
H3 = 'T21:00:00Z&vertCoord=&accept=csv'
HtmlLink <- paste(H1,year,"-",MM,H2,year,"-",MM,"-",LastDay,H3,sep="")
dest = paste("../data/",year,MM,".csv",sep="")
download.file(url =HtmlLink ,destfile=dest,cacheOK=FALSE, method="auto")
trying URL 'as shown in image'
Content type 'text/plain;charset=UTF-8' length unknown
..................................................
................downloaded 666 KB
user system elapsed
28.278 6.605 5201.421
LOG OF FAILED RUN
You can/should turn the following into a function accepting parameters and replace the hardcoded values with said params (I used httr:::parse_query() to make the list):
library(httr)
URL <- "http://ncss.hycom.org/thredds/ncss/GLBu0.08/reanalysis/3hrly"
params <- list(var = "salinity",
var = "water_temp",
var = "water_u",
var = "water_v",
latitude = "13.875",
longitude = "-72.25",
time_start = "2012-05-01T00:00:00Z",
time_end = "2012-05-31T21:00:00Z",
vertCoord = "",
accept = "csv")
dest_file <- "filename"
res <- GET(url=URL,
query=params,
timeout(360),
write_disk(dest_file, overwrite=TRUE),
verbose())
warn_for_status(res)
You can (eventually) remove the verbose() from that GET call, but it's helpful during debugging.
The main issue is that this server is s l o w and times out before the transfer is complete. Even the value of 360 might not be enough (you'll need to experiment).
Many thanks to all for the help. The suggestion by hrbrmstr appears to be an elegant answer and I look forwards to testing it. However, I was unable to install a working copy using the program manager. Installation from a local download also failed since R complained that the OS X version that I downloaded from CRAN was a windows version, not OS X. Yes, I repeated the download several times to make sure I had the right package.
As suggested by Cyrus Mohammadian, I tried the procedures in the curl library.
Running the same URL, download.file transfers failed about 50% of the time. Using curl reduced the transfer times from 2000 seconds to 1000 seconds with no failures in 12 tries.
## calculate number of days in month
ndays <- function(d) {
last_days <- 28:31
rev(last_days[which(!is.na(
as.Date( paste( substr(d, 1, 8),
last_days, sep = ''),
'%Y-%m-%d')))])[1] }
nlat = 13.875
elon = -72.25
#download one month of data
year = 2008
MM = '01'
LastDay = ndays(paste(year,MM,'01',sep="-"))
H1 = paste('http://ncss.hycom.org/thredds/ncss/GLBu0.08/reanalysis/3hrly?
var=salinity&var=water_temp&var=water_u&var=water_v&latitude=',
nlat,'&longitude=', elon,'&time_start=',sep="")
H2 = '-01T00:00:00Z&time_end='
H3 = 'T21:00:00Z&timeStride=1&vertCoord=&accept=csv'
HtmlLink <- paste(H1,year,"-",MM,H2,year,"-",MM,"-",LastDay,H3,sep="")
dest = paste("../data/",year,MM,".csv",sep="")
curl_download(url =HtmlLink ,destfile=dest,quiet=FALSE, mode="wb")

Reading binary file into R

I'm trying to read a binary file into R but this file has rows of data written in binary code. So it doesnt have one full set of data belonging to one column instead it is stored as rows of data. Here's what my data looks like:
Bytes 1-4: int ID
Byte 5: char response character
Bytes 6-9: int Resp Dollars
Byte 10: char Type char
Anyone can help me figure out how to read this file into R?
Here is the code I have tried so far. I tried a couple of things with limited success. Unfortunately, I cant post any of the data on public sites, apologies. I’m relatively new to R, so I need some help in terms of how to improve the code.
> binfile = file("File Location", "rb")
> IDvals = readBin(binfile, integer(), size=4, endian = "little")
> Responsevals = readBin(binfile, character (), size = 5)
> ResponseDollarsvals = readBin (binfile, integer (), size = 9, endian= "little")
Error in readBin(binfile, integer(), size = 9, endian = "little") :
size 9 is unknown on this machine
> Typevals = readBin (binfile, character (), size=4)
> binfile1= cbind(IDvals, Responsevals, ResponseDollarsvals, Typevals)
> dimnames(binfile1)[[2]]
[1] "IDvals" "Responsevals" "ResponseDollarsvals" "Typevals"
> colnames(binfile1)=binfile
Error in `colnames<-`(`*tmp*`, value = 4L) :
length of 'dimnames' [2] not equal to array extent
You could open the file as a raw file, then issue readBin or readChar commands to get each line. Append each value to a column as you go.
my.file <- file('path', 'rb')
id <- integer(0)
response <- character(0)
...
Loop around this block:
id = c(id, readBin(my.file, integer(), size = 4, endian = 'little'))
response = c(response, readChar(my.file, 1))
...
readChar(my.file, size = 1) # For UNIX newlines. Use size = 2 for Windows newlines.
Then create your data frame.
See here: http://www.ats.ucla.edu/stat/r/faq/read_binary.htm

Resources