Read binary raster files in R - r

I want to read binary integers in R and convert them into raster grids.
The files have the following charterers:
NCols= 4320
NRows= 2160
pixel-size: 1/12=0.833 degrees
upper-left-lat: 90.0-1/24
upper-left-lon: -180.0+1/24
lower-right-lat: -90.0+1/24
lower-right-lon: 180.0
nodata= -5000
scale-factor= 10000
datatype: 16-bit signed integer
byte-order: big endian
Here is what I do:
file <-"http://nasanex.s3.amazonaws.com/AVHRR/GIMMS/3G/1980s/geo81aug15a.n07-VI3g"
dat <- readBin(file,what="integer", size=4, signed = TRUE, n = NRows * NCols, endian = "big")
r <- raster(nrow=2160, ncol=4320)
r[] <- dat
But this doesn't seem to be right, I appreciate any suggestions.
.

I built greenbrown from source (based on the files staged on GitHub) and found that it took considerably long to process one single file.
system.time(
r1 <- ReadVI3g("http://nasanex.s3.amazonaws.com/AVHRR/GIMMS/3G/1980s/geo81aug15a.n07-VI3g")
)
# user system elapsed
# 3.252 0.973 143.846
Therefore, I suggest to have a look at the gimms package which has been designed for this particular kind of data and, moreover, is available from CRAN. Note that in contrast to ReadVI3g, it does not offer automated quality control yet, but this feature is scheduled for the next version update. In the meantime, overlay from the raster package should be employed to discard low-quality values.
# install.packages("gimms")
library(gimms)
system.time({
## download file, see ?downloadGimms for further options
f <- updateInventory()
f <- downloadGimms(f[3], overwrite = TRUE) # download 3rd file in 'f', viz. geo81aug15a.n07-VI3g
## rasterize ndvi and flags
ndvi <- rasterizeGimms(f)
flag <- rasterizeGimms(f, flag = TRUE)
## perform quality control
r2 <- overlay(ndvi, flag, fun = function(x, y) {
x[y[] > 1] <- NA
return(x)
})
})
# user system elapsed
# 4.538 3.894 26.781
The two resulting images are obviously identical
> unique(r1 - r2, na.rm = TRUE)
[1] 0
but as you can see, the gimms-based code performs much faster. Moreover, it offers parallel functionality (via doParallel) in case you would like to download and process multiple files at once.

You can read such files with the greenbrown R package.
Install it in R with
install.packages("greenbrown", repos="http://R-Forge.R-project.org")
If that fails because the package needs to be rebuilt by its authors, an alternative is to first download the sources directly from the repo, and then install them manually, as explained in the greenbrown installation instructions. In the latter case you may also have to manually install a couple of packages that greenbrown depends on first: install.packages on Kendall, bfast, strucchange.
After installation, reading the raster from a URL is as easy as:
library(greenbrown)
r <- ReadVI3g("http://nasanex.s3.amazonaws.com/AVHRR/GIMMS/3G/1980s/geo81aug15a.n07-VI3g")
The object returned by greenbrown::ReadVI3g is a RasterLayer. We can display it with
plot(r)
which gives

Related

Fast way to download a really big (14 million row) csv from a zip file? Unzip and read_csv and read.csv never stop loading

I am trying to download the dataset at the below link. It is about 14,000,000 rows long.
I ran this code chunk, and I am stuck at unzip(). The code has been running for a really long time and my computer is hot.
I tried a few different ways that don't use unzip, and then I get stuck at the read.csv/vroom/read_csv step.
Any ideas? This is a public dataset so anyone can try.
library(vroom)
temp <- tempfile()
download.file("https://files.consumerfinance.gov/hmda-historic-loan-data/hmda_2017_nationwide_all-records_labels.zip", temp)
unzip(temp, "hmda_2017_nationwide_all-records_labels.csv")
df2017 <- vroom("hmda_2017_nationwide_all-records_labels.csv")
unlink(temp)
Since the data set is quite large, 2 possible solutions:
With data.table (very fast, only feasible if the data fits into memory)
require(data.table)
system('curl https://files.consumerfinance.gov/hmda-historic-loan-data/hmda_2017_nationwide_all-records_labels.zip > hmda_2017_nationwide_all-records_labels.zip && unzip hmda_2017_nationwide_all-records_labels.zip')
dat <- fread("hmda_2017_nationwide_all-records_labels.csv")
# System errno 22 unmapping file: Invalid argument
# Error in fread("hmda_2017_nationwide_all-records_labels.csv") :
# Opened 10.47GB (11237068086 bytes) file ok but could not memory map it.
# This is a 64bit process. There is probably not enough contiguous virtual memory available.
With readLines (read data step-wise)
f <- file("./hmda_2017_nationwide_all-records_labels.csv", "r")
# if header:
header <- unlist(strsplit(unlist(strsplit(readLines(f, n=1), "\",\"")), ","))
dd <- as.data.frame(t(data.frame(strsplit(readLines(f, n=100), "\",\"") )))
colnames(dd) <- header
rownames(dd) <- 1:nrow(dd)
Repeat and add to the data frame if needed:
de <- t(as.data.frame( strsplit(readLines(f, n=10), "\",\"") ) )
colnames(de) <- header
dd <- rbind( dd, de )
rownames(dd) <- 1:nrow(dd)
close(f)
Use seek to jump within the data.
I was able to download the file to my computer first.
then use vroom (https://vroom.r-lib.org/) to load it without unzipping it:
library(vroom)
df2017 <- vroom("hmda_2017_nationwide_all-records_labels.zip")
I get a warning about possible truncation, but the object has these dimensions:
> dim(df2017)
[1] 5448288 78
one nice thing about vroom, is that it doesn't load the data straight into memory.

why the function Write.fcs in the Flowcore package corrupts my FCS file

I am analyzing FCS files from a CyTOF experiment using Flowcore package
. When I import and export my FCS files using read.FCS and write.FCS, I find that these functions have corrupted my FCS file and all channels are affected and the data looks like the tSNE in the picture below (not what is expected or meaningful).
I'm using R (ver.3.6), Rstudio (1.2.1335), and flowcore ver.3.9.
Here is the code I have used:
library(flowCore)
#Import FCS file
myfilename<-"export_MIX_NT_Ungated_viSNE.fcs"
myfile_fcs<-read.FCS(myfilename,
transformation="linearize", which.lines=NULL,
alter.names=FALSE, column.pattern=NULL)
#I plan to do some data analysis here in the final version before exporting below
#export the fcs file and rename it to T_+filename
write.FCS(myfile_fcs,paste("T_",keyword(myfile_fcs)$"$FIL",sep=""), what="numeric")
and this is what the original file looks like before import into R
and this is what the exported result looks like after export
Here is the file that we have used for this code: dropbox link for the example file
I've looked into your problem and at first I was skeptical about the transformation of read.fcs. Looking into your example file, I also see that there are already columns for your original (full plot) tsne plot, so I'm assuming flowjo is rewriting the tsne values after you read/write it into R. Since Flowcore is generally more targeted towards flow data and not cytof, I took a few pieces of this Bioc2017 walkthough and recreated the transformations, which seems to work better although I'm not sure how flowjo will handle the data now. If you were going to do more work on the data though, we now have it at an accessible low level so you can basically do whatever you want. Here's my code.
fcs_raw <- read.flowSet("~/Downloads/export_MIX_NT_Ungated_viSNE.fcs", transformation = FALSE,
truncate_max_range = FALSE)
fcs <- fsApply(fcs_raw, function(x, cofactor = 5){
expr <- exprs(x)
expr <- asinh(expr[,] / cofactor)
exprs(x) <- expr
x
})
expr <- fsApply(fcs, exprs)
library(matrixStats)
rng <- colQuantiles(expr, probs = c(0.01, 0.99))
expr01 <- t((t(expr) - rng[, 1]) / (rng[, 2] - rng[, 1]))
expr01[expr01 < 0] <- 0
expr01[expr01 > 1] <- 1
expr01
summary(expr01)
Be aware that this does mess up your original tSNE column numbers, so if these were important to you, I would read the flowset, make a copy of those columns, and move on with the data analysis in the code. If you have future questions or analysis with flow data feel free to contact me directly.
#csugai, thanks for your answer. The truncate_max_range = FALSE argument in the read.flowSet function caught my eyes so I included that into my read.FCS function and that fixed the problem! Although I didn't really understand other parts of your code that resulted in a binned data.

How to use readLines in R to read all lines between a certain range?

I am trying to split a large JSONL(.gz) file into a number of .csv files. I have been able to use the code below to create a working .csv file, for the first 25.000 entries. I now want to read and parse the 25.001 to the 50.000th line, and have been unable to do so. I feel like it should be easily done, but my search has been fruitless thus far.
Is there a way to manipulate the 'n' factor in the readLiness function to select a specific range of lines?
(p.s. I'm learning;))
setwd("filename")
a<-list.files(pattern="(.*?).0.jsonl.gz")
a[1]
raw.data<- readLines(gzfile(a[1]), warn = "T",n=25000)
rd <- fromJSON(paste("[",paste(raw.data,collapse=','),']'))
rd2<-do.call("cbind", rd)
file=paste0(a,".csv.gz")
write.csv.gz(rd2, file, na="", row.names=FALSE)
The read_lines() function within the readr package is faster than base::readLines(), and can be used to specify a start and end line for the read. For example:
library(readr)
myFile <- "./data/veryLargeFile.txt"
first25K <- read_lines(myFile,skip=0,n_max = 25000)
second25K <- read_lines(myFile,skip=25000,n_max=25000)
Here is a complete, working example using the NOAA StormData data set. The file describes the location, event type, and damage information for over 900,000 extreme weather events in the United States between 1950 and 2011. We will use readr::read_lines() to read the first 50,000 lines in groups of 25,000 after downloading and unzipping the file.
Warning: the zip file is about 50Mb.
library(R.utils)
library(readr)
dlMethod <- "curl"
if(substr(Sys.getenv("OS"),1,7) == "Windows") dlMethod <- "wininet"
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url,destfile='StormData.csv.bz2',method=dlMethod,mode="wb")
bunzip2("StormData.csv.bz2","StormData.csv")
first25K <- read_lines("StormData.csv",skip=0,n_max = 25000)
second25K <- read_lines("StormData.csv",skip=25000,n_max=25000)
...and the objects as viewed in the RStudio Environment Viewer:
Here are the performance timings comparing base::readLines() with readr::read_lines() on an HP Spectre x-360 laptop with an Intel i7-6500U processor.
> # check performance of readLines()
> system.time(first25K <- readLines("stormData.csv",n=25000))
user system elapsed
0.05 0.00 0.04
> # check performance of readr::read_lines()
> system.time(first25K <- read_lines("StormData.csv",skip=0,n_max = 25000))
user system elapsed
0.00 0.00 0.01

Defining large matrix with "big memory" package in R

I am using the big memory package and need to define a large matrix (20000 * 20000).
A <- big.matrix (20000 , 20000 , type ="double", init = 0)
Resulting in:
Error: memory could not be allocated for instance of type big.matrix
My questions:
(1.) Does the package enables a matrix of that size in general?
(2.) If not, are there any other options to create such a matrix in R?
Many thanks for your help
This answer espands on Imo's explanation of specifying file-backing.
Unfortunately, the current CRAN version of the package (4.5.36) doesn't contain a vignette anymore, but thankfully it's possible to download older versions that contain it. For example, the vignette for version 4.5.28 contains the following piece of code:
x <- read.big.matrix("airline.csv", type="integer", header=TRUE,
backingfile="airline.bin",
descriptorfile="airline.desc",
extraCols="Age")
If you wish to keep your working directory clean, you can use the temppath() and tempdir() functions. Here's one example:
temp_file <- gsub("/", "", tempfile(tmpdir = ""))
A <- big.matrix(
20000 , 20000 , type ="double", init = 0,
backingpath = tempdir(),
backingfile = paste0(temp_file, ".bak"),
descriptorfile = paste0(temp_file, ".desc"),
)

Obtain a package 1st version's date of publication

Documentation of R packages only include the date of last update/publication.
Numbering of versions do not follow a common pattern to all packages.
Therefore, it is quite difficult to know at a glance if the package is old or new. Sometimes you need to decide between two packages with similar functions and knowing the age of a package could guide the decision.
My first approach was to plot downloads per year: By traking CRAN downloads. This methods provides also the relative popularity/usage of a package. However, this requires a lot of memory and time to proceed. Therefore, I would rather have a faster way to look into the history of one package.
Is there a quick way to know or vizualize the first version's date of release of one specific package or even to compare several pakages at once?
The purpose is to facilitate a mental mapping of all available packages in R, especially for newcomers. Getting to know packages and managing them is probably the main challenge why people give up on R.
Just for fun:
## not all repositories have the same archive structure!
archinfo <- function(pkgname,repos="http://www.cran.r-project.org") {
pkg.url <- paste(contrib.url(repos),"Archive",pkgname,sep="/")
r <- readLines(pkg.url)
## lame scraping code
r2 <- gsub("<[^>]+>"," ",r) ## drop HTML tags
r2 <- r2[-(1:grep("Parent Directory",r2))] ## drop header
r2 <- r2[grep(pkgname,r2)] ## drop footer
strip.white <- function(x) gsub("(^ +| +$)","",x)
r2 <- strip.white(gsub(" ","",r2)) ## more cleaning
r3 <- do.call(rbind,strsplit(r2," +")) ## pull out data frame
data.frame(
pkgvec=gsub(paste0("(",pkgname,"_|\\.tar\\.gz)"),"",r3[,1]),
pkgdate=as.Date(r3[,2],format="%d-%b-%Y"),
## assumes English locale for month abbreviations
size=r3[,4])
}
AERinfo <- archinfo("AER")
lme4info <- archinfo("lme4")
comb <- rbind(data.frame(pkg="AER",AERinfo),
data.frame(pkg="lme4",lme4info))
We can't compare package numbers directly because everyone uses different numbering schemes ...
library(dplyr) ## overkill
comb2 <- comb %>% group_by(pkg) %>% mutate(numver=seq(n()))
If you want to arrange by package date:
comb2 <- arrange(comb2,pkg,pkgdate)
Pretty pictures ...
library(ggplot2); theme_set(theme_bw())
ggplot(comb2,aes(x=pkgdate,y=numver,colour=pkg))+geom_line()
As Andrew Taylor suggested, CRAN Archives contains all previous versions and the date is indicated.

Resources