writedlm save a large matrix size, e.g. (100000,1000), with a few none zeros in a very big file ~1Gb. Is there any more efficient method?
As Colin mentioned, SparseArray and the package JLD will do:
using JLD
a = speye(1_000_000)
save("/tmp/foo.jld","a",a)
You could also use Seralization package serialize()/deserialize() if you don't care about the opacity of the format.
# write to file
using Serialization
using SparseArrays
a = speye(1000000)
f = open("mat.dat","w")
serialize(f, a)
close(f)
# read back
a = deserialize(open("mat.dat"))
# may need to close the file
Related
I have a few hundred thousand very small .dat.gz files that I want to read into R in the most efficient way possible. I read in the file and then immediately aggregate and discard the data, so I am not worried about managing memory as I get near the end of the process. I just really want to speed up the bottleneck, which happens to be unzipping and reading in the data.
Each dataset consists of 366 rows and 17 columns. Here is a reproducible example of what I am doing so far:
Building reproducible data:
require(data.table)
# Make dir
system("mkdir practice")
# Function to create data
create_write_data <- function(file.nm) {
dt <- data.table(Day=0:365)
dt[, (paste0("V", 1:17)) := lapply(1:17, function(x) rnorm(n=366))]
write.table(dt, paste0("./practice/",file.nm), row.names=FALSE, sep="\t", quote=FALSE)
system(paste0("gzip ./practice/", file.nm))
}
And here is code applying:
# Apply function to create 10 fake zipped data.frames (550 kb on disk)
tmp <- lapply(paste0("dt", 1:10,".dat"), function(x) create_write_data(x))
And here is my most efficient code so far to read in the data:
# Function to read in files as fast as possible
read_Fast <- function(path.gz) {
system(paste0("gzip -d ", path.gz)) # Unzip file
path.dat <- gsub(".gz", "", path.gz)
dat_run <- fread(path.dat)
}
# Apply above function
dat.files <- list.files(path="./practice", full.names = TRUE)
system.time(dat.list <- rbindlist(lapply(dat.files, read_Fast), fill=TRUE))
dat.list
I have bottled this up in a function and applied it in parallel, but it is still much much too slow for what I need this for.
I have already tried the h2o.importFolder from the wonderful h2o package, but it is actually much much slower compared to using plain R with data.table. Maybe there is a way to speed up the unzipping of files, but I am unsure. From the few times that I have run this, I have noticed that the unzipping of the files usually takes about 2/3rd of the function time.
I'm sort of surprised that this actually worked. Hopefully it works for your case. I'm quite curious to know how speed compares to reading in compressed data from disk directly from R (albeit with a penalty for non-vectorization) instead.
tblNames = fread('cat *dat.gz | gunzip | head -n 1')[, colnames(.SD)]
tbl = fread('cat *dat.gz | gunzip | grep -v "^Day"')
setnames(tbl, tblNames)
tbl
R has the ability to read gzipped files natively, using the gzfile function. See if this works.
rbindlist(lapply(dat.files, function(f) {
read.delim(gzfile(f))
}))
The bottleneck might be caused by the use of the system() call to an external application.
You should try using the builting functions to extract the archive.
This answer explains how: Decompress gz file using R
I realize this is a pretty basic question, but I want to make sure I do it right, so I wanted to ask just to confirm. I have a vector in one project that I want to be able to use in another project, and I was wondering if there was a simple way to export the vector in a form that I can easily import it to another project.
The way that I figured out how to do it so far is to convert it to a df, then export the df as a csv, then import and unpack it to vector form, but that seems needlessly complicated. It's just a simple numeric vector.
There are a number of ways to read and write data/files in R. For reading, you may want to look at: read.table, read.csv, readLines, source, dget, load, unserialize, and readRDS. For writing, you will want to look write.table, writeLines, dump, dput, save, serialize, and saveRDS.
x <- 1:3
# [1] 1 2 3
save(x, file = "myvector.rda")
# Change x to prove a point.
x <- 4:6
x
# [1] 4 5 6
# Better yet, we could remove it entirely
rm(x)
x
# Error: object 'x' not found
# Now load what we saved to get us back to where we started.
load("myvector.rda")
x
# [1] 1 2 3
Alternatively, you can use saveRDS and readRDS -- best practice/convention is to use the .rds extension; note, however, that loading the object is slightly different as saveRDS does not save the object name:
saveRDS(x, file = "myvector_serialized.rds")
x <- readRDS("myvector_serialized.rds")
Finally, saveRDS is a lower-level function and therefore can only save one object a time. The traditional save approach allows you to save multiple objects at the same time, but can become a nightmare if you re-use the same names in different projects/files/scripts...
I agree that saveRDS is a good way to go, but I also recommend the save and save.image functions, which I will demonstrate below.
# save.image
x <- c(5,6,8)
y <- c(8,9,11)
save.image(file="~/vectors.Rdata") # saves all workspace objects
Or alternatively choose which objects you want to save
x <- c(5,6,8)
y <- c(8,9,11)
save(x, y, file="~/vectors.Rdata") # saves only the selected objects
One advantage to using .Rdata over .Rda (a minor one) is that you can click on the object in the file explorer (i.e. in windows) and it will be loaded into the R environment. This doesn't work with .Rda objects in say Rstudio on windows
I'm new to R and I have basically no knowledge of data management in this language. I'm using the dynaTrees package to do some machine learning and I'd like to export the model to a file for further use.
The model is obtained by calling the dynaTrees function:
model <- dynaTrees(
as.matrix(training.data[,-1]),
as.matrix(training.data[, 1]),
R=10
)
I then want to export this model object so it can be loaded in another script later on. I tried the simple:
write(model, file="model.dat")
but that doesn't work (type list not supported).
Is there a generic way (or a dedicated package) in R to export complex data structure to file?
You probably want saveRDS (see ? saveRDS for details). Example:
saveRDS(model, file = "model.Rds")
This saves a single R object to file so that you can restore it later (using readRDS). save is an alternative that is designed for saving multiple R objects (or an entire workspace), which can be accessed later using load.
Your intuition was to use the write function, which is actually a rarely used tool for writing a matrix to a text representation. Here's an example:
write(as.matrix(warpbreaks[1:3,]), file = stdout())
# 26
# 30
# 54
# A
# A
# A
# L
# L
# L
I'm working on a R script which has to load data (obviously). The data loading takes a lot of effort (500MB) and I wonder if I can avoid having to go through the loading step every time I rerun the script, which I do a lot during the development.
I appreciate that I could do the whole thing in the interactive R session, but developing multi-line functions is just so much less convenient on the R prompt.
Example:
#!/usr/bin/Rscript
d <- read.csv("large.csv", header=T) # 500 MB ~ 15 seconds
head(d)
How, if possible, can I modify the script, such that on subsequent executions, d is already available? Is there something like a cache=T statement as in R markdown code chunks?
Sort of. There are a few answers:
Use a faster csv read: fread() in the data.table() package is beloved by many. Your time may come down to a second or two.
Similarly, read once as csv and then write in compact binary form via saveRDS() so that next time you can do readRDS() which will be faster as you do not have to load and parse the data again.
Don't read the data but memory-map it via package mmap. That is more involved but likely very fast. Databases uses such a technique internally.
Load on demand, and eg the package SOAR package is useful here.
Direct caching, however, is not possible.
Edit: Actually, direct caching "sort of" works if you save your data set with your R session at the end. Many of us advise against that as clearly reproducible script which make the loading explicit are preferably in our view -- but R can help via the load() / save() mechanism (which lots several objects at once where saveRSS() / readRDS() work on a single object.
Package ‘R.cache’ R.cache
start_year <- 2000
end_year <- 2013
brics_countries <- c("BR","RU", "IN", "CN", "ZA")
indics <- c("NY.GDP.PCAP.CD", "TX.VAL.TECH.CD", "SP.POP.TOTL", "IP.JRN.ARTC.SC",
"GB.XPD.RSDV.GD.ZS", "BX.GSR.CCIS.ZS", "BX.GSR.ROYL.CD", "BM.GSR.ROYL.CD")
key <- list(brics_countries, indics, start_year, end_year)
brics_data <- loadCache(key)
if (is.null(brics_data)) {
brics_data <- WDI(country=brics_countries, indicator=indics,
start=start_year, end=end_year, extra=FALSE, cache=NULL)
saveCache(brics_data, key=key, comment="brics_data")
}
I use exists to check if the object is present and load conditionally, i.e.:
if (!exists(d))
{
d <- read.csv("large.csv", header=T)
# Any further processing on loading
}
# The rest of the script
If you want to load/process the file again, just use rm(d) before sourcing. Just be careful that you do not use object names that are already used elsewhere, otherwise it will pick that up and not load.
I wrote up some of the common ways of caching in R in "Caching in R" and published it to R-Bloggers. For your purpose, I would recommend just using saveRDS() or qs() from the 'qs' (quick serialization) package. My package, 'mustashe', uses qs() for reading and writing files, so you could just use mustashe::stash(), too.
I have a .csv file: example.csv with 8000 columns x 40000 rows. The csv file have a string header for each column. All fields contains integer values between 0 and 10. When I try to load this file with read.csv it turns out to be extremely slow. It is also very slow when I add a parameter nrow=100. I wonder if there is a way to accelerate the read.csv, or use some other function instead of read.csv to load the file into memory as a matrix or data.frame?
Thanks in advance.
If your CSV only contains integers, you should use scan instead of read.csv, since ?read.csv says:
‘read.table’ is not the right tool for reading large matrices,
especially those with many columns: it is designed to read _data
frames_ which may have columns of very different classes. Use
‘scan’ instead for matrices.
Since your file has a header, you will need skip=1, and it will probably be faster if you set what=integer(). If you must use read.csv and speed / memory consumption are a concern, setting the colClasses argument is a huge help.
Try using data.table::fread(). This is by far on of the fastest ways to read .csv files into R. There is a good benchmark here.
library(data.table)
data <- fread("c:/data.csv")
If you want to make it even faster, you can also read only the subset of columns you want to use:
data <- fread("c:/data.csv", select = c("col1", "col2", "col3"))
Also try Hadley Wickham's readr package:
library(readr)
data <- read_csv("file.csv")
If you'll read the file often, it might well be worth saving it from R in a binary format using the save function. Specifying compress=FALSE often results in faster load times.
...You can then load it in with the (surprise!) load function.
d <- as.data.frame(matrix(1:1e6,ncol=1000))
write.csv(d, "c:/foo.csv", row.names=FALSE)
# Load file with read.csv
system.time( a <- read.csv("c:/foo.csv") ) # 3.18 sec
# Load file using scan
system.time( b <- matrix(scan("c:/foo.csv", 0L, skip=1, sep=','),
ncol=1000, byrow=TRUE) ) # 0.55 sec
# Load (binary) file using load
save(d, file="c:/foo.bin", compress=FALSE)
system.time( load("c:/foo.bin") ) # 0.09 sec
Might be worth it to try the new vroom package
vroom is a new approach to reading delimited and fixed width data into R.
It stems from the observation that when parsing files reading data from disk and finding the delimiters is generally not the main bottle neck. Instead (re)-allocating memory and parsing the values into R data types (particularly for characters) takes the bulk of the time.
Therefore you can obtain very rapid input by first performing a fast indexing step and then using the ALTREP (ALTernative REPresentations) framework available in R versions 3.5+ to access the values in a lazy / delayed fashion.
This approach potentially also allows you to work with data that is larger than memory. As long as you are careful to avoid materializing the entire dataset at once it can be efficiently queried and subset.
#install.packages("vroom",
# dependencies = TRUE, repos = "https://cran.rstudio.com")
library(vroom)
df <- vroom('example.csv')
Benchmark: readr vs data.table vs vroom for a 1.57GB file