reading sparse data in h2o using svmlight - r

I am trying to read a dataset in SVMLight format into h2o. Writing it to a file on disk and reading it back is working OK but reading it directly from R's memory is not. I would like to know if there is a different function or a different way of calling the function I have used below.
Here's an example R 3.3.3, h2o 3.10.3.6:
require(data.table)
require(h2o)
set.seed(1000)
tot_obs <- 100
tot_var <- 500
vars_per_obs <- round(.0*tot_var,0):round(.1*tot_var,0)
#randomly generated data
mat.dt <- do.call('rbind', lapply(1:tot_obs, function(n) {
nvar <- sample(vars_per_obs,1)
if(nvar>0) data.table(obs=n, var=sample(1:tot_var,nvar))[, value:=sample(10:50,.N,replace=TRUE)]
}))
event.dt <- data.table(obs=1:tot_obs)[, is_event:=sample(0:1,.N,prob=c(.9,.1),replace=TRUE)]
#SVMLight format
setorder(mat.dt, obs, var)
mat.agg.dt <- mat.dt[, .(feature=paste(paste0(var,":",value), collapse=" ")), obs]
mat.agg.dt <- merge(event.dt, mat.agg.dt, by="obs", sort=FALSE, all.x=TRUE)
mat.agg.dt[is.na(feature), feature:=""]
mat.agg.dt[, svmlight:=paste(is_event,feature)][, c("obs","is_event","feature"):=NULL]
fwrite(mat.agg.dt, file="svmlight.txt", col.names=FALSE)
#h2o
localH2o <- h2o.init(nthreads=-1, max_mem_size="4g")
h2o.no_progress()
#works
h2o.orig <- h2o.importFile("svmlight.txt", parse=TRUE)
#does NOT work
tmp <- as.h2o(mat.agg.dt)
h2o.orig.1 <- h2o.parseRaw(tmp, parse_type="SVMLight")

The easy answer is that you probably don't have enough R memory to perform this action, so one solution is to increase the amount of memory in R (if that's an option for you). It could also mean that you don't have enough memory in your H2O cluster, so you could increase that as well.
The only way to go directly from R memory to the H2O cluster is the as.h2o() function, so you are definitely using the right command. Under the hood, the as.h2o() function writes the frame from R memory to disk (stored in a temp file) and then reads it directly into the H2O cluster using H2O's native parallel read functionality.
We recently added the ability to use data.table's read/write functionality any place that we use base R, so since you have data.table installed, you should probably be able to get around this bottleneck by adding this to the top of your script: options("h2o.use.data.table"=TRUE). This will force the use of data.table instead of base R to write to disk for the first half of the as.h2o() conversion process. This should work for you since it's doing the exact same thing that your code is doing already where you use fwrite to write to disk and h2o.importFile() to read it back in.
Also you don't need the last line with h2o.parseRaw():
tmp <- as.h2o(mat.agg.dt)
h2o.orig.1 <- h2o.parseRaw(tmp, parse_type="SVMLight")
You can just do:
h2o.orig.1 <- as.h2o(mat.agg.dt)
There is a related post that shows how to use data.table to solve the reverse problem (using as.data.frame() instead of as.h2o()) here.

Related

How to read multiple large sas data files into R, filter rows and save subset datasets as .rds

I have 30 sas-files (dataset1.sas7bdat through dataset30.sas7bdat, approx. 10 GB per file) in a folder, and need to analyse a subset of rows in these data files (all rows where the character variable A begins with 10).
Thus, I need to read each of the sas-files into R, filter a subset with grep on variable A and then save each of these filtered datasets as a .rds-file.
I'm trying to achieve this using a for loop of list.files() and the Haven package to read the sas-file. In order to avoid going out-of-memory, I need to remove the imported dataset on each iteration after the subset has been filtered and saved as .rds.
Though not elegant nor satisfying, I could hard-code it manually 30 times over like this, copy/pasting and incrementing the suffixes by 1 each time:
dt1 <- haven::read_sas("~/folder/dataset1.sas7bdat")
dt1 <- data.table::as.data.table(dt1)
dt1 <- dt1[grep("^10", A)]
saveRDS(dt1, "~/folder/subset1.rds")
dt2 <- haven::read_sas("~/folder/dataset2.sas7bdat")
dt2 <- data.table::as.data.table(dt2)
dt2 <- dt1[grep("^10", A)]
saveRDS(dt2, "~/folder/subset2.rds")
etc.
While the following for loop technically works to read the files into memory, it is never going to finish due to massively going out of memory, so it does not allow me to filter the data:
folder <- "~/folder/"
file_list <- list.files(path = folder, pattern = "^dataset")
for (i in 1:length(file_list)) {
assign(file_list[i], Haven::read_sas(paste(folder, file_list[i], sep='')))
}
Is there a way to - on each iteration in the loop - filter the dataset, remove the unfiltered dataset and save the subset in a .rds-file?
I can't seem to come up with a way to incorporate this into my approach of using the assign() function.
Is there a better way to go about this?
You could potentially look at freeing up memory by clearing the memory after each load and filter via
rm(list=ls())
I slept on it, and woke up with a working solution using a function to do the work, and a loop to cycle through filenames. This also enables me to save the output in a different folder (my raw data folder is read-only):
library(haven)
library(data.table)
fromFolder <- "~/folder_with_input_data/"
toFolder <- "~/folder_with_output_data/"
import_sas <- function(filename) {
dt <- read_sas(paste(fromFolder, filename, sep=''), NULL)
dt <- as.data.table(dt)
dt <- dt[grep("^10", A)]
saveRDS(dt, paste(toFolder,filename,'.rds.', sep =''), compress = FALSE)
remove(dt)
}
file_list <- list.files(path = fromFolder, pattern="^dataset")
for (filename in file_list) {
import_sas(filename)
}
I haven't tested this with the full 30 files yet. I'll do that tonight.
If I encounter problems, I will post an update tomorrow. Otherwise, this question can be closed in 48 hours.
Update: It worked without a hitch and completed the 297GB conversion in around 13 hours. I don't think it can be optimized to accomplish the task much faster; the vast majority of computing time is spent on opening the sas-files, which I don't think can be done faster by other means than Haven. Unless someone has an idea to optimize the process, this question can be closed.

R - Creating subsets of several datasets in a loop

I have a quite big number of quite heavy datasets. I would like to extract a subset out of each of them and save it into different csv files (one for each dataset). These are the commands I would like to loop for all the files I have in the folder:
df <-read.csv("1985.csv",header=FALSE,stringsAsFactors=TRUE,sep="\t")
df_short <- df[df$V6=="OPP", ]
write.csv(df_short, file = "OPP_1985.csv",row.names=FALSE)
rm(df)
rm(df_short)
This is probably a very noob question, but I am struggling to understand how to do it, so I would appreciate a lot help with this!
EDIT:
Following #SimonShine's suggestion, I have run this code and it works!
You don't specify if you are trying to collect the subsets into one dataset, or if you are trying to make one file per subset. You refer to OPP_1985 that appears out of scope for the code you wrote. Did you mean to refer to df_short?
You could start by abstracting what you want to do with one datafile into a function, e.g.:
extract_and_save_from_dataset <- function(csvfile) {
df <- read.csv(csvfile, header=F, stringsAsFactors=T, sep="\t")
df_short <- df[df$V6 == "OPP",]
csvfile_short <- gsub(".csv", "_short.csv", csvfile)
write.csv(df_short, file=csvfile_short, row_names=F)
}
Assuming you have a collection of dataset filenames, you could apply this function multiple times:
# csvfiles <- c("OPP_1985.csv", "OPP_1986.csv", ...)
csvfiles <- list.files("/path/to/my/csvfiles")
for (csvfile in csvfiles) {
extract_and_save_from_dataset(csvfile)
}
The data.table approach is probably the fastest option, specially if you have a large dataset. The function fwrite{data.table} works in parallel using many CPUS, making it extremely fast.
Here is how you can divide your original data according to subgroups defined based on the values of df$V6 and save each subset into a separate .csv file.
library (data.table)
set(df)[, fwrite(.SD, paste0("output_", V6,".csv")), by = V6, .SDcols=names(df) ]
ps. The name of the files will be output_*.csv where * is the correspondent V6 value.

as.h2o() in R to upload files to h2o environment takes a long time

I am using h2o to carry out some modelling, and having tuned the model, i would now like it to be used to carry out a lot of predictions approx 6bln predictions/rows, per prediction row it needs 80 columns of data
The dataset I have already broken down the input dataset down so that it is in about 500 x 12 million row chunks each with the relevant 80 columns of data.
However to upload a data.table that is 12 million by 80 columns to h2o takes quite a long time, and doing it 500 times for me is taking a prohibitively long time...I think its because it is parsing the object first before it is uploaded.
The prediction part is relatively quick in comparison....
Are there any suggestions to speed this part up? Would changing the number of cores help?
Below is an reproducible example of the issues...
# Load libraries
library(h2o)
library(data.table)
# start up h2o using all cores...
localH2O = h2o.init(nthreads=-1,max_mem_size="16g")
# create a test input dataset
temp <- CJ(v1=seq(20),
v2=seq(7),
v3=seq(24),
v4=seq(60),
v5=seq(60))
temp <- do.call(cbind,lapply(seq(16),function(y){temp}))
colnames(temp) <- paste0('v',seq(80))
# this is the part that takes a long time!!
system.time(tmp.obj <- as.h2o(localH2O,temp,key='test_input'))
#|======================================================================| 100%
# user system elapsed
#357.355 6.751 391.048
Since you are running H2O locally, you want to save that data as a file and then use:
h2o.importFile(localH2O, file_path, key='test_intput')
This will have each thread read their parts of the file in parallel. If you run H2O on a separate server, then you would need to copy the data to a location that the server can read from (most people don't set the servers to read from the file system on their laptops).
as.h2o() serially uploads the file to H2O. With h2o.importFile(), the H2O server finds the file and reads it in parallel.
It looks like you are using version 2 of H2O. The same commands will work in H2Ov3, but some of the parameter names have changed a little. The new parameter names are here: http://cran.r-project.org/web/packages/h2o/h2o.pdf
Having also struggled with this problem, I did some tests and found that for objects in R memory (i.e. you don't have the luxury of already having them available in .csv or .txt form), by far the quickest way to load them (~21 x) is to use the fwrite function in data.table to write a csv to disk and read it using h2o.importFile.
The four approaches I tried:
Direct use of as.h2o()
Writing to disk using write.csv() then load using h2o.importFile()
Splitting the data in half, running as.h2o() on each half, then combining using h2o.rbind()
Writing to disk using fwrite() from data.table then load using h2o.importFile()
I performed the tests on a data.frame of varying size, and the results seem pretty clear.
The code, if anyone is interested in reproducing, is below.
library(h2o)
library(data.table)
h2o.init()
testdf <-as.data.frame(matrix(nrow=4000000,ncol=100))
testdf[1:1000000,] <-1000 # R won't let me assign the whole thing at once
testdf[1000001:2000000,] <-1000
testdf[2000001:3000000,] <-1000
testdf[3000001:4000000,] <-1000
resultsdf <-as.data.frame(matrix(nrow=20,ncol=5))
names(resultsdf) <-c("subset","method 1 time","method 2 time","method 3 time","method 4 time")
for(i in 1:20){
subdf <- testdf[1:(200000*i),]
resultsdf[i,1] <-100000*i
# 1: use as.h2o()
start <-Sys.time()
as.h2o(subdf)
stop <-Sys.time()
resultsdf[i,2] <-as.numeric(stop)-as.numeric(start)
# 2: use write.csv then h2o.importFile()
start <-Sys.time()
write.csv(subdf,"hundredsandthousands.csv",row.names=FALSE)
h2o.importFile("hundredsandthousands.csv")
stop <-Sys.time()
resultsdf[i,3] <-as.numeric(stop)-as.numeric(start)
# 3: Split dataset in half, load both halves, then merge
start <-Sys.time()
length_subdf <-dim(subdf)[1]
h2o1 <-as.h2o(subdf[1:(length_subdf/2),])
h2o2 <-as.h2o(subdf[(1+length_subdf/2):length_subdf,])
h2o.rbind(h2o1,h2o2)
stop <-Sys.time()
resultsdf[i,4] <- as.numeric(stop)-as.numeric(start)
# 4: use fwrite then h2o.importfile()
start <-Sys.time()
fwrite(subdf,file="hundredsandthousands.csv",row.names=FALSE)
h2o.importFile("hundredsandthousands.csv")
stop <-Sys.time()
resultsdf[i,5] <-as.numeric(stop)-as.numeric(start)
plot(resultsdf[,1],resultsdf[,2],xlim=c(0,4000000),ylim=c(0,900),xlab="rows",ylab="time/s",main="Scaling of different methods of h2o frame loading")
for (i in 1:3){
points(resultsdf[,1],resultsdf[,(i+2)],col=i+1)
}
legendtext <-c("as.h2o","write.csv then h2o.importFile","Split in half, as.h2o and rbind","fwrite then h2o.importFile")
legend("topleft",legend=legendtext,col=c(1,2,3,4),pch=1)
print(resultsdf)
flush.console()
}

R: use LaF (reads fixed column width data FAST) with SAScii (parses SAS dicionary for import instructions)

I'm trying to read quickly into R a ASCII fixed column width dataset, based on a SAS import file (the file that declares the column widths, and etc).
I know I can use SAScii R package for translating the SAS import file (parse.SAScii) and actually importing (read.SAScii). It works but it is too slow, because read.SAScii uses read.fwf to do the data import, which is slow. I would like to change that for a fast import mathod, laf_open_fwf from the "LaF" package.
I'm almost there, using parse.SAScii() and laf_open_fwf(), but I'm able to correctly connect the output of parse.SAScii() to the arguments of laf_open_fwf().
Here is the code, the data is from PNAD, national household survey, 2013:
# Set working dir.
setwd("C:/User/Desktop/folder")
# installing packages:
install.packages("SAScii")
install.packages("LaF")
library(SAScii)
library(LaF)
# Donwload and unzip data and documentation files
# Data
file_url <- "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2013/Dados.zip"
download.file(file_url,"Dados.zip", mode="wb")
unzip("Dados.zip")
# Documentation files
file_url <- "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2013/Dicionarios_e_input_20150814.zip"
download.file(file_url,"Dicionarios_e_input.zip", mode="wb")
unzip("Dicionarios_e_input.zip")
# importing with read.SAScii(), based on read.fwf(): Works fine
dom.pnad2013.teste1 <- read.SAScii("Dados/DOM2013.txt","Dicionarios_e_input/input DOM2013.txt")
# importing with parse.SAScii() and laf_open_fwf() : stuck here
dic_dom2013 <- parse.SAScii("Dicionarios_e_input/input DOM2013.txt")
head(dic_dom2013)
data <- laf_open_fwf("Dados/DOM2013.txt",
column_types=????? ,
column_widths=dic_dom2013[,"width"],
column_names=dic_dom2013[,"Varname"])
I'm stuck on the last commmand, passing the importing arguments to laf_open_fwf().
UPDATE: here are two solutions, using packages LaF and readr.
Solution using readr (8 seconds)
readr is based on LaF but surprisingly faster. More info on readr here
# Load Packages
library(readr)
library(data.table)
# Parse SAS file
dic_pes2013 <- parse.SAScii("./Dicion rios e input/input PES2013.sas")
setDT(dic_pes2013) # convert to data.table
# read to data frame
pesdata2 <- read_fwf("Dados/DOM2013.txt",
fwf_widths(dput(dic_pes2013[,width]),
col_names=(dput(dic_pes2013[,varname]))),
progress = interactive()
)
Take way: readr seems to be the best option: it's faster, you don't need to worry about column types, shorter code and it shows a progress bar :)
Solution using LaF (20 seconds)
LaFis one of the (maybe THE) fastest ways to read fixed-width files in R, according to this benchmark. It tooke me 20 sec. to read the person level file (PES) into a data frame.
Here is the code:
# Parse SAS file
dic_pes2013 <- parse.SAScii("./Dicion rios e input/input PES2013.sas")
# Read .txt file using LaF. This is virtually instantaneous
pesdata <- laf_open_fwf("./Dados/PES2013.txt",
column_types= rep("character", length(dic_pes2013[,"width"])),
column_widths=dic_pes2013[,"width"],
column_names=dic_pes2013[,"varname"])
# convert to data frame. This tooke me 20 sec.
system.time( pesdata <- pesdata[,] )
Note that that I've used character in column_types. I'm not quite sure why the command returns me an error if I try integer or numeric. This shouldn't be a problem, since you can convert all columns to numeric like this:
# convert all columns to numeric
varposition <- grep("V", colnames(pesdata))
pesdata[varposition] <- sapply(pesdata[],as.numeric)
sapply(pesdata, class)
You can try the read.SAScii.sqlite, also by Anthony Damico. It's 4x faster and lead to no RAM issues (as the author himself describes). But it imports data to a SQLite self-contained database file (no SQL server needed) -- not to a data.frame. Then you can open it in R by using a dbConnection. Here it goes the GitHub adress for the code:
https://github.com/ajdamico/usgsd/blob/master/SQLite/read.SAScii.sqlite.R
In the R console, you can just run:
source("https://raw.githubusercontent.com/ajdamico/usgsd/master/SQLite/read.SAScii.sqlite.R")
It's arguments are almost the same as those for the regular read.SAScii.
I know you are asking for a tip on how to use LaF. But I thought this could also be useful to you.
I think that the best choice is to use fwf2csv() from desc package (C++ code). I will illustrate the procedure with PNAD 2013. Be aware that i'm considering that you already have the dictionary with 3 variables: beginning of the field, size of the field, variable name, AND the dara at Data/
library(bit64)
library(data.table)
library(descr)
library(reshape)
library(survey)
library(xlsx)
end_dom <- dic_dom2013$beggining + dicdom$size - 1
fwf2csv(fwffile='Dados/DOM2013.txt', csvfile='dadosdom.csv', names=dicdom$variable, begin=dicdom$beggining, end=end_dom)
dadosdom <- fread(input='dadosdom.csv', sep='auto', sep2='auto', integer64='double')

Make function and apply to read data in R?

I have set of data (around 50000 data. and each one of them 1.5 mb). So, to load the data and process the data first I have used this code;
data <- list() # creates a list
listcsv <- dir(pattern = "*.txt") # creates the list of all the csv files in the directory
then I use for loop to load each data;
for (k in 1:length(listcsv)){
data[[k]]<- read.csv(listcsv[k],sep = "",as.is = TRUE, comment.char = "", skip=37);
my<- as.matrix(as.double(data[[k]][1:57600,2]));
print(ort_my);
a[k]<-ort_my;
write(a,file="D:/ddd/ads.txt",sep='\t',ncolumns=1)}
So, I set the program run but even if after 6 hours it didn't finished. Although I have a decent pc with a 32 GB ram and 6 core CPU.
I have searched the forum and maybe fread function would be helpful people say. However all the examples which I found so far deal with the single file reading with the fread function.
Can any one suggest me the solution of this problem for faster loop to read data and process it with these many rows and columns?
I am guessing there has to be a way to make the extraction of what you want more efficient. But I think running in parallel could save you a bunch of time. And save you memory by not storing each file.
library("data.table")
#Create function you want to eventually loop through in parallel
readFiles <- function(x) {
data <- fread(x,skip=37)
my <- as.matrix(data[1:57600,2,with=F]);
mesh <- array(my, dim = c(120,60,8));
Ms<-1350*10^3 # A/m
asd2=(mesh[70:75,24:36 ,2])/Ms; # in A/m
ort_my<- mean(asd2);
return(ort_my)
}
#R Code to run functions in parallel
library(“foreach”);library(“parallel”);library(“doMC”)
detectCores() #This will tell you how many cores are available
registerDoMC(8) #Register the parallel backend
#Can change .combine from rbind to list
OutputList <- foreach(listcsv,.combine=rbind,.packages=c(”data.table”)) %dopar% (readFiles(x))
registerDoSEQ() #Very important to close out parallel backend.

Resources