I have a large data frame containing about 4 million rows and 15 variables. I'm trying to implement a model selection algorithm, which adds in a variable that results in the highest increase in the r-squared to the lm model.
The following code snippet is where my function fails due to the large data size. I tried biglm but still no luck. I use mtcars as an example here just to illustrate.
library(biglm)
library(dplyr)
data <- mtcars
y <- "mpg"
vars.model <- "cyl"
vars.remaining <- setdiff(names(data), c("mpg", "cyl"))
new.rsq <- sapply(vars.remaining,
function (x) {
vars.test <- paste(vars.model, x, sep="+")
fit.sum <- biglm(as.formula(paste(y, vars.test, sep="~")),
data) %>% summary()
new.rsq <- fit.sum$rsq
})
new.rsq
I'm not sure how exactly R handles the memory here, but the biglm output for my 4 million rows of data is extremely small (6.6 KB). I don't know how it accumulates to several GB when I wrapped it into sapply. Any tips on how to optimise this is greatly appreciated.
Memory usage goes up because each call to biglm() makes a copy of the data in memory. Since sapply() is basically a for loop, using doMC (or doParallel) allows to go through the loop with a single copy of the data in memory. Here is one possibility:
EDIT: As #moho wu pointed, parallel fitting helps, but not quite enough. Factors are more efficient than plain characters, so that helps too. Then ff can help even more as it keeps bigger data sets on the disk, rather than in memory. I updated the code below to make it a complete toy example using ff and doMC.
library(tidyverse)
library(pryr)
# toy data
df <- sample_n(mtcars, size = 1e7, replace = T)
df$A <- as.factor(letters[1:5])
# get objects / save on disk
all_vars <- names(df)
y <- "mpg"
vars.model <- "cyl"
vars.remaining <- all_vars[-c(1:2)]
save(y, vars.model, vars.remaining, file = "all_vars.RData")
readr::write_delim(df, path = "df.csv", delim = ";")
# close R session and start fresh
library(ff)
library(biglm)
library(doMC)
library(tidyverse)
# read flat file as "ff" ; also read variables
ff_df <- read.table.ffdf(file = "df.csv", sep = ";", header = TRUE)
load("all_vars.RData")
# prepare the "cluster"
nc <- 2 # number of cores to use
registerDoMC(cores = nc)
# make all formula
fo <- paste0(y, "~", vars.model, "+", vars.remaining)
fo <- modify(fo, as.formula) %>%
set_names(vars.remaining)
# fit models in parallel
all_rsq <- foreach(fo = fo) %dopar% {
fit <- biglm(formula = fo, data = ff_df)
new.rsq <- summary(fit)$rsq
}
The culprit to my problem is that I have a lot of character columns. It works fine after I change them all to factors using my original script.
data %>%
mutate_if(is.character, as.factor)
#meriops' answer is also sound. Parallel processing might be something to consider if factorising your data frame doesn't solve the problem
Related
Summary: Despite a complicated lead-up, the solution was very simple: In order to plot a row of a dataframe as a line instead of a lattice, I needed to transpose the data in order to invert from x obs of y variables to y obs of x variables.
I am using RStudio on a Windows 10 computer.
I am using scientific equipment to write measurements to a csv file. Then I ZIP several files and read to R using read.csv. However, the data frame behaves strangely. Commands "length" and "dim" disagree and the "plot" function throws errors. Because I can create simulated data that doesn't throw the errors, I think the problem is either in how the machine wrote the data or in my loading and processing of the data.
Two ZIP files are located in my stackoverflow repository (with "Monterey Jack" in the name):
https://github.com/baprisbrey/stackoverflow
Here is my code for reading and processing them:
# Unzip the folders
unZIP <- function(folder){
orig.directory <- getwd()
setwd(folder)
zipped.folders <- list.files(pattern = ".*zip")
for (i in zipped.folders){
unzip(i)}
setwd(orig.directory)
}
folder <- "C:/Users/user/Documents/StackOverflow"
unZIP(folder)
# Load the data into a list of lists
pullData <- function(folder){
orig.directory <- getwd()
setwd(folder)
#zipped.folders <- list.files(pattern = ".*zip")
#unzipped.folders <- list.files(folder)[!(list.files(folder) %in% zipped.folders)]
unzipped.folders <- list.dirs(folder)[-1] # Removing itself as the first directory.
oData <- vector(mode = "list", length = length(unzipped.folders))
names(oData) <- str_remove(unzipped.folders, paste(folder,"/",sep=""))
for (i in unzipped.folders) {
filenames <- list.files(i, pattern = "*.csv")
#setwd(paste(folder, i, sep="/"))
setwd(i)
files <- lapply(filenames, read.csv, skip = 5, header = TRUE, fileEncoding = "UTF-16LE") #Note unusual encoding
oData[[str_remove(i, paste(folder,"/",sep=""))]] <- vector(mode="list", length = length(files))
oData[[str_remove(i, paste(folder,"/",sep=""))]] <- files
}
setwd(orig.directory)
return(oData)
}
theData <- pullData(folder) #Load the data into a list of lists
# Process the data into frames
bigFrame <- function(bigList) {
#where bigList is theData is the result of pullData
#initialize the holding list of frames per set
preList <- vector(mode="list", length = length(bigList))
names(preList) <- names(bigList)
# process the data
for (i in 1:length(bigList)){
step1 <- lapply(bigList[[i]], t) # transpose each data
step2 <- do.call(rbind, step1) # roll it up into it's own matrix #original error that wasn't reproduced: It showed length(step2) = 24048 when i = 1 and dim(step2) = 48 501. Any comments on why?
firstRow <- step2[1,] #holding onto the first row to become the names
step3 <- as.data.frame(step2) # turn it into a frame
step4 <- step3[grepl("µA", rownames(step3)),] # Get rid of all those excess name rows
rownames(step4) <- 1:(nrow(step4)) # change the row names to rowID's
colnames(step4) <- firstRow # change the column names to the first row steps
step4$ID <- rep(names(bigList[i]),nrow(step4)) # Add an I.D. column
step4$Class[grepl("pos",tolower(step4$ID))] <- "Yes" # Add "Yes" class
step4$Class[grepl("neg",tolower(step4$ID))] <- "No" # Add "No" class
preList[[i]] <- step4
}
# bigFrame <- do.call(rbind, preList) #Failed due to different number of measurements (rows that become columns) across all the data sets
# return(bigFrame)
return(preList) # Works!
}
frameList <- bigFrame(theData)
monterey <- rbind(frameList[[1]],frameList[[2]])
# Odd behaviors
dim(monterey) #48 503
length(monterey) #503 #This is not reproducing my original error of length = 24048
rowOne <- monterey[1,1:(ncol(monterey)-2)]
plot(rowOne) #Error in plot.new() : figure margins too large
#describe the data
quantile(rowOne, seq(0, 1, length.out = 11) )
quantile(rowOne, seq(0, 1, length.out = 11) ) %>% plot #produces undesired lattice plot
# simulate the data
doppelganger <- sample(1:20461,501,replace = TRUE)
names(doppelganger) <- names(rowOne)
# describe the data
plot(doppelganger) #Successful scatterplot. (With my non-random data, I want a line where the numbers in colnames are along the x-axis)
quantile(doppelganger, seq(0, 1, length.out = 11) ) #the random distribution is mildly different
quantile(doppelganger, seq(0, 1, length.out = 11) ) %>% plot # a simple line of dots as desired
# investigating structure
str(rowOne) # results in a dataframe of 1 observation of 501 variables. This is a correct interpretation.
str(as.data.frame(doppelganger)) # results in 501 observations of 1 variable. This is not a correct interpretation but creates the plot that I want.
How do I convert the rowOne to plot like doppelganger?
It looks like one of my errors is not reproducing, where calls to "dim" and "length" apparently disagree.
However, I'm confused as to why the "plot" function is producing a lattice plot on my processed data and a line of dots on my simulated data.
What I would like is to plot each row of data as a line. (Next, and out of the scope of this question, is I would like to classify the data with adaboost. My concern is that if "plot" behaves strangely then the classifier won't work.)
Any tips or suggestions or explanations or advice would be greatly appreciated.
Edit: Investigating the structure with ("str") of the two examples explains the difference between plots. I guess my modified question is, how do I switch between the two structures to enable plotting a line (like doppelganger) instead of a lattice (like rowOne)?
I am answering my own question.
I am leaving behind the part about the discrepancy between "length" and "dim" since I can't provide a reproducible example. However, I'm happy to leave up for comment.
The answer is that in order to produce my plot, I simply have to transpose the row as follows:
rowOne %>% t() %>% as.data.frame() %>% plot
This inverts the structure from one observation of 501 variables to 501 obs of one variable as follows:
rowOne %>% t() %>% as.data.frame() %>% str()
#'data.frame': 501 obs. of 1 variable:
# $ 1: num 8712 8712 8712 8712 8712 ...
Because of the unusual encoding I used, and the strange "length" result, I failed to see a simple solution to my "plot" problem.
I won't pretend that this code is even remotely optimal, but here is the problem I have. I have a list of files with multiple columns read in with sapply(), such that if I call file.list[[1]] I get a summary of that data.frame, and summary(file.list) is a list of files.
I am fitting curves to the data using the mgcv package as follows:
gam_data <- function(curves)
{
out <- gam(curves[, 15] ~ s(curves[, 23]))
pd <- plot(out)
return(pd)
}
out <- lapply(file.list, gam_data)
split_curves <- function(splitting)
{
pd_2 <- c(splitting[[1]]$fit)
pd_3 <- c(splitting[[1]]$x)
pd_4 <- c(splitting[[1]]$se)
curveg <- cbind(pd_2, pd_3, pd_4)
colnames(curveg) <- c("fitted", "sphro", "se")
return(curveg)
}
out2 <- lapply(out, split_curves)
Where the first block is performing gam and the second is extracting the fit of the curve. However, after all of that the original information in file.list such as replicate, genotype, etc. is lost, and the data.frames are not the same length anymore. This is probably a trivial question, but how does one retain that information through processing? I'm applying this to hundreds of data frames so I cannot just manually recreate the columns.
I'm using ff and ffbase libraries to manage a big csv file (~40Go and 275e6 observations). I'd like to split/partition this file according to one of its columns (which is a factor column).
With a normal data frame, I would do something like that:
a <- data.frame(rnorm(10000,0,1),
sample(1:100,10000,replace=T),
sample(letters,10000,replace = T))
names(a) <- c('V1','V2','V3')
a_partition <- split(a,a$V3)
names(a_partition) <- paste("df",names(a_partition),sep = "_")
list2env(a_partition,globalenv())
but ff and ffbase doesn't have a split function. So, looking in the ffbase documentation, I found ffdfply and tried to use it as follows:
ffa <- as.ffdf(a)
ffa_partititon <- ffdfdply(x = ffa,split = ffa$V3)
Alas, I get the log message :
calculating split sizes
building up split locations
working on split 1/1, extracting data in RAM of 26 split elements,
totalling, 0.00015 GB, while max specified
data specified using BATCHBYTES is 0.01999 GB
... applying FUN to selected data
Error: argument "FUN" is missing, with no default
I tried FUN = as.data.frame (since the result of the function must be a data frame) with no luck : doing so makes ffa_partition a copy of ffa...
How can I partition my ffdf?
Two years late, but I believe this does what you needed:
result_list <- list()
for(letter in letters){
result_list[[letter]] <- subset(ffa, V3 == letter)
}
I need to read hundred of .bil files:(reproductive example)
d19810101 <- data.frame(ID=c(1:10),year=rep(1981,10),month=rep(1,10),day=rep(1,10),value=c(11:20))
d19810102 <- data.frame(ID=c(1:10),year=rep(1981,10),month=rep(1,10),day=rep(2,10),value=c(12:21))
d19820101 <- data.frame(ID=c(1:10),year=rep(1982,10),month=rep(1,10),day=rep(1,10),value=c(13:22))
d19820102 <- data.frame(ID=c(1:10),year=rep(1982,10),month=rep(1,10),day=rep(2,10),value=c(14:23))
The code I wrote for testing small amount files works ok but when I tried to run the entire files, it went super slow, please let me know if there is any way that I can improve. What I need to do is simply get the average of 33 years of daily data, here is the code for testing small amount of files:
years <- c(1981:1982)
days <- substr(as.numeric(format(seq(as.Date("1981/1/1"), as.Date("1981/1/2"), "day"), '%Y%m%d')),5,8)
X_Y <- NULL
for (j in days) {
for (i in years) {
XYi <- read.table(paste(i,substr(j,1,2),substr(j,3,4),".csv",sep=''),header=T,sep=",",stringsAsFactors=F)
X_Y <- rbind(X_Y, XYi)
cat(paste("Data in ", i, j, " are processing now.", sep=""), "\n")
}
library(plyr)
X_Y1 <- ddply(X_Y, .(ID, month, day), summarize, mean(value, na.rm=T))
cat(paste("Data in ", i, j, " are processing now.", sep=""), "\n")
}
EDIT:
Thank you for all your help! I tried putting the files in a list to read, but since its .bil files which needs to get the raster characteristics, thus I got error, that's why I need to read them one by one, sorry for didn't make it clear earlier
Read.files <- function(file.names, sep=",") {
library(raster)
ldply(file.names, function(fn) data.frame(Filename=fn, layer <- raster(fn, sep=",")))
}
data1 <- Read.files(paste("filenames here",days,".bil",sep=''), sep=",")
"Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class 'structure("RasterLayer", package = "raster")' into a data.frame.
EDIT 2:
The data structure of my data is actually same with the example data, only that my data is grid data and needs to be extracted(using raster function instead of read.csv), and then to be put into data frame, therefore I need to do the following steps:
for (i in days)
{
layer <- raster(paste("filename here",i,".bil",sep=''))
projection <- projection(layer)
cellsize <- res(layer)[1]
...
s <- resample(layer,r, method='ngb')
XY <- data.frame(rasterToPoints(s))
names(XY) <- c('Long','Lat','Data')
}
It's hard to tell exactly how your are managing file IO, but I think an easier way to achieve this would be to read the files in, put them into one data.frame (e.g. using rbind()), and then get the summary statistics you need via tapply():
data <- do.call(rbind, mget(ls(pattern = "d[0-9]*"))) # combine data
with(data, tapply(value, list(month, day), mean)) # get mean for each month and day combination
This assumes you have already read in all of the files, to objects named as in your example.
I have a fitted model that I'd like to apply to score a new dataset stored as a CSV. Unfortunately, the new data set is kind of large, and the predict procedure runs out of memory on it if I do it all at once. So, I'd like to convert the procedure that worked fine for small sets below, into a batch mode that processes 500 lines at a time, then outputs a file for each scored 500.
I understand from this answer (What is a good way to read line-by-line in R?) that I can use readLines for this. So, I'd be converting from:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
newdata <- as.data.frame(read.csv('newstuff.csv'), stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=filename)
to something like:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
con <- file("newstuff.csv", open = "r")
i = 0
while (length(mylines <- readLines(con, n = 500, warn = FALSE)) > 0) {
i = i+1
newdata <- as.data.frame(mylines, stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=paste(filename,i,'.csv',sep=''))
}
close(con)
However, when I print the mylines object inside the loop, it doesn't get auto-columned correctly the same way read.csv produces something that is---headers are still a mess, and whatever modulo column-width happens under the hood that wraps the vector into an ncol object isn't happening.
Whenever I find myself writing barbaric things like cutting the first row, wrapping the columns, I generally suspect R has a better way to do things. Any suggestions for how I can get a read.csv-like output form a readLines csv connection?
If you want to read your data into memory in chunks using read.csv by using the skip and nrows arguments. In pseudo-code:
read_chunk = function(start, n) {
read.csv(file, skip = start, nrows = n)
}
start_indices = (0:no_chunks) * chunk_size + 1
lapply(start_indices, function(x) {
dat = read_chunk(x, chunk_size)
pred = predict(fit, dat)
write.csv(pred)
}
Alternatively, you could put the data into an sqlite database, and use the sqlite package to query the data in chunks. See also this answer, or do some digging with [r] large csv on SO.