Splitting an ffdf object - r

I'm using ff and ffbase libraries to manage a big csv file (~40Go and 275e6 observations). I'd like to split/partition this file according to one of its columns (which is a factor column).
With a normal data frame, I would do something like that:
a <- data.frame(rnorm(10000,0,1),
sample(1:100,10000,replace=T),
sample(letters,10000,replace = T))
names(a) <- c('V1','V2','V3')
a_partition <- split(a,a$V3)
names(a_partition) <- paste("df",names(a_partition),sep = "_")
list2env(a_partition,globalenv())
but ff and ffbase doesn't have a split function. So, looking in the ffbase documentation, I found ffdfply and tried to use it as follows:
ffa <- as.ffdf(a)
ffa_partititon <- ffdfdply(x = ffa,split = ffa$V3)
Alas, I get the log message :
calculating split sizes
building up split locations
working on split 1/1, extracting data in RAM of 26 split elements,
totalling, 0.00015 GB, while max specified
data specified using BATCHBYTES is 0.01999 GB
... applying FUN to selected data
Error: argument "FUN" is missing, with no default
I tried FUN = as.data.frame (since the result of the function must be a data frame) with no luck : doing so makes ffa_partition a copy of ffa...
How can I partition my ffdf?

Two years late, but I believe this does what you needed:
result_list <- list()
for(letter in letters){
result_list[[letter]] <- subset(ffa, V3 == letter)
}

Related

RSME on dataframe of multiple files in R

My goal is to read many files into R, and ultimately, run a Root Mean Square Error (rmse) function on each pair of columns within each file.
I have this code:
#This calls all the files into a dataframe
filnames <- dir("~/Desktop/LGsampleHUCsWgraphs/testRSMEs", pattern = "*_45Fall_*")
#This reads each file
read_data <- function(z){
dat <- read_excel(z, skip = 0, )
return(dat)
}
#This combines them into one list and splits them by the names in the first column
datalist <- lapply(filnames, read_data)
bigdata <- rbindlist(datalist, use.names = T)
splitByHUCs <- split(bigdata, f = bigdata$HUC...1 , sep = "\n", lex.order = TRUE)
So far, all is working well. Now I want to apply an rmse [library(Metrics)] analysis on each of the "splits" created above. I don't know what to call the "splits". Here I have used names but that is an R reserved word and won't work. I tried the bigdata object but that didn't work either. I also tried to use splitByHUCs, and rMSEs.
rMSEs <- sapply(splitByHUCs, function(x) rmse(names$Predicted, names$Actual))
write.csv(rMSEs, file = "~/Desktop/testRMSEs.csv")
The rmse code works fine when I run it on a single file and create a name for the dataframe:
read_excel("bcc1_45Fall_1010002.xlsm")
bcc1F1010002 <- read_excel("bcc1_45Fall_1010002.xlsm")
rmse(bcc1F1010002$Predicted, bcc1F1010002$Actual)
The "splits" are named by the "splitByHUCs" script, like this:
They are named for the file they came from, appropriately. I need some kind of reference name for the rmse formula and I don't know what it would be. Any ideas? Thanks. I made some small versions of the files, but I don't know how to add them here.
As it is a list, we can loop over the list with sapply/lapply as in the OP's code, but the names$ is incorrect as the lambda function object is x which signifies each of the elements of the list (i.e. a data.frame). Therefore, instead of names$, use x$
sapply(splitByHUCs, function(x) rmse(x$Predicted, x$Actual))

How to Loop over Rasters to convert them to data.frames

I have some rasters that I would like to transform to data frames. I can do it manually one by one but it is ineffcient. When I try to make a loop (using a list or vector with names) the code doesn't work and R error says " Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ‘structure("RasterLayer", package = "raster")’ to a data.frame"
I have tried to make it using the function assign() but it doesn't work either. When using a vector of names I can only get R to make a dataframe of one single observation containing the name of the vector
When I do it one by one, R actually makes what I want. My code for one raster is just
#"a" is the name of the raster
r_1 <- as.data.frame(a, xy=TRUE, na.rm=TRUE, centroids=TRUE)
I have tried several things to male a loop but all have failed. First, I tried by creating a vector and looping with the function assign()
# "a" and "b" are the names of my rasters
o2 <- c("a","b")
for(i in 1:length(o2)){
nam <- substr(o2[i],1,nchar(o2))
assign(nam,as.data.frame(o2[i], xy=TRUE, na.rm=TRUE, centroids=TRUE))
}
But this only creates a dataframe named a1 with one observation "a1" and one variable.
I have tried to make a list too
o4 <- list(a,b)
for(i in 1:length(o4)){
nam <- substr(o4[i],1,nchar(ola4))
r_i <- as.data.frame(o4[i], xy=TRUE, na.rm=TRUE, centroids=TRUE)
}
The error this time says: " Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ‘structure("RasterLayer", package = "raster")’ to a data.frame"
I expect to have a data frame with three columns and as much rows as cells in my raster. The columns should be the latitude and longitude of the centroid of each cell and a column with the information each cell. I don't see any mistake in my code, maybe someone can help me.
I created the rasters myself using different shapefiles. I have more than 40 rasters with the following characteristics: witdth 8806, height: 10389, origin: -77.6699, 4.94778, pixel size: 0,001041666, SRC: EPSG:4326 - WGS 84 - Geographic. As I said, I created the rasters myself and all of them have those same characteristics.
When asking a question like this, always include some example data (normally not your data). Here are use three (identical) raster files
f <- system.file("external/test.grd", package="raster")
ff <- c(f,f,f)
Now use lists to accomplish what you want.
r <- lapply(ff, raster)
x <- lapply(r, function(i) as.data.frame(i, xy=TRUE, na.rm=TRUE))
Never use assign
Instead of a loop you can use apply :
s=c(raster1,raster2,raster3)
lapply(s, as.data.frame)

R - Error in colMeans(wind.speed, na.rm = T) : 'x' must be numeric

I am trying to importa single column of a text file data set where each file is a single day of data. I want to take the mean of each day's wind speed. Here is the code I have written for that:
daily.wind.speed <- c()
file.names <- dir("C:\\Users\\User Name\\Desktop\\R project\\Sightings Data\\Weather Data", pattern =".txt")
for(i in 1:length(file.names))
{
##import data file into data frame
weather.data <-read.delim(file.names[i])
## extract wind speed column
wind.speed <- weather.data[3]
##Attempt to fix numeric error
##wind.speed.num <- as.numeric(wind.speed)
##Take the column mean of wind speed
daily.avg <- colMeans(wind.speed,na.rm=T)
##Add daily average to list
daily.wind.speed <- c(daily.wind.speed,daily.avg)
##Print for troubleshooting and progress
print(daily.wind.speed)
}
This code seems to work on some files in my data set, but others give me this error during this section of the code:
> daily.avg <- colMeans(wind.speed,na.rm=T)
Error in colMeans(wind.speed, na.rm = T) : 'x' must be numeric
I am also having trouble converting these values to numeric and am looking for options to either convert my data to numeric, or to possibly take the mean in a different way that dosen't encounter this issue.
> as.numeric(wind.speed.df)
Error: (list) object cannot be coerced to type 'double'
weather.data Example
Even though this is not a reproducible example the problem is that you are applying a matrix function to a vector so it won't work. Just change the colMeans for mean

Memory issue with lm

I have a large data frame containing about 4 million rows and 15 variables. I'm trying to implement a model selection algorithm, which adds in a variable that results in the highest increase in the r-squared to the lm model.
The following code snippet is where my function fails due to the large data size. I tried biglm but still no luck. I use mtcars as an example here just to illustrate.
library(biglm)
library(dplyr)
data <- mtcars
y <- "mpg"
vars.model <- "cyl"
vars.remaining <- setdiff(names(data), c("mpg", "cyl"))
new.rsq <- sapply(vars.remaining,
function (x) {
vars.test <- paste(vars.model, x, sep="+")
fit.sum <- biglm(as.formula(paste(y, vars.test, sep="~")),
data) %>% summary()
new.rsq <- fit.sum$rsq
})
new.rsq
I'm not sure how exactly R handles the memory here, but the biglm output for my 4 million rows of data is extremely small (6.6 KB). I don't know how it accumulates to several GB when I wrapped it into sapply. Any tips on how to optimise this is greatly appreciated.
Memory usage goes up because each call to biglm() makes a copy of the data in memory. Since sapply() is basically a for loop, using doMC (or doParallel) allows to go through the loop with a single copy of the data in memory. Here is one possibility:
EDIT: As #moho wu pointed, parallel fitting helps, but not quite enough. Factors are more efficient than plain characters, so that helps too. Then ff can help even more as it keeps bigger data sets on the disk, rather than in memory. I updated the code below to make it a complete toy example using ff and doMC.
library(tidyverse)
library(pryr)
# toy data
df <- sample_n(mtcars, size = 1e7, replace = T)
df$A <- as.factor(letters[1:5])
# get objects / save on disk
all_vars <- names(df)
y <- "mpg"
vars.model <- "cyl"
vars.remaining <- all_vars[-c(1:2)]
save(y, vars.model, vars.remaining, file = "all_vars.RData")
readr::write_delim(df, path = "df.csv", delim = ";")
# close R session and start fresh
library(ff)
library(biglm)
library(doMC)
library(tidyverse)
# read flat file as "ff" ; also read variables
ff_df <- read.table.ffdf(file = "df.csv", sep = ";", header = TRUE)
load("all_vars.RData")
# prepare the "cluster"
nc <- 2 # number of cores to use
registerDoMC(cores = nc)
# make all formula
fo <- paste0(y, "~", vars.model, "+", vars.remaining)
fo <- modify(fo, as.formula) %>%
set_names(vars.remaining)
# fit models in parallel
all_rsq <- foreach(fo = fo) %dopar% {
fit <- biglm(formula = fo, data = ff_df)
new.rsq <- summary(fit)$rsq
}
The culprit to my problem is that I have a lot of character columns. It works fine after I change them all to factors using my original script.
data %>%
mutate_if(is.character, as.factor)
#meriops' answer is also sound. Parallel processing might be something to consider if factorising your data frame doesn't solve the problem

Prevent a numeric from being converted into a factor

I'm building a table from a CSV file. When the file is initially loaded I need to load as characters.
datset <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
I have function to convert a factor containing number (from other stack q)
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
I clean up the file with
i<-17
datset[datset=="Not Available"]<-NA
datset<-datset[complete.cases(datset[,i]),]
x<- as.numeric.factor(datset[, i])
The datset table contains lots of columns I don't need so I build a new table :
dat <- data.frame(cbind("HospitalName"= datset[,2], "State"= datset[,7],"Rating" = x))
My problem is that even though x is numeric, it gets turned into a factor when loaded to the dataframe. I can verify this from debug mode with :
class(x)
"Numeric"
class(dat[,3])
"Factor"
In later code I'm trying to sort the Rating column but it's failing due it being a factor - I guess.
I've even tried appending stringsAsFactors = FALSE to read.csv but this has no effect.
How can I prevent x from being converted into a factor when loading to a DF?
As Henrik explained in his comment, this:
dat <- data.frame(cbind("HospitalName"= datset[,2], "State"= datset[,7],"Rating" = x))
is a poor way to construct a data frame. cbind converts everything to a matrix, which can only hold a single data type. Hence the coercion.
It would be better to do:
dat <- data.frame(HospitalName = dataset[,2],state = dataset[,7],rating = x)
However, it is also true as Roland mentioned that you should be able to specify this one column to be numeric when reading the data in via:
colclasses <- rep("character", 40)
colclasses[7] <- "numeric"
and then passing that in read.csv.

Resources