lapply to sum columns of many data frames - r

I have a large number of CSV files with x, y, value and cluster columns in one folder. I want to use lapply() to take out the value and cluster column of each file to result in one data frame with the sum for both columns of all files. What is the best way to do this?

Do you mean something like below?
aggregate(
cbind(value, cluster) ~ .,
do.call(rbind, lapply(list.files(pattern = "*.csv"), read.csv)),
sum
)

An option with tidyverse would be to read the csv files with read_csv from readr, row bind (_dfr), grouped by 'x', 'y' columns, get the sum of the numeric columns
library(purrr)
library(readr)
library(dplyr)
files <- list.files(pattern = "\\.csv$")
map_dfr(files, read_csv) %>%
group_by(x, y) %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE))
If we want to do this in parallel, use future.apply
library(future.apply)
future::plan(multiprocess, workers = length(files))
options(future.globals.maxSize= +Inf)
out <- future.apply::future_Map(files, read_csv)
future::plan(sequential)
bind_rows(out) %>%
group_by(x, y) %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE))
Or make use of parallel
ncores <- min(parallel::detectCores(), length(files))
cl <- parallel::makeCluster(ncores, type = "SOCK")
doSNOW::registerDoSNOW(cl)
out2 <- foreach(i = seq_along(files),
.packages = c("data.table")) %dopar% {
fread(files[i])
}
parallel::stopCluster(cl)
library(data.table_
rbindlist(out2)[, lapply(.SD, sum, na.rm = TRUE), .(x, y)]

Related

Using lapply and an anonymous function in a list of data frames

In a list of data frames listdf , I want to determine the mean, max, min, stdv and the number or rows (number of values) for the column Concentration in each data frame :
mean <- lapply(listdf, function(x) {mean(x$Concentration, na.rm = F)})
max <- lapply(listdf, function(x) {max(x$Concentration, na.rm = F)})
min <- lapply(listdf, function(x) {min(x$Concentration, na.rm = F)})
sd <- lapply(listdf, function(x) {sd(x$Concentration, na.rm = F)})
nbr <- lapply(listdf, function(x) {nrow(x$Concentration, na.rm = F)})
However, nrow does not work with lapply and a function. How can I modify it ?
Also, is it possible to add (via lapply and a function or tibble ?) an additional sixth list of analysis for listdf to tell that the number of rejected data frames in listdf is "NA" ? I know it because I selected in listdf all the data frames without rejected values, but I wish to add this step in the script for further analysis.
nbr_rejected <- lapply(listdf, ??? "NA")
(nbr_rejected="NA")
You can use length:
a <- list(mtcars, mtcars)
lapply(a, function(x) {length(x$hp[!is.na(x$hp)])})
EDIT
Your second question:
a <- list(mtcars, mtcars)
lapply(a, function(x) {data.frame(value = matrix(NA, nrow(x), 1))})

Speeding up dplyr pipe including checks with mutate_if and if_else on larger tables

I wrote some code to performed oversampling, meaning that I replicate my observations in a data.frame and add noise to the replicates, so they are not exactly the same anymore. I'm quite happy that it works now as intended, but...it is too slow. I'm just learning dplyr and have no clue about data.table, but I hope there is a way to improve my function. I'm running this code in a function for 100s of data.frames which may contain about 10,000 columns and 400 rows.
This is some toy data:
library(tidyverse)
train_set1 <- rep(0, 300)
train_set2 <- rep("Factor1", 300)
train_set3 <- data.frame(replicate(1000, sample(0:1, 300, rep = TRUE)))
train_set <- cbind(train_set1, train_set2, train_set3)
row.names(train_set) <- c(paste("Sample", c(1:nrow(train_set)), sep = "_"))
This is the code to replicate each row a given number of times and a function to determine whether the added noise later will be positive or negative:
# replicate each row twice, added row.names contain a "."
train_oversampled <- train_set[rep(seq_len(nrow(train_set)), each = 3), ]
# create a flip function
flip <- function() {
sample(c(-1,1), 1)
}
In the relevant "too slow" piece of code, I'm subsetting the row.names for the added "." to filter for the replicates. Than I select only the numeric columns. I go through those columns row by row and leave the values untouched if they are 0. If not, a certain amount is added (here +- 1 %). Later on, I combine this data set with the original data set and have my oversampled data.frame.
# add percentage of noise to non-zero values in numerical columns
noised_copies <- train_oversampled %>%
rownames_to_column(var = "rowname") %>%
filter(grepl("\\.", row.names(train_oversampled))) %>%
rowwise() %>%
mutate_if(~ is.numeric(.), ~ if_else(. == 0, 0,. + (. * flip() * 0.01 ))) %>%
ungroup() %>%
column_to_rownames(var = "rowname")
# combine original and oversampled, noised data set
train_noised <- rbind(noised_copies, train_set)
I assume there are faster ways using e.g. data.table, but it was already tough work to get this code running and I have no idea how to improve its performance.
EDIT:
The solution is working perfectly fine with fixed values, but called within a for loop I receive "Error in paste(Sample, n, sep = ".") : object 'Sample' not found"
Code to replicate:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = train_set, cc = train_set)
for(current_table in train_list) {
setDT(current_table, keep.rownames="Sample")
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}
Any ideas why the column Sample can't be found now?
Here is a more vectorized approach using data.table:
library(data.table)
setDT(train_set, keep.rownames="Sample")
cols <- names(train_set)[sapply(train_set, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(train_set)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
With data.table version >= 1.12.9, you can pass is.numeric directly to .SDcols argument and maybe a shorter way (e.g. (.SD) or names(.SD)) to pass to the left hand side of :=
address OP's updated post:
The issue is that although each data.frame within the list is converted to a data.table, the train_list is not updated. You can update the list with a left bind before the for loop:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = copy(train_set), cc = copy(train_set))
train_list <- lapply(train_list, setDT, keep.rownames="Sample")
for(current_table in train_list) {
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, train_list), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}

Analysis by row with multiple functions in dplyr

I'm trying to parse the cases (rows) of a data.frame with dplyr, but to no avail. I created two functions for this:
f1 <- function(x) {
c(s = sum(x),
m = mean(x),
v = var(x))
}
f2 <- function(x) {
apply(x, 1, f1)
}
My data.frame (data_1):
for (i in 1:6) {
assign(paste('var', i, sep = '_'),
runif(30, 20, 100))
}
data_1 <- do.call(
cbind.data.frame,
mget(ls(pattern = '*v'))
)
Using dplyr functions:
library(dplyr)
data_1 %>%
mutate_at(.vars = vars (starts_with('v')),
.funs = funs(.= f2))
data_1 %>%
mutate_if(is.numeric, .funs = funs(.= f2))
Error in mutate_impl(.data, dots) : Evaluation error: dim(X) must have a positive length.
Since the analysis is done in the rows, and I have three functions (sum, mean, and variance), the expected return is three columns.
In fact, although not deprecated, rowwise() does not play well with other grouping and summary functions, so is best avoided in dplyr. A useful alternative can be to group by row number. Here is a solution to the above using this approach.
colNames <- syms(paste0("var_", 1:6))
data_1 %>%
group_by (row_number()) %>%
summarize(dataMean = mean(!!!colNames),
dataSum = sum(!!!colNames))

5 lists in data.frame get their mean, sd, removed outliers

I have a data.frame of 5 lists
each list has 3 columns:
T_C is an indicator of TEST or CONTROL
id, T_C, SPEND
I know how to use lapply to get a mean of T or C, but how do you do that with multiple lists ?
dfList <- lapply(tableListBase, function(t) fetch(dbSendQuery(con, paste0("SELECT * FROM ", t))))
dfList <- setNames(dfList, tableListBase).
??
For a single list I can do this ?
means <- tapply(NET_SPEND, TC_INDICATOR, mean)
I am learning :-)
My goal to get the mean(), sd() over these 5 lists T/C for now.
My ultimate goal is to identify: mean - 3 sd() and mean + 3 sd() and remove them from this 6 list set and create a new one, after removing outliers.
I know how to do this in a more manual formal not, more code of lines, but would like to learn how to employ more FUN() methods :- )
Here are a few approaches you can take. I've ordered them by which one I would most likely use myself:
# Make a list of 5 data frames. I'll use `mtcars` for convenience, since
# I don't have your data.
X <- lapply(1:5,
function(i) mtcars[sample(1:nrow(mtcars),
size = nrow(mtcars),
replace = TRUE), ])
library(dplyr)
# Bring all of the tables together and summarise
mapply(function(df, i){ df$tbl_id <- i; df},
X,
seq_along(X),
SIMPLIFY = FALSE) %>%
bind_rows() %>%
group_by(tbl_id, am) %>%
summarise(mean = mean(mpg),
sd = sd(mpg))
# Make a list of summaries
lapply(X,
function(df)
{
df %>%
group_by(am) %>%
summarise(mean = mean(mpg),
sd = sd(mpg))
})
# Run tapply separately for the means and sds
mean_list <-
lapply(X,
function(df)
{
tapply(df$mpg, df$am, mean)
}
)
sd_list <-
lapply(X,
function(df)
{
tapply(df$mpg, df$am, sd)
}
)

Dealing with NAs when calculating mean (summarize_each) on group_by

I have a data frame md:
md <- data.frame(x = c(3,5,4,5,3,5), y = c(5,5,5,4,4,1), z = c(1,3,4,3,5,5),
device1 = c("c","a","a","b","c","c"), device2 = c("B","A","A","A","B","B"))
md[2,3] <- NA
md[4,1] <- NA
md
I want to calculate means by device1 / device2 combinations using dplyr:
library(dplyr)
md %>% group_by(device1, device2) %>% summarise_each(funs(mean))
However, I am getting some NAs. I want the NAs to be ignored (na.rm = TRUE) - I tried, but the function doesn't want to accept this argument.
Both these lines result in error:
md %>% group_by(device1, device2) %>% summarise_each(funs(mean), na.rm = TRUE)
md %>% group_by(device1, device2) %>% summarise_each(funs(mean, na.rm = TRUE))
The other answers showed you the syntax for passing mean(., na.rm = TRUE) into summarize/_each.
Personally, I deal with this so often and it's so annoying that I just define the following convenience set of NA-aware basic functions (e.g. in my .Rprofile), such that you can apply them with dplyr with summarize(mean_) and no pesky arg-passing; also keeps the source-code cleaner and more readable, which is another strong plus:
mean_ <- function(...) mean(..., na.rm=T)
median_ <- function(...) median(..., na.rm=T)
sum_ <- function(...) sum(..., na.rm=T)
sd_ <- function(v) sqrt(sum_((v-mean_(v))^2) / length(v))
cor_ <- function(...) cor(..., use='pairwise.complete.obs')
max_ <- function(...) max(..., na.rm=T)
min_ <- function(...) min(..., na.rm=T)
pmax_ <- function(...) pmax(..., na.rm=T)
pmin_ <- function(...) pmin(..., na.rm=T)
table_ <- function(...) table(..., useNA='ifany')
mode_ <- function(...) {
tab <- table(...)
names(tab[tab==max(tab)]) # the '==' implicitly excludes NA values
}
clamp_ <- function(..., minval=0, maxval=70) pmax(minval, pmin(maxval,...))
Really you want to be able to flick one global switch once and for all, like na.action/na.pass/na.omit/na.fail to tell functions as default behavior what to do, and not throw errors or be inconsistent, as they currently do, across different packages.
There used to be a CRAN package called Defaults for setting per-function defaults but it is not maintained since 2014, pre-3.x . For more about it Setting Function Defaults R on a Project Specific Basis
try:
library(dplyr)
md %>% group_by(device1, device2) %>%
summarise_each(funs(mean(., na.rm = TRUE)))
Simple as that:
funs(mean(., na.rm = TRUE))

Resources