Analysis by row with multiple functions in dplyr - r

I'm trying to parse the cases (rows) of a data.frame with dplyr, but to no avail. I created two functions for this:
f1 <- function(x) {
c(s = sum(x),
m = mean(x),
v = var(x))
}
f2 <- function(x) {
apply(x, 1, f1)
}
My data.frame (data_1):
for (i in 1:6) {
assign(paste('var', i, sep = '_'),
runif(30, 20, 100))
}
data_1 <- do.call(
cbind.data.frame,
mget(ls(pattern = '*v'))
)
Using dplyr functions:
library(dplyr)
data_1 %>%
mutate_at(.vars = vars (starts_with('v')),
.funs = funs(.= f2))
data_1 %>%
mutate_if(is.numeric, .funs = funs(.= f2))
Error in mutate_impl(.data, dots) : Evaluation error: dim(X) must have a positive length.
Since the analysis is done in the rows, and I have three functions (sum, mean, and variance), the expected return is three columns.

In fact, although not deprecated, rowwise() does not play well with other grouping and summary functions, so is best avoided in dplyr. A useful alternative can be to group by row number. Here is a solution to the above using this approach.
colNames <- syms(paste0("var_", 1:6))
data_1 %>%
group_by (row_number()) %>%
summarize(dataMean = mean(!!!colNames),
dataSum = sum(!!!colNames))

Related

lapply to sum columns of many data frames

I have a large number of CSV files with x, y, value and cluster columns in one folder. I want to use lapply() to take out the value and cluster column of each file to result in one data frame with the sum for both columns of all files. What is the best way to do this?
Do you mean something like below?
aggregate(
cbind(value, cluster) ~ .,
do.call(rbind, lapply(list.files(pattern = "*.csv"), read.csv)),
sum
)
An option with tidyverse would be to read the csv files with read_csv from readr, row bind (_dfr), grouped by 'x', 'y' columns, get the sum of the numeric columns
library(purrr)
library(readr)
library(dplyr)
files <- list.files(pattern = "\\.csv$")
map_dfr(files, read_csv) %>%
group_by(x, y) %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE))
If we want to do this in parallel, use future.apply
library(future.apply)
future::plan(multiprocess, workers = length(files))
options(future.globals.maxSize= +Inf)
out <- future.apply::future_Map(files, read_csv)
future::plan(sequential)
bind_rows(out) %>%
group_by(x, y) %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE))
Or make use of parallel
ncores <- min(parallel::detectCores(), length(files))
cl <- parallel::makeCluster(ncores, type = "SOCK")
doSNOW::registerDoSNOW(cl)
out2 <- foreach(i = seq_along(files),
.packages = c("data.table")) %dopar% {
fread(files[i])
}
parallel::stopCluster(cl)
library(data.table_
rbindlist(out2)[, lapply(.SD, sum, na.rm = TRUE), .(x, y)]

FUN == 'x' does not work, how to go around it in R

I am trying to write a function that uses some other function FUN as an argument - in this case, I want to (among other things), alter what to do if I set FUN = match0.
library(dplyr)
library(purrr)
f <- function(df, pair, FUN, ...){
df1 <- df %>%
group_split()
w <- df1 %>%
map(~ .x %>%
nrow() %>%
seq())
x <- map2(w, df1, ~map(.x, mean, df = .y))
y <- map(x, unlist)
l <- map2(y, df1, ~map(.x, function(x, df = .y){
if(deparse(substitute(FUN)) == 'match0'){
out <- x
} else{
out <- df[x, pair]}
return(out)
}
)) %>% unlist()
df <- bind_rows(df1) %>% bind_cols(index = l)
return(df)
}
If I run, for instance:
a <- data.frame(n = c(15,20,15,20,15,20)) %>% group_by(n)
x <- f(a, pair = 'pairs0', FUN = match0)
I get Column 'pairs0' doesn't exist.
This would be the case if, in fact, the conditional statement evaluated to FALSE. How can I change this?
To be honest, I'm not quite sure to use deparse, substitute and the like, I've just tried to follow some other posts. FWIW, I thought it would work because if I test deparse(substitute(match0)) == 'match0', I get TRUE.
Any help?

5 lists in data.frame get their mean, sd, removed outliers

I have a data.frame of 5 lists
each list has 3 columns:
T_C is an indicator of TEST or CONTROL
id, T_C, SPEND
I know how to use lapply to get a mean of T or C, but how do you do that with multiple lists ?
dfList <- lapply(tableListBase, function(t) fetch(dbSendQuery(con, paste0("SELECT * FROM ", t))))
dfList <- setNames(dfList, tableListBase).
??
For a single list I can do this ?
means <- tapply(NET_SPEND, TC_INDICATOR, mean)
I am learning :-)
My goal to get the mean(), sd() over these 5 lists T/C for now.
My ultimate goal is to identify: mean - 3 sd() and mean + 3 sd() and remove them from this 6 list set and create a new one, after removing outliers.
I know how to do this in a more manual formal not, more code of lines, but would like to learn how to employ more FUN() methods :- )
Here are a few approaches you can take. I've ordered them by which one I would most likely use myself:
# Make a list of 5 data frames. I'll use `mtcars` for convenience, since
# I don't have your data.
X <- lapply(1:5,
function(i) mtcars[sample(1:nrow(mtcars),
size = nrow(mtcars),
replace = TRUE), ])
library(dplyr)
# Bring all of the tables together and summarise
mapply(function(df, i){ df$tbl_id <- i; df},
X,
seq_along(X),
SIMPLIFY = FALSE) %>%
bind_rows() %>%
group_by(tbl_id, am) %>%
summarise(mean = mean(mpg),
sd = sd(mpg))
# Make a list of summaries
lapply(X,
function(df)
{
df %>%
group_by(am) %>%
summarise(mean = mean(mpg),
sd = sd(mpg))
})
# Run tapply separately for the means and sds
mean_list <-
lapply(X,
function(df)
{
tapply(df$mpg, df$am, mean)
}
)
sd_list <-
lapply(X,
function(df)
{
tapply(df$mpg, df$am, sd)
}
)

Dealing with NAs when calculating mean (summarize_each) on group_by

I have a data frame md:
md <- data.frame(x = c(3,5,4,5,3,5), y = c(5,5,5,4,4,1), z = c(1,3,4,3,5,5),
device1 = c("c","a","a","b","c","c"), device2 = c("B","A","A","A","B","B"))
md[2,3] <- NA
md[4,1] <- NA
md
I want to calculate means by device1 / device2 combinations using dplyr:
library(dplyr)
md %>% group_by(device1, device2) %>% summarise_each(funs(mean))
However, I am getting some NAs. I want the NAs to be ignored (na.rm = TRUE) - I tried, but the function doesn't want to accept this argument.
Both these lines result in error:
md %>% group_by(device1, device2) %>% summarise_each(funs(mean), na.rm = TRUE)
md %>% group_by(device1, device2) %>% summarise_each(funs(mean, na.rm = TRUE))
The other answers showed you the syntax for passing mean(., na.rm = TRUE) into summarize/_each.
Personally, I deal with this so often and it's so annoying that I just define the following convenience set of NA-aware basic functions (e.g. in my .Rprofile), such that you can apply them with dplyr with summarize(mean_) and no pesky arg-passing; also keeps the source-code cleaner and more readable, which is another strong plus:
mean_ <- function(...) mean(..., na.rm=T)
median_ <- function(...) median(..., na.rm=T)
sum_ <- function(...) sum(..., na.rm=T)
sd_ <- function(v) sqrt(sum_((v-mean_(v))^2) / length(v))
cor_ <- function(...) cor(..., use='pairwise.complete.obs')
max_ <- function(...) max(..., na.rm=T)
min_ <- function(...) min(..., na.rm=T)
pmax_ <- function(...) pmax(..., na.rm=T)
pmin_ <- function(...) pmin(..., na.rm=T)
table_ <- function(...) table(..., useNA='ifany')
mode_ <- function(...) {
tab <- table(...)
names(tab[tab==max(tab)]) # the '==' implicitly excludes NA values
}
clamp_ <- function(..., minval=0, maxval=70) pmax(minval, pmin(maxval,...))
Really you want to be able to flick one global switch once and for all, like na.action/na.pass/na.omit/na.fail to tell functions as default behavior what to do, and not throw errors or be inconsistent, as they currently do, across different packages.
There used to be a CRAN package called Defaults for setting per-function defaults but it is not maintained since 2014, pre-3.x . For more about it Setting Function Defaults R on a Project Specific Basis
try:
library(dplyr)
md %>% group_by(device1, device2) %>%
summarise_each(funs(mean(., na.rm = TRUE)))
Simple as that:
funs(mean(., na.rm = TRUE))

Why are sum() working in this dplyr expression while quantile() isn't?

I want to calculate the quantiles of each row of a data frame and return the result as a matrix. Since I want to calculate and arbitrary number of quantiles (and I imagine that it is faster to calculate them all at once, rather than re-running the function), I tried using a formula I found in this question:
library(dplyr)
df<- as.data.frame(matrix(rbinom(1000,10,0.5),nrow = 2))
interim_res <- df %>%
rowwise() %>%
do(out = sapply(min(df):max(df), function(i) sum(i==.)))
interim_res <- interim_res[[1]] %>% do.call(rbind,.) %>% as.data.frame(.)
This makes sense, but when I try to apply the same framework to the quantile() function, as coded here,
interim_res <- df %>%
rowwise() %>%
do(out = quantile(.,probs = c(0.1,0.5,0.9)))
interim_res <- interim_res[[1]] %>% do.call(rbind,.) %>% as.data.frame(.)
I get this error message:
Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) :
'x' must be atomic
Why am I getting an error with quantile and not sum? How should I fix this issue?
. in do is a data frame, which is why you get the error. This works:
df %>%
rowwise() %>%
do(data.frame(as.list(quantile(unlist(.),probs = c(0.1,0.5,0.9)))))
but risks being horrendously slow. Why not just:
apply(df, 1, quantile, probs = c(0.1,0.5,0.9))
Here are some timings with larger data:
df <- as.data.frame(matrix(rbinom(100000,10,0.5),nrow = 1000))
library(microbenchmark)
microbenchmark(
df %>% rowwise() %>% do(data.frame(as.list(quantile(unlist(.),probs = c(0.1,0.5,0.9))))),
apply(df, 1, quantile, probs = c(0.1,0.5,0.9)),
times=5
)
Produces:
min lq mean median uq max neval
dplyr 2375.2319 2376.6658 2446.4070 2419.4561 2454.6017 2606.0794 5
apply 224.7869 231.7193 246.7137 233.4757 245.0718 298.5144 5
If you go the apply route you should probably stick with a matrix from the get go.

Resources