Get the mean across list of dataframes by rows

Get the mean across list of dataframes by rows - r

I have a list of dataframes and I want to calculate a mean from each first rows, for all second rows etc.
I think this is possible by creating some common factor as index, put dataframes together using rbind and then calculate the mean value using aggregate(value ~ row.index, mean, large.df). However, I guess there is more straightforward way?
Here is my example:
df1 = data.frame(val = c(4,1,0))
df2 = data.frame(val = c(5,2,1))
df3 = data.frame(val = c(6,3,2))
myLs=list(df1, df2, df3)
[[1]]
val
1 4
2 1
3 0
[[2]]
val
1 5
2 2
3 1
[[3]]
val
1 6
2 3
3 2
And my expected dataframe output, as rowise means:
df.means
mean
1 5
2 2
3 1
My first steps, not working as expected yet:
# Calculate the mean of list by rows
lapply(myLs, function(x) mean(x[1,]))

A simple way would be to cbind the list and calculate mean of each row with rowMeans
rowMeans(do.call(cbind, myLs))
#[1] 5 2 1
We can also use bind_cols from dplyr to combine all the dataframes.
rowMeans(dplyr::bind_cols(myLs))

Here is another base R solution using unlist + data.frame + rowMeans, i.e.,
rowMeans(data.frame(unlist(myLs,recursive = F)))
# [1] 5 2 1

Using double loop:
sapply(1:3, function(i) mean(sapply(myLs, function(j) j[i, ] )))
# [1] 5 2 1

Another base R possibility could be:
Reduce("+", myLs)/length(myLs)
val
1 5
2 2
3 1

Related

changing column names of a data frame by changing values - R

Let I have the below data frame.
df.open<-c(1,4,5)
df.close<-c(2,8,3)
df<-data.frame(df.open, df.close)
> df
df.open df.close
1 1 2
2 4 8
3 5 3
I wanto change column names which includes "open" with "a" and column names which includes "close" with "b":
Namely I want to obtain the below data frame:
a b
1 1 2
2 4 8
3 5 3
I have a lot of such data frames. The pre values(here it is "df.") are changing but "open" and "close" are fix.
Thanks a lot.

We can create a function for reuse
f1 <- function(dat) {
names(dat)[grep('open$', names(dat))] <- 'a'
names(dat)[grep('close$', names(dat))] <- 'b'
dat
}
and apply on the data
df <- f1(df)
-output
df
a b
1 1 2
2 4 8
3 5 3
if these datasets are in a list
lst1 <- list(df, df)
lst1 <- lapply(lst1, f1)

Thanks to dear #akrun's insightful suggestion as always we can do it in one go. So we create character vectors in pattern and replacement arguments of str_replace to be able to carry out both operations at once. We can assign character vector of either length one or more to each one of them. In case of the latter the length of both vectors should correspond. More to the point as the documentation says:
References of the form \1, \2, etc will be replaced with the contents
of the respective matched group (created by ())
library(dplyr)
library(stringr)
df %>%
rename_with(~ str_replace(., c(".*\\.open", ".*\\.close"), c("a", "b")))
a b
1 1 2
2 4 8
3 5 3

Another base R option using gsub + match + setNames
setNames(
df,
c("a", "b")[match(
gsub("[^open|close]", "", names(df)),
c("open", "close")
)]
)
gives
a b
1 1 2
2 4 8
3 5 3

Using the apply or plyr when the return has a variable number of columns

I'm wondering if there is a way to directly return a data frame from an apply or plyr call when the return from the function can have a variable number of columns (but will always have the same number of rows). For example:
df <- data.frame(A = 1:3, B = c("a","b", "c"))
my_fun <- function(x){
if(is.numeric(unlist(x))){
return(x)
} else {
return(cbind(x, x))
}
}
The closest I've been able to get is by returning a list and converting it into a data frame:
library(plyr)
data.frame(alply(df, 2, my_fun))
## A X2.B X2.B.1
## 1 1 a a
## 2 2 b b
## 3 3 c c
It feels like there should be a way to do this without the extra conversion, is there?

I use lapply() a lot in this way, when you want to apply a function to several columns of a data frame. In base R, you can treat a data frame as a list, where each column is one element. If you use lapply() as usual it will return a list, which isn't what we want.
> lapply(df, my_fun)
$A
[1] 1 2 3
$B
x x
[1,] 1 1
[2,] 2 2
[3,] 3 3
But if you assign the result to df[] it will signal to R that you want a subset of your original data frame (the full subset, which isn't a subset at all), thus preserving the data frame object type.
> df[] <- lapply(df, my_fun)
> df
A B.x B.x
1 1 1 1
2 2 2 2
3 3 3 3

R grouping by name and perform stats (t-test)

I have two data.frames:
word1=c("a","a","a","a","b","b","b")
word2=c("a","a","a","a","c","c","c")
values1 = c(1,2,3,4,5,6,7)
values2 = c(3,3,0,1,2,3,4)
df1 = data.frame(word1,values1)
df2 = data.frame(word2,values2)
df1:
word1 values1
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
df2:
word2 values2
1 a 3
2 a 3
3 a 0
4 a 1
5 c 2
6 c 3
7 c 4
I would like to split these dataframes by word*, and perform two sample t.tests in R.
For example, the word "a" is in both data.frames. What's the t.test between the data.frames for the word "a"? And do this for all the words that are in both data.frames.
The result is a data.frame(result):
word tvalues
1 a 0.4778035
Thanks

Find the words common to both dataframes, then loop over these words, subsetting both dataframes and performing the t.test on the subsets.
E.g.:
df1 <- data.frame(word=sample(letters[1:5], 30, replace=TRUE),
x=rnorm(30))
df2 <- data.frame(word=sample(letters[1:5], 30, replace=TRUE),
x=rnorm(30))
common_words <- sort(intersect(df1$word, df2$word))
setNames(lapply(common_words, function(w) {
t.test(subset(df1, word==w, x), subset(df2, word==w, x))
}), common_words)
This returns a list, where each element is the output of the t.test for one of the common words. setNames just names the list elements so you can see which words they correspond to.
Note I've created new example data here since your example data only have one word in common (a) and so don't really resemble your true problem.
If you just want a matrix of statistics, you can do something like:
t(sapply(common_words, function(w) {
test <- t.test(subset(df1, word==w, x), subset(df2, word==w, x))
c(test$statistic, test$parameter, p=test$p.value,
`2.5%`=test$conf.int[1], `97.5%`=test$conf.int[2])
}))
## t df p 2.5% 97.5%
## a 0.9141839 8.912307 0.38468553 -0.4808054 1.1313220
## b -0.2182582 7.589109 0.83298193 -1.1536056 0.9558315
## c -0.2927253 8.947689 0.77640684 -1.5340097 1.1827691
## d -2.7244728 12.389709 0.01800568 -2.5016301 -0.2826952
## e -0.3683153 7.872407 0.72234501 -1.9404345 1.4072499

Grouping an entire data set and aggregating

I have a dataset of 20 variables V1,V2,V3......V20 with 1,200 rows.
I want to average of every four rows in my data frame, i.e my output dataset should have 20 columns
containing V1,V2,V3…V20 and 300 rows containing average of data in group of 4.
I cannot use tapply as for that I have to input 1 variable at a time; I want to average all the 20 variables at a time.
Is there an efficient way to do this? I want to use functions from apply family and would
like to avoid looping.

Using lapply with colMeans
set.seed(42)
dat <- as.data.frame(matrix(sample(1:20, 20*1200, replace=TRUE), ncol=20))
n <- seq_len(nrow(dat))
res <- do.call(rbind,lapply(split(dat, (n-1)%/%4 +1),colMeans, na.rm=TRUE))
dim(res)
#[1] 300 20
Explanation
Here the idea is to create a grouping variable that splits the datasets into subsets of datasets in a list with the condition that 1:4 rows goes into first subset, 5:8 to 2nd subset, and ..., the last subset would have 297:300. For easy understanding, using a subset of rows. Suppose if your dataset has 10 rows:
n1 <- seq_len(10)
n1
#[1] 1 2 3 4 5 6 7 8 9 10
(n1-1) %/%4 #created a numeric index to split by group
# [1] 0 0 0 0 1 1 1 1 2 2
I added 1 to the above to start from 1 instead of 0
(n1-1) %/%4 +1
#[1] 1 1 1 1 2 2 2 2 3 3
You could also use gl ie.
gl(10, 4, 10)
For the dataset, it should be
gl(1200, 4, 1200)
Now, you can either split n1 by the newly created grouping index or the dataset
split(n1,(n1-1) %/%4 +1) # you can check the result of this
For a subset of 10 rows of the dataset
split(dat[1:10,], (n1-1) %/%4 +1)
and then use lapply along with colMeans to get the column means of each list element and rbind them using do.call(rbind,..)
Or
summarise_each from dplyr
library(dplyr)
res2 <- dat %>%
mutate(N= (row_number()-1)%/%4+1) %>%
group_by(N) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE))) %>%
select(-N)
dim(res2)
#[1] 300 20
all.equal(as.data.frame(res), as.data.frame(res2), check.attributes=FALSE)
#[1] TRUE
Or
Using data.table
library(data.table)
DT1 <- setDT(dat)[, N:=(seq_len(.N)-1)%/%4 +1][,
lapply(.SD, mean, na.rm=TRUE), by=N][,N:=NULL]
dim(DT1)
#[1] 300 20

Reshaping count-summarised data into long form in R [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
Embarrassingly basic question, but if you don't know.. I need to reshape a data.frame of count summarised data into what it would've looked like before being summarised. This is essentially the reverse of {plyr} count() e.g.
> (d = data.frame(value=c(1,1,1,2,3,3), cat=c('A','A','A','A','B','B')))
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B
> (summry = plyr::count(d))
value cat freq
1 1 A 3
2 2 A 1
3 3 B 2
If you start with summry what is the quickest way back to d? Unless I'm mistaken (very possible), {Reshape2} doesn't do this..

Just use rep:
summry[rep(rownames(summry), summry$freq), c("value", "cat")]
# value cat
# 1 1 A
# 1.1 1 A
# 1.2 1 A
# 2 2 A
# 3 3 B
# 3.1 3 B
A variation of this approach can be found in expandRows from my "SOfun" package. If you had that loaded, you would be able to simply do:
expandRows(summry, "freq")

There is a good table to dataframe function on the R cookbook website that you can modify slightly. The only modifications were changing 'Freq' -> 'freq' (to be consistent with plyr::count) and making sure the rownames were reset as increasing integers.
expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".") {
# Take each row in the source data frame table and replicate it
# using the Freq value
DF <- sapply(1:nrow(x),
function(i) x[rep(i, each = x$freq[i]), ],
simplify = FALSE)
# Take the above list and rbind it to create a single DF
# Also subset the result to eliminate the Freq column
DF <- subset(do.call("rbind", DF), select = -freq)
# Now apply type.convert to the character coerced factor columns
# to facilitate data type selection for each column
for (i in 1:ncol(DF)) {
DF[[i]] <- type.convert(as.character(DF[[i]]),
na.strings = na.strings,
as.is = as.is, dec = dec)
}
row.names(DF) <- seq(nrow(DF))
DF
}
expand.dft(summry)
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Get the mean across list of dataframes by rows - r

A simple way would be to cbind the list and calculate mean of each row with rowMeans rowMeans(do.call(cbind, myLs)) #[1] 5 2 1 We can also use bind_cols from dplyr to combine all the dataframes. rowMeans(dplyr::bind_cols(myLs))

Here is another base R solution using unlist + data.frame + rowMeans, i.e., rowMeans(data.frame(unlist(myLs,recursive = F))) # [1] 5 2 1

Using double loop: sapply(1:3, function(i) mean(sapply(myLs, function(j) j[i, ] ))) # [1] 5 2 1

Another base R possibility could be: Reduce("+", myLs)/length(myLs) val 1 5 2 2 3 1

Related

changing column names of a data frame by changing values - R

Using the apply or plyr when the return has a variable number of columns

R grouping by name and perform stats (t-test)

Grouping an entire data set and aggregating

Reshaping count-summarised data into long form in R [duplicate]

Categories

Resources