Mean of elements in a list of data.frames - r

Suppose I had a list of data.frames (of equal rows and columns)
dat1 <- as.data.frame(matrix(rnorm(25), ncol=5))
dat2 <- as.data.frame(matrix(rnorm(25), ncol=5))
dat3 <- as.data.frame(matrix(rnorm(25), ncol=5))
all.dat <- list(dat1=dat1, dat2=dat2, dat3=dat3)
How can I return a single data.frame that is the mean (or sum, etc.) for each element in the data.frames across the list (e.g., mean of first row and first column from lists 1, 2, 3 and so on)? I have tried lapply and ldply in plyr but these return the statistic for each data.frame within the list.
Edit: For some reason, this was retagged as homework. Not that it matters either way, but this is not a homework question. I just don't know why I can't get this to work. Thanks for any insight!
Edit2: For further clarification:
I can get the results using loops, but I was hoping that there were a way (a simpler and faster way because the data I am using has data.frames that are 12 rows by 100 columns and there is a list of 1000+ of these data frames).
z <- matrix(0, nrow(all.dat$dat1), ncol(all.dat$dat1))
for(l in 1:nrow(all.dat$dat1)){
for(m in 1:ncol(all.dat$dat1)){
z[l, m] <- mean(unlist(lapply(all.dat, `[`, i =l, j = m)))
}
}
With a result of the means:
> z
[,1] [,2] [,3] [,4] [,5]
[1,] -0.64185488 0.06220447 -0.02153806 0.83567173 0.3978507
[2,] -0.27953054 -0.19567085 0.45718399 -0.02823715 0.4932950
[3,] 0.40506666 0.95157856 1.00017954 0.57434125 -0.5969884
[4,] 0.71972821 -0.29190645 0.16257478 -0.08897047 0.9703909
[5,] -0.05570302 0.62045662 0.93427522 -0.55295824 0.7064439
I was wondering if there was a less clunky and faster way to do this. Thanks!

Here is a one liner with plyr. You can replace mean with any other function that you want.
ans1 = aaply(laply(all.dat, as.matrix), c(2, 3), mean)

You would have an easier time changing the data structure, combining the three two dimensional matrices into a single 3 dimensional array (using the abind library). Then the solution is more direct using apply and specifying the dimensions to average over.
EDIT:
When I answered the question, it was tagged homework, so I just gave an approach. The original poster removed that tag, so I will take him/her at his/her word that it isn't.
library("abind")
all.matrix <- abind(all.dat, along=3)
apply(all.matrix, c(1,2), mean)

I gave one answer that uses a completely different data structure to achieve the result. This answer uses the data structure (list of data frames) given directly. I think it is less elegant, but wanted to provide it anyway.
Reduce(`+`, all.dat) / length(all.dat)
The logic is to add the data frames together element by element (which + will do with data frames), then divide by the number of data frames. Using Reduce is necessary since + can only take two arguments at a time (and addition is associative).

Another approach using only base functions to change the structure of the object:
listVec <- lapply(all.dat, c, recursive=TRUE)
m <- do.call(cbind, listVec)
Now you can calculate the mean with rowMeans or the median with apply:
means <- rowMeans(m)
medians <- apply(m, 1, median)

I would take a slightly different approach:
library(plyr)
tmp <- ldply(all.dat) # convert to df
tmp$counter <- 1:5 # 1:12 for your actual situation
ddply(tmp, .(counter), function(x) colMeans(x[2:ncol(x)]))

Couldn't you just use nested lapply() calls?
This appears to give the correct result on my machine
mean.dat <- lapply(all.dat, function (x) lapply(x, mean, na.rm=TRUE))

Related

How to find elements with not enough observations in a list

Say I have the following list where each element is a data.frame of different sizes
df1 <- data.frame(matrix(rnorm(12346), ncol = 2))
df2 <- data.frame(matrix(rnorm(14330), ncol = 2))
df3 <- data.frame(matrix(rnorm(2422), ncol = 2))
l <- list(df1, df2, df3)
In my example each data.frame represents a year of observations, and clearly df3 contains a lot fewer observations compared to the other two.
My question is then:
What is the best approach to detect those elements of the list l that does not compare in the number of rows and then remove them from the list?
I've so far tried using the median but as this should always remove half of the elements in l I'm not sure this is the best solution for future use
library(collapse)
cutoff <- input %>%
vapply(nrow, FUN.VALUE = length(.) %>%
median()
idx <- dapply(X = input, FUN = function(x) nrow(x) >= cutoff)
input[idx]
where input is a list as the above l
NOTE: As this is my first question on SO, please feel free to edit the question if it does not live up the standards of this community or give feedback on asking better questions. Thanks in advance
EDIT:
The question is not so much on how to use median to remove elements of the list, but rather IF median is the right method to remove those data.frames which have a lot less observations than the others
Does this work:
l[sapply(l, function(x) nrow(x) >= median(unlist(lapply(l, nrow))))]
purrr::keep is the way to go when filtering lists with conditions.
library(purrr)
keep(l, ~ nrow(.x) > median(map_dbl(l, nrow)))
It looks like you have a variable number of rows in your data frames, and you want to identify those that are unusually low. This is a statistical problem called outlier detection.
Programmatically, you want to extract the number of rows from your list of data frames, which is easily done with
rows <- sapply(l, nrow)
Statistically, you now want to take a look at your data and its distribution. Good, simple visualisations in R can be
hist(rows)
boxplot(rows)
Note that these will work better if you have many dfs, and are pretty useless with 3.
How to now determine which values are outside an "expected" distribution is not always trivial. Some resources:
outliers tag on CrossValidated
a nice RBloggers post
Note that it's also acceptable for you to choose a cutoff manually if you can reasonably justify it.

for loop and apply to calculate means

i want to calculate the means for 32 vectors in a list. I thought this code should do the job:
for(i in sequence(length(means16list))){
mat.means16 <- apply(means16list, 1, mean)
}
where means16list contains 32 numeric vectors and mat.means16 should contain the means. It is a matrix 4,4 and defined in a previous step.
Maybe I did not understand how loops work yet.
Can someone help?
Cheers
mat.means16 is being overridden each time, you should make a list and store the results there potentially.
There are likely better ways to do this if you post example data, I'm assuming you want rowMeans() of a matrix.
results <- lapply(means16list, rowMeans)
Unless I misunderstood, why not just:
# Sample data
set.seed(2017);
means16list <- lapply(1:32, function(x) runif(10))
# Return a list of the sample means
lapply(means16list, mean);
# Return a vector of the sample means
sapply(means16list, mean);
I don't see the point of the for loop. lapply/sapply will loop through every element of your list and apply function mean to it.

How to apply operation and sum over columns in R?

I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))

How can I include a column of matrices in a data.frame?

I want to store the output from many regression models including regression coefficients and information matrix from each model.
To store the results, it will be convenient if can use a data frame with two columns, one for the regression coefficients, and one for the information matrix. How can I create such a data frame?
res = data.frame(mu = I(matrix(0, m, n)), j = ???)
(It seems j should be an array in such a situation.)
You can do just not at the birth of the dataframe as you're trying. You can add it on later (As I show below). I've done the same thing on occasion and thus far no R gods have attempted to destroy me. Maybe not the best thing but a data.frame is a list so it can be done. Sometimes though the visual table format of the data.frame may be nicer than a list.
dat <- data.frame(coeff = 1:10)
dat$mats <- lapply(1:10, function(i) matrix(1:4, 2))
dat[1, 2]
## [[1]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
Data.frames work best when you have rectangular data; specifically a collection of atomic vectors of same length. Trying to shove other data in there is not a good idea. Plus adding rows one-by-one to a data.frame is not an efficient operation. The general container for all objects in R is the list. I can hold anything you list and you can name the elements whatever you like. Not sure why you think you may need a data.frame.

Correct implementation of lapply

In so far as I understand it, when using r it can be more elegant to use functions such as lapply rather than for loops (that are used more often than not in other object oriented languages). However I cannot get my head around the syntax and am making foolish errors when trying to implement simple tasks with the command. For example:
I have a series of dataframes loaded from csv files using a for loop.The following dummy dataframes adequately describe the data:
x <- c(0,10,11,12,13)
y <- c(1,NA,NA,NA,NA)
z <- c(2,20,21,22,23)
a <- c(0,6,5,4,3)
b <- c(1,7,8,9,10)
c <- c(2,NA,NA,NA,NA)
df1 <- data.frame(x,y,z)
df2 <- data.frame(a,b,c)
I first generate a list of dataframe names (data_names- I do this when loading the csv files) and then simply want to sum the columns. My attempt of course does not work:
lapply(data_names, function(df) {
counts <- colSums(!is.na(data_names))
})
I could of course use lists (and I realise in the long run this maybe better) however from a pedagogical point of view I would like to understand lapply better.
Many thanks for any pointers
It's really just your use of is.na and the fact you don't need to use the asignment operator <- inside the function. lapply returns a list which is the result of applying FUN to each element of the input list. You assign the output of lapply to a variable, e.g. res <- lapply( .... , FUN ).
I'm also not too sure how you made the list initially, but the below should suffice. You also don't need an anonymous function in this case, you can use the named colSums and also provide the na.rm = TRUE argument to take care of persky NAs in your data:
lapply( list( df1, df2 ) , colSums , na.rm = TRUE )
[[1]]
x y z
46 1 88
[[2]]
a b c
18 35 2
So you can read this as:
For each df in the list:
apply colSums with the argument na.rm = TRUE
The result is a list, each element of which is the result of applying colSums to each df in the list.

Resources