Mean of each row of data.frames column nested in list - r

I have many data.frames nested in a list and want to get the row means of a column in the data.frames. Here is my MWE. I am wondering how to accomplish this in R?
set.seed(12345)
df1 <- data.frame(x=rnorm(10))
df2 <- data.frame(x=rnorm(10))
ls1 <- list(df1=df1, df2=df2)
ls1
$df1
x
1 0.5855288
2 0.7094660
3 -0.1093033
4 -0.4534972
5 0.6058875
6 -1.8179560
7 0.6300986
8 -0.2761841
9 -0.2841597
10 -0.9193220
$df2
x
1 -0.1162478
2 1.8173120
3 0.3706279
4 0.5202165
5 -0.7505320
6 0.8168998
7 -0.8863575
8 -0.3315776
9 1.1207127
10 0.2987237
Something like this one
(ls1$df1+ls1$df2)/2
x
1 0.23464051
2 1.26338903
3 0.13066227
4 0.03335964
5 -0.07232227
6 -0.50052806
7 -0.12812949
8 -0.30388085
9 0.41827645
10 -0.31029915
Edited
ls1 <- list(df=df1)
ls2 <- list(df=df2)
How this (ls1$df+ls2$df)/2 can be written more coherently?
x
1 0.23464051
2 1.26338903
3 0.13066227
4 0.03335964
5 -0.07232227
6 -0.50052806
7 -0.12812949
8 -0.30388085
9 0.41827645
10 -0.31029915

Extract the column, cbind the data.frames, and calculate the row means:
rowMeans(do.call(cbind, lapply(ls1, "[", "x")))

If there are no NA values, another option would be to do element wise (+) with Reduce and divide by the length of 'ls1'
Reduce(`+`, ls1)/length(ls1)
# x
#1 0.23464051
#2 1.26338903
#3 0.13066227
#4 0.03335964
#5 -0.07232227
#6 -0.50052806
#7 -0.12812949
#8 -0.30388085
#9 0.41827645
#10 -0.31029915

Related

Adding new columns to dataframe with suffix

I want to subtract one column from another and create a new one using the corresponding suffix in the first column. I have approx 50 columns
I can do it "manually" as follows...
df$new1 <- df$col_a1 - df$col_b1
df$new2 <- df$col_a2 - df$col_b2
What is the easiest way to create a loop that does the job for me?
We can use grep to identify columns which has "a" and "b" in it and subtract them directly.
a_cols <- grep("col_a", names(df))
b_cols <- grep("col_b", names(df))
df[paste0("new", seq_along(a_cols))] <- df[a_cols] - df[b_cols]
df
# col_a1 col_a2 col_b1 col_b2 new1 new2
#1 10 15 1 5 9 10
#2 9 14 2 6 7 8
#3 8 13 3 7 5 6
#4 7 12 4 8 3 4
#5 6 11 5 9 1 2
#6 5 10 6 10 -1 0
data
Tested on this data
df <- data.frame(col_a1 = 10:5, col_a2 = 15:10, col_b1 = 1:6, col_b2 = 5:10)

How do I add observations to an existing data frame column?

I have a data frame. Let's say it looks like this:
Input data set
I have simulated some values and put them into a vector c(4,5,8,8). I want to add these simulated values to columns a, b and c.
I have tried rbind or inserting the vector into the existing data frame, but that replaced the existing values with the simulated ones, instead of adding the simulated values below the existing ones.
x <- data.frame("a" = c(2,3,1), "b" = c(5,1,2), "c" = c(6,4,7))
y <- c(4,5,8,8)
This is the output I expect to see:
Output
Help would be greatly appreciated. Thank you.
Can do:
as.data.frame(sapply(x,
function(z)
append(z,y)))
a b c
1 2 5 6
2 3 1 4
3 1 2 7
4 4 4 4
5 5 5 5
6 8 8 8
7 8 8 8
An option is assignment
n <- nrow(x)
x[n + seq_along(y), ] <- y
x
# a b c
#1 2 5 6
#2 3 1 4
#3 1 2 7
#4 4 4 4
#5 5 5 5
#6 8 8 8
#7 8 8 8
Another option is replicate the 'y' and rbind
rbind(x, `colnames<-`(replicate(ncol(x), y), names(x)))
x[(nrow(x)+1):(nrow(x)+length(y)),] <- y

How to use stack to produce multiple columns data frame?

I want to convert a list of lists into a data.frame. First I each sublist was only of length 1 and so I used stack(as.data.frame(...)) but stack does not seam to be able to produce multicolumns data.frame. So what it the best way to achieve that:
# works fine with only sublists of length 1
l = list(a = sample(1:5, 5), b = sample(1:5, 5))
> stack(as.data.frame(l))
values ind
1 5 a
2 4 a
3 1 a
4 2 a
5 3 a
6 2 b
7 1 b
8 3 b
9 5 b
10 4 b
Now my list is a list of lists:
l = list(a = list(first = sample(1:5, 5), sec = sample(1:5, 5)), b = list(first = sample(1:5, 5), sec = sample(1:5, 5)))
stack(as.data.frame(l))
values ind
1 4 a.first
2 5 a.first
3 3 a.first
4 1 a.first
5 2 a.first
6 3 a.sec
7 5 a.sec
8 1 a.sec
9 2 a.sec
10 4 a.sec
11 5 b.first
12 4 b.first
13 3 b.first
14 1 b.first
15 2 b.first
16 3 b.sec
17 4 b.sec
18 1 b.sec
19 2 b.sec
20 5 b.sec
while I'd like to have still a column ind with a and b and two columns first and sec
We can flatten the list by concatenating (c) the nested elements ('l1'), get the substring from the names of 'l1' ('nm1' and 'nm2'), split the 'l1' by 'nm1' (i.e. substring obtained by removing the prefix) while we set the names of 'l1' with 'nm2' (substring obtained by removing suffix starting with .), loop through the list and stack it ('lst'). Then, we cbind the 'ind' column (which is the same in all the list elements so we get it from the first list element - lst[[1]][2]) with the 'value' column i.e. the first column.
l1 <- do.call(c, l)
nm1 <- sub("[^.]+\\.", "", names(l1))
nm2 <- sub("\\..*", "", names(l1))
lst <- lapply(split(setNames(l1, nm2), nm1), stack)
cbind(lst[[1]][2],lapply(lst, `[[`, 1))
# ind first sec
#1 a 1 1
#2 a 5 5
#3 a 4 4
#4 a 3 3
#5 a 2 2
#6 b 3 4
#7 b 4 5
#8 b 2 2
#9 b 1 3
#10 b 5 1
Or using dplyr/purrr we can get the expected output.
library(purrr)
library(dplyr)
l1 <- transpose(l)
n1 <- names(l1)
l1 %>%
map(stack) %>%
bind_cols %>%
setNames(., make.unique(names(.))) %>%
select(ind, matches("value")) %>%
setNames(., c("ind", n1))
# ind first sec
# (fctr) (int) (int)
#1 a 1 1
#2 a 5 5
#3 a 4 4
#4 a 3 3
#5 a 2 2
#6 b 3 4
#7 b 4 5
#8 b 2 2
#9 b 1 3
#10 b 5 1
Here is another approach:
df <- stack(as.data.frame(l))
# split names of variables
indVars <- strsplit(as.character(df$ind), split="\\.")
# add variables to data.frame
df$letters <- sapply(indVars, function(i) i[1])
df$order <- sapply(indVars, function(i) i[2])
# get final data.frame
cbind("order"=unstack(df, letters~order)[,1], unstack(df, values~order))

r cumsum-like function for splitting dataframe

Given the following dataframe:
mydf <- data.frame(x=c(1:10,10:1),y=c(10:1,1:10))
How is it possible to split it such that each sub-dataframe will have consecutive values of one column which are greater than the other column?
For example in mydf, the outcome that I am hoping for is spliting it into three dataframes:
(y > x; should contain the first 5 rows of mydf)
(x > y; should contain rows 6 to 15 of mydf)
(y > x again; should contain the last 5 rows of mydf)
I tried using the following code but it produced bad results where each y > x would be split individually; moreover, dataframes where x > y would contain a y > x in the first row:
split(mydf, cumsum(mydf$x > mydf$y))
Another less elegant approach I tried to do is sapply with individual ifs inside the split function, but I don't want to go this path because of performance issues.
Try
rl <- with(mydf, rle(x >y))
grp <- inverse.rle(within.list(rl , values <- seq_along(values)))
split(mydf, grp)
#$`1`
# x y
#1 1 10
#2 2 9
#3 3 8
#4 4 7
#5 5 6
#$`2`
# x y
#6 6 5
#7 7 4
#8 8 3
#9 9 2
#10 10 1
#11 10 1
#12 9 2
#13 8 3
#14 7 4
#15 6 5
#$`3`
# x y
#16 5 6
#17 4 7
#18 3 8
#19 2 9
#20 1 10
Or
group <- with(mydf, cumsum(c(1,abs(diff(x >y)))))
split(mydf, group)
Or you can use rleid from the devel version of data.table (from #David Arenburg's comments) , i.e. v1.9.5. Onstructions to install it are here
library(data.table)
split(mydf, rleid(with(mydf, y > x)))

How to reshape multiple rows to a single row with several columns

This may seem an obvious questions for someone who has practice with reshape package but I'm trying to get use to its functions and I can't figure out the right syntax!
Let's have the following data frame,
df <- data.frame(matrix(1:12,ncol=3),row.names=letters[1:4])
X1 X2 X3
a 1 5 9
b 2 6 10
c 3 7 11
d 4 8 12
how can we bind the rows into columns in order to get the following result?
X1.a X2.a X3.a X1.b X2.b X3.b X1.c X2.c X3.c X1.d X2.d X3.d
1 5 9 2 6 10 3 7 11 4 8 12
Thank you
This too would work:
vec <- c(t(df))
names(vec) <- c(outer(colnames(df), rownames(df), paste, sep="."))
## > vec
## X1.a X2.a X3.a X1.b X2.b X3.b X1.c X2.c X3.c X1.d X2.d X3.d
## 1 5 9 2 6 10 3 7 11 4 8 12
Since you want it as a vector, there's no need for reshape perse. You can just unlist it and then use setNames to set the names accordingly.
df.t <- as.data.frame(t(df))
vec <- unlist(df.t, use.names=FALSE) # gives a vector not matrix/data.frame
vec.names <- do.call(paste, c(expand.grid(rownames(df.t), colnames(df.t)), sep="."))
vec <- setNames(vec, vec.names)
# X1.a X2.a X3.a X1.b X2.b X3.b X1.c X2.c X3.c X1.d X2.d X3.d
# 1 5 9 2 6 10 3 7 11 4 8 12
Here's one:
m <- melt(cbind(df, rn=rownames(df)), id.vars='rn')
cast(m, ~ rn + variable)
## value a_X1 a_X2 a_X3 b_X1 b_X2 b_X3 c_X1 c_X2 c_X3 d_X1 d_X2 d_X3
## 1 (all) 1 5 9 2 6 10 3 7 11 4 8 12
Or as Arun indicates, acast gives a matrix (without the additional value column):
acast(m, . ~ variable+rn)
## X1_a X1_b X1_c X1_d X2_a X2_b X2_c X2_d X3_a X3_b X3_c X3_d
## [1,] 1 2 3 4 5 6 7 8 9 10 11 12
(Note that the permutation is in the other order, due to the formula being flipped.)

Resources