calculate means and occurences from multiple matrices - r

I have a number of matrices that they all have the same type of elements but different lengths. Columns in all files are the same (lets call them "A" and "B") but rows between files are mostly the same elements but not always.
Here are some example data (in the form of dataframes)
df1 <- data.frame(A = 1:3, B = 3:1)
rownames(df1)=c("alpha","beta","gamma")
df2 <- data.frame(A = 1:5,B = 5:1)
rownames(df2)=c("alpha","beta","delta","gamma","zeta")
df3 <- data.frame(A = 1:7, B = 7:1)
rownames(df3)=c("alpha","beta","delta","gamma","zeta","theta","epsilon")
as you can see as far as the rows go even though "alpha","beta" and "gamma" are always present many of the others are not always there
I would like to calculate 2 things:
the average values of all A and B columns in all matrices and ideally that would be by creating an ave.matr that would have all rownames and the average/mean values of the columns "A" and "B"
A B
alpha 1 7
beta 2 6
delta 3 5
gamma 4 4
zeta 5 3
theta 6 2
epsilon 7 1
(where the above numbers are the mean values of all matrices)
and then an occurrence matrix, lets call it occur.matr that would count the number of occurrences of each row across all matrices and it should look like that
A B
alpha 3
beta 3
delta 2
gamma 3
zeta 2
theta 1
epsilon 1
I have started working on this today but I cannot figure out how to do it.
I started by creating a list and a matrix with the unique rownames from all matrices
list=c(rownames(df1),rownames(df2),rownames(df3))
unique=unique(list)
avematr<-matrix(NA,nrow=length(unique),ncol=2)
and my next step would be to make rownames of all matrices identical. I tried with match but i cannot figure it out but at this moment I dont even know if this is the best strategy...
And all similar questions out there are related to merging the matrices (which is not what I want to do).
Any help is greatly appreciated

Here is a tidyverse approach:
library(tidyverse)
df1 <- data.frame(A = 1:3, B = 3:1)
rownames(df1)=c("alpha","beta","gamma")
df2 <- data.frame(A = 1:5,B = 5:1)
rownames(df2)=c("alpha","beta","delta","gamma","zeta")
df3 <- data.frame(A = 1:7, B = 7:1)
rownames(df3)=c("alpha","beta","delta","gamma","zeta","theta","epsilon")
dat <- list(df1, df2, df3) %>%
map_dfr(rownames_to_column)
avg_dat <- dat %>%
group_by(id) %>%
summarise(A = mean(A),
B = mean(B))
#> `summarise()` ungrouping output (override with `.groups` argument)
avg_dat
#> # A tibble: 7 x 3
#> id A B
#> <chr> <dbl> <dbl>
#> 1 alpha 1 5
#> 2 beta 2 4
#> 3 delta 3 4
#> 4 epsilon 7 1
#> 5 gamma 3.67 2.33
#> 6 theta 6 2
#> 7 zeta 5 2
occ_dat <- dat %>% count(id)
occ_dat
#> id n
#> 1 alpha 3
#> 2 beta 3
#> 3 delta 2
#> 4 epsilon 1
#> 5 gamma 3
#> 6 theta 1
#> 7 zeta 2
Created on 2021-01-27 by the reprex package (v0.3.0)

If you want to stick to base R:
For the averaging task it makes things easier when you add your rowname as a column. This prevents autonumbering of rownames when combining the dataframes. You then can simply loop over every unique rowname and construct the averages. A quick and dirty solution could look like this:
df1 <- data.frame(A = 1:3, B = 3:1)
rownames(df1)=c("alpha","beta","gamma")
df2 <- data.frame(A = 1:5,B = 5:1)
rownames(df2)=c("alpha","beta","delta","gamma","zeta")
df3 <- data.frame(A = 1:7, B = 7:1)
rownames(df3)=c("alpha","beta","delta","gamma","zeta","theta","epsilon")
add_row_names_to_df <- function(df) {
df$rn <- rownames(df)
return(df)
}
new_df <- rbind(add_row_names_to_df(df1),
add_row_names_to_df(df2),
add_row_names_to_df(df3))
avg_df <- as.data.frame(matrix(unique(new_df$rn),
nrow = length(unique(new_df$rn)),
ncol = 3))
for(i in 1:nrow(avg_df)) {
avg.df[i,] <- c(avg_df[i,1],
mean(new_df$A[new_df$rn==avg_df[i,1]]),
mean(new_df$B[new_df$rn==avg_df[i,1]]))
}
colnames(avg_df) <- c("rowname", "avgA", "avgB")
avg_df
results in:
rowname avgA avgB
1 alpha 1 5
2 beta 2 4
3 gamma 3.66666666666667 2.33333333333333
4 delta 3 4
5 zeta 5 2
6 theta 6 2
7 epsilon 7 1
For the occurence matrix you can use the table() function from R:
as.matrix(table(c(rownames(df1),rownames(df2),rownames(df3))))
yields:
[,1]
alpha 3
beta 3
delta 2
epsilon 1
gamma 3
theta 1
zeta 2

Related

Split character vector values by fixed delimiter - return data frame with one row for each vector value

The delimiter is present in all vector values, and only once, thus each vector value should result in exactly one pair and the result should be a two column data frame.
I am not unhappy about my solution, but wondered if there might be cool functions around that make this easier. Open for any package, but base R preferred.
test <- rep("a,b", 5)
# expected result
data.frame(t(do.call(cbind, strsplit(test, ","))))
#> X1 X2
#> 1 a b
#> 2 a b
#> 3 a b
#> 4 a b
#> 5 a b
You can use tidyr::separate().
test <- data.frame(x = rep("a,b", 5))
separate(test,x, c("X1","X2"))
#> X1 X2
#> 1 a b
#> 2 a b
#> 3 a b
#> 4 a b
#> 5 a b
You can use extract:
library(tidyr)
data.frame(test) %>%
extract(col = test,
into = c("X1", "X2"),
regex = "(.),(.)")
X1 X2
1 a b
2 a b
3 a b
4 a b
5 a b
Data:
test <- rep("a,b", 5)

Separate data frame into ID columns (with only one unique value), and variable columns (with more than one value)

I have a data frame with several ID cols containing only one unique value and columns that actually contain variables. How to separate those?
I have come up with the following approach using a conditional statement in sapply, but I wondered if there may be a more elegant way to do that?
I am happy with any package, and any output where the data frames are separated, this can also be in a list. Each frame does not need to be assigned to a new object.
mydf <- data.frame(a = 'a', b = 'b', val1 = 1:10, val2 = 10:1)
head(mydf,3)
#> a b val1 val2
#> 1 a b 1 10
#> 2 a b 2 9
#> 3 a b 3 8
id_cols <- mydf[sapply(names(mydf), function(x) {length(unique(mydf[[x]])) == 1})]
variable_cols <- mydf[sapply(names(mydf), function(x) {length(unique(mydf[[x]])) != 1})]
head(id_cols, 3)
#> a b
#> 1 a b
#> 2 a b
#> 3 a b
head(variable_cols, 3)
#> val1 val2
#> 1 1 10
#> 2 2 9
#> 3 3 8
Created on 2020-04-02 by the reprex package (v0.3.0)
A very, very slightly shorter way would be
Var = lengths(lapply(mydf, unique)) > 1
id_cols = mydf[, Var]
variable_cols = mydf[, !Var]

Creating a mean variable, from variable names and weights supplied by vectors

Suppose I want to create a mean variable in a given dataframe based on two vectors, one specifying the names of the variables to use, and one specifying weights by which these variables should go into the mean variable:
vars <- c("a", "b", "c","d"))
weights <- c(0.5, 0.7, 0.8, 0.2))
df <- data.frame(cbind(c(1,4,5,7), c(2,3,7,5), c(1,1,2,3),
c(4,5,3,3), c(3,2,2,1), c(5,5,7,1)))
colnames(df) <- c("a","b","c","d","e","f")
How could I use dplyr::mutate() to create a mean variable that uses vars and weights to calculate a rowwise score? mutate() should specifically use the variables supplied by vars
The result should basically do the following:
df <- df %>%
rowwise() %>%
mutate(comp = mean(c(vars[1]*weights[1], vars[2]*weights[2], ...)))
Written out:
df2 <- df %>%
rowwise() %>%
mutate(comp = mean(c(0.5*a, 0.7*b, 0.8*c, 0.2*d)))
I can't figure out how to do this because, although vars contains the exact variable names that I want to use for mutate in my df, inside vars they are strings. How could I make mutate() understand that the strings vars contains relate to columns in my df? If you know another procedure not using mutate() that's fine also. Thanks!
You may use
df %>% mutate(wmean = apply(.[vars], 1, weighted.mean, weights))
# a b c d e f mean
# 1 1 2 1 4 3 5 1.590909
# 2 4 3 1 5 2 5 2.681818
# 3 5 7 2 3 2 7 4.363636
# 4 7 5 3 3 1 1 4.545455
but there is not much to gain with tidyverse as base R approaches can be almost the same and end up being shorter:
df$wmean <- apply(df[vars], 1, weighted.mean, weights)
or one of the following:
df$wmean <- colSums(t(df[vars]) * weights) / sum(weights)
df$wmean <- as.matrix(df[vars]) %*% weights / sum(weights)
df$wmean <- rowSums(sweep(df[vars], 2, weights, `*`)) / sum(weights)
Row-wise operations can be a bit tricky in the tidyverse. This is a case where some base R knowledge can be really handy. For example, you can do it in one line with apply (note that I corrected a typo in the line that creates weights and drop columns e and f, which do not have weights):
vars <- c("a", "b", "c","d")
weights <- c(0.5, 0.7, 0.8, 0.2)
df <- data.frame(cbind(c(1,4,5,7), c(2,3,7,5), c(1,1,2,3),
c(4,5,3,3), c(3,2,2,1), c(5,5,7,1)))
colnames(df) <- c("a","b","c","d","e","f")
df$weighted.mean <- apply(df %>% select(-e, -f), 1, weighted.mean, weights)
a b c d e f weighted.mean
1 1 2 1 4 3 5 1.590909
2 4 3 1 5 2 5 2.681818
3 5 7 2 3 2 7 4.363636
4 7 5 3 3 1 1 4.545455
If you really wanted to do it in the tidyverse, this should get you started:
library(tidyverse)
df.weights <- data.frame(vars, weights)
df.new <- df %>%
mutate(row.num = 1:n()) %>%
gather(variable, value, -row.num) %>%
left_join(df.weights, by = c(variable = 'vars')) %>%
filter(variable %in% vars) %>%
group_by(row.num) %>%
mutate(weighted.mean = weighted.mean(value, weights))
There should be a tidyverse solution using pmap, but it eludes me. Here's another approach using tidyverse packages purrr and tibble
library(tidyverse)
vars <- c("a", "b", "c", "d")
weights <- c(0.5, 0.7, 0.8, 0.2)
df <- data.frame(cbind(c(1,4,5,7), c(2,3,7,5), c(1,1,2,3),
c(4,5,3,3), c(3,2,2,1), c(5,5,7,1)))
colnames(df) <- c("a","b","c","d","e","f")
df %>%
transpose() %>%
simplify_all() %>%
map_dbl(~weighted.mean(.x[vars], weights)) %>%
add_column(df, wmean = .)
#> a b c d e f wmean
#> 1 1 2 1 4 3 5 1.590909
#> 2 4 3 1 5 2 5 2.681818
#> 3 5 7 2 3 2 7 4.363636
#> 4 7 5 3 3 1 1 4.545455
Created on 2018-11-24 by the reprex package (v0.2.1)

Recursively sum data frames for matching rows

I would like to combine a set of data frames into a single data frame by summing columns that have matching variables (instead of appending columns).
For example, given
df1 <- data.frame(A = c(0,0,1,1,1,2,2), B = c(1,2,1,2,3,1,5), x = c(2,3,1,5,3,7,0))
df2 <- data.frame(A = c(0,1,1,2,2,2), B = c(1,1,3,2,4,5), x = c(4,8,4,1,0,3))
df3 <- data.frame(A = c(0,1,2), B = c(5,4,2), x = c(5,3,1))
I want to match by "A" and "B" and sum the values of "x". For this example, I can get the desired result as follows:
library(plyr)
library(dplyr)
# rename columns so that join_all preserves them all:
colnames(df1)[3] <- "x1"
colnames(df2)[3] <- "x2"
colnames(df3)[3] <- "x3"
# join the data frames by matching "A" and "B" values:
res <- join_all(list(df1, df2, df3), by = c("A", "B"), type = "full")
# get the sums and drop superfluous columns:
arrange(res, A, B) %>%
rowwise() %>%
mutate(x = sum(x1, x2, x3, na.rm = TRUE)) %>%
select(A, B, x)
Result:
A B x
<dbl> <dbl> <dbl>
1 0 1 6
2 0 2 3
3 0 5 5
4 1 1 9
5 1 2 5
6 1 3 7
7 1 4 3
8 2 1 7
9 2 2 2
10 2 4 0
11 2 5 3
A more general solution is
library(dplyr)
# function to get the desired result for two data frames:
my_merge <- function(df1, df2)
{
m1 <- merge(df1, df2, by = c("A", "B"), all = TRUE)
m1 <- rowwise(res) %>%
mutate(x = sum(x.x, x.y, na.rm = TRUE)) %>%
select(A, B, x)
return(m1)
}
l1 <- list(df2, df3) # omit the first data frame
res <- df1 # initial value of the result
for(df in l1) res <- my_merge(res, df) # call the function repeatedly
Is there a more efficient option for combining a large set of data frames? Ideally it should be recursive (i.e. it's better not to join all data frames into one massive data frame before calculating the sums).
An easier option is to bind the rows of the datasets, then group by the columns of interest and get the summarised output by getting the sum of 'x'
library(tidyverse)
bind_rows(df1, df2, df3) %>%
group_by(A, B) %>%
summarise(x = sum(x))
# A tibble: 11 x 3
# Groups: A [?]
# A B x
# <dbl> <dbl> <dbl>
# 1 0 1 6
# 2 0 2 3
# 3 0 5 5
# 4 1 1 9
# 5 1 2 5
# 6 1 3 7
# 7 1 4 3
# 8 2 1 7
# 9 2 2 2
#10 2 4 0
#11 2 5 3
If there are many objects in the global environment with the pattern "df" followed by some digits
mget(ls(pattern= "^df\\d+")) %>%
bind_rows %>%
group_by(A, B) %>%
summarise(x = sum(x))
As the OP mentioned about memory constraints, if we do the join first and then use rowSums or + with reduce, it would be more efficient
mget(ls(pattern= "^df\\d+")) %>%
reduce(full_join, by = c("A", "B")) %>%
transmute(A, B, x = rowSums(.[3:5], na.rm = TRUE)) %>%
arrange(A, B)
# A B x
#1 0 1 6
#2 0 2 3
#3 0 5 5
#4 1 1 9
#5 1 2 5
#6 1 3 7
#7 1 4 3
#8 2 1 7
#9 2 2 2
#10 2 4 0
#11 2 5 3
This could also be done with data.table
library(data.table)
rbindlist(mget(ls(pattern= "^df\\d+")))[, .(x = sum(x)), by = .(A, B)]
Ideally it should be recursive (i.e. it's better not to join all data frames into one massive data frame before calculating the sums).
If you're memory constrained and willing to sacrifice speed (vs #akrun's data.table approach), use one table at a time in a loop:
library(data.table)
tabs = c("df1", "df2", "df3")
# enumerate all combos for the results table
# initializing sum to 0
res = CJ(A = 0:2, B = 1:5, x = 0)
# loop over tabs, adding on
for (i in seq_along(tabs)){
tab = get(tabs[[i]])
res[tab, on=.(A, B), x := x + i.x][]
rm(tab)
}
If you need to read tables from disk, change tabs to file names and get to fread or whatever function.
I am skeptical that you can fit all the tables in memory, but cannot also fit an rbind-ed copy of them together.
Similarly (thanks to #akrun's comment), use his approach pairwise:
res = data.table(get(tabs[[1]]))[0L]
for (i in seq_along(tabs)){
tab = get(tabs[[i]])
res = rbind(res, tab)[, .(x = sum(x)), by=.(A,B)]
rm(tab)
}

Bind data frames on longer identifiers R

I've got two data frames in which the unique identifiers common to both frames differ in the number of observations. I would like to create a dataframe from both in which the observations from each frame are taken if they have more observations for a common identifier. For example:
f1 <- data.frame(x = c("a", "a", "b", "c", "c", "c"), y = c(1,1,2,3,3,3))
f2 <- data.frame(x = c("a","b", "b", "c", "c"), y = c(4,5,5,6,6))
I would like this to generate a merge based on the longer x such that it produces:
x y
a 1
a 1
b 5
b 5
c 3
c 3
c 3
Any and all thoughts would be great.
Here's a solution using split
dd<-rbind(cbind(f1, s="f1"), cbind(f2, s="f2"))
keep<-unsplit(lapply(split(dd$s, dd$x), FUN=function(x) {
y<-table(x)
x == names(y[which.max(y)])
}), dd$x)
dd <- dd[keep,]
Normally i'd prefer to use the ave function here but because i'm changing data.types from a factor to a logical, it wasn't as appropriate so I basically copied the idea that ave uses and used split.
dplyr solution
library(dplyr)
First we combine the data:
with rbind() and introduce a new variable called ref to know where each observation came from:
both <- rbind( f1, f2 )
both$ref <- rep( c( "f1", "f2" ) , c( nrow(f1), nrow(f2) ) )
then count the observations:
make another new variable that contains how many observations for each ref and x combination:
both_with_counts <- both %>%
group_by( ref ,x ) %>%
mutate( counts = n() )
then filter for the largest count:
both_with_counts %>% group_by( x ) %>% filter( n==max(n) )
note: you could also select only the x and y cols with select(x,y)...
this gives:
## Source: local data frame [7 x 4]
## Groups: x
##
## x y ref counts
## 1 a 1 f1 2
## 2 a 1 f1 2
## 3 c 3 f1 3
## 4 c 3 f1 3
## 5 c 3 f1 3
## 6 b 5 f2 2
## 7 b 5 f2 2
Altogether now...
what_I_want <-
rbind(cbind(f1,ref = "f1"),cbind(f2,ref = "f2")) %>%
group_by(ref,x) %>%
mutate(counts = n()) %>%
group_by( x ) %>%
filter( counts==max(counts) ) %>%
select( x, y )
and thus:
> what_I_want
# Source: local data frame [7 x 2]
# Groups: x
#
# x y
# 1 a 1
# 2 a 1
# 3 c 3
# 4 c 3
# 5 c 3
# 6 b 5
# 7 b 5
Not a elegant answer but still give the desired result. Hope this help.
f1table <- data.frame(table(f1$x))
colnames(f1table) <- c("x","freq")
f1new <- merge(f1,f1table)
f2table <- data.frame(table(f2$x))
colnames(f2table) <- c("x","freq")
f2new <- merge(f2,f2table)
table <- rbind(f1table, f2table)
table <- table[with(table, order(x,-freq)), ]
table <- table[!duplicated(table$x), ]
data <-rbind(f1new, f2new)
merge(data, table, by=c("x","freq"))[,c(1,3)]
x y
1 a 1
2 a 1
3 b 5
4 b 5
5 c 3
6 c 3
7 c 3

Resources