R - Sum list of matrix with different columns - r

I have a large list of matrix with different columns and I would like to sum these matrix counting 0 if column X does not exist in one matrix.
If you have used the function rbind.fill from plyr I would like something similar but with sum function. Of course I could build a function to do that, but I'm thinking about a native function efficiently programmed in Frotrain or C due to my large data.
Here an example:
This is the easy example where I have the same columns:
aa <- list(
m1 = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','b','c'))),
m2 = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','b','c')))
)
aa
Reduce('+',aa)
Giving the results:
> aa
$m1
a b c
1 1 4 7
2 2 5 8
3 3 6 9
$m2
a b c
1 1 4 7
2 2 5 8
3 3 6 9
> Reduce('+',aa)
a b c
1 2 8 14
2 4 10 16
3 6 12 18
And with my data:
bb <- list(
m1 = matrix(c(1,2,3,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','c'))),
m2 = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','b','c')))
)
bb
Reduce('+',bb)
Here I would like to have b = c(0,0,0) in the first matrix to sum them.
> bb
$m1
a c
1 1 7
2 2 8
3 3 9
$m2
a b c
1 1 4 7
2 2 5 8
3 3 6 9
Many thanks!
Xevi

One option would be
un1 <- sort(unique(unlist(lapply(bb, colnames))))
bb1 <- lapply(bb, function(x) {
nm1 <- setdiff(un1, colnames(x))
m1 <- matrix(0, nrow = nrow(x), ncol = length(nm1), dimnames = list(NULL, nm1))
cbind(x, m1)[, un1]})
and use the Reduce
Reduce(`+`, bb1)
# a b c
# 1 2 4 14
# 2 4 5 16
# 3 6 6 18

Related

Remove column name pattern in multiple dataframes in R

I have >100 dataframes loaded into R with column name prefixes in some but not all columns that I would like to remove. In the below example with 3 dataframes, I would like to remove the pattern x__ in the 3 dataframes but keep all the dataframe names and everything else the same. How could this be done?
df1 <- data.frame(`x__a` = rep(3, 5), `x__b` = seq(1, 5, 1), `x__c` = letters[1:5])
df2 <- data.frame(`d` = rep(5, 5), `x__e` = seq(2, 6, 1), `f` = letters[6:10])
df3 <- data.frame(`x__g` = rep(5, 5), `x__h` = seq(2, 6, 1), `i` = letters[6:10])
You could put the data frames in a list and use an anonymous function with gsub.
lst <- mget(ls(pattern='^df\\d$'))
lapply(lst, \(x) setNames(x, gsub('x__', '', names(x))))
# $df1
# a b c
# 1 3 1 a
# 2 3 2 b
# 3 3 3 c
# 4 3 4 d
# 5 3 5 e
#
# $df2
# d e f
# 1 5 2 f
# 2 5 3 g
# 3 5 4 h
# 4 5 5 i
# 5 5 6 j
#
# $df3
# g h i
# 1 5 2 f
# 2 5 3 g
# 3 5 4 h
# 4 5 5 i
# 5 5 6 j
If you have no use of the list, move the changed dfs back into .GlobalEnv using list2env, but I don't recommend it, since it overwrites.
lapply(lst, \(x) setNames(x, gsub('x__', '', names(x)))) |> list2env(.GlobalEnv)

Creating a list with column-wise partitions of a data.frame

I have a data.frame with a single "identifier" column and many additional columns. I am interested in turning this data.frame into a list of length K, whose elements are sets of columns partitioning the data.frame.
For example, given the below data.frame:
# Example data.frame
df <- data.frame(id = 1:10,
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10),
x4 = rnorm(10))
I'd like to have some function that converts it into this:
# Partitioning function
foo(df, partitions = 3)
# Expected output
list(data.frame(id = df$id, x1 = df[ ,2]),
data.frame(id = df$id, x2 = df[ ,3]),
data.frame(id = df$id, x3 = df[ ,4], x4 = df[ ,5]),
Bonus points if you can extend this so that you can specify how many non-id columns each element of the list should contain by passing a numeric vector. Imagine the same output with an input that looks like this or equivalent.
columns_per_element <- c(1,1,2)
foo(df, columns_per_element)
It is actually easier to define a function with the splitting sequence. The key functions here are repand split.default i.e.
f2 <- function(df, n, split){
i1 <- rep(seq(n), split)
res_list <- split.default(df[-1], i1)
return(lapply(res_list, function(i)cbind.data.frame(ID = df$id, i)))
}
f2(df, 3, c(1, 1, 2))
$`1`
ID x1
1 1 1.54960977
2 2 -1.59144017
3 3 0.02853548
4 4 -0.14231391
5 5 1.26989801
6 6 0.87495876
7 7 0.27373774
8 8 -0.75600983
9 9 0.32216493
10 10 -1.05113771
$`2`
ID x2
1 1 0.8529416
2 2 0.4555094
3 3 -0.3620756
4 4 1.4779813
5 5 -1.6484066
6 6 -0.5697431
7 7 -0.2139384
8 8 0.1619074
9 9 -0.5390306
10 10 -0.2228809
$`3`
ID x3 x4
1 1 -0.2579865 1.185526074
2 2 -0.0519554 -0.388179976
3 3 2.5350092 -0.675504829
4 4 -1.7051955 0.073448252
5 5 0.6207733 -0.637220508
6 6 0.3015831 -1.324024114
7 7 -0.5647717 0.969025962
8 8 0.1404714 -1.575383604
9 9 1.3049560 -1.846413101
10 10 -0.6716643 0.008675125
f2(df, 3, c(1, 2, 1))
$`1`
ID x1
1 1 1.54960977
2 2 -1.59144017
3 3 0.02853548
4 4 -0.14231391
5 5 1.26989801
6 6 0.87495876
7 7 0.27373774
8 8 -0.75600983
9 9 0.32216493
10 10 -1.05113771
$`2`
ID x2 x3
1 1 0.8529416 -0.2579865
2 2 0.4555094 -0.0519554
3 3 -0.3620756 2.5350092
4 4 1.4779813 -1.7051955
5 5 -1.6484066 0.6207733
6 6 -0.5697431 0.3015831
7 7 -0.2139384 -0.5647717
8 8 0.1619074 0.1404714
9 9 -0.5390306 1.3049560
10 10 -0.2228809 -0.6716643
$`3`
ID x4
1 1 1.185526074
2 2 -0.388179976
3 3 -0.675504829
4 4 0.073448252
5 5 -0.637220508
6 6 -1.324024114
7 7 0.969025962
8 8 -1.575383604
9 9 -1.846413101
10 10 0.008675125
Here is solution with two parameters in the function with a vectorized column select. note this assumes the first column is id and is called id. second if the sum of the vector is greater than ncol(df)-1 (this will be your input df) it will throw an error.
f2 <- function(x,y){
#keep id
id <- x[,"id" , drop = FALSE]
#keep all other variables
df2 <- x[,-1]
#get sequence for columns
y2 <- lapply(cumsum(y), function(x){sequence(x)})
#grab correct columns
y3 <- c(y2[1],mapply(dplyr::setdiff ,y2[2:length(y2)],y2[1:2]))
#recreate df
lapply(y3,
function(x){
cbind.data.frame(id, df2[,x, drop = FALSE])
})
}
f2(df, c(1,1,2))

extract and format data from dataset into matrix in R

I want to make this dataframe
into this matrix
I have tried:
x <- read.csv("sample1.csv")
ax <- matrix(c(x[1,1],x[2,1],x[1,3],x[1,1],x[3,1],x[1,4],x[1,1],x[4,1],x[1,5],x[1,1],x[5,1],x[1,6],x[1,1],x[6,1],x[1,7],x[2,1],x[1,1],x[2,2],x[2,1],x[3,1],x[2,4],x[2,1],x[4,1],x[2,5],x[2,1],x[5,1],x[2,6],x[3,1],x[6,1],x[2,7],x[3,1],x[1,1],x[3,2],x[3,1],x[2,1],x[3,3],x[3,1],x[4,1],x[3,5],x[3,1],x[5,1],x[3,6],x[3,1],x[6,1],x[3,7],x[4,1],x[1,1],x[4,2],x[4,1],x[2,1],x[4,3],x[4,1],x[3,1],x[4,4],x[4,1],x[5,1],x[4,6],x[4,1],x[6,1],x[4,7],x[5,1],x[1,1],x[2,2],x[5,1],x[2,1],x[2,4],x[5,1],x[3,1],x[2,5],x[5,1],x[4,1],x[2,6],x[5,1],x[6,1],x[2,7],x[6,1],x[1,1],x[2,2],x[6,1],x[2,1],x[2,4],x[6,1],x[3,1],x[2,5],x[6,1],x[4,1],x[2,6],x[6,1],x[5,1],x[2,7]),10,3, byrow=TRUE)
bx <- ax[order(ax[,3], decreasing = TRUE),]
But it's not beautiful at all, and also it's gonna be lots of work if I got different sample data.
So I wish to simplified it if possible, any suggestion?
This can be achieved by using melt() function from reshape2 package:
> a = matrix(c(1:9), nrow = 3, ncol = 3, dimnames = list(LETTERS[1:3], letters[1:3]))
> a
a b c
A 1 4 7
B 2 5 8
C 3 6 9
> library(reshape2)
> melt(a, na.rm = TRUE)
Var1 Var2 value
1 A a 1
2 B a 2
3 C a 3
4 A b 4
5 B b 5
6 C b 6
7 A c 7
8 B c 8
9 C c 9

R - find clusters of group 2 (pairs)

I am looking for a way to find clusters of group 2 (pairs).
Is there a simple way to do that?
Imagine I have some kind of data where I want to match on x and y, like
library(cluster)
set.seed(1)
df = data.frame(id = 1:10, x_coord = sample(10,10), y_coord = sample(10,10))
I want to find the closest pair of distances between the x_coord and y_coord:
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)
I get a dendrogram like the one below. What I would like is that the pairs (9,10), (1,3), (6,7), (4,5) be grouped together. And that in fact the cases 8 and 2, be left alone and removed.
Maybe there is a more effective alternative for doing this than clustering.
Ultimately I would like is to remove the unmatched ids and keep the pairs and have a dataset like this one:
id x_coord y_coord pair_id
1 9 3 1
3 7 5 1
4 1 8 2
5 2 2 2
6 5 6 3
7 3 10 3
9 6 4 4
10 8 7 4
You could use the element h$merge. Any rows of this two-column matrix that both contain negative values represent a pairing of singletons. Therefore you can do:
pairs <- -h$merge[apply(h$merge, 1, function(x) all(x < 0)),]
df$pair <- (match(df$id, c(pairs)) - 1) %% nrow(pairs) + 1
df <- df[!is.na(df$pair),]
df
#> id x_coord y_coord pair
#> 1 1 9 3 4
#> 3 3 7 5 4
#> 4 4 1 8 1
#> 5 5 2 2 1
#> 6 6 5 6 2
#> 7 7 3 10 2
#> 9 9 6 4 3
#> 10 10 8 7 3
Note that the pair numbers equate to "height" on the dendrogram. If you want them to be in ascending order according to the order of their appearance in the dataframe you can add the line
df$pair <- as.numeric(factor(df$pair, levels = unique(df$pair)))
Anyway, if we repeat your plotting code on our newly modified df, we can see there are no unpaired singletons left:
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)
And we can see the method scales nicely:
df = data.frame(id = 1:50, x_coord = sample(50), y_coord = sample(50))
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
pairs <- -h$merge[apply(h$merge, 1, function(x) all(x < 0)),]
df$pair <- (match(df$id, c(pairs)) - 1) %% nrow(pairs) + 1
df <- df[!is.na(df$pair),]
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)

change data.frame column into rows in R

A <- c(1,6)
B <- c(2,7)
C <- c(3,8)
D <- c(4,9)
E <- c(5,0)
df <- data.frame(A,B,C,D,E)
df
A B C D E
1 1 2 3 4 5
2 6 7 8 9 0
I would like to have this:
df
1 2
A 1 6
B 2 7
C 3 8
D 4 9
E 5 0
If your dataframe is truly in that format, then all of your vectors will be character vectors. Or, you basically have a character matrix and you could do this:
data.frame(t(df))
It would be better, though, to just define it the way you want it from the get-go
df <- data.frame(c('A','B','C','D','E'),
c(1, 2, 3, 4, 5),
c(6, 7, 8, 9, 0))
You could also do this
df <- data.frame(LETTERS[1:5], 1:5, c(6:9, 0))
If you wanted to give the columns names, you could do this
df <- data.frame(L = LETTERS[1:5], N1 = 1:5, N2 = c(6:9, 0))
Sometimes, if I use read.DIF of Excel data the data gets transposed. Is that how you got the original data in? If so, you can call
read.DIF(filename, transpose = T)
to get the data in the correct orientation.
I really recommend data.table approach without manual steps becauce they are error-prone
A <- c(1,6)
B <- c(2,7)
C <- c(3,8)
D <- c(4,9)
E <- c(5,0)
df <- data.frame(A,B,C,D,E)
df
library('data.table')
dat.m <- melt(as.data.table(df, keep.rownames = "Vars"), id.vars = "Vars") # https://stackoverflow.com/a/44128640/54964
dat.m
Output
A B C D E
1 1 2 3 4 5
2 6 7 8 9 0
Vars variable value
1: 1 A 1
2: 2 A 6
3: 1 B 2
4: 2 B 7
5: 1 C 3
6: 2 C 8
7: 1 D 4
8: 2 D 9
9: 1 E 5
10: 2 E 0
R: 3.4.0 (backports)
OS: Debian 8.7

Resources