Populate matrix by colname identity - r

I have many samples, each one of which has a corresponding abundance matrix. From these abundance matrices, I would like to create a large matrix that contains abundance information for each sample in rows.
For example, a single abundance matrix would look like:
A B C D
sample1 1 3 4 2
where A, B, C, and D represent colnames, and the abundances are the row values.
I would like to populate my larger matrix, which has as colnames all possible letters (A:Z) and all possible samples (sample1:sampleN) as rows, by matching the colname values.
For ex. :
A B C D E F G .... Z
sample1 1 3 4 2 NA NA NA ....
sample2 NA NA 2 5 7 NA NA ....
sample3 4 NA 6 9 2 NA 2 .....
....
sampleN
Different samples have a varying mix of abundances, in no guaranteed order.
When iteratively adding to this larger matrix, how could I ensure that the correct columns are populated by the right abundance values (ex. column "A" is only filled by values corresponding to abundances of "A" in different samples)? Thanks!

Starting data, changing just a little to highlight differences:
m1 <- as.matrix(read.table(header=TRUE, text="
A B C Z
sample1 1 3 4 2"))
m2 <- as.matrix(read.table(header=TRUE, text="
A B C D E F G
sample2 NA NA 2 5 7 NA NA
sample3 4 NA 6 9 2 NA 2"))
First, we need to make sure both matrices have the same column names:
newcols <- setdiff(colnames(m2), colnames(m1))
m1 <- cbind(m1, matrix(NA, nr=nrow(m1), nc=length(newcols), dimnames=list(NULL, newcols)))
newcols <- setdiff(colnames(m1), colnames(m2))
m2 <- cbind(m2, matrix(NA, nr=nrow(m2), nc=length(newcols), dimnames=list(NULL, newcols)))
m1
# A B C Z D E F G
# sample1 1 3 4 2 NA NA NA NA
m2
# A B C D E F G Z
# sample2 NA NA 2 5 7 NA NA NA
# sample3 4 NA 6 9 2 NA 2 NA
And now we combine them; regular cbind needs the column names to be aligned as well:
rbind(m2, m1[,colnames(m2),drop=FALSE])
# A B C D E F G Z
# sample2 NA NA 2 5 7 NA NA NA
# sample3 4 NA 6 9 2 NA 2 NA
# sample1 1 3 4 NA NA NA NA 2

You should be able to take advantage of matrix indexing, like so:
big[cbind(rownames(abun),colnames(abun))] <- abun
Using this example abundance matrix, and a big matrix to fill:
abun <- matrix(c(1,3,4,2),nrow=1,dimnames=list("sample1",LETTERS[1:4]))
big <- matrix(NA,nrow=5,ncol=26,dimnames=list(paste0("sample",1:5),LETTERS))

Another solution using reduce from purrr package and union_all from dplyr package:
library(purrr)
library(dplyr)
sample_names <- c("sample1","sample2","sample3")
Generating 3 random abundance dataframes:
num1 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df1 <- data.frame(t(num1))
colnames(df1) <- sample(LETTERS,length(num1))
num2 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df2 <- data.frame(t(num2))
colnames(df2) <- sample(LETTERS,length(num2))
num3 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df3 <- data.frame(t(num3))
colnames(df3) <- sample(LETTERS,length(num3))
This is actually the code that does all the magic:
A <- reduce(list(df1,df2,df3),union_all)
col_order <- sort(colnames(A),decreasing = FALSE)
A <- A[,col_order]
rownames(A) <- sample_names
Output:
> A
A C E F O P Q U W Y
sample1 9 NA NA NA 9 NA 5 6 NA NA
sample2 NA NA NA NA 5 4 NA NA 5 NA
sample3 NA 6 5 9 NA NA 3 NA 5 7

Related

function to rbind list of dataframes different columns and rows

I want to create a function that merges a list of dataframes with different column numbers and the rows have different names that I'd like to keep. Essentially I want to stack dataframes where the column names just become another row to be appended.
df <- list()
df[[1]] <- data.frame(d = c(4,5), e = c("c", "d"))
rownames(df[[1]]) <- c("df2_row_1", "df2_row_2")
df[[2]] <- data.frame(a = c(1,2,3), b = c("a", "b", "c"), c = c("one", "two", "three"))
rownames(df[[2]]) <- c("df1_row_1", "df1_row_2", "df1_row_3")
df[[3]] <- data.frame(f = c(6,7,8), g = c("e", "f", "g"), h = c("one", "two", "three"), w = c(100,101,102))
rownames(df[[3]]) <- c("df3_row_1", "df3_row_2", "df3_row_3")
Current Output:
do.call(bind_rows, df)
d e a b c f g h w
1 4 c NA <NA> <NA> NA <NA> <NA> NA
2 5 d NA <NA> <NA> NA <NA> <NA> NA
3 NA <NA> 1 a one NA <NA> <NA> NA
4 NA <NA> 2 b two NA <NA> <NA> NA
5 NA <NA> 3 c three NA <NA> <NA> NA
6 NA <NA> NA <NA> <NA> 6 e one 100
7 NA <NA> NA <NA> <NA> 7 f two 101
8 NA <NA> NA <NA> <NA> 8 g three 102
Desired Output
d e
df2_row_1 4 c
df2_row_2 5 d
a b c
df1_row_1 1 a one
df1_row_2 2 b two
df1_row_3 3 c three
f g h w
df3_row_1 6 e one 100
df3_row_2 7 f two 101
df3_row_3 8 g three 102
I've tried (unsuccessfully) creating a function that finds the longest data frame, then appends empty columns to the data frames that are shorter than the longest, then gives all the data frames the same name for each of those columns.
I also realize this couldn't be more NOT tidy - is this possible?
Thank you!!!
This can be achieved with a for loop (I think it could be achieved with mapply to, check ?mapply). The overall strategy is filling each df in the list with NAs (cbinding them) and then rbindlisting the resulting list:
library(data.table)
cols <- max(sapply(df, ncol))
# This is the length of the NA vectors that make the cbinding dfs:
lengths <- (cols - sapply(df, ncol))*sapply(df, nrow)
newdf <- list()
for (i in 1:length(df)){
if (ncol(df[[i]]) != cols){
newdf[[i]] <- cbind(df[[i]],
as.data.frame(matrix(rep(NA, lengths[i]),
ncol = lengths[i] / nrow(df[[i]]))))
} else {
newdf[[i]] <- df[[i]]
}
}
rbindlist(newdf, use.names = FALSE)
Which results in:
d e V1 V2
1: 4 c <NA> NA
2: 5 d <NA> NA
3: 1 a one NA
4: 2 b two NA
5: 3 c three NA
6: 6 e one 100
7: 7 f two 101
8: 8 g three 102

Merge data.frame columns on set number of columns removing na's unless not enough values in row

I'd like to remove the NA values from my columns, merge all columns into four columns, while keeping NA's if there is not 4 values in each row.
Say I have data like this,
df <- data.frame('a' = c(1,4,NA,3),
'b' = c(3,NA,3,NA),
'c' = c(NA,2,NA,NA),
'd' = c(4,2,NA,NA),
'e'= c(NA,5,3,NA),
'f'= c(1,NA,NA,4),
'g'= c(NA,NA,NA,4))
#> a b c d e f g
#> 1 1 3 NA 4 NA 1 NA
#> 2 4 NA 2 2 5 NA NA
#> 3 NA 3 NA NA 3 NA NA
#> 4 3 NA NA NA NA 4 4
My desired outcome would be,
df.desired <- data.frame('a' = c(1,4,3,3),
'b' = c(3,2,3,4),
'c' = c(4,2,NA,4),
'd' = c(1,5,NA,NA))
df.desired
#> a b c d
#> 1 1 3 4 1
#> 2 4 2 2 5
#> 3 3 3 NA NA
#> 4 3 4 4 NA
You could've probably explored a bit more on SO to tweak two answers 1 & 2.
Shifting all the Numbers with NAs
Remove the columns where you've got All NAs
Result:
df <- data.frame('a' = c(1,4,NA,3),
'b' = c(3,NA,3,NA),
'c' = c(NA,2,NA,NA),
'd' = c(4,2,NA,NA),
'e'= c(NA,5,3,NA),
'f'= c(1,NA,NA,4),
'g'= c(NA,NA,NA,4))
df.new<-do.call(rbind,lapply(1:nrow(df),function(x) t(matrix(df[x,order(is.na(df[x,]))])) ))
colnames(df.new)<-colnames(df)
df.new
df.new[,colSums(is.na(df.new))<nrow(df.new)]
Output:
> df.new[,colSums(is.na(df.new))<nrow(df.new)]
a b c d
[1,] 1 3 4 1
[2,] 4 2 2 5
[3,] 3 3 NA NA
[4,] 3 4 4 NA
I believe there are more efficient ways, anyhow that is my try:
x00=sapply(1:nrow(df),function(x) df[x,][!is.na( df[x,])])
x01=lapply(x00,function(x) x=c(x,rep(NA,7-length(x)-1)))
x02=as.data.frame(do.call("rbind",x01))
x02 <- x02[,colSums(is.na(x02))<nrow(x02)]
I have following solution:
df <- data.frame('a' = c(1,4,NA,3),
'b' = c(3,NA,3,NA),
'c' = c(NA,2,NA,NA),
'd' = c(4,2,NA,NA),
'e'= c(NA,5,3,NA),
'f'= c(1,NA,NA,4),
'g'= c(NA,NA,NA,4))
df
x <-list()
for(i in 1:nrow(df)){
x[[i]] <- df[i,]
x[[i]] <- x[[i]][!is.na(x[[i]])]
# x[[i]] <- as.data.frame(x[[i]], stringsAsFactors = FALSE)
x[[i]] <- c(x[[i]], rep(0, 5 -length(x[[i]])))
}
result <- do.call(rbind, x)
result

Match Dataframes Excluding Last Non-NA Value and disregarding order

I have two dataframes:
Partner<-c("Alpha","Beta","Zeta")
COL1<-c("A","C","M")
COL2<-c("B","D","K")
COL3<-c("C","F",NA)
COL4<-c("D",NA,NA)
df1<-data.frame(Partner,COL1,COL2,COL3,COL4)
lift<-c(9,10,11,12,12,23,12,24)
RULE1<-c("B","B","D","A","C","K","M","K")
RULE2<-c("A","A","C","B","A","M","T","M")
RULE3<-c("G","D","M","C" ,"M", "E",NA,NA)
RULE4<-c(NA,NA,"K","D" ,NA, NA,NA,NA)
df2<-data.frame(lift,RULE1,RULE2,RULE3,RULE4)
df1
Partner COL1 COL2 COL3 COL4
Alpha A B C D
Beta C D F NA
Zeta M K NA NA
df2
lift RULE1 RULE2 RULE3 RULE4
9 B A G NA
10 B A D NA
11 D C M K
12 A B C D
12 C A M NA
23 K M E NA
12 M T NA NA
24 K M NA NA
This is a market basket analysis. df1 is the customer/partner that bought each of the items listed: A, B, C...etc.
df2 are the recommendations associated with the items bought in the past.
The last value in each of the df2 rows represent the recommendation. So the preceding values in each row from the last non-NA value are the "baskets".
So for example in the first row of df2, it is stating: If B and A are bought together, recommend G.
I want to be able to figure out if each partner from df1 bought ALL the values in each row excluding the final value since that is the recommendation. Then add that recommendation to the end of each row of the new dataframe.
For example:
For partner: Alpha, would it be good to recommend value G from the first row? Answer would be yes because they bought all the values from that row in df2 (A and B).
For partner: Beta, it would not be good to recommend value G because not all of the values from the first row of df2 are found in the Beta row.
Final Output:
Partner COL1 COL2 COL3 COL4 lift RULE1 RULE2 RULE3 RULE4 Does Last Non-NA Value Exist in Row?
Alpha A B C D 9 B A G NA No
Alpha A B C D 10 B A D NA Yes
Alpha A B C D 12 A B C D Yes
Alpha A B C D 12 C A M NA No
Zeta M K NA NA 23 K M E NA No
Zeta M K NA NA 12 M T NA NA No
Zeta M K NA NA 24 K M NA NA Yes
Written out results for clarity:
df3
row1 outputs "No" because G is not found in Alpha Partner and all values before G show up in Alpha Partner (B,A)
row2 outputs "Yes" because D is found in Alpha Partner and all values before D show up in Alpha Partner (B,A)
row3 outputs "Yes" because D is found in Alpha Partner and all values before D show up in Alpha Partner (A,B,C)
row4 outputs "No" because M is not found in Alpha Partner and all values before M show up in Alpha Partner (C,A)
row5 outputs "No" because E is not found in Zeta Partner and all values before E show up in Zeta Partner (K,M)
row6 outputs "No" because T is not found in Zeta Partner and all values before T show up in Zeta Partner (M)
row7 outputs "Yes" because M is found in Zeta Partner and all values before M show up in Zeta Partner (K)
I think that has to be a join or a match of some kind but can't figure out how to do it.
This would be extremely helpful if someone can help me out with this.
Thanks.
This was the attempt:
df1<-cbind(df1_id=1:nrow(df1),df1)
df2 <- cbind(df2_id=1:nrow(df2),df2)
d11 <- df1 %>% gather(Col, Value,starts_with("C")) #Long
d11 <- d11 %>% na.omit() %>%group_by(df1_id) %>% slice(-n()) #remove last non NA
d22 <- df2 %>% gather(Rule, Value,starts_with("R"))
res <- inner_join(d11,d22)
rm(d22)
rm(d11)
final<-cbind(df1[res$df1_id,],df2[res$df2_id,])
final$Exist <- apply(final, 1, FUN = function(x)
c("No", "Yes")[(anyDuplicated(x[!is.na(x) & x != "" ])!=0) +1])
But this didn't work because it didn't take all of the values into account, only if one of them matched...not all.
This is quite tricky because the purchases of n customers have to be compared to a set of m rules. Besides this, there are two points which add to the complexity:
The last non-NA RULE column in df2 is semantically different from the others. Unfortunately, the given data structure doesn't reflect this. So, df2 is missing an explicite recommended column.
Finally, it has to be determined whether a partner already has purchased the recommended item.
The approach below relies on melt(), dcast() and join operations of the data.table package for performance reasons. However, in order to avoid creation of cartesian crossproduct of n * m rows, a loop is used.
EDIT The dcast() has been moved out of the lapply() function.
Prepare data for n:m join
library(data.table)
# convert to data.table and add row numbers
# here, a copy is used insteasd of setDT() in order to rename the data.tables
purchases <- as.data.table(df1)[, rnp := seq_len(.N)]
rules <- as.data.table(df2)[, rnr := seq_len(.N)]
# prepare purchases for joins
lp <- melt(purchases, id.vars = c("rnp", "Partner"), na.rm = TRUE)
wp <- dcast(lp, rnp ~ value, drop = FALSE)
wp
# rnp A B C D F K M
#1: 1 A B C D NA NA NA
#2: 2 NA NA C D F NA NA
#3: 3 NA NA NA NA NA K M
# prepare rules
lr <- melt(rules, id.vars = c("rnr", "lift"), na.rm = TRUE)
# identify last column of each rule which becomes the recommendation
rn_of_last_col <- lr[, last(.I), by = rnr][, V1]
# reshape from long to wide without recommendation
wr <- dcast(lr[-rn_of_last_col], rnr ~ value)
# add column with recommendations (kind of cbind, no join)
wr[, recommended := lr[rn_of_last_col, value]]
wr
# rnr A B C D K M recommended
#1: 1 A B NA NA NA NA G
#2: 2 A B NA NA NA NA D
#3: 3 NA NA C D NA M K
#4: 4 A B C NA NA NA D
#5: 5 A NA C NA NA NA M
#6: 6 NA NA NA NA K M E
#7: 7 NA NA NA NA NA M T
#8: 8 NA NA NA NA K NA M
Combine rules and purchases
combi <- rbindlist(
# implied loop over rules to find matching purchases for each rule
lapply(seq_len(nrow(rules)), function(i) {
# get col names except last col which is the recommendation
cols <- lr[rnr == i, value[-.N]]
# join single rule with all partners on relevant cols for this rule
wp[wr[i, .SD, .SDcols = c(cols, "rnr", "recommended")], on = cols, nomatch = 0]
})
)
# check if recommendation was purchased already
combi[, already_purchased := Reduce(`|`, lapply(.SD, function(x) x == recommended)),
.SDcols = -c("rnp", "rnr", "recommended")]
# clean up already purchased
combi[is.na(already_purchased), already_purchased := FALSE
][, already_purchased := ifelse(already_purchased, "Yes", "No")]
combi
# rnp A B C D F K M rnr recommended already_purchased
#1: 1 A B C D NA NA NA 1 G No
#2: 1 A B C D NA NA NA 2 D Yes
#3: 1 A B C D NA NA NA 4 D Yes
#4: 1 A B C D NA NA NA 5 M No
#5: 3 NA NA NA NA NA K M 6 E No
#6: 3 NA NA NA NA NA K M 7 T No
#7: 3 NA NA NA NA NA K M 8 M Yes
In creating combi, the trick is to join only on those columns which are included in each rule. This is why the join needs to be done for each rule separately.
Essentially, we are done now. However, it doesn't look like the desired output.
Final joins
tmp_rules <- rules[combi[, .(rnp, rnr, recommended, already_purchased)], on = "rnr"]
tmp_purch <- purchases[combi[, .(rnp, rnr)], on = "rnp"]
result <- tmp_purch[tmp_rules, on = c("rnp", "rnr")]
result[, (c("rnp", "rnr")) := NULL]
result
# Partner COL1 COL2 COL3 COL4 lift RULE1 RULE2 RULE3 RULE4 recommend already_purchased
#1: Alpha A B C D 9 B A G NA G No
#2: Alpha A B C D 10 B A D NA D Yes
#3: Alpha A B C D 12 A B C D D Yes
#4: Alpha A B C D 12 C A M NA M No
#5: Zeta M K NA NA 23 K M E NA E No
#6: Zeta M K NA NA 12 M T NA NA T No
#7: Zeta M K NA NA 24 K M NA NA M Yes

Average over specific number of rows, according to criteria

My data is structured as follows:
set.seed(20)
RawData <- data.frame(Trial = c(rep(1, 10), rep(2, 10)),
X_Velocity = runif(20, 1, 3),
Y_Velocity = runif(20, 4, 6))
I now wish to calculate an average for X_Velocity and Y_Velocity across every two rows, for each Trial. My anticipated output, for the first four rows would be:
X_Velocity_AVG Y_Velocity_AVG
NA NA
2.6460545 4.522224
NA NA
1.8081265 4.5175165
How do I complete this?
You could do this using function f in which the average of every two elements is computed:
f <- function(a) tapply(a, rep(1:(length(a)/2), each = 2), FUN = mean)
res <- data.frame(X_Velocity_AVG=rep(NA, nrow(RawData)),
Y_Velocity_AVG=rep(NA, nrow(RawData)))
res$X_Velocity_AVG[c(F,T)] <- f(RawData$X_Velocity)
res$Y_Velocity_AVG[c(F,T)] <- f(RawData$Y_Velocity)
# X_Velocity_AVG Y_Velocity_AVG
# 1 NA NA
# 2 2.646055 4.522224
# 3 NA NA
# 4 1.808127 4.517517
# 5 NA NA
# 6 2.943262 4.334551
# 7 NA NA
# 8 1.162082 5.899396
# 9 NA NA
# 10 1.697668 4.739195
# 11 NA NA
# 12 2.473324 4.778723
# 13 NA NA
# 14 1.744730 5.020097
# 15 NA NA
# 16 1.644518 4.986245
# 17 NA NA
# 18 1.431219 5.375815
# 19 NA NA
# 20 2.108719 4.909284

Calculate cumsum() while ignoring NA values

Consider the following named vector x.
( x <- setNames(c(1, 2, 0, NA, 4, NA, NA, 6), letters[1:8]) )
# a b c d e f g h
# 1 2 0 NA 4 NA NA 6
I'd like to calculate the cumulative sum of x while ignoring the NA values. Many R functions have an argument na.rm which removes NA elements prior to calculations. cumsum() is not one of them, which makes this operation a bit tricky.
I can do it this way.
y <- setNames(numeric(length(x)), names(x))
z <- cumsum(na.omit(x))
y[names(y) %in% names(z)] <- z
y[!names(y) %in% names(z)] <- x[is.na(x)]
y
# a b c d e f g h
# 1 3 3 NA 7 NA NA 13
But this seems excessive, and makes a lot of new assignments/copies. I'm sure there's a better way.
What better methods are there to return the cumulative sum while effectively ignoring NA values?
You can do this in one line with:
cumsum(ifelse(is.na(x), 0, x)) + x*0
# a b c d e f g h
# 1 3 3 NA 7 NA NA 13
Or, similarly:
library(dplyr)
cumsum(coalesce(x, 0)) + x*0
# a b c d e f g h
# 1 3 3 NA 7 NA NA 13
It's an old question but tidyr gives a new solution.
Based on the idea of replacing NA with zero.
require(tidyr)
cumsum(replace_na(x, 0))
a b c d e f g h
1 3 3 3 7 7 7 13
Do you want something like this:
x2 <- x
x2[!is.na(x)] <- cumsum(x2[!is.na(x)])
x2
[edit] Alternatively, as suggested by a comment above, you can change NA's to 0's -
miss <- is.na(x)
x[miss] <- 0
cs <- cumsum(x)
cs[miss] <- NA
# cs is the requested cumsum
Here's a function I came up from the answers to this question. Thought I'd share it, since it seems to work well so far. It calculates the cumulative FUNC of x while ignoring NA. FUNC can be any one of sum(), prod(), min(), or max(), and x is a numeric vector.
cumSkipNA <- function(x, FUNC)
{
d <- deparse(substitute(FUNC))
funs <- c("max", "min", "prod", "sum")
stopifnot(is.vector(x), is.numeric(x), d %in% funs)
FUNC <- match.fun(paste0("cum", d))
x[!is.na(x)] <- FUNC(x[!is.na(x)])
x
}
set.seed(1)
x <- sample(15, 10, TRUE)
x[c(2,7,5)] <- NA
x
# [1] 4 NA 9 14 NA 14 NA 10 10 1
cumSkipNA(x, sum)
# [1] 4 NA 13 27 NA 41 NA 51 61 62
cumSkipNA(x, prod)
# [1] 4 NA 36 504 NA 7056 NA
# [8] 70560 705600 705600
cumSkipNA(x, min)
# [1] 4 NA 4 4 NA 4 NA 4 4 1
cumSkipNA(x, max)
# [1] 4 NA 9 14 NA 14 NA 14 14 14
Definitely nothing new, but maybe useful to someone.
Another option is using the collapse package with fcumsum function like this:
( x <- setNames(c(1, 2, 0, NA, 4, NA, NA, 6), letters[1:8]) )
#> a b c d e f g h
#> 1 2 0 NA 4 NA NA 6
library(collapse)
fcumsum(x)
#> a b c d e f g h
#> 1 3 3 NA 7 NA NA 13
Created on 2022-08-24 with reprex v2.0.2

Resources