R - Merge Matrices on a common column - r

I am trying to merge a list of matrices all by the first column like this:
a x x
a q q
b y y
c z z
d w w x x x x
e v v q q q q
e r r y y y y
----------> z z z z
a x x w w w w
a q q v v v v
b y y r r r r
c z z
d w w
e v v
e r r
I would like to use the first column to combine the matrices but it does not need to be in the resulting matrix. The thing that is challenging me is the fact that there are multiple instances of the same value in the first row (a and e)
I have been looking around but unable to find any solutions that account for the same values in the column that the matrices are being joined with. With my current code (shown bellow) I get something like:
x x x x
q q q q
x x x x
q q q q
x x x x
q q q q
y y y y
z z z z
w w w w
v v v v
r r r r
v v v v
r r r r
v v v v
r r r r
I cant seem to find out why the duplicate rows are appearing but it has something to do with the length of list so I am assuming it takes place in the merge function.
mergeM <- function(list){ # list is a list of matrices
len = length(list)
mat = merge(list[[1]],list[[2]],by.x = "V1", by.y = "V1", all = TRUE)
if(len >2){
for(i in 3:len){
mat = merge(mat,list[[i]],by.x = "V1", by.y = "V1", all = TRUE)
}
}
mat = mat[,-1]
return(mat)
}# end function

Related

Tensorflow: Find greater than pairs and stack along axis

The problem I have using tensorflow is as follows:
For one tensor X with dims n X m
X = [[x11,x12...,x1m],[x21,x22...,x2m],...[xn1,xn2...,xnm]]
I want to get an n X m X m tensor which are n m X m matrices
Each m X m matrix is the result of:
tf.math.greater(tf.reshape(x,(-1,1)), x) where x is a row of X
In words, for every row k in X, Im trying to get the pairs i,j where xki > xkj. This gives me a matrix, and then I want to stack those matrices along the first axis, to get a n m x m cube.
Example:
X = [[1,2],[4,3], [5,7]
Result = [[[False, False],[True, False]],[[False, True],[False, False]], [[False, False],[True, False]]]
Result has shape 3 X 2 X 2
Reshaping each row is the same as reshaping all rows. Try this:
def fun(X):
n, m = X.shape
X1 = tf.expand_dims(X, -1)
X2 = tf.reshape(X, (n, 1, m))
return tf.math.greater(X1, X2)
X = tf.Variable([[1,2],[4,3], [5,7]])
print(fun(X))
Output:
tf.Tensor(
[[[False False]
[ True False]]
[[False True]
[False False]]
[[False False]
[ True False]]], shape=(3, 2, 2), dtype=bool)

R - build a matrix from other matrices with linking information [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 3 years ago.
I need to build a matrix from data that is stored in several other matrices that all have a pointer in their first column. This is how the original matrices might look, with a-e being the pointers connecting the the data from all the matrices and the v-z being the data that is linked together. The arrow points to what I want my final matrix to look like.
a x x
b y y
c z z
d w w
e v v
e v v
d w w
c z z
b y y
a x x
----->
x x x x
y y y y
z z z z
w w w w
v v v v
I cant seem to write the right algorithm to do this, I am either getting subscript out of bounds errors or replacement has length zero errors. Here is what I have now but it is not working.
for(i in 1:length(matlist)){
tempmatrix = matlist[[i]] # list of matrices to be combined
genMatrix[1,i] = tempmatrix[1,2]
for(j in 2:length(tempmatrix[,1])){
index = which(indexv == tempmatrix[j,1]) #the row index for the data that needs to be match
# with an ECID
for(k in 1:length(tempmatrix[1,])){
genMatrix[index,k+i] = tempmatrix[j,k]
}
# places the data in same row as the ecid
}
}
print(genMatrix)
EDIT: I just want to clarify that my example only shows two matrices but in the list matlist there can be any number of matrices. I need to find a way of merging them without having to know how many matrices are in matlist at the time.
We can merge all the matrices in the list using Reduce and merge from base package.
as.matrix(read.table(text="a x x
b y y
c z z
d w w
e v v")) -> mat1
as.matrix(read.table(text="e v v
d w w
c z z
b y y
a x x")) -> mat2
as.matrix(read.table(text="e x z
d z w
c w v
b y x
a v y")) -> mat3
matlist <- list(mat1=mat1, mat2=mat2, mat3=mat3)
Reduce(function(m1, m2) merge(m1, m2, by = "V1", all.x = TRUE),
matlist)[,-1]
#> V2.x V3.x V2.y V3.y V2 V3
#> 1 x x x x v y
#> 2 y y y y y x
#> 3 z z z z w v
#> 4 w w w w z w
#> 5 v v v v x z
Created on 2019-06-05 by the reprex package (v0.3.0)
Or we can append all the matrices together and then use tidyr to go from long to wide and get the desired output.
library(tidyr)
library(dplyr)
bind_rows(lapply(matlist, as.data.frame), .id = "mat") %>%
gather(matkey, val, c("V2","V3")) %>%
unite(matkeyt, mat, matkey, sep = ".") %>%
spread(matkeyt, val) %>%
select(-V1)
#> mat1.V2 mat1.V3 mat2.V2 mat2.V3 mat3.V2 mat3.V3
#> 1 x x x x v y
#> 2 y y y y y x
#> 3 z z z z w v
#> 4 w w w w z w
#> 5 v v v v x z
Created on 2019-06-06 by the reprex package (v0.3.0)

R - How pass the environment of a data.table to a function?

I'd like to do this:
for example, i have one data.table as:
dt <- data.table(a=1:3, b=5:7, c=10:8)
# a b c
#1: 1 5 10
#2: 2 6 9
#3: 3 7 8
and i want to pass the environment of one row per time to a function, for example:
f <- function(a,b,c){
x <- a*b
y <- a*c
z <- a/b
return( x + y + z)
}
I know i could use in this case mapply to solve this multivariate function, but in my real need i have a function that manipulate almost 150 variables of a data.table, and i don't want to assign the variable's names one by one. I also tried some .SD manipulatations, but it didn't work either.
I would like something that i pass the number of data.table row, and inside the function they get the objects a, b and c in the data.table environment.
Something similar to this:
f <- function(row_id){
# set function parent env as data.table[row_id]
# and *a = data.table[row_id, a]* and successively to b and c...
x <- a*b
y <- a*c
z <- a/b
return( x + y + z)
}
One way would be to adapt the function to take in a given data.table and a row and output your x + y + z:
f <- function(dataTable,row_id){
a <- dataTable[row_id,a]
b <- dataTable[row_id,b]
c <- dataTable[row_id,c]
x <- a*b
y <- a*c
z <- a/b
return( x + y + z)
}
If you input f(dt) it'll give youall of the x+y+z values, or if you give it f(dt,1), it'll return values for the first row only.
EDIT:
Assuming that you're column names are the variable names you want to assign, you could try this:
f <- function(dataTable,row_id){
for(i in colnames(dataTable)){
assign(paste(i,"",sep=""), dataTable[row_id,..i])
}
x <- a*b
y <- a*c
z <- a/b
return( x + y + z)
}

Number of differences between columns in a data frame in R

I have a data frame with sequences as columns and amino acid sites as rows. I would like to compare the difference between these sequences at each site.
seq1 seq2 seq3 seq4 seq5 seq6 seq7 seq8
1 K E K K A A A A
2 V D A A T A A A
3 W W W W W W W W
4 R R R R R R S R
5 F S F F F Y F F
6 P P P P P P P P
7 N N N C N N N N
8 V I D D Q Q Q Q
9 Q Q Q Q Q Q Q Q
10 E E G G L I S F
11 L L Q L L L L L
12 N N Y Y V V S S
13 N N N N Q Q P P
14 L L L L L L L L
15 T T T T T T T I
Ideally, I would like to be able to have an additional column in my data frame that shows me the sites that are the same in all sequences and those that are the same only between seq1-4 or seq 5-8.
I am not sure what the best way to do this is, and any help is greatly appreciated.
Also, is there a way to add another column that shows the types of amino acids observed at each site?
Thanks in advance!
I am first getting an array where all columns are same:
allsame <- apply(df,1,function(x){
val <- ifelse(length(unique(x)) == 1,1,0)
})
Next I am getting an an array where either of the column sets are same
startfour <- apply(df[,1:4],1,function(x){
val <- ifelse(length(unique(x)) == 1,1,0)
})
lastfour <- apply(df[,5:8],1,function(x){
val <- ifelse(length(unique(x)) == 1,1,0)
})
gen <- startfour + lastfour
eithersame <- ifelse(gen == 0,0,1)
Finally you can just create a column vector as required and join it to the dataframe using the above 2 arrays
output <- as.character(length(allsame))
for(i in 1:length(allsame)){
if(allsame[i] == 1){
output[i] <- "all same"
}
else if(eithersame[i] == 1){
output[i] <- "either same"
}
else{
output[i] <- "none same"
}
}
df <- cbind(df,output)
Here is a quick and dirty way to create the flags that you mentioned. Assuming the dataframe is called amino:
amino$first_flag<-with(amino,ifelse(seq1==seq2 & seq2==seq3 & seq3 == seq4,"same","diff"))
amino$second_flag<-with(amino,ifelse(seq5==seq6 & seq6==seq7 & seq7 == seq8,"same","diff"))
amino$total_flag<-with(amino,ifelse(first_flag=="same" & second_flag=="same" & seq1==seq5,"same","diff"))
Hopefully that works.
edit: and for your last question, I'm not sure what you mean but if you just want the letters that appear in each row then something like this could work:
for(i in 1:nrow(amino)) amino$types[i]<-paste(unique(amino[i,1:4,drop=TRUE]),collapse=",")
It will give you a column containing a comma separated list of the letters that appeared in each row.
edit2: If you have significantly more than 8 sequences, then a modified form of Ganesh's solution might work better (his output code isn't actually necessary):
amino$first_flag <- apply(amino[,1:4],1,function(x){
ifelse(length(unique(x)) == 1,"same","diff")
})
amino$second_flag <- apply(amino[,5:8],1,function(x){
ifelse(length(unique(x)) == 1,"same","diff")
})
amino$total_flag <- apply(amino[,1:8],1,function(x){
ifelse(length(unique(x)) == 1,"same","diff")
})
amino$types <- apply(amino[,1:8],1,function(x) paste(unique(x),collapse=","))
And for your new question-
amino$one_diff <- apply(amino[,1:8],1,function(x){
ifelse(7 %in% as.data.frame(table(x))[,2,drop=TRUE],"1 diff",NA)
})
This uses the table() function which normally gives you a count based on a vector or a column like table(amino$seq1). Using apply, we instead stick a row of the 8 sequences into it, it returns the counts, then we use as.data.frame and the brackets [] to get rid of some extra table() output that we don't need. The "7 %in%" part means if there are 7 of the same letters then there must be 1 different one. Anything else (i.e., all 8 same or more than 1 difference) will get NA.

Concatenate a column from data.frame elements of a list

I am looking for an idiomatic way to join a column, say named 'x', which exists in every data.frame element of a list. I came up with a solution with two steps by using lapply and Reduce. The second attempt trying to use only Reduce failed. Can I actually use only Reduce with one anonymous function to do this?
#data
xs <- replicate(5, data.frame(x=sample(letters, 10, T), y =runif(10)), simplify = FALSE)
# This works, but may be still unnecessarily long
otmap = lapply(xs, function(df) df$x)
jotm = Reduce(c, otmap)
# This does not count as another solution:
jotm = Reduce(c, lapply(xs, function(df) df$x))
# Try to use only Reduce function. This produces an error
jotr =Reduce(function(a,b){c(a$x,b$x)}, xs)
# Error in a$x : $ operator is invalid for atomic vectors
We can unlist after extracting the 'x' column
unlist(lapply(xs, `[[`, 'x'))
#[1] b y y i z o q w p d f f z b h m c u f s j e i v y b w j n q e w i r h p z q f x a b v z e x l c q f
#Levels: b d i o p q w y z c f h m s u e j n v r x a l

Resources