I have a data frame containing 7 columns and I want to add a column with information about the 'parent-row'. This sounds vague, so I'll clarify with an example. Below you can see a data frame:
` Nclass0 Nclass1 BestSBestI impurity n
[1,] 5 5 4 36.0 0.2500000 10
[2,] 5 2 1 37.0 0.2040816 7
[3,] 4 0 -1 -1.0 0.0000000 4
[4,] 1 2 2 0.5 0.2222222 3
[5,] 1 0 -1 -1.0 0.0000000 1
[6,] 0 2 -1 -1.0 0.0000000 2
[7,] 0 3 -1 -1.0 0.0000000 3`
Using the nclass0 and nclass1, I want to add an 8th column in which matching pairs have the same id. The first row is the parent row (with id=0). The rows match if [rowX,1] + [rowY,1] are equal to the parents row nclass0 and [rowX,2] + [rowY,2] are equal to the parent rows nclass1. RowX and rowY are the child rows and should get id=1.
In this case the parent row [1,] has child rows [2,]&[7,] and these rows should get id=1. After this the second row becomes the parent row with its own child rows [3,] and [4,] with id=2, until all rows with child rows have been assigned an id.
I have made several attempts but failed miserably. Does anyone have a suggestion how this can be done? The desired output for this case would be:
` Nclass0 Nclass1 BestS BestI impurity n id
[1,] 5 5 4 36.0 0.2500000 10 0
[2,] 5 2 1 37.0 0.2040816 7 1
[3,] 4 0 -1 -1.0 0.0000000 4 2
[4,] 1 2 2 0.5 0.2222222 3 2
[5,] 1 0 -1 -1.0 0.0000000 1 4
[6,] 0 2 -1 -1.0 0.0000000 2 4
[7,] 0 3 -1 -1.0 0.0000000 3 1`
Here's a solution that makes use of a while loop. The loop will run until either every row has an id value, or until it has evaluated all of the rows in the data frame. I'm sure there are some weaknesses, but it's a good start:
Note: I think this could get unbearably slow in a large data frame, so I hope you don't need to do this on anything large (each outer takes about 1 second to complete on a vector of 10,000).
DF <-
structure(list(Nclass0 = c(5, 5, 4, 1, 1, 0, 0),
Nclass1 = c(5, 2, 0, 2, 0, 2, 3),
BestS = c(4, 1, -1, 2, -1, -1, -1),
BestI = c(36, 37, -1, 0.5, -1, -1, -1),
impurity = c(0.25, 0.2040816, 0, 0.2222222, 0, 0, 0),
n = c(10, 7, 4, 3, 1, 2, 3)),
.Names = c("Nclass0", "Nclass1", "BestS", "BestI", "impurity", "n"),
row.names = c(NA, -7L), class = "data.frame")
DF[["id"]] <- c(0, rep(NA, nrow(DF) - 1))
i <- 1
while(sum(is.na(DF[["id"]])) > 0){
cross0 <- outer(DF[["Nclass0"]], DF[["Nclass0"]], `+`)
match0 <- cross0 == DF[["Nclass0"]][i] & lower.tri(cross0)
cross1 <- outer(DF[["Nclass1"]], DF[["Nclass1"]], `+`)
match1 <- cross1 == DF[["Nclass1"]][i] & lower.tri(cross1)
rows <- as.vector(which(match0 & match1, arr.ind = TRUE))
if (length(rows)) DF[["id"]][rows] <- i
if (i == nrow(DF)) break else i <- i + 1
}
Explanation
To try to clarify your problem, you are looking for pairs where x1 + x2 = x_ref AND y1 + y2 = y_ref.
What this code does is make a matrix of all of the possible pairwise sums of a vector with itself. This is accomplished with outer.
outer(DF[["Nclass0"]], DF[["Nclass0"]], `+`)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 10 10 9 6 6 5 5
[2,] 10 10 9 6 6 5 5
[3,] 9 9 8 5 5 4 4
[4,] 6 6 5 2 2 1 1
[5,] 6 6 5 2 2 1 1
[6,] 5 5 4 1 1 0 0
[7,] 5 5 4 1 1 0 0
When trying to find the x-match for the first row, we compare this matrix to DF$class0[1] (and take set the upper triangle to false to avoid duplicates).
match0 <- cross0 == DF[["Nclass0"]][1] & lower.tri(cross0)
match0
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[6,] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[7,] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
We repeat this process for Nclass1
cross1 <- outer(DF[["Nclass1"]], DF[["Nclass1"]], `+`)
match1 <- cross1 == DF[["Nclass1"]][1] & lower.tri(cross1)
match1
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE TRUE FALSE TRUE FALSE TRUE FALSE
To find the row indices, we want to find the intersection of these two match matrices--in other words which positions in both matrices are TRUE
as.vector(which(match0 & match1, arr.ind = TRUE))
[1] 7 2
So rows 7 and 2 are related to the first row. We can repeat this operation for each subsequent row until we've assigned an ID for every row.
Turning it into a function
Here's a function that takes a data frame, a column name for the x-match, the column name for the y-match, and a character to name the id variable. I've added some bells and whistles to check the inputs.
assign_id <- function(DF, class0, class1, id_var){
check <- require(checkmate)
if (!check) stop ("Install the checkmate package")
checkmate::assert_character(x = class0,
len = 1)
checkmate::assert_character(x = class1,
len = 1)
checkmate::assert_character(x = id_var,
len = 1)
checkmate::assert_subset(c(class0, class1),
choices = names(DF))
i <- 1
DF[[id_var]] <- c(0, rep(NA, nrow(DF) - 1))
while(sum(is.na(DF[[id_var]])) > 0){
cross0 <- outer(DF[[class0]], DF[[class0]], `+`)
match0 <- cross0 == DF[[class0]][i] & lower.tri(cross0)
cross1 <- outer(DF[[class1]], DF[[class1]], `+`)
match1 <- cross1 == DF[[class1]][i] & lower.tri(cross1)
rows <- as.vector(which(match0 & match1, arr.ind = TRUE))
if (length(rows)) DF[[id_var]][rows] <- i
if (i == nrow(DF)) break else i <- i + 1
}
DF
}
assign_id(DF, "Nclass0", "Nclass1", "id")
Related
I have a matrix of values with thousands of rows and a couple dozen columns. For a given row, $$R_0$$, I'd like to find all other complementary rows. A complementary row is defined as:
if given row has a non-zero value for a column, then the complement must have a zero value for that column
the sum of the elements of a given row and its complements must be less than 1.0
To illustrate, here is a toy matrix:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 0 0 0.1816416 0 0.1796779
[2,] 0.1889351 0 0 0 0 0
[3,] 0 0 0.1539683 0 0 0.1983812
[4,] 0 0.155489 0.1869410 0 0 0
[5,] 0 0 0 0 0.1739382 0
For row 1, there are values for columns 4 and 6. A complementary row must have "0" for columns 4 and 6.
I don't know what data structure my desired output should be. But I know the output should tell me:
row 1 has the following complementary rows: 2, 3, 5
row 2 has the following complementary rows: 1, 3, 4, 5
row 3 has the following complementary rows: 2, 5
row 4 has the following complementary rows: 1, 2, 5
row 5 has the following complementary rows: 1, 2, 3, 4
Perhaps a list of lists? I.e.:
[1: 2, 3, 5;
2: 1, 3, 4, 5;
3: 2, 5;
4: 1, 2, 5;
5: 1, 2, 3, 4]
But I'm open to other data structures.
The following code generates the toy matrix above.
set.seed(1)
a = runif(n=30, min=0, max=0.2)
a[a<0.15] = 0
A = matrix(a, # the data elements
nrow=5, # number of rows
ncol=6, # number of columns
byrow = TRUE) # fill matrix by rows
Is there a package or clever way to approach this problem?
We can create a function to check if the combination of two rows is a compliment
check_compliment <- function(x, y) {
all(A[y, A[x,] != 0] == 0) & sum(c(A[x, ], A[y, ])) < 1
}
Here, we subset row y for columns where x is not 0 and check if all of them are 0. Also check if sum of x and y rows is less than 1.
Apply this function for every combination using outer
sapply(data.frame(outer(1:nrow(A), 1:nrow(A), Vectorize(check_compliment))), which)
#$X1
#[1] 2 4 5
#$X2
#[1] 1 3 4 5
#$X3
#[1] 2 5
#$X4
#[1] 1 2 5
#$X5
#[1] 1 2 3 4
outer step gives us TRUE/FALSE value for every combination of a row with every other row indicating if it is a compliment
outer(1:nrow(A), 1:nrow(A), Vectorize(check_compliment))
# [,1] [,2] [,3] [,4] [,5]
#[1,] FALSE TRUE FALSE TRUE TRUE
#[2,] TRUE FALSE TRUE TRUE TRUE
#[3,] FALSE TRUE FALSE FALSE TRUE
#[4,] TRUE TRUE FALSE FALSE TRUE
#[5,] TRUE TRUE TRUE TRUE FALSE
We convert this to data frame and use which to get indices for every column.
I currently have three lists List.1, List.2, and List.3 which each contains 500 matrices, each of which has dimensions 100 x 100. Hence, List.1[[1]] is a matrix of dimensions 100 x 100.
The manipulation I would like to do is to see which of the elements for a given matrix in List.2 is between the corresponding matrix and elements in List.1 and List.3. The manipulation is the following for one matrix in the 3 lists:
+(List.2[[1]] < List.3[[1]] & List.2[[1]] > List.1[[1]])
which returns a matrix of 1's and 0's, with 1 for an entry being if the condition above was satisfied and 0 if it wasn't.
I would like to then do this over all 500 matrices in the list, without having to resort to loops. Is there a way to do this with the Reduce or lapply function, or both?
So far what I have is:
zero.one.mat <- List.1[[1]]-List.1[[1]] # Create empty zero matrix
for(i in 1:500){
zero.one.mat <- zero.one.mat + +(List.2[[i]] < List.3[[i]] & List.2[[i]] > List.1[[i]])
}
which obviously isn't the most ideal way to do it. Any thought would be appreciated. Thanks!
list.1 <- list()
list.2 <- list()
list.3 <- list()
N=5
list.1[[1]] <- matrix(1,nrow=N, ncol=N)
list.2[[1]] <- matrix(2,nrow=N, ncol=N)
list.3[[1]] <- matrix(3,nrow=N, ncol=N)
list.1[[2]] <- matrix(-1,nrow=N, ncol=N)
list.2[[2]] <- matrix(-2,nrow=N, ncol=N)
list.3[[2]] <- matrix(-3,nrow=N, ncol=N)
list.1[[3]] <- matrix(rnorm(N*N),nrow=N, ncol=N)
list.2[[3]] <- matrix(rnorm(N*N),nrow=N, ncol=N)
list.3[[3]] <- matrix(rnorm(N*N),nrow=N, ncol=N)
list.result <- lapply(1:length(list.1), FUN=function(i){list.2[[i]] < list.3[[i]] & list.2[[i]] > list.1[[i]]})
# [[1]]
# [,1] [,2] [,3] [,4] [,5]
# [1,] TRUE TRUE TRUE TRUE TRUE
# [2,] TRUE TRUE TRUE TRUE TRUE
# [3,] TRUE TRUE TRUE TRUE TRUE
# [4,] TRUE TRUE TRUE TRUE TRUE
# [5,] TRUE TRUE TRUE TRUE TRUE
#
# [[2]]
# [,1] [,2] [,3] [,4] [,5]
# [1,] FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE
#
# [[3]]
# [,1] [,2] [,3] [,4] [,5]
# [1,] FALSE FALSE FALSE TRUE FALSE
# [2,] FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE TRUE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE TRUE TRUE
# If you need to fund the sum of all of them, then you can add Reduce:
Reduce("+",lapply(1:length(list.1),
FUN=function(i){
list.2[[i]] < list.3[[i]] &
list.2[[i]] > list.1[[i]]}))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 1 1 2 1
# [2,] 1 1 1 1 1
# [3,] 1 1 1 1 1
# [4,] 1 2 1 1 1
# [5,] 1 1 1 2 2
I want to compare each value of a row of a data.frame to its corresponding value in a vector. Here is an example:
df1 <- matrix(c(2,2,4,8,6,9,9,6,4), ncol = 3)
df2 <- c(5,4,6)
> df1
[,1] [,2] [,3]
[1,] 2 8 9
[2,] 2 6 6
[3,] 4 9 4
> df2
[1] 5 4 6
The comparison would be, if a value in a row of df1 is smaller than its corresponding value in df2, so row1: 2 < 5, 8 < 5, 9 < 5; row2: 2 < 4, 6 < 4, 6 < 4; row3: 4 < 6, 9 < 6, 4 < 6
> result
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] TRUE FALSE FALSE
[3,] TRUE FALSE TRUE
Is there any way to do this without use of a loop?
Thanks lads!
We can just do a comparison to create the logical matrix
df1 < df2
# [,1] [,2] [,3]
#[1,] TRUE FALSE FALSE
#[2,] TRUE FALSE FALSE
#[3,] TRUE FALSE TRUE
The reason why it works is based on the recycling of the vector. So, each elements of the vector 'df2', compares with the first columns 'df1', then goes to the second column and so on.
If the length of the vector is not equal to the number of columns of first dataset, we can replicate the vector
df1 < df2[row(df1)]
# [,1] [,2] [,3]
#[1,] TRUE FALSE FALSE
#[2,] TRUE FALSE FALSE
#[3,] TRUE FALSE TRUE
Or another option is sweep
sweep(df1, 1, df2, "<")
I want to set NA's in every element of a matrix where the value in a column is greater than or equal to the value of a given vector. For example, I can create a matrix:
set.seed(1)
zz <- matrix(data = round(10L * runif(12)), nrow = 4, ncol = 3)
which gives for zz:
[,1] [,2] [,3]
[1,] 8 5 7
[2,] 6 5 1
[3,] 5 10 3
[4,] 9 1 9
and for the comparison vector (for example):
xx <- round(10L * runif(4))
where xx is:
[1] 6 3 8 2
if I perform this operation:
apply(zz,2,function(x) x >= xx)
I get:
[,1] [,2] [,3]
[1,] TRUE FALSE TRUE
[2,] TRUE TRUE FALSE
[3,] FALSE TRUE FALSE
[4,] TRUE FALSE TRUE
What I want is everywhere I have a TRUE element I want an NA and everywhere I have a FALSE I get the number in the zz matrix (e.g., manually ...):
NA 5 NA
NA NA 1
5 NA 3
NA 1 NA
I can cobble together some "for" loops to do what I want, but is there a vector-based way to do this??
Thanks for any tips.
You could simply do:
zz[zz>=xx] <- NA
# [,1] [,2] [,3]
#[1,] NA 5 NA
#[2,] NA NA 1
#[3,] 5 NA 3
#[4,] NA 1 NA
Here is one option to get the expected output. We get a logical matrix (zz >= xx), using NA^ on that returns NA for the TRUE values and 1 for the FALSE, then multiply it with original matrix 'zz' so that NA remains as such while the 1 changes to the corresponding value in 'zz'.
NA^(zz >= xx)*zz
# [,1] [,2] [,3]
#[1,] NA 5 NA
#[2,] NA NA 1
#[3,] 5 NA 3
#[4,] NA 1 NA
Or another option is ifelse
ifelse(zz >= xx, NA, zz)
data
zz <- structure(c(8, 6, 5, 9, 5, 5, 10, 1, 7, 1, 3, 9), .Dim = c(4L, 3L))
xx <- c(6, 3, 8, 2)
I have a data frame of the form:
my.df = data.frame(ID=c(1,2,3,4,5,6,7), STRAND=c('+','+','+','-','+','-','+'), COLLAPSE=c(0,0,1,0,1,0,0))
and another matrix of dimensions nrow(mydf) by nrow(my.df). It is a correlation matrix, but that's not important for the discussion.
For example:
mat = matrix(rnorm(n=nrow(my.df)*nrow(my.df),mean=1,sd=1), nrow = nrow(my.df), ncol=nrow(my.df))
The question is how to retrieve only the upper triangle elements from matrix mat, such that my.df have values of COLLAPSE == 0, and are of the of the same strand?
In this specific example, I'd interested in retrieving the following entries from matrix mat in a vector:
mat[1,2]
mat[1,7]
mat[2,7]
mat[4,6]
The logic is as follows, 1,2 are both of the same strand, and it's collapse value is equal to zero so should be retrieved, 3 would never be combined with any other row because it has collapse value = 1, 1,3 are of the same strand and have collapse value = 0 so should also be retrieved,...
I could write a for loop but I am looking for a more crantastic way to achieve such results...
Here's one way to do it using outer:
First, find indices with identical STRAND values and where COLLAPSE == 0:
idx <- with(my.df, outer(STRAND, STRAND, "==") &
outer(COLLAPSE, COLLAPSE, Vectorize(function(x, y) !any(x, y))))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE
# [2,] TRUE FALSE FALSE FALSE FALSE FALSE TRUE
# [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [6,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
# [7,] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
Second, set values in lower triangle and on the diagonal to FALSE. Create a numeric index:
idx2 <- which(idx & upper.tri(idx), arr.ind = TRUE)
# row col
# [1,] 1 2
# [2,] 4 6
# [3,] 1 7
# [4,] 2 7
Extract values:
mat[idx2]
# [1] 1.72165093 0.05645659 0.74163428 3.83420241
Here's one way to do it.
# select only the 0 collapse records
sel <- my.df$COLLAPSE==0
# split the data frame by strand
groups <- split(my.df$ID[sel], my.df$STRAND[sel])
# generate all possible pairs of IDs within the same strand
pairs <- lapply(groups, combn, 2)
# subset the entries from the matrix
lapply(pairs, function(ij) mat[t(ij)])
df <- my.df[my.df$COLLAPSE == 0, ]
strand <- c("+", "-")
idx <- do.call(rbind, lapply(strand, function(strand){
t(combn(x = df$ID[df$STRAND == strand], m = 2))
}))
idx
# [,1] [,2]
# [1,] 1 2
# [2,] 1 7
# [3,] 2 7
# [4,] 4 6
mat[idx]