R : how to Detect Pattern in Matrix By Row - r

I have a big matrix with 4 columns, containing normalized values (by column, mean ~ 0 and standard deviation = 1)
I would like to see if there is a pattern in the matrix, and if yes I would like to cluster rows by pattern, by pattern I mean values in a given row example
for row N
if value in column 1 < column 2 < column 3 < column 4 then it is let's say a pattern 1
Basically there is 4^4 = 256 possible patterns (in theory)
Is there a way in R to do this ?
Thanks in advance
Rad

Yes. (Although the number of distinct permutations is only 24 = 4*3*2. After one value is chosen, there are only three possible second values, and after the second is specified there are only two more orderings left.) The order function applied to each row should give the desired 1,2,3, 4 permutations:
mtx <- matrix(rnorm(10000), ncol=4)
res <- apply(mtx, 1, function(x) paste( order(x), collapse=".") )
> table(res)[1:10]
> table(res)
res
1.2.3.4 1.2.4.3 1.3.2.4 1.3.4.2 1.4.2.3 1.4.3.2
98 112 95 120 114 118
2.1.3.4 2.1.4.3 2.3.1.4 2.3.4.1 2.4.1.3 2.4.3.1
101 114 105 102 104 122
3.1.2.4 3.1.4.2 3.2.1.4 3.2.4.1 3.4.1.2 3.4.2.1
105 82 107 90 97 86
4.1.2.3 4.1.3.2 4.2.1.3 4.2.3.1 4.3.1.2 4.3.2.1
99 93 100 108 118 110

Related

lapply with different indices in list of lists

I'm trying to get the output of a certain column ($new_age = numeric values) within lists of lists.
Data is "my_groups", which consists of 28 lists. Those lists have lists themselves of irregular size:
92 105 96 86 91 94 73 100 87 89 88 90 112 82 95 83 94 106
91 101 86 81 89 68 89 87 109 73 (len_df)
The 1st list has 92 lists, the 2nd 105 etc. ... until the 28th list with 73 lists.
First, I want my function to iterate through the 28 years of data and second, within these years I want to iterate through len_df, since $new_age is in the nested lists.
What I tried is this:
test <- lapply(seq(1:28), function(i) sapply(seq(1:len_df), function(j) (my_groups[[i]][[j]]$new_age) ) )
However, the index is out of bounds and I'm not sure how to combine two different indices for the nested lists. Unlist is not ideal, since I have to treat the data as separate groups and sorted for each year.
Expected output: $new_age (numeric values) for each of the 28 years e.g. 1st = 92 values, 2nd = 105 values etc.
Any idea how to make this work? Thank you!
Here are a few different approaches:
1) whole object approach Assuming that the input is L shown reproducibly in the Note at the end and that what is wanted is a list of length(L) numeric vectors, i.e. list(1:2, 3:5), consisting of the new_age values:
lapply(L, sapply, `[[`, "new_age")
giving:
[[1]]
[1] 1 2
[[2]]
[1] 3 4 5
2) indices If you want to do it using indices, as in the code shown in question, then using seq_along:
ix <- seq_along(L)
lapply(ix, function(i) sapply(seq_along(L[[i]]), function(j) L[[i]][[j]]$new_age))
3) unlist To use unlist form an appropriate grouping variable using rep and split into separate vectors by it. This assumes that new_age are the only leaves which may or may not be the case in your data but is the case in the reproducible example in the Note at the end.
split(unname(unlist(L)), rep(seq_along(L), lengths(L)))
Note
L <- list(list(list(new_age = 1), list(new_age = 2)),
list(list(new_age = 3), list(new_age = 4), list(new_age = 5)))

R: Get index of first instance in vector greater than variable but for whole colum

I am trying to create a new variable in a data.table. It is intended to take a variable in the data.table and for each observation compare that variable to a vector and return the index of the first observation in the vector that is greater than the variable in the data.table.
Example
ComparatorVector <- c(seq(1000, 200000, 1000))
Variable <- runif(10, min = 1000, max = 200000)
For each observation in Variable I'd like to know the index of the first observation in ComparatorVector that is larger than the observation of Variable.
I've played araound with min(which()), but couldn't get it to just go through the ComparatorVector. I also saw the match() function, but didn't find how to get it to return anything but the index of the exact match.
An option is findInterval
findInterval(Variable, ComparatorVector) +1
#[1] 190 152 99 107 38 148 114 95 53 73
Or with sapply
sapply(Variable, function(x) which(ComparatorVector > x)[1])
#[1] 190 152 99 107 38 148 114 95 53 73

How to get a specific column from a matrix in r?

I have a matrix as following, how can I extract the desired column with [?
MX <- matrix(101:112,ncol=3)
MX[,2]
# [1] 105 106 107 108
`[`(MX, c(1:4,2))
# [1] 101 102 103 104 102
Obviously, it does not extract 2nd column as intuitive guess, but honestly gets the 2nd element of all.
More like I am asking how to express MX[,2] with [.
Please advise, Thanks
Keep the row index as blank
`[`(MX, ,2)
#[1] 105 106 107 108
or if we need to extract selected rows (1:4) of a specific column (2), specify the row, column index without concatenating. c will turn the row and column index to a single vector instead of two
`[`(MX, 1:4, 2)
#[1] 105 106 107 108

What can do to find and remove semi-duplicate rows in a matrix?

Assume I have this matrix
set.seed(123)
x <- matrix(rnorm(410),205,2)
x[8,] <- c(0.13152348, -0.05235148) #similar to x[5,]
x[16,] <- c(1.21846582, 1.695452178) #similar to x[11,]
The values are very similar to the rows specified above, and in the context of the whole data, they are semi-duplicates. What could I do to find and remove them? My original data is an array that contains many such matrices, but the position of the semi duplicates is the same across all matrices.
I know of agrep but the function operates on vectors as far as I understand.
You will need to set a threshold, but you can just compute the distance between each row using dist and find the points that are sufficiently close together. Of course, Each point is near itself, so you need to ignore the diagonal of the distance matrix.
DM = as.matrix(dist(x))
diag(DM) = 1 ## ignore diagonal
which(DM < 0.025, arr.ind=TRUE)
row col
8 8 5
5 5 8
16 16 11
11 11 16
48 48 20
20 20 48
168 168 71
91 91 73
73 73 91
71 71 168
This finds the "close" points that you created and a few others that got generated at random.

data.matrix() when character involved

In order to calculate the highest contribution of a row per ID I have a beautiful script which works when the IDs are a numeric. Today however I found out that it is also possible that IDs can have characters (for instance ABC10101). For the function to work, the dataset is converted to a matrix. However data.matrix(df) does not support characters. Can the code be altered in order for the function to work with all kinds of IDs (character, numeric, etc.)? Currently I wrote a quick workaround which converts IDs to numeric when ID=character, but that will slow the process down for large datasets.
Example with code (function: extract the first entry with the highest contribution, so if 2 entries have the same contribution it selects the first):
Note: in this example ID is interpreted as a factor and data.matrix() converts it to a numeric value. In the code below the type of the ID column should be character and the output should be as shown at the bottom. Order IDs must remain the same.
tc <- textConnection('
ID contribution uniqID
ABCUD022221 40 101
ABCUD022221 40 102
ABCUD022222 20 103
ABCUD022222 10 104
ABCUD022222 90 105
ABCUD022223 75 106
ABCUD022223 15 107
ABCUD022223 10 108 ')
df <- read.table(tc,header=TRUE)
#Function that needs to be altered
uniqueMaxContr <- function(m, ID = 1, contribution = 2) {
t(
vapply(
split(1:nrow(m), m[,ID]),
function(i, x, contribution) x[i, , drop=FALSE]
[which.max(x[i,contribution]),], m[1,], x=m, contribution=contribution
)
)
}
df<-data.matrix(df) #only works when ID is numeric
highestdf<-uniqueMaxContr(df)
highestdf<-as.data.frame(highestdf)
In this case the outcome should be:
ID contribution uniqID
ABCUD022221 40 101
ABCUD022222 90 105
ABCUD022223 75 106
Others might be able to make it more concise, but this is my attempt at a data.table solution:
tc <- textConnection('
ID contribution uniqID
ABCUD022221 40 101
ABCUD022221 40 102
ABCUD022222 20 103
ABCUD022222 10 104
ABCUD022222 90 105
ABCUD022223 75 106
ABCUD022223 15 107
ABCUD022223 10 108 ')
df <- read.table(tc,header=TRUE)
library(data.table)
dt <- as.data.table(df)
setkey(dt,uniqID)
dt2 <- dt[,list(contribution=max(contribution)),by=ID]
setkeyv(dt2,c("ID","contribution"))
setkeyv(dt,c("ID","contribution"))
dt[dt2,mult="first"]
## ID contribution uniqID
## [1,] ABCUD022221 40 101
## [2,] ABCUD022222 90 105
## [3,] ABCUD022223 75 106
EDIT -- more concise solution
You can use .SD which is the subset of the data.table for the grouping, and then use which.max to extract a single row.
in one line
dt[,.SD[which.max(contribution)],by=ID]
## ID contribution uniqID
## [1,] ABCUD022221 40 101
## [2,] ABCUD022222 90 105
## [3,] ABCUD022223 75 106

Resources