Using pmatch for checking columns of a matrix into another - r

I have two matrices and I want to check which (column) vectors of the first one are also in the second one, and if so to get their index.
I tried to use pmatch but I have to tweak it a bit because it first convert the matrices into vector, see the MWE:
X <- matrix(rnorm(12), 3, 4)
x <- X[, c(2, 4)]
pm <- pmatch(x, X)
print(pm)
[1] 4 5 6 10 11 12
d1 <- dim(X)[1]
d2 <- length(pm)/d1
ind <- pmatch(x, X)[d1*c(1:d2)]/d1
print(ind)
[1] 2 4
ind is what I want, but I guess there might be prebuilt function to do it. And I'm also concerned with computational efficiency.

We can loop over the columns of 'x' and use ==
sapply(seq_len(ncol(x)), function(i) which(!colSums(X != x[,i])))
#[1] 2 4

Related

Call apply-like function on two rows to match

I have a dataframe with multiple rows. I want to call a function is using any two rows. For example, Let's say I have this data and this myFunc which accepts two args:
df <- data.frame(q1=c(1,2,5), q2=c(5,5,5), q3=c(5,2,5), q4=c(5,5,5), q5=c(2,3,1))
df
q1 q2 q3 q4 q5
1 1 5 5 5 2
2 2 5 2 5 3
3 5 5 5 5 1
myFunc<-function(a,b) sum((df[a,]==df[b,] & df[a,]==5)*1)
A want to apply myFunc for row 1 and 2, myFunc(1,2) and I expect 2, myFunc compute how many "5" are have in common under the same column, between row 1 and 2.
Since I have thousands of rows, and I want to match all pairs, I want do this without writing a for loop, maybe with the do call or apply function family.
I tried this:
a=c(1,2) # match the row 1 and 2
b=c(2,3) # match the row 2 and 3
my_list=list(a,b)
do.call("myFunc", my_list)
But I got 4, instead of 2 and 2, any ideas?
The question recently changed. My understanding of it is that the input should be a list of pairs of row numbers and the output should be the same length as that list such that each component of the output is the number of columns with both entries equal to 5 in both rows defined by the corresponding pair. Thus for df shown in the question the list L shown below would correspond to c(myFunc(1, 2), myFunc(2, 3)) where myFunc is as defined in the question.
L <- list(1:2, 2:3)
myFunc2 <- function(x) myFunc(x[1], x[2])
sapply(L, myFunc2)
## [1] 2 2
Note that *1 in myFunc is unnecessary since sum will coerce a logical argument to numeric.
An alternative might be to specify the first row numbers as a vector and the second row numbers as another vector. In terms of L that would be a <- sapply(L, "[", 1); b <- sapply(L, "[", 2). Then use mapply.
a <- c(1, 2) # L[[1]][1], L[[2]][1]
b <- c(2, 3) # L[[1]][2], L[[2]][2]
mapply(myFunc, a, b)
## [1] 2 2
Try passing the rows instead of the row index
df <- data.frame(q1=c(1,2,5), q2=c(5,5,5), q3=c(5,2,5), q4=c(5,5,5), q5=c(2,3,1))
myFunc<-function(a,b) sum((a==b & a==5)*1)
myFunc(df[1,],df[2,])
This worked for me (returned 2)

Flag rows in matrix that contain the same set of values

I have a matrix of integers
m <- rbind(c(1,2),
c(3,6),
c(5,1),
c(2,1),
c(6,3))
and I am looking for a function that takes this matrix as input and outputs a vector flag with length(flag) == ncol(m) that assigns the rows that contain the same set of integers the same unique (let's say integer) value.
For the above example, the desired output would be:
flag <- c(1, 2, 3, 1, 2)
So rows 1 and 4 inm get the same flag 1, because they both contain the same set of integers, in this case {1, 2}. Similarly, rows 2 and 5 get the same flag.
The solution should work for any number of columns.
The only thing I could come up with is the following approach ...
FlagSymmetric <- function(x) {
vec_sim <- rep(NA, nrow(x)) # object containing flags
ind_ord <- ncol(x)
counter <- 1
for(i in 1:nrow(x)) {
if(is.na(vec_sim[i])) { # if that row is not flagged yet, proceed ...
vec_sim[i] <- counter # ... and give the next free flag
for(j in (i+1):nrow(x)) {
if( (i+1) > nrow(x) ) next # in case of tiny matrices
ind <- x[j, ] %in% x[i, ]
if(sum(ind)==ind_ord) vec_sim[j] <- counter # if the same, assign flag
}
counter <- counter + 1
}
}
return(vec_sim)
}
... which does what I want:
> FlagSymmetric(m)
[1] 1 2 3 1 2
If n = nrow(m) this needs 1/2 n^2 operations. Of course, I could make it much quicker by writing this in C++, but this only alleviates my problem to some extent, because I am working with matrices with a potentially huge number of rows.
I guess there must be a smarter way of doing this.
EDIT:
Additional, more general example (sorting row and pasting to character string not possible):
m2 <- rbind(c(1,112),
c(11,12),
c(12,11),
c(112,1),
c(6,3))
flag2 <- c(1, 2, 2, 1, 3) # desired output
FlagSymmetric(m2) # works
[1] 1 2 2 1 3
Assuming you only have numeric data in your matrix.
First converting the matrix to dataframe,
m <- data.frame(m)
We can sort every row and paste them together. Convert them to factor and then to numeric to get unique numbers for every combination
m$flag <- as.numeric(factor(apply(m, 1, function(x) paste0(sort(x), collapse = ""))))
m
# X1 X2 flag
#1 1 2 1
#2 3 6 3
#3 5 1 2
#4 2 1 1
#5 6 3 3
EDIT
The above solution does not work for every combination as explained in the new example. To differentiate between each number, as #d.b commented we can use any non-empty collapse argument. For updated example,
as.numeric(factor(apply(m2, 1, function(x) paste0(sort(x), collapse = "-"))))
#[1] 1 2 2 1 3

How to retrieve even indices from a vector and divide them by two using R in one line?

The problem I face is:
Using the vector x below, divide every element in an even index by two, without modifying the elements in odd indices.
The vector is:
x <- 1:10
x[c(FALSE, TRUE)] <- x[c(FALSE, TRUE)]/2
Or maybe:
x[seq(2, length(x), 2)] <- x[seq(2, length(x), 2)]/2
Since dividing by 1 is the same as doing nothing, you could just rely on vector recycling and do:
x / 1:2
#[1] 1 1 3 2 5 3 7 4 9 5

Named arrays, dataframes and matrices

If I split my data matrix into rows according to class labels in another vector y like this, the result is something with 'names' like this:
> X <- matrix(c(1,2,3,4,5,6,7,8),nrow=4,ncol=2)
> y <- c(1,3,1,3)
> X_split <- split(as.data.frame(X),y)
$`1`
V1 V2
1 1 5
3 3 7
$`3`
V1 V2
2 2 6
4 4 8
I want to loop through the results and do some operations on each matrix, for example sum the elements or sum the columns. How do I access each matrix in a loop so I can that?
labels = names(X_split)
for (k in labels) {
# How do I get X_split[k] as a matrix?
sum_class = sum(X_split[k]) # Doesn't work
}
In fact, I don't really want to deal with dataframes and named arrays at all. Is there a way I can call split without as.data.frame and get a list of matrices or something similar?
To split without converting to a data frame
X_split <- list(X[c(1, 3), ], X[c(2, 4), ])
More generally, to write it in terms of a vector y of length nrow(X), indicating the group to which each row belongs, you can write this as
X_split <- lapply(unique(y), function(i) X[y == i, ])
To sum the results
X_sum <- lapply(X_split, sum)
# [[1]]
# [1] 16
# [[2]]
# [1] 20
(or use sapply if you want the result as a vector)
Another option is not to split in the first place and just sum per y. Here's a possible data.table approach
library(data.table)
as.data.table(X)[, sum(sapply(.SD, sum)), by = y]
# y V1
# 1: 1 16
# 2: 3 20
Pretty sure operating directly on the matrix is most efficient:
tapply(rowSums(X),y,sum)
# 1 3
# 16 20

Easy Way to Get Averages Based on Names in List

Is there any easy way to get the averages of items in a list based on their names? Example dataset:
sampleList <- list("a.1"=c(1,2,3,4,5), "b.1"=c(3,4,1,4,5), "a.2"=c(5,7,2,8,9), "b.2"=c(6,8,9,0,6))
sampleList
$a.1
[1] 1 2 3 4 5
$b.1
[1] 3 4 1 4 5
$a.2
[1] 5 7 2 8 9
$b.2
[1] 6 8 9 0 6
What I am trying to do is get column averages between similarly but not identically named rows, outputting a list with the column averages for the a's and b's. Currently I can do the following:
y <- names(sampleList)
y <- gsub("\\.1", "", y)
y <- gsub("\\.2", "", y)
y <- sort(unique(y))
sampleList <- t(as.matrix(as.data.frame(sampleList)))
t <- list()
for (i in 1:length(y)){
temp <- sampleList[grep(y[i], rownames(sampleList)),]
t[[i]] <- apply(temp, 2, mean)
}
t
[[1]]
[1] 3.0 4.5 2.5 6.0 7.0
[[2]]
[1] 4.5 6.0 5.0 2.0 5.5
A I have a large dataset with a large number of sets of similar names, is there an easier way to go about this?
EDIT: I've broken out the name issue into a separate question. It can be found here
Well, this is shorter. You didn't say exactly how big your actual data is, so I"m not going to make any promises, but the performance of this shouldn't be terrible:
dat <- do.call(rbind,sampleList)
grp <- substr(rownames(dat),1,1)
aggregate(dat,by = list(group = grp),FUN = mean)
(Edited to remove the unnecessary conversion to a data frame, which will incur a significant performance hit, probably.)
If your data is crazy big, or even just medium-big but the number of groups is fairly large so there are a small number of vectors in each group, the standard recommendation would be to investigate data.table once you've rbinded the data into a matrix.
I might do something like this:
# A *named* vector of patterns you want to group by
patterns <- c(start.a="^a",start.b="^b",start.c="^c")
# Find the locations of those patterns in your list
inds <- lapply(patterns, grep, x=names(sampleList))
# Calculate the mean of each list element that matches the pattern
out <- lapply(inds, function(i)
if(l <- length(i)) Reduce("+",sampleList[i])/l else NULL)
# Set the names of the output
names(out) <- names(patterns)

Resources