Extract max values from a matrix in R (random selection) - r

Given a matrix, extracting the row names of the column with max values is a common problem.
sapply(mat,2,which.max)
mat<-matrix(list(20,0,0,80,80,0,
20,0,40,0,40,20,
40,0,40,20,20,0,
0,80,40,20,20,20),ncol=6,byrow=T)
rownames(mat)<-c("A","C","G","T")
But here, some columns have two similar max values (in the example matrix, col 3 and 4). By default the script chooses "A" has the row with the max column value in col 3 and 4. I am having trouble in writing a script to randomly select between two row names (A and T) wherein both have max values in column 3 and 4.
Any help with the scripting is appreciated.

The rank function comes in handy:
> apply(mat,2,function(x) which(rank(-unlist(x), ties.method="random") == 1))
[1] 3 4 4 1 1 2
> apply(mat,2,function(x) which(rank(-unlist(x), ties.method="random") == 1))
[1] 3 4 3 1 1 2
> apply(mat,2,function(x) which(rank(-unlist(x), ties.method="random") == 1))
[1] 3 4 4 1 1 4
The ties.method="random" part is crucial for resolving the ties in a random fashion.

Consider reading the documentation for which.max, which suggests using which.is.max from nnet. Either borrow that algorithm or use that package.
> library(nnet)
> which.is.max
function (x)
{
y <- seq_along(x)[x == max(x)]
if (length(y) > 1L)
sample(y, 1L)
else y
}
<bytecode: 0x0000000013fda7c8>
<environment: namespace:nnet>

You could sample from those rownames which have values equal to the max value in that column:
mat<-matrix(c(20,0,0,80,80,0,
20,0,40,0,40,20,
40,0,40,20,20,0,
0,80,40,20,20,20),ncol=6,byrow=T)
rownames(mat)<-c("A","C","G","T")
set.seed(123)
apply( mat, 2 , function(x) sample( c( rownames(mat)[ which( x == max(x) ) ] ) , 1 ) )
#[1] "G" "T" "G" "A" "A" "C"
set.seed(1234)
apply( mat, 2 , function(x) sample( c( rownames(mat)[ which( x == max(x) ) ] ) , 1 ) )
#[1] "G" "T" "G" "A" "A" "T"
p.s. I'm not sure why you construct the matrix data usin a list object - matrices are vectors.

Related

Creating a CDF by Maximizing Prevalence for Increasing Groupings of Dummy Variables (Columns) in R data.table

I have prevalence data by non-exclusive categories/classifications. (e.g., a story could be 'amazing', 'boring', 'charming', 'dark', or any combination of the four.) Illustrative:
library(data.table)
set.seed(0)
results = as.data.table( expand.grid( rep( list(0:1) , 4 ) ) )
names(results) = c('a', 'b', 'c', 'd')
results$prevalence = runif( n = 16 )
results$prevalence = results$prevalence/sum(results$prevalence)
I'd like to be able to answer the question(s):
(trivial) What is the population coverage that is not in any category (a = b = c = d = 0)?
What is the one category that covers the largest percent of the population?
What are the two categories that cover the largest percent of the population?
... and so on...
Effectively, I'd like to create a quasi-CDF where:
I know that for data in the none category (i.e., a = b = c = d = 0) I cover 10% of the population.
I know that for data in either one or no categories, I can cover 21% of the population by limiting myself to category c.
That is:
results[ ( a == 0 & b == 0 & d == 0 ) & rowSums( results[ , -'prevalence' ] ) <= 1 , sum(prevalence) ]
I know that for data in either two, one, or no categories, I can cover 36% of the population by limiting myself to categories b and c.
That is:
results[ ( a == 0 & d == 0 ) & rowSums( results[ , -'prevalence' ] ) <= 2 , sum(prevalence) ]
I know that for data in either three, two, one, or no categories, I can cover 59% of the population by limiting myself to categories a, b, and c.
That is:
results[ ( d == 0 ) & rowSums( results[ , -'prevalence' ] ) <= 3 , sum(prevalence) ]
And, trivially, I know that for data in either four, three, two, one, or no categories, I can cover 100% of the population by limiting myself to each of the four categories (a, b, c, d).
In this limited example, I just checked all possible categories to find the largest prevalence by grouping of allowable non-zero categories (actually, as you see by my code snippets, I was doing the inverse and finding prevalence by grouping categories that were restricted to zero).
How can I do this in a data.table way so that I don't have to brute force through the many combinations of dummy variables (columns) in my real summary data set?
I have suspicions that it might involve some clever use of .EACHI or lapply that I haven't been able to think of.
Try this:
#' #param dat 'data.frame' (or derivative), with only binary indicator columns
#' #param prev 'numeric', the prevalence indicator to be summed
#' #param n 'integer', number of categories for limiting coverage
#' #return numeric, with attribute "columns" indicating the selected combination of columns
func <- function(dat, prev, n) {
stopifnot(ncol(dat) >= n)
if (n == ncol(dat)) {
out <- sum(prev) # ideally 1
attr(out, "columns") <- colnames(dat)
} else {
com <- t(combn(ncol(dat), ncol(dat) - n))
vec <- apply(com, 1, function(ind) {
sum(prev[rowSums(sapply(subset(dat, select = ind), `>`, 0)) < 1])
})
out <- max(vec)
attr(out, "columns") <- colnames(dat)[-com[which.max(vec),]]
}
out
}
In action:
func(results[,1:4], results$prevalence, 0)
# [1] 0.1038405
# attr(,"columns")
# character(0)
func(results[,1:4], results$prevalence, 1)
# [1] 0.2090139
# attr(,"columns")
# [1] "c"
func(results[,1:4], results$prevalence, 2)
# [1] 0.3561435
# attr(,"columns")
# [1] "b" "c"
func(results[,1:4], results$prevalence, 3)
# [1] 0.5859805
# attr(,"columns")
# [1] "a" "b" "c"
func(results[,1:4], results$prevalence, 4)
# [1] 1
# attr(,"columns")
# [1] "a" "b" "c" "d"
That is not data.table-syntax, but it is compatible:
results[, func(.SD, prevalence, 2), .SDcols = a:d]
# [1] 0.3561435
# attr(,"columns")
# [1] "b" "c"
or all numbers at once:
results[, sapply(c(0L, seq_along(.SD)), func, dat = .SD, prev = prevalence), .SDcols = a:d]
# [1] 0.1038405 0.2090139 0.3561435 0.5859805 1.0000000
The point of separating the data into separate "category columns" (dat) and a prevalence object is to simplify the column-selection with combn and to not hard-code column names, counts, or positions in the function.

How do you get a pre-defined size/length of a variable in R? [duplicate]

I want to find the function in R which does the same as the function size in Matlab.
In Matlab, if A = [ 1 2 3 4 5], then size(A) = 1 5.
If A =[ 1 2 3;4 5 6], then size(A) = 3 3.
In R, I found that the function dim gives the size of a matrix, but it doesn't apply to vectors.
Please help me solve this problem.
Thanks a lot.
Try dim(A) it's equal to Matlab size(A) function
As you noted dim doesn't work on vectors. You can use this function which will take any number of vectors matrices, data.frames or lists and find their dimension or length:
DIM <- function( ... ){
args <- list(...)
lapply( args , function(x) { if( is.null( dim(x) ) )
return( length(x) )
dim(x) } )
}
# length 10 vector
a <- 1:10
# 3x3 matrix
b <- matrix(1:9,3,3)
# length 2 list
c <- list( 1:2 , 1:100 )
# 1 row, 2 column data.frame
d <- data.frame( a =1 , b = 2 )
DIM(a,b,c,d)
#[[1]]
#[1] 10
#[[2]]
#[1] 3 3
#[[3]]
#[1] 2
#[[4]]
#[1] 1 2
Vectors are dimensionless in R, they have length.
If one wants to consider a vector as a dimensioned object (and later work on that vector), s/he must use t() (transpose) (that in essence makes it 1-dimensional array).
dim(1:10) # NULL
length(1:10) # 10
dim(t(1:10)) # 1x10
Belated answer, but note that NROW and NCOL give the dimensions of both vectors and matrices/data.frames. So, for example:
> a<-c(1,2,3,4)
> NROW(a);NCOL(a)
[1] 4
[1] 1
If you don't know if the class of variable A is 'matrix' or not, then try:
if (class(A) == "matrix" | class(A) == "data.frame") {
size <- dim(A)
} else {
size <- length(A)
}
size
This should work for your case.
You can use the following command:
c(NROW(w), NCOL(w))

How to concisely deal with subsets when their lengths become zero?

To exclude elements from a vector x,
x <- c(1, 4, 3, 2)
we can subtract a vector of positions:
excl <- c(2, 3)
x[-excl]
# [1] 1 2
This also works dynamically,
(excl <- which(x[-which.max(x)] > quantile(x, .25)))
# [1] 2 3
x[-excl]
# [1] 1 2
until excl is of length zero:
excl.nolength <- which(x[-which.max(x)] > quantile(x, .95))
length(excl.nolength)
# [1] 0
x[-excl.nolength]
# integer(0)
I could kind of reformulate that, but I have many objects to which excl is applied, say:
letters[1:4][-excl.nolength]
# character(0)
I know I could use setdiff, but that's rather long and hard to read:
x[setdiff(seq(x), excl.nolength)]
# [1] 1 4 3 2
letters[1:4][setdiff(seq(letters[1:4]), excl.nolength)]
# [1] "a" "b" "c" "d"
Now, I could exploit the fact that nothing is excluded if the element number is greater than the number of elements:
length(x)
# [1] 4
x[-5]
# [1] 1 4 3 2
To generalize that I should probably use .Machine$integer.max:
tmp <- which(x[-which.max(x)] > quantile(x, .95))
excl <- if (!length(tmp) == 0) tmp else .Machine$integer.max
x[-excl]
# [1] 1 4 3 2
Wrapped into a function,
e <- function(x) if (!length(x) == 0) x else .Machine$integer.max
that's quite handy and clear:
x[-e(excl)]
# [1] 1 2
x[-e(excl.nolength)]
# [1] 1 4 3 2
letters[1:4][-e(excl.nolength)]
# [1] "a" "b" "c" "d"
But it seems a little fishy to me...
Is there a better equally concise way to deal with a subset of length zero in base R?
Edit
excl comes out as dynamic result of a function before (as shown with which above) and might be of length zero or not. If length(excl) == 0 nothing should be excluded. Following lines of code, e.g. x[-excl] should not have to be changed at best or as little as possible.
You can overwrite [ with your own function.
"[" <- function(x,y) {if(length(y)==0) x else .Primitive("[")(x,y)}
x <- c(1, 4, 3, 2)
excl <- c(2, 3)
x[-excl]
#[1] 1 2
excl <- integer()
x[-excl]
#[1] 1 4 3 2
rm("[") #Go back to normal mode
I would argue this is somewhat opinion based.
For example i find:
x <- x[-if(length(excl <- which(x[-which.max(x)] > quantile(x, .95))) == 0) .Machine$integer.max else excl]
very unreadable, but some people like one-liners. Reading package code you'll often find this is instead split up into one of the many suggestions you gave
excl <- which(x[-which.max(x)] > quantile(x, .95))
if(length(excl) != 0)
x <- x[-excl]
Alternatively, you could avoid which, and simply use the logical vector for subsetting, and this would likely be considered more clean by most
x <- x[!x[-which.max(x)] > quantile(x, .95)]
This would avoid zero-length index problem, at the cost of some loss of efficiency.
As a side note, the very example used above and in the question seems somewhat off. First which.max only returns the first index which is equal to the max value, and in addition the index will be offset for every value removed. More likely the expected example would be
x <- x[!(x > quantile(x, .95))[-which(x == max(x))]]
How bout this?
a <- letters[1:3]
excl1 <- c(1,3)
excl2 <- c()
a[!(seq_along(a) %in% excl1)]
a[!(seq_along(a) %in% excl2)]

Find closest value with condition

I have a function that finds me the nearest values for each row in a matrix. It then reports a list with an index of the nearest rows. However, I want it to exclude values if they are +1 in the first AND +1 in the second column away from a particular set of values (-1 in the first and -1 in the second column should also be removed). Moreover, +1 in first column and -1 in second column with respect to the values of interest should also be avoided.
As an example, if I want things closes to c(2, 1), it should accept c(3,1) or (2,2) or (1,1), but NOT c(3,2) and not c(1,0).
Basically, for an output to be reported either column 1 or column 2 should be a value of 1 away from a row of interest, but not both.
input looks like this
x
v1 v2
[1,] 3 1
[2,] 2 1
[3,] 3 2
[4,] 1 2
[5,] 8 5
myfunc(x)
The output looks like this. Notice that the closest thing to row 2 ($V2 in output) is row 1,3,4. The answer should only be 1 though.
$V1
[1] 2 3
$V2
[1] 1 3 4
$V3
[1] 1 2
$V4
[1] 2
$V5
integer(0)
Here is myfunc
myfunc = function(t){
d1 <- dist(t[,1])
d2 <- dist(t[,2])
dF <- as.matrix(d1) <= 1 & as.matrix(d2) <= 1
diag(dF) <- NA
colnames(dF) <- NULL
dF2 <- lapply(as.data.frame(dF), which)
return(dF2)
}
Basically, the rows that you want to find should differ from your reference element by +1 or -1 in one column and be identical in the other column. That means that the sum over the absolute values of the differences is exactly one. For your example c(2, 1), this works as follows:
c(3, 1): difference is c(1, 0), thus sum(abs(c(1, 0))) = 1 + 0 = 1
c(1, 1): difference is c(-1, 0), thus sum(abs(c(-1, 0))) = 1 + 0 = 1
etc.
The following function checks exactly this:
myfunc <- function(x) {
do_row <- function(r) {
r_mat <- matrix(rep(r, length = length(x)), ncol = ncol(x), byrow = TRUE)
abs_dist <- abs(r_mat - x)
return(which(rowSums(abs_dist) == 1))
}
return(apply(x, 1, do_row))
}
do_row() does the job for a single row, and then apply() is used to do this with each row. For your example, I get:
myfunc(x)
## [[1]]
## [1] 2 3
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 1
##
## [[4]]
## integer(0)
##
## [[5]]
## integer(0)
Using sweep(), one can write a shorter function:
myfunc2 <- function(x) {
apply(x, 1, function(r) which(rowSums(abs(sweep(x, 2, r))) == 1))
}
But this seems harder to understand and it turns out that it is slower by about a factor two for your matrix x. (I have also tried it with a large matrix, and there, the efficiency seems about the same.)

R : randomly divide 10 data values into a group of 5 and a group of 5

I would like to find all the possibilities to divide 10 data values into 2 groups of 5
If i'm right there are 252 possibilities
choose(10,5)
252
How can i do it with R ?
Thanks !
Here's one possibility:
a <- letters[1:10]
split1 <- combn(a, 5);
split2 <- apply(b, 2, function(x) a[!a %in% x])
Pick a random one:
set.seed(1)
rnd <- sample(1:ncol(split1), size=1)
split1[, rnd]; split2[, rnd]
# [1] "a" "c" "d" "g" "i"
# [1] "b" "e" "f" "h" "j"
So i will explain in details what i have to do :
I have 2 sets of data :
cellular_wt = c(1.1656,0.9577,1.3655,0.9016,0.9336)
cellular_mutant = c(2.8896,5.7018,3.595,1.6998,1.8893)
secreted_wt = c(7.8491,6.1546,5.1972,6.1607,5.928)
secreted_mutant = c(4.6801,3.2418,3.6651,3.0678,2.3221)
mean_cellular_wt <- mean(cellular_wt)
mean_cellular_mutant <- mean(cellular_mutant)
mean_secreted_wt <- mean(secreted_wt)
mean_secreted_mutant <- mean(secreted_mutant)
mean_secreted_wt/mean_cellular_wt = 5.877085
mean_secreted_mutant/mean_cellular_mutant = 1.076156
mean_ratio <- (mean_secreted_wt/mean_cellular_wt)/(mean_secreted_mutant/mean_cellular_mutant) = 5.46
I want to run a randomization test on these data to test the significance of mean ratio
To do so, i would like to randomly divide these 10 values (cellular_wt + cellular_mutant and secreted_wt + secreted_mutant into 2 groups of 5 (as the initial data sets), and calculate the mean ratio each time.
In this way, i can see whether the observed difference of 5.46 seems unusually large by comparing it to the 252 differences that could have been seen due to random assignment alone. Do you understand ?

Resources