Selecting rows in R based on threshold - r

In R, I have a matrix with N columns of all numbers. (Each row has a name, but that's irrelevant.) I'd like to return rows where there is at least one column has a value greater than some threshold. Right now, I'm doing something like this:
THRESHOLD <- 10
# my_matrix[,1] can be ignored
my_matrix <- subset (my_matrix, my_matrix[,1] > THRESHOLD | my_matrix[,2] > THRESHOLD | ... )
It seems odd to have to manually list each column. Also, if the number of input columns changes, I have to rewrite this.
There has to be a better way, but I can't figure out what I should be looking for.
I can convert my matrix to a data frame, if that is easier... Any suggestions would be appreciated!

find any row values greater than threshold using apply and use it to extract the rows from mat data.
mat[apply( mat2, 1, function( x ) any( x > threshold ) ), ]
EDIT:
Break down of the above single line.
# create sample data by simulating samples from standard normal distribution
set.seed(1L) # set random number generator for consistent data simulation
mat <- matrix( data = c(letters[1:3], as.character( rnorm(9, mean = 0, sd = 1))),
byrow = FALSE,
nrow = 3,
ncol = 4 ) # create simulated data matrix
threshold <- 0 # set threshold
mat2 <- apply( mat[, 2:ncol(mat) ], 2, as.numeric ) # extract columns 2 to end and convert to numeric
# Get the logical indices (true or false) if any row has values greater than 0 (threshold)
row_indices <- apply( mat2, 1, function( x ) any( x > threshold ) )
mat[row_indices, ] # extract matrix data rows that has TRUE in row_indices
# [,1] [,2] [,3] [,4]
# [1,] "a" "-0.626453810742332" "1.59528080213779" "0.487429052428485"
# [2,] "b" "0.183643324222082" "0.329507771815361" "0.738324705129217"
# [3,] "c" "-0.835628612410047" "-0.820468384118015" "0.575781351653492"
Note:
In your question, you mentioned that first column is character and the rest are numbers. By rule, matrix can hold one data type. Given this information, I assume that your data matrix is a character data type. You can find it by using class(mat). If it is character matrix, then extract columns 2 to end and then convert it to numeric. Then use it in the apply loop to check for any values greater than threshold.

Related

counting N occurrences within a ceiling range of a matrix by-row

I would like to tally each time a value lies within a given range in a matrix by-row, and then sum these logical outcomes to derive a "measure of consistency" for each row.
Reproducible example:
m1 <- matrix(c(1,2,1,6,3,7,4,2,6,8,11,15), ncol=4, byrow = TRUE)
# expected outcome, given a range of +/-1 either side
exp.outcome<-matrix(c(TRUE,TRUE,TRUE,FALSE,
TRUE,FALSE,TRUE,TRUE,
FALSE,FALSE,FALSE,FALSE),
ncol=4, byrow=TRUE)
Above I've indicated the the expected outcome, in the case where each value lies within +/- 1 range of any other values within that row.
Within the first row of m1 the first value (1) is within +/-1 of any other value in that row hence equals TRUE, and so on.
By contrast, none of the values in row 4 of m1 are within a single digit value of each other and hence each is assigned FALSE.
Any pointers would be much appreciated?
Update:
Thanks to the help provided I can now count the unique pairs of values which meet the ceiling criteria for any arbitrarily large matrix (using the binomial coefficient, k draws from n, without replacement).
Before progressing with the answer I just wanted to clarify that in your question you have said:
Within the first row of m1 the first value (1) is within +/-1 of any
other value in that row hence equals TRUE, and so on.
However,
>> m1[1,4]
[1] 6
6 is not within the +/- 1 from 1, and there is FALSE value as a correct result in your answer.
Solution
This solution should get you to the desired results:
t(apply(
X = m1,
# Take each row from the matrix
MARGIN = 1,
FUN = function(x) {
sapply(
X = x,
# Now go through each element of that row
FUN = function(y) {
# Your conditions
y %in% c(x - 1) | y %in% c(x + 1)
}
)
}
))
Results
[,1] [,2] [,3] [,4]
[1,] TRUE TRUE TRUE FALSE
[2,] TRUE FALSE TRUE TRUE
[3,] FALSE FALSE FALSE FALSE
Check
For results stored as res.
>> identical(res, exp.outcome)
[1] TRUE
Here is a kind of neat base R method that uses an array:
The first two lines are setup that store a three dimensional array of acceptable values and a matrix that will store the desired output. The structure of the array is as follows: columns correspond with acceptable values of a matrix element in same column. The third dimension correspond with the rows of the matrix.
Pre-allocation in this way should cut down on repeated computations.
# construct array of all +1/-1 values
valueArray <- sapply(1:nrow(m1), function(i) rbind(m1[i,]-1, m1[i,], m1[i,]+1),
simplify="array")
# get logical matrix of correct dimensions
exp.outcome <- matrix(TRUE, nrow(m1), ncol(m1))
# get desired values
for(i in 1:nrow(m1)) {
exp.outcome[i, ] <- sapply(1:ncol(m1), function(j) m1[i, j] %in% c(valueArray[, -j, i]))
}
Which returns
exp.outcome
[,1] [,2] [,3] [,4]
[1,] TRUE TRUE TRUE FALSE
[2,] TRUE FALSE TRUE TRUE
[3,] FALSE FALSE FALSE FALSE

R: Expand a vector of matrix column numbers into a matrix with those columns filled

I have two vectors in R and want to generate a new matrix based on them.
a=c(1,2,1,2,3) # a[1] is 1: thus row 1, column 1 should be equal to...
b=c(10,20,30,40,50) # ...b[1], or 10.
I want to produce matrix 'v' BUT without my 'for' loop through columns of v and my multiplication:
v = as.data.frame(matrix(0,nrow=length(a),ncol=length(unique(a))))
for(i in 1:ncol(v)) v[[i]][a==i] <- 1 # looping through columns of 'v'
v <- v*b
I am sure there is a fast/elegant way to do it in R. At least of expanding 'a' into the earlier version of 'v' (before its multiplication by 'b').
Thanks a lot!
This is one way that sparse matrices can be defined.
Matrix::sparseMatrix(i = seq_along(a), j = a, x = b)
# Setup the problem:
set.seed(4242)
a <- sample(1:100, 1000000, replace = TRUE)
b <- sample(1:500, length(a), replace = TRUE)
# Start the timer
start.time <- proc.time()[3]
# Actual code
# We use a matrix instead of a data.frame
# The number of columns matches the largest column index in vector "a"
v <- matrix(0,nrow=length(a), ncol= max(a))
v[cbind(seq_along(a), a)] <- b
# Show elapsed time
stop.time <- proc.time()[3]
cat("elapsed time is: ", stop.time - start.time, "seconds.\n")
# For a million rows and a hundred columns, my prehistoric
# ... laptop says: elapsed time is: 2.597 seconds.
# these checks take much longer to run than the function itself
# Make sure the modified column in each row matches vector "a"
stopifnot(TRUE == all.equal(a, apply(v!=0, 1, which)))
# Make sure the modified value in each row equals vector "b"
stopifnot(TRUE == all.equal(rowSums(v), b))

How can I efficiently generate a dataframe of simulated values?

I'm trying to generate a data frame of simulated values based on existing distribution parameters. My main data frame contains the mean and standard deviation for each observation, like so:
example.data <- data.frame(country=c("a", "b", "c"),
score_mean=c(0.5, 0.4, 0.6),
score_sd=c(0.1, 0.1, 0.2))
# country score_mean score_sd
# 1 a 0.5 0.1
# 2 b 0.4 0.1
# 3 c 0.6 0.2
I can use sapply() and a custom function to use the score_mean and score_sd parameters to randomly draw from a normal distribution:
score.simulate <- function(score.mean, score.sd) {
return(mean(rnorm(100, mean=score.mean, sd=score.sd)))
}
simulated.scores <- sapply(example.data$score_mean,
FUN=score.simulate,
score.sd=example.data$score_sd)
# [1] 0.4936432 0.3753853 0.6267956
This will generate one round (or column) of simulated values. However, I'd like to generate a lot of columns (like 100 or 1,000). The only way I've found to do this is to wrap my sapply() function inside a generic function inside lapply() and then convert the resulting list into a data frame with ldply() in plyr:
results.list <- lapply(1:5, FUN=function(x) sapply(example.data$score_mean, FUN=score.simulate, score.sd=example.data$score_sd))
library(plyr)
simulated.scores <- as.data.frame(t(ldply(results.list)))
# V1 V2 V3 V4 V5
# V1 0.5047807 0.4902808 0.4857900 0.5008957 0.4993375
# V2 0.3996402 0.4128029 0.3875678 0.4044486 0.3982045
# V3 0.6017469 0.6055446 0.6058766 0.5894703 0.5960403
This works, but (1) it seems really convoluted, especially with the as.data.frame(t(ldply(lapply(... FUN=function(x) sapply ...)))) approach, (2) it is really slow when using large numbers of iterations or bigger data—my actual dataset has 3,000 rows, and running 1,000 iterations takes 1–2 minutes.
Is there a more efficient way to create a data frame of simulated values like this?
The quickest way I can think of is to take advantage of the vectorisation built-in to rnorm. Both the mean and sd arguments are vectorised, however you can only supply a single integer for the number of draws. If you supply a vector to the mean and sd arguments, R will cycle through them until it has completed the required number of draws. Therefore, just make the argument n to rnorm a multiple of the length of your mean vector. The multiplier will be the number of replicates for each row of your data.frame. In the function below this is n.
I can't think of a factor way than using base::rnorm on its own.
Worked example
#example data
df <- data.frame(country=c("a", "b", "c"),
mean=c(1, 10, 100),
sd=c(1, 2, 10))
#function which returns a matrix, and takes column vectors as arguments for mean and sd
normv <- function( n , mean , sd ){
out <- rnorm( n*length(mean) , mean = mean , sd = sd )
return( matrix( out , , ncol = n , byrow = FALSE ) )
}
#reproducible result (note order of magnitude of rows and input sample data)
set.seed(1)
normv( 5 , df$mean , df$sd )
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0.3735462 2.595281 1.487429 0.6946116 0.3787594
#[2,] 10.3672866 10.659016 11.476649 13.0235623 5.5706002
#[3,] 91.6437139 91.795316 105.757814 103.8984324 111.2493092
This can be done very quickly if you remember that rnorm(1, mean, sd) is the same as rnorm(1)*sd + mean so using your data frame df, you can generate sim simulations of your obs observations like:
obs = nrow(df)
sim = 1000
mat = data.frame(matrix(rnorm(obs*sim), obs, sim) * df$sd + df$mean)
You can check that this has the desired means by using rowMeans(mat) and check the standard deviation for, say, row 1 as sd(mat[1,]).

which.min() with random sampling

I have a vector of variables:
x<-runif(1000,0,1)
I would like to select the element with the lowest value:
x[which.min(x)]
By default which.min(x) will return the first element that satisfies this condition, however, it can happen that there are multiple elements that are equally low.
is there a way to sample from those values and return only one ?
Use which to find the indices of all those elements which are equal to the minimum of the vector and randomly sample one (unless the minimum value appears once - then we can just return it).
# Find indices of minima of vector
ids <- which( x == min(x) )
# If the minimum value appear multiple times pick one index at random otherwise just return its position in the vector
if( length( ids ) > 1 )
ids <- sample( ids , 1 )
# You can use 'ids' to subset as per usual
x[ids]
Another similar approach, but one which does not use if is to do a sample with seq_along the matched values.
Here are two examples. x1 has multiple min values. x2 has just one.
## Make some sample data
set.seed(1)
x1 <- x2 <- sample(100, 1000, replace = TRUE)
x2[x2 == 1][-1] <- 2 ## Make x2 have just one min value
## Identify the minimum values, and extract just one of them.
y <- which(x1 == min(x1))
y[sample(seq_along(y), 1)]
# [1] 721
z <- which(x2 == min(x2))
z[sample(seq_along(z), 1)]
# [1] 463

How to write linearly dependent column in a matrix in terms of linearly independent columns?

I have a large mxn matrix, and I have identified the linearly dependent columns. However, I want to know if there's a way in R to write the linearly dependent columns in terms of the linearly independent ones. Since it's a large matrix, it's not possible to do based on inspection.
Here's a toy example of the type of matrix I have.
> mat <- matrix(c(1,1,0,1,0,1,1,0,0,1,1,0,1,1,0,1,0,1,0,1), byrow=TRUE, ncol=5, nrow=4)
> mat
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 0 1 0
[2,] 1 1 0 0 1
[3,] 1 0 1 1 0
[4,] 1 0 1 0 1
Here it's obvious that x3 = x1-x2, x5=x1-x4. I want to know if there's an automated way to get that for a larger matrix.
Thanks!
I'm sure there is a better way but I felt like playing around with this. I basically do a check at the beginning to see if the input matrix is full column rank to avoid unnecessary computation in case it is full rank. After that I start with the first two columns and check if that submatrix is of full column rank, if it is then I check the first thee columns and so on. Once we find some submatrix that isn't of full column rank I regress the last column in that submatrix on the previous one which tells us how to construct linear combinations of the first columns to get the last column.
My function isn't very clean right now and could do some additional checking but at least it's a start.
mat <- matrix(c(1,1,0,1,0,1,1,0,0,1,1,0,1,1,0,1,0,1,0,1), byrow=TRUE, ncol=5, nrow=4)
linfinder <- function(mat){
# If the matrix is full rank then we're done
if(qr(mat)$rank == ncol(mat)){
print("Matrix is of full rank")
return(invisible(seq(ncol(mat))))
}
m <- ncol(mat)
# cols keeps track of which columns are linearly independent
cols <- 1
for(i in seq(2, m)){
ids <- c(cols, i)
mymat <- mat[, ids]
if(qr(mymat)$rank != length(ids)){
# Regression the column of interest on the previous
# columns to figure out the relationship
o <- lm(mat[,i] ~ mat[,cols] + 0)
# Construct the output message
start <- paste0("Column_", i, " = ")
# Which coefs are nonzero
nz <- !(abs(coef(o)) <= .Machine$double.eps^0.5)
tmp <- paste("Column", cols[nz], sep = "_")
vals <- paste(coef(o)[nz], tmp, sep = "*", collapse = " + ")
message <- paste0(start, vals)
print(message)
}else{
# If the matrix subset was of full rank
# then the newest column in linearly independent
# so add it to the cols list
cols <- ids
}
}
return(invisible(cols))
}
linfinder(mat)
which gives
> linfinder(mat)
[1] "Column_3 = 1*Column_1 + -1*Column_2"
[1] "Column_5 = 1*Column_1 + -1*Column_4"

Resources