find indices of values within tolerance range in R - r

say I have vector x
x <- c(1, 1, 1.1, 2, 1, 2.1, 2.6)
tol <- 0.4
how do I get the indices of the groups of elements that are 'unique' within the tolerance range (tol) as in the list below. I don't know how many of these groups there are beforehand.
[[1]]
[1] 1 2 3 5
[[2]]
[1] 4 6
[[3]]
[1] 7
thanks

Not 100% reliable, since it uses unique on lists, but you can try:
unique(apply(outer(x,x,function(a,b) abs(a-b)<tol),1,which))
#[[1]]
#[1] 1 2 3 5
#
#[[2]]
#[1] 4 6
#
#[[3]]
#[1] 7
The point #Roland raised in the comments showed that there is some ambiguity in your requirements. For instance if x<-c(1, 1.3, 1.6), my line gives three groups: 1-2, 2-3 and 1-2-3. This because, from the 1 point of view, it is similar only to 1.3, but from 1.3 point of view, it is similar to both 1 and 1.6.

An alternative using nn2 from RANN to find nearest neighbors within radius for clustering:
library(RANN)
x <- c(1, 1, 1.1, 2, 1, 2.1, 2.6)
tol=0.4
nn <- nn2(x,x,k=length(x),searchtype="radius",radius=tol)
m <- unique(apply(nn$nn.idx,1,sort), MARGIN=2)
sapply(seq_len(ncol(m)), function(i) m[which(m[,i] > 0),i])
##[[1]]
##[1] 1 2 3 5
##
##[[2]]
##[1] 4 6
##
##[[3]]
##[1] 7
x <- c(1, 1.3, 1.6)
nn <- nn2(x,x,k=length(x),searchtype="radius",radius=tol)
m <- unique(apply(nn$nn.idx,1,sort), MARGIN=2)
sapply(seq_len(ncol(m)), function(i) m[which(m[,i] > 0),i])
##[[1]]
##[1] 1 2
##
##[[2]]
##[1] 1 2 3
##
##[[3]]
##[1] 2 3
Notes:
The call to nn2 finds all nearest neighbors for each element of x with respect to all elements of x within a radius equalling the tol. The result nn$nn.idx is a matrix whose rows contain the indices that are nearest neighbors for each element in x. The matrix is dense and filled with zeroes as needed.
Clustering is performed by sorting each row so that unique rows can be extracted. The output m is a matrix where each column contains the indices in a cluster. Again, this matrix is dense and filled with zeroes as needed.
The resulting list is extracted by subsetting each column to remove the zero entries.
This is likely more efficient for large x because nn2 uses a KD-Tree, but it suffers from the same issue for elements that overlap (with respect to the tolerance) as pointed out by nicola.

Maybe it's a hammer to kill a mosquito, but i thought of univariate density clustering: the dbscan library enables you to do exactly that:
library(dbscan)
groups <- dbscan(as.matrix(x), eps=tol, minPts=1)$cluster
#### [1] 1 1 1 2 1 2 3
You don't neek to know in advance the number of groups.
It gives you the cluster number in output but you can if you prefer, take the groups means and round them to the closest integer. Once you've got this, you generate the list for instance like this:
split(seq_along(x), groups)
#### $`1`
#### [1] 1 2 3 5
#### ...
Edit: Behaviour with overlapping:
This algo attributes the same group to all elements that are within the range of tolerance of one other (works by proximity). So you might end up with fewer groups than expected if there is overlapping.

Here is another attempt with cut function from base R. We first try to create the range vector named sq and then go through x elements that falls within any specific range.
sq <- seq(min(x)-tol,max(x)+tol*2,tol*2)
# [1] 0.6 1.4 2.2 3.0
sapply(1:(length(sq)-1), function(i) which(!is.na(cut(x, breaks =c(sq[i], sq[i+1])))))
# [[1]]
# [1] 1 2 3 5
# [[2]]
# [1] 4 6
# [[3]]
# [1] 7
It does not produce any duplicate. (no need to use unique as it is the case for #nicola's answer)
It works as follows, in sapply, first we search for elements within the range [0.6, 1.4], then for [1.4, 2.2] and finally [2.2, 3.0].

Related

Algorithm to extract elements with largest sum from a matrix without repeating rows or columns?

I have a numeric matrix, and I need to extract the set of elements with the largest possible sum, subject to the constraint that no 2 elements can come from the same row or the same column. Is there any efficient algorithm for this, and is there an implementation of that algorithm for R?
For example, if the matrix is (using R's matrix notation):
[,1] [,2] [,3]
[1,] 7 1 9
[2,] 8 4 2
[3,] 3 6 5
then the unique solution is [1,3], [2,1], [3,2], which extracts the numbers 9, 8, and 6 for a total of 23. However, if the matrix is:
[,1] [,2] [,3]
[1,] 6 2 1
[2,] 4 9 5
[3,] 8 7 3
then there are 3 equally good solutions: 1,8,9; 3,6,9; and 5,6,7. These all add up to 18.
Additional notes:
If there are multiple equally good solutions, I need to find all of them. (Being able to find additional solutions that are almost as good would be useful as well, but not essential.)
The matrix elements are all non-negative, and many of them will be zero. Each row and column will contain at least 1 element that is nonzero.
The matrix can contain repeated elements.
The matrix need not be square. It might have more rows than columns or vice versa, but the constraint is always the same: no row or column may be used twice.
This problem could also be reformulated as finding a maximal-scoring set of edges between the 2 halves of a bipartite graph without re-using any node.
If it helps, you may assume that there is some small fixed k such that no row or column contains more than k non-zero values.
If anyone is curious, the rows of the matrix represent items to be labeled, the columns represent the labels, and each matrix element represents the "consistency score" for assigning a label to an item. I want to assign the each label to exactly one item in the way that maximizes the total consistency.
My suggest would be to (1) find all the combinations of elements following the rule that in each combination, no two elements coming from the same row or same column (2) calculate the sum of elements in each combination (3) find the maximum sum and the corresponding combination.
Here I only show the square matrix case, the non-square matrix would follow similar idea.
(1) Suppose the matrix is n*n, keep the row order as 1 to n, all I need to do is to find all the permutations of columns index (1:n), after combine the row index and one permutation of columns index, then I would get the positions of elements in one combination that follow the rule, in this way I can identify the positions of elements in all the combinations.
matrix_data <- matrix(c(6,2,1,4,9,5,8,7,3), byrow=T,nrow = 3)
## example matrix
n_length <- dim(matrix_data)[1]
## row length
all_permutation <- permn(c(1:n_length))
## list of all the permutations of columns index
(2) Find sum of elements in each combination
index_func <- function(x){ ## x will be a permutation from the list all_permutation
matrix_indexs <- matrix(data = c(c(1:n_length),x),
byrow = F, nrow = n_length)
## combine row index and column index to construct the positions of the elements in the matrix
matrix_elements <- matrix_data[matrix_indexs]
## extract the elements based on their position
matrix_combine <- cbind(matrix_indexs,matrix_elements)
## combine the above two matrices
return(matrix_combine)
}
results <- sapply(all_permutation, sum(index_func(x)[,"matrix_elements"]))
## find the sums of all the combination
(3) Find the maximum sum and corresponding combination
max(results) ## 18 maximum sum is 18
max_index <- which(results==max(results)) ## 1 2 4 there are three combinations
## if you want the complete position index
lapply(all_permutation[max_index], index_func)
## output, first column is row index, second column is column index, last column is the corresponding matrix elements
[[1]]
matrix_elements
[1,] 1 1 6
[2,] 2 2 9
[3,] 3 3 3
[[2]]
matrix_elements
[1,] 1 1 6
[2,] 2 3 5
[3,] 3 2 7
[[3]]
matrix_elements
[1,] 1 3 1
[2,] 2 2 9
[3,] 3 1 8
Here are 2 options:
1) Approaching this as an optimization problem where the objective function is to maximize the sum of elements chosen subject to the constraints that each row and column cannot be selected more than once.
sample data:
set.seed(0L)
m <- matrix(sample(12), nrow=4)
#m <- matrix(sample(16), nrow=4)
m
[,1] [,2] [,3]
[1,] 9 2 6
[2,] 4 5 11
[3,] 7 3 12
[4,] 1 8 10
code:
library(lpSolve)
nr <- nrow(m)
nc <- ncol(m)
#create the indicator matrix for column indexes
colmat <- data.table::shift(c(rep(1, nr), rep(0, (nc-1)*nr)), seq(0, by=nr, length.out=nc), fill=0)
#create indicator matrix for row indexes
rowmat <- data.table::shift(rep(c(1, rep(0, nr-1)), nc), 0:(nr-1), fill=0)
A <- do.call(rbind, c(colmat, rowmat))
#call lp solver
res <- lp("max",
as.vector(m),
A,
rep("<=", nrow(A)),
rep(1, nrow(A)),
all.bin=TRUE,
num.bin.solns=3)
sample output:
which(matrix(res$solution[1:ncol(A)], nrow=nr)==1L, arr.ind=TRUE)
row col
[1,] 1 1
[2,] 4 2
[3,] 3 3
2)
And the above leads to an greedy heuristics approach to pick the largest element and eliminate the chosen row and column and then repeat on the smaller matrix:
v <- integer(min(nc, nr))
allix <- matrix(0, nrow=length(v), ncol=2)
for (k in seq_along(v)) {
ix <- which(m == max(m), arr.ind=TRUE)
allix[k,] <- ix
v[k] <- m[ix]
m <- m[-ix[1], -ix[2], drop=FALSE]
}
v
#[1] 12 9 8
But this does not lead to multiple solutions and hence not developing further to extract indices.

Count within multiple nested lists R

Suppose I have a list of length 2, within which is another list of length 2, within which there is a data frame of numbers coded as either 0, 1 or 2 (bear with me!):
set.seed(42)
l1<-data.frame(sample(0:2, 5, replace = TRUE))
l2<-data.frame(sample(0:2, 5, replace = TRUE))
l<-list(l1,l2)
ll<-list(list(l,l), list(l,l))
I need to count the number of times either 1 or 2 appears within each data frame. I then need to sum these counts across all counts at the level above.
So for ll[[1]][[1]][[1]] the count would be 1, for ll[[1]][[1]][[2]] the count would be 4. Across those two dataframes the sum would be 5.
To give a more plain-English description of the real data I'm working with: the top level is the number of species (in this example, 2 species), the level below that is the year when data was recorded (in this example, data is collected in 2 different years). Below that is a location within which data are recorded. I need to know that, within years, how many times 1 or 2 appears across all locations (within that year).
There is perhaps a better way to describe this but so far it's eluding me. Any help would be appreciated.
We can use purrr functions.
library(purrr)
map(ll, function(x) transpose(x) %>% map(~sum(unlist(.x) != 0)))
#[[1]]
#[[1]][[1]]
#[1] 2
#[[1]][[2]]
#[1] 8
#[[2]]
#[[2]][[1]]
#[1] 2
#[[2]][[2]]
#[1] 8
A bit nested, but the solution should work:
lapply(ll,
function(l)
lapply(l,
function(li) sum(unlist(li) %in% 1:2)))
# [[1]]
# [[1]][[1]]
# [1] 5
#
# [[1]][[2]]
# [1] 5
#
#
# [[2]]
# [[2]][[1]]
# [1] 5
#
# [[2]][[2]]
# [1] 5

Removing specific element from a list of vectors in R

Suppose I have a list of indices and values.
indx_list <- list(1,2,c(3,4),5,c(6,7,8))
val_list <- list(0.1,0.6,c(0.8,0.9),0.3,c(0.4,0.8,0.5))
I then want to update both lists by removing indices c(4,7) and the corresponding values c(0.9,0.5). This is pretty easily done using lapply and setdiff. For example:
indx_list_new <- lapply(indx_list,function(x) setdiff(x,c(4,7)))
val_list_new <- lapply(val_list,function(x) setdiff(x,c(0.9,0.5)))
However, I don't know beforehand what indices and corresponding values I will be removing.
set.seed(1234)
indx_flag <- sample(seq(8),2)
You can also see that some values are repeated (i.e. 0.8) so using setdiff might actually remove values at the wrong position.
Questions
1) I can still use lapply and setdiff to update indx_list, but how can I update the values in val_list?
2) Is lapply the most efficient solution here? I will have lists with thousands of elements, and each element can be a vector of hundreds of indices/values.
Edit
Each element in the list (highest level) actually has a particular meaning, so I'd like to keep the list structure.
Instead, arrange your data into a 'tidy' representation
df = data.frame(
indx = unlist(indx_list),
val = unlist(val_list),
grp = factor(rep(seq_along(indx_list), lengths(indx_list)))
)
where the operation is more-or-less transparent
base::subset(df, !indx %in% c(4, 7))
indx val grp
1 1 0.1 1
2 2 0.6 2
3 3 0.8 3
5 5 0.3 4
6 6 0.4 5
8 8 0.5 5
Using subset() is similar to df[!df$indx %in% c(4, 7), , drop = FALSE]. (I used factor() to allow for empty groups, i.e., levels with no corresponding values).
Here's an attempt using relist and Map to remove the same points:
Map(`[`, val_list, relist(!unlist(indx_list) %in% c(4,7), indx_list))
#[[1]]
#[1] 0.1
#
#[[2]]
#[1] 0.6
#
#[[3]]
#[1] 0.8
#
#[[4]]
#[1] 0.3
#
#[[5]]
#[1] 0.4 0.5

seq and seq_along, best of both worlds?

If I want to number all elements in two vectors, vector 1 gets all odd bumbers and vector 2 gets all even numbers, I can do this assuming the vectors are of length 10.
seq(1, 10, by=2)
[1] 1 3 5 7 9
seq(2, 11, by=2)
[1] 2 4 6 8 10
but if my vector has only one element I will run into problems:
seq(2)
[1] 1 2
so I use:
seq_along(2)
[1] 1
BUT I cant use by= in seq_long(). How do i get the reliability of seq_along with the functionality of seq()?
This example might clear things.
Imagine I ahve two lists:
list1 <- list(4)
list2 <- list(4)
list1 must get even names along the element of the list.
list2 must get odd names along the element of the list.
I dont know how long the list elements will be.
seq_along(list1[[1]]) # this will know to only give one name but I cant make it even
seq(list2[[1]]) # this know to give 1 name
#and
seq(2, list1[[1]], by=2) # this gives me even but too nay names
Here's a function that adds a 'by' argument to seq_along:
seq_along_by = function(x, by=1L, from = 1L) (seq_along(x) - 1L) * by + from
and some test cases
> seq_along_by(integer(), 2L)
integer(0)
> seq_along_by(1, 2L)
[1] 1
> seq_along_by(1:4, 2L)
[1] 1 3 5 7
> seq_along_by(1:4, 2.2)
[1] 1.0 3.2 5.4 7.6
> seq_along_by(1:4, -2.2)
[1] 1.0 -1.2 -3.4 -5.6
one way i just found is:
y <- seq_along(1:20)
y[y %% 2 == 0 ]
[1] 2 4 6 8 10 12 14 16 18 20
y[ !y %% 2 == 0 ]
[1] 1 3 5 7 9 11 13 15 17 19
But this will only work when my vectors are even. Must be able to do better.
I'm not sure what you are trying to do, but if you want to split odd and even elements in a vector, you can do just that:
x <- 1:19
split(x,x%%2)
$`0`
[1] 2 4 6 8 10 12 14 16 18
$`1`
[1] 1 3 5 7 9 11 13 15 17 19
To extract the odd and even numbered elements, use lapply on this list using seq_along to enumerate the element numbers:
x <- rep(c("odd","even"),times=4)
lapply(split(seq_along(x),seq_along(x)%%2),function(y) "["(x,y))
$`0`
[1] "even" "even" "even" "even"
$`1`
[1] "odd" "odd" "odd" "odd"
This can of course be made into a function:
split_oe <- function(x) lapply(split(seq_along(x),seq_along(x)%%2),function(y) "["(x,y))
split_oe(1:10)
$`0`
[1] 2 4 6 8 10
$`1`
[1] 1 3 5 7 9
> split_oe(2)
$`1`
[1] 2
I'm adding another answer to address what may be your intent of the question rather than the question as you've stated it.
Let's assume you have a couple arrays, A1 and A2, with values, and you want to link an index to those values, so you can say index[n] and get a corresponding value from A1[n/2 + 1] if n is odd and A2[n/2] if n is even.
We would build a new vector, index, like so:
# Sample arrays
A1 <- sample(LETTERS, 5, rep=TRUE)
A2 <- sample(LETTERS, 5, rep=TRUE)
n_Max <- length(c(A1,A2))
index <- integer(n_Max)
index[seq(1,n_Max,by=2)] <- A1
index[seq(2,n_Max,by=2)] <- A2
Now, index[n] returns A1 values when n is odd, and returns A2 values when n is even. This breaks if length(A2) is not equal to or one less than length(A1).
If I understand correctly, what you really want is a to get the 'seq' function to return only odd or oven numbers 1..max or 2..max, respectively. You would write that like so:
seq(1, max, by=2) # Odd numbers
seq(2, max, by=2) # Even numbers
Where max is the top number in your series. The only time this will break is if max is less than 2.
Update 1: There seems to be a bit of discussion about what the OP is requesting. If we assume there are two existing vectors to be numbered, we can obtain the total number of vector items using max <- length(c(vector1, vector2)) to obtain the maximum number being used. Then, the indices would be assigned like so:
vector1 <- seq(1, max, by=2)
vector2 <- seq(2, max, by=2)
And this will work for any set EXCEPT when one vector does not have any elements at all.
Update 2: There is one final approach, which you can take if your vectors do not represent all values between 1 and max. This is how it would work:
vector1 <- seq(1, length(vector1) * 2, by=2)
vector2 <- seq(1, length(vector2) * 2, by=2)
This independently assigns the values of vector1 and vector2 according to their own lengths.

"replace" function examples

I don't find the help page for the replace function from the base package to be very helpful. Worst part, it has no examples which could help understand how it works.
Could you please explain how to use it? An example or two would be great.
If you look at the function (by typing it's name at the console) you will see that it is just a simple functionalized version of the [<- function which is described at ?"[". [ is a rather basic function to R so you would be well-advised to look at that page for further details. Especially important is learning that the index argument (the second argument in replace can be logical, numeric or character classed values. Recycling will occur when there are differing lengths of the second and third arguments:
You should "read" the function call as" "within the first argument, use the second argument as an index for placing the values of the third argument into the first":
> replace( 1:20, 10:15, 1:2)
[1] 1 2 3 4 5 6 7 8 9 1 2 1 2 1 2 16 17 18 19 20
Character indexing for a named vector:
> replace(c(a=1, b=2, c=3, d=4), "b", 10)
a b c d
1 10 3 4
Logical indexing:
> replace(x <- c(a=1, b=2, c=3, d=4), x>2, 10)
a b c d
1 2 10 10
You can also use logical tests
x <- data.frame(a = c(0,1,2,NA), b = c(0,NA,1,2), c = c(NA, 0, 1, 2))
x
x$a <- replace(x$a, is.na(x$a), 0)
x
x$b <- replace(x$b, x$b==2, 333)
Here's two simple examples
> x <- letters[1:4]
> replace(x, 3, 'Z') #replacing 'c' by 'Z'
[1] "a" "b" "Z" "d"
>
> y <- 1:10
> replace(y, c(4,5), c(20,30)) # replacing 4th and 5th elements by 20 and 30
[1] 1 2 3 20 30 6 7 8 9 10
Be aware that the third parameter (value) in the examples given above: the value is a constant (e.g. 'Z' or c(20,30)).
Defining the third parameter using values from the data frame itself can lead to confusion.
E.g. with a simple data frame such as this (using dplyr::data_frame):
tmp <- data_frame(a=1:10, b=sample(LETTERS[24:26], 10, replace=T))
This will create somthing like this:
a b
(int) (chr)
1 1 X
2 2 Y
3 3 Y
4 4 X
5 5 Z
..etc
Now suppose you want wanted to do, was to multiply the values in column 'a' by 2, but only where column 'b' is "X". My immediate thought would be something like this:
with(tmp, replace(a, b=="X", a*2))
That will not provide the desired outcome, however. The a*2 will defined as a fixed vector rather than a reference to the 'a' column. The vector 'a*2' will thus be
[1] 2 4 6 8 10 12 14 16 18 20
at the start of the 'replace' operation. Thus, the first row where 'b' equals "X", the value in 'a' will be placed by 2. The second time, it will be replaced by 4, etc ... it will not be replaced by two-times-the-value-of-a in that particular row.
Here's an example where I found the replace( ) function helpful for giving me insight. The problem required a long integer vector be changed into a character vector and with its integers replaced by given character values.
## figuring out replace( )
(test <- c(rep(1,3),rep(2,2),rep(3,1)))
which looks like
[1] 1 1 1 2 2 3
and I want to replace every 1 with an A and 2 with a B and 3 with a C
letts <- c("A","B","C")
so in my own secret little "dirty-verse" I used a loop
for(i in 1:3)
{test <- replace(test,test==i,letts[i])}
which did what I wanted
test
[1] "A" "A" "A" "B" "B" "C"
In the first sentence I purposefully left out that the real objective was to make the big vector of integers a factor vector and assign the integer values (levels) some names (labels).
So another way of doing the replace( ) application here would be
(test <- factor(test,labels=letts))
[1] A A A B B C
Levels: A B C

Resources