Find indices of 5 closest samples in distance matrix - r

Users
I have a distance matrix dMat and want to find the 5 nearest samples to the first one. What function can I use in R? I know how to find the closest sample (cf. 3rd line of code), but can't figure out how to get the other 4 samples.
The code:
Mat <- replicate(10, rnorm(10))
dMat <- as.matrix(dist(Mat))
which(dMat[,1]==min(dMat[,1]))
The 3rd line of code finds the index of the closest sample to the first sample.
Thanks for any help!
Best,
Chega

You can use order to do this:
head(order(dMat[-1,1]),5)+1
[1] 10 3 4 8 6
Note that I removed the first one, as you presumably don't want to include the fact that your reference point is 0 distance away from itself.

Alternative using sort:
sort(dMat[,1], index.return = TRUE)$ix[1:6]
It would be nice to add a set.seed(.) when using random numbers in matrix so that we could show the results are identical. I will skip the results here.
Edit (correct solution): The above solution will only work if the first element is always the smallest! Here's the correct solution that will always give the 5 closest values to the first element of the column:
> sort(abs(dMat[-1,1] - dMat[1,1]), index.return=TRUE)$ix[1:5] + 1
Example:
> dMat <- matrix(c(70,4,2,1,6,80,90,100,3), ncol=1)
# James' solution
> head(order(dMat[-1,1]),5) + 1
[1] 4 3 9 2 5 # values are 1,2,3,4,6 (wrong)
# old sort solution
> sort(dMat[,1], index.return = TRUE)$ix[1:6]
[1] 4 3 9 2 5 1 # values are 1,2,3,4,6,70 (wrong)
# Correct solution
> sort(abs(dMat[-1,1] - dMat[1,1]), index.return=TRUE)$ix[1:5] + 1
[1] 6 7 8 5 2 # values are 80,90,100,6,4 (right)

Related

How to calculate most frequent occurring terms/words in a document collection/corpus using R?

First I create a document term matrix like below
dtm <- DocumentTermMatrix(docs)
Then I take the sum of the occurance of each word vectors as below
totalsums <- colSums(as.matrix(dtm))
My totalsums (R says type 'double') looks like below for first 7 elements.
aaab aabb aabc aacc abbb abbc abcc ...
9 2 10 4 7 3 12 ...
I managed to sort this with the following command
sorted.sums <- sort(totalsums, decreasing=T)
Now I want to extract the first 4 terms/words with the highest sums which are greater than value 5.
I could get the first 4 highest with sorted.sums[1:4] but how can I set a threshold value?
I managed to do this with the order function like below but, is there a way to do this than sort function or without using findFreqTerms fucntion?
ord.totalsums <- order(totalsums)
findFreqTerms(dtm, lowfreq=5)
Appreciate your thoughts on this.
You can use
sorted.sums[sorted.sums > 5][1:4]
But if you have at least 4 values that are greater than 5 only using sorted.sums[1:4] should work as well.
To get the words you can use names.
names(sorted.sums[sorted.sums > 5][1:4])

Looping through items on a list in R

this may be a simple question but I'm fairly new to R.
What I want to do is to perform some kind of addition on the indexes of a list, but once I get to a maximum value it goes back to the first value in that list and start over from there.
for example:
x <-2
data <- c(0,1,2,3,4,5,6,7,8,9,10,11)
data[x]
1
data[x+12]
1
data[x+13]
3
or something functionaly equivalent. In the end i want to be able to do something like
v=6
x=8
y=9
z=12
values <- c(v,x,y,z)
data <- c(0,1,2,3,4,5,6,7,8,9,10,11)
set <- c(data[values[1]],data[values[2]], data[values[3]],data[values[4]])
set
5 7 8 11
values <- values + 8
set
1 3 4 7
I've tried some stuff with additon and substraction to the lenght of my list but it does not work well on the lower numbers.
I hope this was a clear enough explanation,
thanks in advance!
We don't need a loop here as vectors can take vectors of length >= 1 as index
data[values]
#[1] 5 7 8 11
NOTE: Both the objects are vectors and not list
If we need to reset the index
values <- values + 8
ifelse(values > length(data), values - length(data) - 1, values)
#[1] 1 3 4 7

Add a percent chance of something happening in R

I have a vector of numbers stored in R.
my_vector <- c(1,2,3,4,5)
I want to add two to each number.
my_vector + 2
[1] 3 4 5 6 7
However, I want there to only be a twenty percent chance of adding two to the numbers in my vector each time I run the code. Is there a way to code this in R?
What I mean is, if I run the code, the output could be:
[1] 3 4 5 6 9
Or perhaps
[1] 5 4 5 6 7
i.e. there is only a 20% chance that any one number in the vector will get two added to it.
myvector + 2*sample(c(TRUE,FALSE), length(myvector), prob=c(0.2,0.8), repl=TRUE)
That will give a variable number of 2's to be added (which is what you were asking) but sometimes people want to know that exactly 20% will have a 2 added in whoch case it would be:
myvector + 2*sample(c(TRUE,rep(FALSE,4)))

finding set of multinomial combinations

Let's say I have a vector of integers 1:6
w=1:6
I am attempting to obtain a matrix of 90 rows and 6 columns that contains the multinomial combinations from these 6 integers taken as 3 groups of size 2.
6!/(2!*2!*2!)=90
So, columns 1 and 2 of the matrix would represent group 1, columns 3 and 4 would represent group 2 and columns 5 and 6 would represent group 3. Something like:
1 2 3 4 5 6
1 2 3 5 4 6
1 2 3 6 4 5
1 2 4 5 3 6
1 2 4 6 3 5
...
Ultimately, I would want to expand this to other multinomial combinations of limited size (because the numbers get large rather quickly) but I am having trouble getting things to work. I've found several functions that do binomial combinations (only 2 groups) but I could not locate any functions that do this when the number of groups is greater than 2.
I've tried two approaches to this:
Building up the matrix from nothing using for loops and attempting things with the reshape package (thinking that might be something there for this with melt() )
working backwards from the permutation matrix (720 rows) by attempting to retain unique rows within groups and or removing duplicated rows within groups
Neither worked for me.
The permutation matrix can be obtained with
library(gtools)
dat=permutations(6, 6, set=TRUE, repeats.allowed=FALSE)
I think working backwards from the full permutation matrix is a bit excessive but I'm tring anything at this point.
Is there a package with a prebuilt function for this? Anyone have any ideas how I shoud proceed?
Here is how you can implement your "working backwards" approach:
gps <- list(1:2, 3:4, 5:6)
get.col <- function(x, j) x[, j]
is.ordered <- function(x) !colSums(diff(t(x)) < 0)
is.valid <- Reduce(`&`, Map(is.ordered, Map(get.col, list(dat), gps)))
dat <- dat[is.valid, ]
nrow(dat)
# [1] 90

R how to produce a vector trimming another vector by choosing a fixed value of its components

This is an elementary question; I apologize for it.
Let x <- c(1,2,3,4,5). I would like to produce a vector z of length 5 s.t. its components are all those x satisfying the condition
if x[i]>2 then write 2.
The result should look like
z <- c(1,2,2,2,2)
I know that
z <- which(x>2)
gives me
3 4 5
but I cannot find a good way to implement it to arrive at the result.
I thank you all for your support.
EDIT. If instead of considering a vector x I have a matrix M with columns x and y and I want to apply the above trimming to the column x leaving y untouched, how should I proceed?
You can use pmin:
pmin(x, 2)
# [1] 1 2 2 2 2
For example:
y <- x
y[x>2] <- 2
1 2 2 2 2
If you've a matrix M with two columns, and you want to replace only the first column with values > 2 to 2, then do:
M[,1][M[,1]>2] <- 2

Resources