I have a matrix with a hundred rows.
Is there a way to obtain a subset of ten rows which are most similar to the first row.
res2 <- matrix(rexp(200, rate=.1), ncol=10, nrow=100)
set1 <- subset(res2, res2 >condition1)
set1[with(set1, order(condition)), ]
set2 <- head(set1,10)
Perhaps:
Generate data:
set.seed(101)
res2 <- matrix(rexp(200, rate=.1), ncol=10, nrow=100)
Calculate the distance matrix. This is very inefficient because we're computing all of the pairwise distances, but it's efficiently coded and easy to use and you have lots of choices of distance metric (see ?dist, look for method). For this size problem it's very quick.
dd <- dist(res2)
rr <- rank(as.matrix(dd)[1,])
You'll notice that the rank of the first element of the first row (which is the distance between row 1 and itself) is 1, and its value (as.matrix(dd)[1,1]) is zero. So all we need now are the rows with the next ten smallest distances ...
res2[rr>1 & rr<=11,]
Related
I need to get k nearest neighbors from distance matrix. Example:
I have two "training" vectors "a" <- c(1,1) and "b" <- c(2,2) which are two dimensional vectors. I have to classify c(3,3) and I didn't have regular distance because numbers are codes for characteristics, and distance(2,3) > distance(1,3)...so c(3,3) has "a" for nearest neighbor. Later I have to generalize and output n nearest neighbors, but only for one vector at a time.
This was most promising at first, but when I looked into documentation for k.nearest.neighbors I realized it won't help me. I can't do this with Python's scikit-learn, but have some hope for R implementation, any suggestions?
I need speed with this so if I'm going to implement it in high level language I need to do it with some library...I can easily code this up in Python's numpy, but will be almost certainly too slow.
EDIT:
library(FNN)
distance_matrix <- matrix( rep( 0, len=9), nrow = 3)
distance_matrix[1,3] <- 2
distance_matrix[3,1] <- 2
distance_matrix[2,3] <- 3
distance_matrix[3,2] <- 3
train <- rbind(c(1,1), c(2,2)
test <- rbind(c(3,3))
y <- c("one", "two")
fit <- knn(train, test, y, distance_matrix, k=1, prob=TRUE)
result <- data.frame(test, pred=fit, prob=attr(fit, "prob"))
But when I look at dataframe result I see result based on euclidian metric or something alike, not my distance matrix.
Let's assume I have done several operations and created cluster vectors of correlation values shown below
D <- matrix(rexp(10*10,rate=.1), ncol=10) #create a randomly filled 10x10 matrix
C <- matrix(rexp(10*10,rate=.1),ncol=10)
DCor <- cor(D) # generate correlation matrix
CCor <- cor(C)
DUpper<- DCor[upper.tri(DCor)] # extract upper triangle
CUpper<- CCor[upper.tri(CCor)]
ClusterD <- kmeans(DUpper,3) # cluster correlations
ClusterC <- kmeans(CUpper,3)
ClusterC <- cbind(c(1:45),matrix(ClusterC$cluster)) # add row numbers as column
ClusterD <- cbind(c(1:45),matrix(ClusterD$cluster))
I would like to generate a matrix shows the intersection of each cluster group. In this matrix, 5 rows belong to both C1 and D2 group.
How can I generate a matrix like this?
Before the cbind lines, you could do:
table(ClusterC$cluster, ClusterD$cluster)
I'm trying to do a leave-one-out cross-validation on a relatively small dataset (n = 22, p = 17) on a linear regression made from the LARS algorithm. Essentially I need to create n matrices of standardized data (each column consists of entries centered by the mean and standardized by the SD of the column).
I've never used lists before, but would be open to making lists as long as columns of the different matrices can be manipulated/standardized.
Here's what I tried in R:
for (i in 1:n)
{
x.standardized.i <- matrix(data = NA, nrow = (n-1), ncol = p) #creates n matrices, all n-1 x p
for (j in 1:p)
{
x.standardized.i[,j] <- ((x[-i,j]-mean(x[-i,j]))/sd(x[-i,j])) #and standardizes the p variables with the ith row missing in each n matrix (i increments from 1 to n)
}
}
I'm not sure if I can share the data, since it's related to grades from a class, but when I run the code it goes through the loop and stops by assigning a standardized matrix with the last row missing as x.standardized.i.
You can do this quite simply with sapply and scale:
# Create dummy data
m <- matrix(runif(200), ncol=10)
# Leave each row out in turn, and scale each column
A <- sapply(seq_len(nrow(m)), function(i) scale(m[-i, ]), simplify='array')
By default, scale centres each column on its mean, and divides by its sd.
For the example above, you'll end up with an array with 19 rows, 10 columns and 20 slices.
To access particular slices (i.e. cross-validation training folds), you can subset like this:
A[,, 1] # all rows, all cols, first slice
A[,, 10] # all rows, all cols, tenth slice
To confirm that columns are centred on their mean and standardised by one sd:
apply(A, c(2, 3), mean)
apply(A, c(2, 3), sd)
I am trying to smooth a matrix by attributing the mean value of a window covering n columns around a given column. I've managed to do it but I'd like to see how would be 'the R way' of doing it as I am making use of for loops. Is there a way to get this using apply or some function of the same family?
Example:
# create a toy matrix
mat <- matrix(ncol=200);
for(i in 1:100){ mat <- rbind(mat,sample(1:200, 200) )}
# quick visualization
image(t(mat))
This is the matrix before smoothing:
I wrote the function smooth_mat that takes a matrix and the length of the smoothing kernel:
smooth_row_mat <- function(k, k.d=5){
k.range <- (k.d + 2):(ncol(k) - k.d - 1)
k.smooth <- matrix(nrow=nrow(k))
for( i in k.range){
if (i %% 10 == 0) cat('\r',round(i/length(k.range), 2))
k.smooth <- cbind( k.smooth, rowMeans(k[,c( (i-1-k.d):(i-1) ,i, (i+1):(i + 1 - k.d) )]) )
}
return(k.smooth)
}
Now we use smooth_row_mat() with mat
mat.smooth <- smooth_mat(mat)
And we have successfully smoothed, on a row basis, the content of the matrix.
This is the matrix after:
This method is good for such a small matrix although my real matrices are around 40,000 x 400, still works but I'd like to improve my R skills.
Thanks!
You can apply a filter (running mean) across each row of your matrix as follows:
apply(k, 1, filter, rep(1/k.d, k.d))
Here's how I'd do it, with the raster package.
First, create a matrix filled with random data and coerce it to a raster object.
library(raster)
r <- raster(matrix(sample(200, 200*200, replace=TRUE), nc=200))
plot(r)
Then use the focal function to calculate a neighbourhood mean for a neighbourhood of n cells either side of the focal cell. The values in the matrix of weights you provide to the focal function determine how much the value of each cell contributes to the focal summary. For a mean, we say we want each cell to contribute 1/n, so we fill a matrix of n columns, with values 1/n. Note that n must be an odd number, and the cell in the centre of the matrix is considered the focal cell.
n <- 3
smooth_r <- focal(r, matrix(1/n, nc=n))
plot(smooth_r)
I have a (k x n) matrix. I have initially managed to linearly regress (using the lm function) column 1 with each and every other column and extracted only the coefficients.
fore.choose <- matrix(0, 1, NCOL(assets))
for(i in seq(1, NCOL(assets), 1))
{
abc <- lm(assets[,1]~assets[,i])$coefficients
fore.choose[1,i] <- abc[2:length(abc)]
}
The coefficients are placed in the fore.choose matrix.
What I now need to do is to linearly regress column 2 with each and every other column, and then column 3 and so on and so forth and extract only the coefficients.
The output will be a square matrix of OLS univariate coefficients. Kind of similar to a correlation matrix, but it is the beta coefficients I am interested in.
fore.choose <- matrix(0, 1, NCOL(assets))
will initially need to become
fore.choose <- matrix(0, NCOL(assets), NCOL(assets))
I'd just compute the coefficients directly from the correlation matrix, using beta = cor(x,y)*sd(x)/sd(y), like this:
# set up some sample data
set.seed(1)
d <- matrix(rnorm(50), ncol=5)
# get the coefficients
s <- apply(d, 2, sd)
cor(d)*outer(s, s, "/")
You could also use lsfit to get the coefficients of one term on all the others at once and then only have one loop to do:
sapply(1:ncol(d), function(i) {
coef(lsfit(d[,i], d))[2,]
})
I'm sure there must be a more elegant way than to nested loops.
fore.choose <- matrix(NA, NCOL(assets), NCOL(assets))
abc <- NULL
for(i in seq_len(ncol(assets))){ # loop over "dependant" columns
for(j in seq_len(ncol(assets))){ # loop over "independant" columns
abc <- lm(assets[,i]~assets[,j])$coefficients
fore.choose[i,j] <- abc[-1]
}
}