Using sapply instead of a for loop - r

Working on a project where we need to take the average of numbers in a matrix with those around it. For example, imagine a 3x3 matrix such as
[(1,2,3),
(4,5,6),
(7,8,9)].
Step 1 is to add padding around the matrix. Lets say we add 1 layer of padding thus getting a 5x5 matrix
[[0,0,0,0,0],
[0,1,2,3,0],
[0,4,5,6,0],
[0,7,8,9,0],
[0,0,0,0,0]].
matrix(c(0,0,0,0,0,0,1,2,3,0,0,4,5,6,0,0,7,8,9,0,0,0,0,0,0), nrow=5, ncol=5, byrow=T)
Then we average and filter getting the final 3x3 matrix. The first row/first column of this matrix should be (1+2+4+5)/9 = 1.33.
Right now my code works and looks like
for(row in (k+1):(nrow(pad.m) - k)){
for(col in (k+1):(ncol(pad.m) - k)) {
y <- pad.m[seq(row-k, row+k), seq(col-k, col+k)]
filter.m[row-k, col-k]<- mean(y)
}
where k is the number of layers of padding and pad.m is our matrix. Unfortunately my professor says that this is too unwieldy and prefers sapply over 2 for loops. I was wondering how I could subset and iterate through the matrix with sapply.

Use tensorflow. You can use either a convolutional layer or a pooling layer. Example:
library(tensorflow)
mymat <- matrix(c(0,0,0,0,0,0,1,2,3,0,0,4,5,6,0,0,7,8,9,0,0,0,0,0,0), nrow=5, ncol=5, byrow=T) # Your padded matrix
matrix1 <- tf$constant( array(mymat, dim=c(1,nrow(mymat),ncol(mymat),1)), dtype="float64" )
pool1 <- tf$nn$avg_pool(matrix1, c(1L,2L,2L,1L), c(1L,1L,1L,1L), "SAME")
sess <- tf$Session()
sess$run(tf$global_variables_initializer())
res <- pool1$eval(session=sess)
sess$close()
The above takes the average over 2x2 regions. But you added up the 2x2 regions and then divided by 9, which is weird, but okay. So you can get the results like this:
res <- res[1,,,]
(res * 4/9)[-1,][,-1][-(3:4),][,-(3:4)]
[,1] [,2]
[1,] 1.333333 1.777778
[2,] 2.666667 3.111111
The above is just formatting the array output back to matrix.

Related

Implementing KNN with different distance metrics using R

I am working on a dataset in order to compare the effect of different distance metrics. I am using the KNN algorithm.
The KNN algorithm in R uses the Euclidian distance by default. So I wrote my own one. I would like to find the number of correct class label matches between the nearest neighbor and target.
I have prepared the data at first. Then I called the data (wdbc_n), I chose K=1. I have used Euclidian distance as a test.
library(philentropy)
knn <- function(xmat, k,method){
n <- nrow(xmat)
if (n <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n, ncol = k)
for(i in 1:n) {
ddist<- distance(xmat, method)
neigh[i, ] <- order(ddist)[2:(k + 1)]
}
return(neigh)
}
wdbc_nn <-knn(wdbc_n ,1,method="euclidean")
Hoping to get a similar result to the paper ("on the surprising behavior of distance metrics in high dimensional space") (https://bib.dbvis.de/uploadedFiles/155.pdf, page 431, table 3).
My question is
Am I right or wrong with the codes?
Any suggestions or reference that will guide me will be highly appreciated.
EDIT
My data (breast-cancer-wisconsin)(wdbc) dimension is
569 32
After normalizing and removing the id and target column the dimension is
dim(wdbc_n)
569 30
The train and test split is given by
wdbc_train<-wdbc_n[1:469,]
wdbc_test<-wdbc_n[470:569,]
Am I right or wrong with the codes?
Your code is wrong.
The call to the distance function taked about 3 seconds every time on my rather recent PC so I only did the first 30 rows for k=3 and noticed that every row of the neigh matrix was identical. Why is that? Take a look at this line:
ddist<- distance(xmat, method)
Each loop feeds the whole xmat matrix at the distance function, then uses only the first line from the resulting matrix. This calculates the distance between the training set rows, and does that n times, discarding every row except the first. Which is not what you want to do. The knn algorithm is supposed to calculate, for each row in the test set, the distance with each row in the training set.
Let's take a look at the documentation for the distance function:
distance(x, method = "euclidean", p = NULL, test.na = TRUE, unit =
"log", est.prob = NULL)
x a numeric data.frame or matrix (storing probability vectors) or a
numeric data.frame or matrix storing counts (if est.prob is
specified).
(...)
in case nrow(x) = 2 : a single distance value. in case nrow(x) > 2 :
a distance matrix storing distance values for all pairwise probability
vector comparisons.
In your specific case (knn classification), you want to use the 2 row version.
One last thing: you used order, which will return the position of the k largest distances in the ddist vector. I think what you want is the distances themselves, so you need to use sort instead of order.
Based on your code and the example in Lantz (2013) that your code seemed to be based on, here is a complete working solution. I took the liberty to add a few lines to make a standalone program.
Standalone working solution(s)
library(philentropy)
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
knn <- function(train, test, k, method){
n.test <- nrow(test)
n.train <- nrow(train)
if (n.train + n.test <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n.test, ncol = k)
ddist <- NULL
for(i in 1:n.test) {
for(j in 1:n.train) {
xmat <- rbind(test[i,], train[j,]) #we make a 2 row matrix combining the current test and train rows
ddist[j] <- distance(as.data.frame(xmat), method, k) #then we calculate the distance and append it to the ddist vector.
}
neigh[i, ] <- sort(ddist)[2:(k + 1)]
}
return(neigh)
}
wbcd <- read.csv("https://resources.oreilly.com/examples/9781784393908/raw/ac9fe41596dd42fc3877cfa8ed410dd346c43548/Machine%20Learning%20with%20R,%20Second%20Edition_Code/Chapter%2003/wisc_bc_data.csv")
rownames(wbcd) <- wbcd$id
wbcd$id <- NULL
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
wbcd_train<-wbcd_n[1:469,]
wbcd_test<-wbcd_n[470:549,]
wbcd_nn <-knn(wbcd_train, wbcd_test ,3, method="euclidean")
Do note that this solution might be slow because of the numerous (100 times 469) calls to the distance function. However, since we are only feeding 2 rows at a time into the distance function, it makes the execution time manageable.
Now does that work?
The two first test rows using the custom knn function:
[,1] [,2] [,3]
[1,] 0.3887346 0.4051762 0.4397497
[2,] 0.2518766 0.2758161 0.2790369
Let us compare with the equivalent function in the FNN package:
library(FNN)
alt.class <- get.knnx(wbcd_train, wbcd_test, k=3, algorithm = "brute")
alt.class$nn.dist
[,1] [,2] [,3]
[1,] 0.3815984 0.3887346 0.4051762
[2,] 0.2392102 0.2518766 0.2758161
Conclusion: not too shabby.

diagonal replacement in an r correlogram

I am rather new to R. I am trying to replace the main diagonal of a correlogram (that's consisted of ones obviously). I have created the vectors for the the correlogram, and have used the cor() function from the cocron package to create the correlogram. I also created a list with the values that i want instead of the ones in the correlogram, consisted of internal reliabilities of the correlogram vectors.
library(cocron)
library(fmsb)
# defining correlated variables
JOB_ins = subset(df,select=c("q9","Rq10_new","q11","q12"))
INT_to_quit = subset(df,select=c("q13","q14","Rq15_new","q16"))
Employability = subset(df,select=c("q17","q18","q19","q20"))
Mobility_pref = subset(df,select=c("Rq21","Rq22","Rq23","Rq24","Rq25"))
Career_self_mgmt = subset(df,select=c("q26","q27","q28","q29","q30"
,"q31","q32","q33"))
# subsetting dataframes
x = subset(df,select=c(JOB_ins, INT_to_quit, Employability
,Mobility_pref,Career_self_mgmt))
#creating a correlation matrix
corrmat = cor(x)
#creating Cronbach Alpha reliabilities vector for diagonal replacement
dlist=list(round(CronbachAlpha(JOB_ins),2),round(CronbachAlpha(Int_to_quit),2)
,round(CronbachAlpha(Employability),2)
,round(CronbachAlpha(Mobility_pref),2)
,round(CronbachAlpha(Career_self_mgmt),2))
#replacing the main diagonal
diag(corrmat)=dlist
Doing that I do replace the main diagonal but It seems I also turn my correlogram from a matrix to a vector. Any idea how do I keep that from happening or reverse that?
First, you can use a vector instead of a list, replace list(round(CronbachAlpha(JOB_ins),2),...) by c(round(CronbachAlpha(JOB_ins),2),...)
Second, you can convert a vector to a matrix easily. Example:
matrix(c(1,2,3,4), nrow = 2) will convert the c(1,2,3,4) vector into the following 2x2 matrix:
[,1] [,2]
[1,] 1 3
[2,] 2 4

how to calculate mean of multiple matrices

I have 2000 covariance matrices of size 27*27, I want to get the mean covariance matrix over all 2000 matrices. The result I want is one matrix of size 27x27 in which position [1,1] is the mean of position [1,1] of the given 27 matrices.
I could see from other posts that I should make an array and use apply function, but it does not work!
my codes:
a<-array(ml.1[c(1:2000)])
apply(a,c(1,2),mean)
I get this error message:
Error in if (d2 == 0L) { : missing value where TRUE/FALSE needed
I would appreciate if anyone can help me to solve this problem.
First, #eipi10 is right your're question is not reproducible. But the key here is in how you set up your array.
#Make some fake data 10 matrices 10x10
m <- lapply(1:10, function(x) matrix(rnorm(100), nrow = 10))
#bind the matrices together
a <- do.call(cbind, m)
#recast the matrix into three dimensions
dim(a) <- c(10,10,10)
#now apply should work
apply(a, c(1,2), mean)

create a random non-singular matrix reliably

How can I create a matrix of pseudo-random values that is guaranteed to be non-singular? I tried the code below, but it failed. I suppose I could just loop until I got one by chance but I would prefer a more elegant "R-like" solution if anyone has an idea.
library(matrixcalc)
exampledf<- matrix(ceiling(runif(16,0,50)), ncol=4)
is.singular.matrix(exampledf) #this may or may not return false
using a while loop:
exampledf<-NULL
library(matrixcalc)
while(is.singular.matrix(exampledf)!=TRUE){
exampledf<- matrix(ceiling(runif(16,0,50)), ncol=4)
}
I suppose one method that guarantees (not is fairly likely, but actually guarantees) that the matrix is non-singular, is to start from a known non-singular matrix and apply the basic linear operations used for example in Gaussian Elimination: 1. add / subtract a multiple of one row from another row or 2. multiply row by a constant.
Depending on how "random" and how dense you want your matrix to be you can start from the identity matrix and multiply all elements with a random constant. Afterwards, you can apply a randomly selected set of operations from above, that will result in a non singular matrix. You can even apply a predefined set of operations, but using a randomly selected constant at each step.
An alternative could be to start from an upper triangular matrix for which the product of main diagonal entries is not zero. This is because the determinant of a triangular matrix is the product of the elements on the main diagonal. This effectively boils down to generating N random numbers, placing them on the main diagonal, and setting the rest of the entries (above the main diagonal) to whatever you like. If you want the matrix to be fully dense, add the first row to every other row of the matrix.
Of course this approach (like any other probably would) assumes that the matrix is relatively numerically stable and the singularity will not be affected by precision errors (as you know the precision of data types in all programming languages is limited). You would do well to avoid very small / very large values which can make the method numerically unstable.
It should be fairly unlikely that this will produce a singular matrix:
Mat1 <- matrix(rnorm(100), ncol=4)
Mat2 <- matrix(rnorm(100), ncol=4)
crossprod(Mat1,Mat2)
[,1] [,2] [,3] [,4]
[1,] 0.8138 5.112 2.945 -5.003
[2,] 4.9755 -2.420 1.801 -4.188
[3,] -3.8579 8.791 -2.594 3.340
[4,] 7.2057 6.426 2.663 -1.235
solve( crossprod(Mat1,Mat2) )
[,1] [,2] [,3] [,4]
[1,] -0.11273 0.15811 0.05616 0.07241
[2,] 0.03387 0.01187 0.07626 0.02881
[3,] 0.19007 -0.60377 -0.40665 0.17771
[4,] -0.07174 -0.31751 -0.15228 0.14582
inv1000 <- replicate(1000, {
Mat1 <- matrix(rnorm(100), ncol=4)
Mat2 <- matrix(rnorm(100), ncol=4)
try(solve( crossprod(Mat1,Mat2)))} )
str(inv1000)
#num [1:4, 1:4, 1:1000] 0.1163 0.0328 0.3424 -0.227 0.0347 ...
max(inv1000)
#[1] 451.6
> inv100000 <- replicate(100000, {Mat1 <- matrix(rnorm(100), ncol=4)
+ Mat2 <- matrix(rnorm(100), ncol=4)
+ is.singular.matrix( crossprod(Mat1,Mat2))} )
> sum(inv100000)
[1] 0

Determining if a matrix is diagonalizable in the R Programming Language

I have a matrix and I would like to know if it is diagonalizable. How do I do this in the R programming language?
If you have a given matrix, m, then one way is the take the eigen vectors times the diagonal of the eigen values times the inverse of the original matrix. That should give us back the original matrix. In R that looks like:
m <- matrix( c(1:16), nrow = 4)
p <- eigen(m)$vectors
d <- diag(eigen(m)$values)
p %*% d %*% solve(p)
m
so in that example p %*% d %*% solve(p) should be the same as m
You can implement the full algorithm to check if the matrix reduces to a Jordan form or a diagonal one (see e.g., this document). Or you can take the quick and dirty way: for an n-dimensional square matrix, use eigen(M)$values and check that they are n distinct values. For random matrices, this always suffices: degeneracy has prob.0.
P.S.: based on a simple observation by JD Long below, I recalled that a necessary and sufficient condition for diagonalizability is that the eigenvectors span the original space. To check this, just see that eigenvector matrix has full rank (no zero eigenvalue). So here is the code:
diagflag = function(m,tol=1e-10){
x = eigen(m)$vectors
y = min(abs(eigen(x)$values))
return(y>tol)
}
# nondiagonalizable matrix
m1 = matrix(c(1,1,0,1),nrow=2)
# diagonalizable matrix
m2 = matrix(c(-1,1,0,1),nrow=2)
> m1
[,1] [,2]
[1,] 1 0
[2,] 1 1
> diagflag(m1)
[1] FALSE
> m2
[,1] [,2]
[1,] -1 0
[2,] 1 1
> diagflag(m2)
[1] TRUE
You might want to check out this page for some basic discussion and code. You'll need to search for "diagonalized" which is where the relevant portion begins.
All symmetric matrices across the diagonal are diagonalizable by orthogonal matrices. In fact if you want diagonalizability only by orthogonal matrix conjugation, i.e. D= P AP' where P' just stands for transpose then symmetry across the diagonal, i.e. A_{ij}=A_{ji}, is exactly equivalent to diagonalizability.
If the matrix is not symmetric, then diagonalizability means not D= PAP' but merely D=PAP^{-1} and we do not necessarily have P'=P^{-1} which is the condition of orthogonality.
you need to do something more substantial and there is probably a better way but you could just compute the eigenvectors and check rank equal to total dimension.
See this discussion for a more detailed explanation.

Resources