KNN used to create new synthetic examples in oversampling

KNN used to create new synthetic examples in oversampling - r

i am working with r language and on umbalanced dataset and i need to know how can get the k nearest neighbors of a dataset becaue i need them to create new synthetic examples .
set.seed(123)
test <- 1:100
train.gc <- gc.subset[-test,]
test.gc <- gc.subset[test,]
train.def <- gc$Default[-test]
test.def <- gc$Default[test]
library(class)
knn.5 <- knn(train.gc, test.gc, train.def, k=5)
#how can i get the five nearest neighbours????????

Although it doesn't seem to be documented, the help for knn hints that the attributes may store something:
train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
k = knn(train, test, cl, k = 3, prob=TRUE)
names(attributes(k))
# [1] "levels" "class" "prob" "nn.index" "nn.dist"
and I'd hazard a guess that nn.index is the index of the neighbours:
> head(attr(k,"nn.index"))
[,1] [,2] [,3]
[1,] 10 2 13
[2,] 24 8 18
[3,] 1 18 8
[4,] 1 18 8
I'd guess those are the 3 nearest neighbours of the first four data points.

Related

Most efficient way in base R to do pairwise correlations between thousands of columns in a matrix [duplicate]

I'm new to R, so I apologize if this is a straightforward question, however I've done quite a bit of searching this evening and can't seem to figure it out. I've got a data frame with a whole slew of variables, and what I'd like to do is create a table of the correlations among a subset of these, basically the equivalent of "pwcorr" in Stata, or "correlations" in SPSS. The one key to this is that not only do I want the r, but I also want the significance associated with that value.
Any ideas? This seems like it should be very simple, but I can't seem to figure out a good way.

Bill Venables offers this solution in this answer from the R mailing list to which I've made some slight modifications:
cor.prob <- function(X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr / (1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
cor.mat <- t(R)
cor.mat[upper.tri(cor.mat)] <- NA
cor.mat
}
So let's test it out:
set.seed(123)
data <- matrix(rnorm(100), 20, 5)
cor.prob(data)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 NA NA NA NA
[2,] 0.7005361 1.0000000 NA NA NA
[3,] 0.5990483 0.6816955 1.0000000 NA NA
[4,] 0.6098357 0.3287116 0.5325167 1.0000000 NA
[5,] 0.3364028 0.1121927 0.1329906 0.5962835 1
Does that line up with cor.test?
cor.test(data[,2], data[,3])
Pearson's product-moment correlation
data: data[, 2] and data[, 3]
t = 0.4169, df = 18, p-value = 0.6817
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3603246 0.5178982
sample estimates:
cor
0.09778865
Seems to work ok.

Here is something that I just made, I stumbled on this post because I was looking for a way to take every pair of variables, and get a tidy nX3 dataframe. Column 1 is a variable, Column 2 is a variable, and Column 3 and 4 are their absolute value and true correlation. Just pass the function a dataframe of numeric and integer values.
pairwiseCor <- function(dataframe){
pairs <- combn(names(dataframe), 2, simplify=FALSE)
df <- data.frame(Vairable1=rep(0,length(pairs)), Variable2=rep(0,length(pairs)),
AbsCor=rep(0,length(pairs)), Cor=rep(0,length(pairs)))
for(i in 1:length(pairs)){
df[i,1] <- pairs[[i]][1]
df[i,2] <- pairs[[i]][2]
df[i,3] <- round(abs(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]])),4)
df[i,4] <- round(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]]),4)
}
pairwiseCorDF <- df
pairwiseCorDF <- pairwiseCorDF[order(pairwiseCorDF$AbsCor, decreasing=TRUE),]
row.names(pairwiseCorDF) <- 1:length(pairs)
pairwiseCorDF <<- pairwiseCorDF
pairwiseCorDF
}
This is what the output is:
> head(pairwiseCorDF)
Vairable1 Variable2 AbsCor Cor
1 roll_belt accel_belt_z 0.9920 -0.9920
2 gyros_dumbbell_x gyros_dumbbell_z 0.9839 -0.9839
3 roll_belt total_accel_belt 0.9811 0.9811
4 total_accel_belt accel_belt_z 0.9752 -0.9752
5 pitch_belt accel_belt_x 0.9658 -0.9658
6 gyros_dumbbell_z gyros_forearm_z 0.9491 0.9491

I've found that the R package picante does a nice job dealing with the problem that you have. You can easily pass your dataset to the cor.table function and get a table of correlations and p-values for all of your variables. You can specify Pearson's r or Spearman in the function. See this link for help:
http://www.inside-r.org/packages/cran/picante/docs/cor.table
Also remember to remove any non-numeric columns from your dataset prior to running the function. Here's an example piece of code:
install.packages("picante")
library(picante)
#Insert the name of your dataset in the code below
cor.table(dataset, cor.method="pearson")

You can use the sjt.corr function of the sjPlot-package, which gives you a nicely formatted correlation table, ready for use in your Office application.
Simplest function call is just to pass the data frame:
sjt.corr(df)
See examples here.

How can I use SOM algorithm for classification prediction

I would like to see If SOM algorithm can be used for classification prediction.
I used to code below but I see that the classification results are far from being right. For example, In the test dataset, I get a lot more than just the 3 values that I have in the training target variable. How can I create a prediction model that will be in alignment to the training target variable?
library(kohonen)
library(HDclassif)
data(wine)
set.seed(7)
training <- sample(nrow(wine), 120)
Xtraining <- scale(wine[training, ])
Xtest <- scale(wine[-training, ],
center = attr(Xtraining, "scaled:center"),
scale = attr(Xtraining, "scaled:scale"))
som.wine <- som(Xtraining, grid = somgrid(5, 5, "hexagonal"))
som.prediction$pred <- predict(som.wine, newdata = Xtest,
trainX = Xtraining,
trainY = factor(Xtraining$class))
And the result:
$unit.classif
[1] 7 7 1 7 1 11 6 2 2 7 7 12 11 11 12 2 7 7 7 1 2 7 2 16 20 24 25 16 13 17 23 22
[33] 24 18 8 22 17 16 22 18 22 22 18 23 22 18 18 13 10 14 15 4 4 14 14 15 15 4

This might help:
SOM is an unsupervised classification algorithm, so you shouldn't expect it to be trained on a dataset that contains a classifier label (if you do that it will need this information to work, and will be useless with unlabelled datasets)
The idea is that it will kind of "convert" an input numeric vector to a network unit number (try to run your code again with a 1 per 3 grid and you'll have the output you expected)
You'll then need to convert those network units numbers back into the categories you are looking for (that is the key part missing in your code)
Reproducible example below will output a classical classification error. It includes one implementation option for the "convert back" part missing in your original post.
Though, for this particular dataset, the model overfitts pretty quickly: 3 units give the best results.
#Set and scale a training set (-1 to drop the classes)
data(wine)
set.seed(7)
training <- sample(nrow(wine), 120)
Xtraining <- scale(wine[training, -1])
#Scale a test set (-1 to drop the classes)
Xtest <- scale(wine[-training, -1],
center = attr(Xtraining, "scaled:center"),
scale = attr(Xtraining, "scaled:scale"))
#Set 2D grid resolution
#WARNING: it overfits pretty quickly
#Errors are 36% for 1 unit, 63% for 2, 93% for 3, 89% for 4
som_grid <- somgrid(xdim = 1, ydim=3, topo="hexagonal")
#Create a trained model
som_model <- som(Xtraining, som_grid)
#Make a prediction on test data
som.prediction <- predict(som_model, newdata = Xtest)
#Put together original classes and SOM classifications
error.df <- data.frame(real = wine[-training, 1],
predicted = som.prediction$unit.classif)
#Return the category number that has the strongest association with the unit
#number (0 stands for ambiguous)
switch <- sapply(unique(som_model$unit.classif), function(x, df){
cat <- as.numeric(names(which.max(table(
error.df[error.df$predicted==x,1]))))
if(length(cat)<1){
cat <- 0
}
return(c(x, cat))
}, df = data.frame(real = wine[training, 1], predicted = som_model$unit.classif))
#Translate units numbers into classes
error.df$corrected <- apply(error.df, MARGIN = 1, function(x, switch){
cat <- switch[2, which(switch[1,] == x["predicted"])]
if(length(cat)<1){
cat <- 0
}
return(cat)
}, switch = switch)
#Compute a classification error
sum(error.df$corrected == error.df$real)/length(error.df$real)

How can I get the contribution by each predictor to the final regression prediction in lm

Using R when I use rlm or lm I would like to get the contribution of each predictor of the model.
The problem occurs when I have interaction terms as I think they are not in the lm object
Bellow is sample data (I am looking for a way that generalized to any number of predictors)
Sample data:
set.seed(1)
y <- rnorm(10)
m <- data.frame(v1=rnorm(10), v2=rnorm(10), v3=rnorm(10))
lmObj <- lm(formula=y~0+v1*v3+v2*v3, data=m)
betaHat <- coefficients(lmObj)
betaHat
v1 v3 v2 v1:v3 v3:v2
0.03455 -0.50224 -0.57745 0.58905 -0.65592
# How do I get the data.frame or matrix with columns (v1,v3,v2,v1:v3,v3:v2)
# worth [M$v1*v1, ... , (M$v3*M$v2)*v3:v2]

I thought by "contribution" you want explained variance of each term (which an ANOVA table helps), while actually you want term-wise prediction:
predict(lmObj, type = "terms")
See ?predict.lm.

Actually I got it from lm itself, the trick is to ask for x=TRUE
lmObj <- lm(formula=y~0+v1*v3+v2*v3, data=m, x=TRUE)
lmObj$x %*% diag(lmObj$coefficients)
[,1] [,2] [,3] [,4] [,5]
1 0.0522305 -0.68238 -0.53066 1.20993 -0.81898
2 0.0134687 0.05162 -0.45164 -0.02360 0.05273
3 -0.0214632 -0.19470 -0.04306 -0.14187 -0.01896
4 -0.0765156 0.02702 1.14875 0.07019 -0.07021
5 0.0388652 0.69161 -0.35792 -0.91250 0.55985
6 -0.0015524 0.20843 0.03241 0.01098 -0.01528
7 -0.0005594 0.19803 0.08996 0.00376 -0.04029
8 0.0326086 0.02979 0.84928 -0.03298 -0.05722
9 0.0283723 -0.55248 0.27611 0.53213 0.34500
10 0.0205187 -0.38330 -0.24134 0.26699 -0.20921

parallel k-means in R

I am trying to understand how to parallelize some of my code using R. So, in the following example I want to use k-means to cluster data using 2,3,4,5,6 centers, while using 20 iterations.
Here is the code:
library(parallel)
library(BLR)
data(wheat)
parallel.function <- function(i) {
kmeans( X[1:100,100], centers=?? , nstart=i )
}
out <- mclapply( c(5, 5, 5, 5), FUN=parallel.function )
How can we parallel simultaneously the iterations and the centers?
How to track the outputs, assuming I want to keep all the outputs from k-means across all, iterations and centers, just to learn how?

This looked very simple to me at first ... and then i tried it. After a lot of monkey typing and face palming during my lunch break however, I arrived at this:
library(parallel)
library(BLR)
data(wheat)
mc = mclapply(2:6, function(x,centers)kmeans(x, centers), x=X)
It looks right though I didn't check how sensible the clustering was.
> summary(mc)
Length Class Mode
[1,] 9 kmeans list
[2,] 9 kmeans list
[3,] 9 kmeans list
[4,] 9 kmeans list
[5,] 9 kmeans list
On reflection the command syntax seems sensible - although a lot of other stuff that failed seemed reasonable too...The examples in the help documentation are maybe not that great.
Hope it helps.
EDIT
As requested here is that on two variables nstart and centers
(pars = expand.grid(i=1:3, cent=2:4))
i cent
1 1 2
2 2 2
3 3 2
4 1 3
5 2 3
6 3 3
7 1 4
8 2 4
9 3 4
L=list()
# zikes horrible
pars2=apply(pars,1,append, L)
mc = mclapply(pars2, function(x,pars)kmeans(x, centers=pars$cent,nstart=pars$i ), x=X)
> summary(mc)
Length Class Mode
[1,] 9 kmeans list
[2,] 9 kmeans list
[3,] 9 kmeans list
[4,] 9 kmeans list
[5,] 9 kmeans list
[6,] 9 kmeans list
[7,] 9 kmeans list
[8,] 9 kmeans list
[9,] 9 means list
How'd you like them apples?

There's a CRAN package called knor that is derived from a research paper that improves the performance using a memory efficient variant of Elkan's pruning algorithm. It's an order of magnitude faster than everything in these answers.
install.packages("knor")
require(knor)
iris.mat <- as.matrix(iris[,1:4])
k <- length(unique(iris[, dim(iris)[2]])) # Number of unique classes
nthread <- 4
kms <- Kmeans(iris.mat, k, nthread=nthread)

You may use parallel to try K-Means from different random starting points on multiple cores.
The code below is an example. (K=K in K-means, N= number of random starting points, C = number of cores you would like to use)
suppressMessages( library("Matrix") )
suppressMessages( library("irlba") )
suppressMessages( library("stats") )
suppressMessages( library("cluster") )
suppressMessages( library("fpc") )
suppressMessages( library("parallel") )
#Calculate KMeans results
calcKMeans <- function(matrix, K, N, C){
#Parallel running from various of random starting points (Using C cores)
results <- mclapply(rep(N %/% C, C), FUN=function(nstart) kmeans(matrix, K, iter.max=15, nstart=nstart), mc.cores=C);
#Find the solution with smallest total within sum of square error
tmp <- sapply(results, function(r){r[['tot.withinss']]})
km <- results[[which.min(tmp)]]
#return cluster, centers, totss, withinss, tot.withinss, betweenss, size
return(km)
}
runKMeans <- function(fin_uf, K, N, C,
#fout_center, fout_label, fout_size,
fin_record=NULL, fout_prediction=NULL){
uf = read.table(fin_uf)
km = calcKMeans(uf, K, N, C)
rm(uf)
#write.table(km$cluster, file=fout_label, row.names=FALSE, col.names=FALSE)
#write.table(km$center, file=fout_center, row.names=FALSE, col.names=FALSE)
#write.table(km$size, file=fout_size, row.names=FALSE, col.names=FALSE)
str(km)
return(km$center)
}
Hope it helps!

Pairwise Correlation Table

I'm new to R, so I apologize if this is a straightforward question, however I've done quite a bit of searching this evening and can't seem to figure it out. I've got a data frame with a whole slew of variables, and what I'd like to do is create a table of the correlations among a subset of these, basically the equivalent of "pwcorr" in Stata, or "correlations" in SPSS. The one key to this is that not only do I want the r, but I also want the significance associated with that value.
Any ideas? This seems like it should be very simple, but I can't seem to figure out a good way.

Bill Venables offers this solution in this answer from the R mailing list to which I've made some slight modifications:
cor.prob <- function(X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr / (1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
cor.mat <- t(R)
cor.mat[upper.tri(cor.mat)] <- NA
cor.mat
}
So let's test it out:
set.seed(123)
data <- matrix(rnorm(100), 20, 5)
cor.prob(data)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 NA NA NA NA
[2,] 0.7005361 1.0000000 NA NA NA
[3,] 0.5990483 0.6816955 1.0000000 NA NA
[4,] 0.6098357 0.3287116 0.5325167 1.0000000 NA
[5,] 0.3364028 0.1121927 0.1329906 0.5962835 1
Does that line up with cor.test?
cor.test(data[,2], data[,3])
Pearson's product-moment correlation
data: data[, 2] and data[, 3]
t = 0.4169, df = 18, p-value = 0.6817
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3603246 0.5178982
sample estimates:
cor
0.09778865
Seems to work ok.

Here is something that I just made, I stumbled on this post because I was looking for a way to take every pair of variables, and get a tidy nX3 dataframe. Column 1 is a variable, Column 2 is a variable, and Column 3 and 4 are their absolute value and true correlation. Just pass the function a dataframe of numeric and integer values.
pairwiseCor <- function(dataframe){
pairs <- combn(names(dataframe), 2, simplify=FALSE)
df <- data.frame(Vairable1=rep(0,length(pairs)), Variable2=rep(0,length(pairs)),
AbsCor=rep(0,length(pairs)), Cor=rep(0,length(pairs)))
for(i in 1:length(pairs)){
df[i,1] <- pairs[[i]][1]
df[i,2] <- pairs[[i]][2]
df[i,3] <- round(abs(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]])),4)
df[i,4] <- round(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]]),4)
}
pairwiseCorDF <- df
pairwiseCorDF <- pairwiseCorDF[order(pairwiseCorDF$AbsCor, decreasing=TRUE),]
row.names(pairwiseCorDF) <- 1:length(pairs)
pairwiseCorDF <<- pairwiseCorDF
pairwiseCorDF
}
This is what the output is:
> head(pairwiseCorDF)
Vairable1 Variable2 AbsCor Cor
1 roll_belt accel_belt_z 0.9920 -0.9920
2 gyros_dumbbell_x gyros_dumbbell_z 0.9839 -0.9839
3 roll_belt total_accel_belt 0.9811 0.9811
4 total_accel_belt accel_belt_z 0.9752 -0.9752
5 pitch_belt accel_belt_x 0.9658 -0.9658
6 gyros_dumbbell_z gyros_forearm_z 0.9491 0.9491

I've found that the R package picante does a nice job dealing with the problem that you have. You can easily pass your dataset to the cor.table function and get a table of correlations and p-values for all of your variables. You can specify Pearson's r or Spearman in the function. See this link for help:
http://www.inside-r.org/packages/cran/picante/docs/cor.table
Also remember to remove any non-numeric columns from your dataset prior to running the function. Here's an example piece of code:
install.packages("picante")
library(picante)
#Insert the name of your dataset in the code below
cor.table(dataset, cor.method="pearson")

You can use the sjt.corr function of the sjPlot-package, which gives you a nicely formatted correlation table, ready for use in your Office application.
Simplest function call is just to pass the data frame:
sjt.corr(df)
See examples here.