Let me start by saying I have no experience with R, KNN or data science in general. I recently found Kaggle and have been playing around with the Digit Recognition competition/tutorial.
In this tutorial they provide some sample code to get you started with a basic submission:
# makes the KNN submission
library(FNN)
train <- read.csv("c:/Development/data/digits/train.csv", header=TRUE)
test <- read.csv("c:/Development/data/digits/test.csv", header=TRUE)
labels <- train[,1]
train <- train[,-1]
results <- (0:9)[knn(train, test, labels, k = 10, algorithm="cover_tree")]
write(results, file="knn_benchmark.csv", ncolumns=1)
My questions are:
How can I view the nearest neighbors that have been selected for a
particular test row?
How can I modify which of those ten is selected
for my results?
These questions may be too broad. If so, I would welcome any links that could point me down the right road.
It is very possible that I have said something that doesn't make sense here. If this is the case, please correct me.
1) You can get the nearest neighbors of a given row like so:
k <- knn(train, test, labels, k = 10, algorithm="cover_tree")
indices <- attr(k, "nn.index")
Then if you want the indices of the 10 nearest neighbors to row 20 in the training set:
print(indices[20, ])
(You'll get the 10 nearest neighbors because you selected k=10). For example, if you run with only the first 1000 rows of the training and testing set (to make it computationally easier):
train <- read.csv("train.csv", header=TRUE)[1:1000, ]
test <- read.csv("test.csv", header=TRUE)[1:1000, ]
labels <- train[,1]
train <- train[,-1]
k <- knn(train, test, labels, k = 10, algorithm="cover_tree")
indices = attr(k, "nn.index")
print(indices[20, ])
# output:
# [1] 829 539 784 487 293 882 367 268 201 277
Those are the indices within the training set of 1000 that are closest to the 20th row of the test set.
2) It depends what you mean by "modify". For starters, you can get the indices of each of the 10 closest labels to each row like this:
closest.labels = apply(indices, 2, function(col) labels[col])
You can then see the labels of the 10 closest points to the 20th training point like this:
closest.labels[20, ]
# [1] 0 0 0 0 0 0 0 0 0 0
This indicates that all 10 of the closest points to row 20 are all in the group labeled 0. knn simply chooses the label by majority vote (with ties broken at random), but you could choose some kind of weighting scheme if you prefer.
ETA: If you're interested in weighting the closer elements more heavily in your voting scheme, note that you can also get the distances to each of the k neighbors like this:
dists = attr(k, "nn.dist")
dists[20, ]
# output:
# [1] 1238.777 1243.581 1323.538 1398.060 1503.371 1529.660 1538.128 1609.730
# [9] 1630.910 1667.014
Related
I'm working on Wisconsin Breast Cancer Dataset, my aim is to build a model which features a good accuracy and 100% sensitivity. I know that in order to achieve this, I've to work with the thresholds. The problem is that I don't understand how does thresholds work and how can I properly choose them.
I'm studying on the famous Intro to SL (with applications in R) book, but I'm not able to find the explanation about choosing the threshold in chapter 4.
Here is the code I've written so far:
df <- subset(df, select = -c(X, id)) # Selecting features
set.seed(4)
# Train and test
nrows <- NROW(df)
index <- sample(1:nrows, 0.7 * nrows)
traindf <- df[index,]
testdf <- df[-index,]
glm.fit=glm(diagnosis~., data=traindf ,family=binomial)
glm.probs=predict(glm.fit,testdf,type="response")
glm.pred=rep("B",dim(tested)[1])
glm.pred[glm.probs >.5]="M"
table(glm.pred, testdf[,1])
Now, this gives me
glm.pred B M
B 108 3
M 4 56
What I want is to put 0 in the top right of the table, but changing the thresholds doesn't work.
How can I fix the problem?
The same is with the lad function (which I avoid to write here).
Thanks
To understand my problem, you will need the whole dataset: https://pastebin.com/82paf0G8
Pre-processing: I had a list of orders and 696 unique item numbers, and wanted to cluster them, based on how frequent each pair of items are ordered together. I calculated for each pair of items, number of frequency of occurence within the same order. I.e the highest number of occurrence was 489 between two items. I then "calculated" the similarity/correlation, by: Frequency / "max frequency of all pairs" (489). Now I have the dataset that I have uploaded.
Similarity/correlation: I don't know if my similarity approach is the best in this case. I also tried with something called "Jaccard’s coefficient/index", but get almost same results.
The dataset: The dataset contains material numbers V1 and V2. and N is the correlation between the two material numbers between 0 - 1.
With help from another one, I managed to create a distance matrix and use the PAM clustering.
Why PAM clustering? A data scientist suggest this: You have more than 95% of pairs without information, this makes all these materials are at the same distance and a single cluster very dispersed. This problem can be solved using a PAM algorithm, but still you will have a very concentrated group. Another solution is to increase the weight of the distances other than one.
Problem 1: The matrix is only 567x567. I think for clustering I need the 696x696 full matrix, even though a lot of them are zeros. But i'm not sure.
Problem 2: Clustering does not do very well. I get very concentrated clusters. A lot of items are clustered in the first cluster. Also, according to how you verify PAM clusters, my clustering results are poor. Is it due to the similarity analysis? What else should I use? Is it due to the 95% of data being zeros? Should I change the zeros to something else?
The whole code and results:
#Suppose X is the dataset
df <- data.table(X)
ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1]
ss <- ss/max(ss, na.rm = TRUE)
ss[is.na(ss)] <- 0
diag(ss) <- 1
Now using the PAM clustering
dd2 <- as.dist(1 - sqrt(ss))
pam2 <- pam(dd2, 4)
summary(as.factor(pam2$clustering))
But I get very concentrated clusters, as:
1 2 3 4
382 100 23 62
I'm not sure where you get the 696 number from. After you rbind, you have a dataframe with 567 unique values for V1 and V2, and then you perform the dcast, and end up with a matrix as expected 567 x 567. Clustering wise I see no issue with your clusters.
dim(df) # [1] 7659 3
test <- rbind(df, df[, .(V1 = V2, V2 = V1, N)])
dim(test) # [1] 15318 3
length(unique(test$V1)) # 567
length(unique(test$V2)) # 567
test2 <- dcast(test, V1~V2, value.var = "N")[,-1]
dim(test2) # [1] 567 567
#Mayo, forget what the data scientist said about PAM. Since you've mentioned this work is for a thesis. Then from an academic viewpoint, your current justification to why PAM is required, does not hold any merit. Essentially, you need to either prove or justify why PAM is a necessity for your case study. And given the nature of (continuous) variables in the dataset, V1, V2, N, I do not see the logic on why PAM is applicable here (like I mentioned in the comments, PAM works best for mixed variables).
Continuing further, See this post on correlation detection in R;
# Objective: Detect Highly Correlated variables, visualize them and remove them
data("mtcars")
my_data <- mtcars[, c(1,3,4,5,6,7)]
# print the first 6 rows
head(my_data, 6)
# compute correlation matrix using the cor()
res<- cor(my_data)
round(res, 2) # Unfortunately, the function cor() returns only the correlation coefficients between variables.
# Visualize the correlation
# install.packages("corrplot")
library(corrplot)
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
# Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.
# tl.col (for text label color) and tl.srt (for text label string rotation) are used to change text colors and rotations.
#Apply correlation filter at 0.80,
#install.packages("caret", dependencies = TRUE)
library(caret)
highlyCor <- colnames(my_data)[findCorrelation(res, cutoff = 0.80, verbose = TRUE)]
# show highly correlated variables
highlyCor
[1] "disp" "mpg"
removeHighCor<- findCorrelation(res, cutoff = 0.80) # returns indices of highly correlated variables
# remove highly correlated variables from the dataset
my_data<- my_data[,-removeHighCor]
[1] 32 4
Hope this helps.
Problem
I have a dataframe that composes of > 5 variables at any time and am trying to do a K-Means of it. Because K-Means is greatly affected by outliers, I've been trying to look for a few hours on how to calculate and remove multivariate outliers. Most examples demonstrated are with 2 variables.
Possible Solutions Explored
mvoutlier - Kind user here noted that mvoutlier may be what I need.
Another Outlier Detection Method - Poster here commented with a mix of R functions to generate an ordered list of outliers.
Issues thus Far
Regarding mvoutlier, I was unable to generate a result because it noted my dataset contained negatives and it could not work because of that. I'm not sure how to alter my data to only positive since I need negatives in the set I am working with.
Regarding Another Outlier Detection Method I was able to come up with a list of outliers, but am unsure how to exclude them from the current data set. Also, I do know that these calculations are done after K-Means, and thus I probably will apply the math prior to doing K-Means.
Minimal Verifiable Example
Unfortunately, the dataset I'm using is off-limits to be shown to anyone, so what you'll need is any random data set with more than 3 variables. The code below is code converted from the Another Outlier Detection Method post to work with my data. It should work dynamically if you have a random data set as well. But it should have enough data where cluster center amount should be okay with 5.
clusterAmount <- 5
cluster <- kmeans(dataFrame, centers = clusterAmount, nstart = 20)
centers <- cluster$centers[cluster$cluster, ]
distances <- sqrt(rowSums(clusterDataFrame - centers)^2)
m <- tapply(distances, cluster$cluster, mean)
d <- distances/(m[cluster$cluster])
# 1% outliers
outliers <- d[order(d, decreasing = TRUE)][1:(nrow(clusterDataFrame) * .01)]
Output: A list of outliers ordered by their distance away from the center they reside in I believe. The issue then is getting these results paired up to the respective rows in the data frame and removing them so I can start my K-Means procedure. (Note, while in the example I used K-Means prior to removing outliers, I'll make sure to take the necessary steps and remove outliers before K-Means upon solution).
Question
With Another Outlier Detection Method example in place, how do I pair the results with the information in my current data frame to exclude those rows before doing K-Means?
I don't know if this is exactly helpful but if your data is multivariate normal you may want to try out a Wilks (1963) based method. Wilks showed that the mahalanobis distances of multivariate normal data follow a Beta distribution. We can take advantage of this (iris Sepal data used as an example):
test.dat <- iris[,-c(1,2))]
Wilks.function <- function(dat){
n <- nrow(dat)
p <- ncol(dat)
# beta distribution
u <- n * mahalanobis(dat, center = colMeans(dat), cov = cov(dat))/(n-1)^2
w <- 1 - u
F.stat <- ((n-p-1)/p) * (1/w-1) # computing F statistic
p <- 1 - round( pf(F.stat, p, n-p-1), 3) # p value for each row
cbind(w, F.stat, p)
}
plot(test.dat,
col = "blue",
pch = c(15,16,17)[as.numeric(iris$Species)])
dat.rows <- Wilks.function(test.dat); head(dat.rows)
# w F.stat p
#[1,] 0.9888813 0.8264127 0.440
#[2,] 0.9907488 0.6863139 0.505
#[3,] 0.9869330 0.9731436 0.380
#[4,] 0.9847254 1.1400985 0.323
#[5,] 0.9843166 1.1710961 0.313
#[6,] 0.9740961 1.9545687 0.145
Then we can simply find which rows of our multivariate data are significantly different from the beta distribution.
outliers <- which(dat.rows[,"p"] < 0.05)
points(test.dat[outliers,],
col = "red",
pch = c(15,16,17)[as.numeric(iris$Species[outliers])])
I am trying to use the NbClust method in R to determine the best number of clusters in a cluster analysis following the approach in the book from Manning.
However, I get an error message saying:
Error in hclust(md, method = "average"): must have n >= 2 objects to
cluster.
Even though the hclust method appears to work. Therefore, I assume that the problem is (which is also stated by the error message), that NbClust tries to create groups with only one object inside.
Here is my code:
mydata = read.table("PLR_2016_WM_55_5_Familienstand_aufbereitet.csv", skip = 0, sep = ";", header = TRUE)
mydata <- mydata[-1] # Without first line (int)
data.transformed <- t(mydata) # Transformation of matrix
data.scale <- scale(data.transformed) # Scaling of table
data.dist <- dist(data.scale) # Calculates distances between points
fit.average <- hclust(data.dist, method = "average")
plot(fit.average, hang = -1, cex = .8, main = "Average Linkage Clustering")
library(NbClust)
nc <- NbClust(data.scale, distance="euclidean",
min.nc=2, max.nc=15, method="average")
I found a similar problem here, but I was not able to adapt the code.
There are some problems in your dataset.
The last 4 rows do not contain data and must be deleted.
mydata <- read.table("PLR_2016_WM_55_5_Familienstand_aufbereitet.csv", skip = 0, sep = ";", header = TRUE)
mydata <- mydata[1:(nrow(mydata)-4),]
mydata[,1] <- as.numeric(mydata[,1])
Now rescale the dataset:
data.transformed <- t(mydata) # Transformation of matrix
data.scale <- scale(data.transformed) # Scaling of table
For some reason data.scale is not a full rank matrix:
dim(data.scale)
# [1] 72 447
qr(data.scale)$rank
# [1] 71
Hence, we delete a row from data.scale and transpose it:
data.scale <- t(data.scale[-72,])
Now the dataset is ready for NbClust.
library(NbClust)
nc <- NbClust(data=data.scale, distance="euclidean",
min.nc=2, max.nc=15, method="average")
The output is
[1] "Frey index : No clustering structure in this data set"
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 8 proposed 2 as the best number of clusters
* 4 proposed 3 as the best number of clusters
* 8 proposed 4 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 1 proposed 8 as the best number of clusters
* 1 proposed 11 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************
I have decided to learn R and am going through Introduction to Scientific programming in R book (http://www.ms.unimelb.edu.au/spuRs/)
I am currently stuck on chapter 7 question 3 of the book, the question is:
Consider the following very simple genetic model. A population consists of
equal numbers of two sexes: male and female. At each generation men and
women are paired at random, and each pair produces exactly two offspring,
one male and one female. We are interested in the distribution of height
from one generation to the next. Suppose that the height of both children
is just the average of the height of their parents, how will the distribution
of height change across generations?
Represent the heights of the current generation as a dataframe with two
variables, m and f, for the two sexes. The command rnorm(100, 160, 20)
will generate a vector of length 100, according to the normal distribution
with mean 160 and standard deviation 20 (see Section 16.5.1). We use it to
randomly generate the population at generation 1:
pop <- data.frame(m = rnorm(100, 160, 20), f = rnorm(100, 160, 20))
The command sample(x, size = length(x)) will return a random sample
of size size taken from the vector x (without replacement). (It will also
sample with replacement, if the optional argument replace is set to TRUE.)
The following function takes the dataframe pop and randomly permutes the
ordering of the men. Men and women are then paired according to rows,
and heights for the next generation are calculated by taking the mean of
each row. The function returns a dataframe with the same structure, giving
the heights of the next generation.
next.gen <- function(pop) {
pop$m <- sample(pop$m)
pop$m <- apply(pop, 1, mean)
pop$f <- pop$m
return(pop)
}
Use the function next.gen to generate nine generations, then use the lattice
function histogram to plot the distribution of male heights in each
generation, as in Figure 7.7. The phenomenon you see is called regression
to the mean.
Hint: construct a dataframe with variables height and generation, where
each row represents a single man.
I have constructed a blank data frame:
generations <- data.frame(gen="", height="")
For now I am trying to get just the first generation height information into it, so I run:
next.gen(pop)
generations$height <- pop$m
and I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "height", value = c(165.208323681597, :
replacement has 100 rows, data has 1
I understand that I'm trying to squeeze in information from pop$m dataframe into a single row of generations$height and that is causing the problem, I do not know how to fix this? I thought that a blank data frame is flexible enough to add rows as they are being copied from pop data frame?
I tried then to run this code:
generations <- pop$m
And I get 100 values but that just turns my generations dataframe into a vector I think and running
generations
Just lists the values copied in a vector only.
I think I am approaching the first step wrong, is my dataframe definition correct? Why can't I copy row information from 1 data frame into an empty one and just adjust the size of the empty data frame as needed?
Thank you
Unsure the exact output you are looking for. Here is an approach which should be simple enough to follow. ** Note: There are workable approaches aplenty.
pop <- data.frame(m = rnorm(100, 160, 20), f = rnorm(100, 160, 20))
next.gen <- function(pop) {
pop$m <- sample(pop$m)
pop$m <- apply(pop, 1, mean)
pop$f <- pop$m
return(pop)
}
# the code
test <- list()
for (i in 1:9) {
test[[i]] <- next.gen(pop)["m"]
test[[i]]$generation <- paste0("g", i)
}
library(data.table)
test2 <- rbindlist(test)
# result
m generation
1: 174.6558 g1
2: 143.2617 g1
3: 185.2829 g1
4: 168.9719 g1
5: 151.6948 g1
---
896: 159.6091 g9
897: 161.4546 g9
898: 171.8679 g9
899: 138.4982 g9
900: 152.7390 g9
Try:
> generations <- data.frame(gen="", height="", stringsAsFactors=F)
> for(i in 1:length(pop$m)) generations[i,] = c("",pop$m[i])
> generations
gen height
1 136.70042632318
2 153.985392293761
3 122.077485676327
4 166.582538529591
5 170.751368839498
6 190.8894492681
...