I'm having issue with predicting cluster labeling for a test data, based on a dbscan clustering model on the training data.
I used gower distance matrix when creating the model:
> gowerdist_train <- daisy(analdata_train,
metric = "gower",
stand = FALSE,
type = list(asymm = c(5,6)))
Using this gowerdist matrix, the dbscan clustering model created was:
> sb <- dbscan(gowerdist_train, eps = .23, minPts = 50)
Then I try to use predict to label a test dataset using the above dbscan object:
> predict(sb, newdata = analdata_test, data = analdata_train)
But I receive the following error:
Error in frNN(rbind(data, newdata), eps = object$eps, sort = TRUE,
...) : x has to be a numeric matrix
I can take a guess on where this error might be coming from, which is probably due to the absence of the gower distance matrix that hasn't been created for the test data.
My question is, should I create a gower distance matrix for all data (datanal_train + datanal_test) separately and feed it into predict? how else would the algorithm know what the distance of test data from the train data is, in order to label?
In that case, would the newdata parameter be the new gower distance matrix that contains ALL (train + test) data? and the data parameter in predict would be the training distance matrix, gowerdist_train?
What I am not quite sure about is how would the predict algorithm distinguish between the test and train data set in the newly created gowerdist_all matrix?
The two matrices (new gowerdist for all data and the gowerdist_train) would obviously not have the same dimensions. Also, it doesn't make sense to me to create a gower distance matrix only for test data because distances must be relative to the test data, not the test data itself.
Edit:
I tried using gower distance matrix for all data (train + test) as my new data and received an error when fed to predict:
> gowerdist_all <- daisy(rbind(analdata_train, analdata_test),
metric = "gower",
stand = FALSE,
type = list(asymm = c(5,6)))
> test_sb_label <- predict(sb, newdata = gowerdist_all, data = gowerdist_train)
ERROR: Error in 1:nrow(data) : argument of length 0 In addition:
Warning message: In rbind(data, newdata) : number of columns of
result is not a multiple of vector length (arg 1)
So, my suggested solution doesn't work.
I decided to create a code that would use KNN algorithm in dbscan to predict cluster labeling using gower distance matrix. The code is not very pretty and definitely not programmaticaly efficient but it works. Happy for any suggestions that would improve it.
The pseydocode is:
1) calculate new gower distance matrix for all data, including test and train
2) use the above distance matrix in kNN function (dbscan package) to determine the k nearest neighbours to each test data point.
3) determine the cluster labels for all those nearest points for each test point. Some of them will have no cluster labeling because they are test points themselves
4) create a count matrix to count the frequency of clusters for the k nearest points for each test point
5) use very simple likelihood calculation to choose the cluster for the test point based on its neighbours clusters (the maximum frequency). this part also considers the neighbouring test points. That is, the cluster for the test point is chosen only when the maximum frequency is largest when you add the number of neighbouring test points to the other clusters. Otherwise, it doesn't decide the cluster for that test point and waits for the next iteration when hopefully more of its neighboring test points have had their cluster label decided based on their neighbours.
6) repeat above (steps 2-5) until you've decided all clusters
** Note: this algorithm doesn't converge all the time. (once you do the math, it's obvious why that is) so, in the code i break out of the algorithm when the number of unclustered test points doesn't change after a while. then i repeat 2-6 again with new knn (change the number of nearest neighbours and then run the code again). This will ensure more points are involved in deciding in th enext round. I've tried both larger and smaller knn's and both work. Would be good to know which one is better. I haven't had to run the code more than twice so far to decide the clusters for the test data point.
Here is the code:
#calculate gower distance for all data (test + train)
gowerdist_test <- daisy(all_data[rangeofdataforgowerdist],
metric = "gower",
stand = FALSE,
type = list(asymm = listofasymmvars),
weights = Weights)
summary(gowerdist_test)
Then use the code below to label clusters for test data.
#library(dbscan)
# find the k nearest neibours for each point and order them with distance
iteration_MAX <- 50
iteration_current <- 0
maxUnclusterRepeatNum <- 10
repeatedUnclustNum <- 0
unclusteredNum <- sum(is.na(all_data$Cluster))
previousUnclustereNum <- sum(is.na(all_data$Cluster))
nn_k = 30 #number of neighbourhoods
while (anyNA(all_data$Cluster) & iteration_current < iteration_MAX)
{
if (repeatedUnclustNum >= maxUnclusterRepeatNum) {
print(paste("Max number of repetition (", maxUnclusterRepeatNum ,") for same unclustered data has reached. Clustering terminated unsuccessfully."))
invisible(gc())
break;
}
nn_test <- kNN(gowerdist_test, k = nn_k, sort = TRUE)
# for the TEST points in all data, find the closets TRAIN points and decide statistically which cluster they could belong to, based on the clusters of the nearest TRAIN points
test_matrix <- nn_test$id[1: nrow(analdata_test),] #create matrix of test data knn id's
numClusts <- nlevels(as.factor(sb_train$cluster))
NameClusts <- as.character(levels(as.factor(sb_train$cluster)))
count_clusters <- matrix(0, nrow = nrow(analdata_test), ncol = numClusts + 1) #create a count matrix that would count number of clusters + NA
colnames(count_clusters) <- c("NA", NameClusts) #name each column of the count matrix to cluster numbers
# get the cluster number of each k nearest neibhour of each test point
for (i in 1:nrow(analdata_test))
for (j in 1:nn_k)
{
test_matrix[i,j] <- all_data[nn_test$id[i,j], "Cluster"]
}
# populate the count matrix for the total clusters of the neighbours for each test point
for (i in 1:nrow(analdata_test))
for (j in 1:nn_k)
{
if (!is.na(test_matrix[i,j]))
count_clusters[i, c(as.character(test_matrix[i,j]))] <- count_clusters[i, c(as.character(test_matrix[i,j]))] + 1
else
count_clusters[i, c("NA")] <- count_clusters[i, c("NA")] + 1
}
# add NA's (TEST points) to the other clusters for comparison
count_clusters_withNA <- count_clusters
for (i in 2:ncol(count_clusters))
{
count_clusters_withNA[,i] <- t(rowSums(count_clusters[,c(1,i)]))
}
# This block of code decides the maximum count of cluster for each row considering the number other test points (NA clusters) in the neighbourhood
max_col_countclusters <- apply(count_clusters,1,which.max) #get the column that corresponds to the maximum value of each row
for (i in 1:length(max_col_countclusters)) #insert the maximum value of each row in its associated column in count_clusters_withNA
count_clusters_withNA[i, max_col_countclusters[i]] <- count_clusters[i, max_col_countclusters[i]]
max_col_countclusters_withNA <- apply(count_clusters_withNA,1,which.max) #get the column that corresponds to the maximum value of each row with NA added
compareCountClust <- max_col_countclusters_withNA == max_col_countclusters #compare the two count matrices
all_data$Cluster[1:nrow(analdata_test)] <- ifelse(compareCountClust, NameClusts[max_col_countclusters - 1], all_data$Cluster) #you subtract one because of additional NA column
iteration_current <- iteration_current + 1
unclusteredNum <- sum(is.na(all_data$Cluster))
if (previousUnclustereNum == unclusteredNum)
repeatedUnclustNum <- repeatedUnclustNum + 1
else {
repeatedUnclustNum <- 0
previousUnclustereNum <- unclusteredNum
}
print(paste("Iteration: ", iteration_current, " - Number of remaining unclustered:", sum(is.na(all_data$Cluster))))
if (unclusteredNum == 0)
print("Cluster labeling successfully Completed.")
invisible(gc())
}
I guess you can use this for any other type of clustering algorithm, it doesn't matter how you decided the cluster labels for the train data, as long as they are in your all_data before running the code.
Hope this help.
Not the most efficient or rigorous code. So, happy to see suggestions how to improve it.
*Note: I used t-SNE to compare the clustering of train with the test data and looks impressively clean. so, it seems it is working.
I tried to use princomp() and principal() to do PCA in R with data set USArressts. However, I got two different results for loadings/rotaion and scores.
First, I centered and normalised the original data frame so it is easier to compare the outputs.
library(psych)
trans_func <- function(x){
x <- (x-mean(x))/sd(x)
return(x)
}
A <- USArrests
USArrests <- apply(USArrests, 2, trans_func)
princompPCA <- princomp(USArrests, cor = TRUE)
principalPCA <- principal(USArrests, nfactors=4 , scores=TRUE, rotate = "none",scale=TRUE)
Then I got the results for the loadings and scores using the following commands:
princompPCA$loadings
principalPCA$loadings
Could you please help me to explain why there is a difference? and how can we interprete these results?
At the very end of the help document of ?principal:
"The eigen vectors are rescaled by the sqrt of the eigen values to produce the component loadings more typical in factor analysis."
So principal returns the scaled loadings. In fact, principal produces a factor model estimated by the principal component method.
In 4 years, I would like to provide a more accurate answer to this question. I use iris data as an example.
data = iris[, 1:4]
First, do PCA by the eigen-decomposition
eigen_res = eigen(cov(data))
l = eigen_res$values
q = eigen_res$vectors
Then the eigenvector corresponding to the largest eigenvalue is the factor loadings
q[,1]
We can treat this as a reference or the correct answer. Now we check the results by different r functions.
First, by function 'princomp'
res1 = princomp(data)
res1$loadings[,1]
# compare with
q[,1]
No problem, this function actually just return the same results as 'eigen'. Now move to 'principal'
library(psych)
res2 = principal(data, nfactors=4, rotate="none")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
You may find they are still different. The problem is the 'principal' function does eigendecomposition on the correlation matrix by default. Note: PCA is not invariant with rescaling the variables. If you modify the code as
res2 = principal(data, nfactors=4, rotate="none", cor="cov")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
Now, you will get the same results as 'eigen' and 'princomp'.
Summarize:
If you want to do PCA, you'd better apply 'princomp' function.
PCA is a special case of the Factor model or a simplified version of the factor model. It is just equivalent to eigendecomposition.
We can apply PCA to get an approximation of a factor model. It doesn't care about the specific factors, i.e. epsilons in a factor model. So, if you change the number of factors in your model, you will get the same estimations of the loadings. It is different from the maximum likelihood estimation.
If you are estimating a factor model, you'd better use 'principal' function, since it provides more functions, like rotation, calculating the scores by different methods, and so on.
Rescale the loadings of a PCA model doesn't affect the results too much. Since you still project the data onto the same optimal direction, i.e. maximize the variation in the resulting PC.
ev <- eigen(R) # R is a correlation matrix of DATA
ev$vectors %*% diag(ev$values) %*% t(ev$vectors)
pc <- princomp(scale(DATA, center = F, scale = T),cor=TRUE)
p <-principal(DATA, rotate="none")
#eigen values
ev$values^0.5
pc$sdev
p$values^0.5
#eigen vectors - loadings
ev$vectors
pc$loadings
p$weights %*% diag(p$values^0.5)
pc$loading %*% diag(pc$sdev)
p$loadings
#weights
ee <- diag(0,2)
for (j in 1:2) {
for (i in 1:2) {
ee[i,j] <- ev$vectors[i,j]/p$values[j]^0.5
}
};ee
#scores
s <- as.matrix(scale(DATA, center = T, scale = T)) %*% ev$vectors
scale(s)
p$scores
scale(pc$scores)
I am working with the cumulative emergence of flies over time (taken at irregular intervals) over many summers (though first I am just trying to make one year work). The cumulative emergence follows a sigmoid pattern and I want to create a maximum likelihood estimation of a 3-parameter Weibull cumulative distribution function. The three-parameter models I've been trying to use in the fitdistrplus package keep giving me an error. I think this must have something to do with how my data is structured, but I cannot figure it out. Obviously I want it to read each point as an x (degree days) and a y (emergence) value, but it seems to be unable to read two columns. The main error I'm getting says "Non-numeric argument to mathematical function" or (with slightly different code) "data must be a numeric vector of length greater than 1". Below is my code including added columns in the df_dd_em dataframe for cumulative emergence and percent emergence in case that is useful.
degree_days <- c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94)
emergence <- c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0)
cum_em <- cumsum(emergence)
df_dd_em <- data.frame (degree_days, emergence, cum_em)
df_dd_em$percent <- ave(df_dd_em$emergence, FUN = function(df_dd_em) 100*(df_dd_em)/46)
df_dd_em$cum_per <- ave(df_dd_em$cum_em, FUN = function(df_dd_em) 100*(df_dd_em)/46)
x <- pweibull(df_dd_em[c(1,3)],shape=5)
dframe2.mle <- fitdist(x, "weibull",method='mle')
Here's my best guess at what you're after:
Set up data:
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd <- transform(dd,cum_em=cumsum(emergence))
We're actually going to fit to an "interval-censored" distribution (i.e. probability of emergence between successive degree day observations: this version assumes that the first observation refers to observations before the first degree-day observation, you could change it to refer to observations after the last observation).
library(bbmle)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun <- function(scale,shape,x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull(c(-Inf,x), ## or (c(x,Inf))
shape=shape,scale=scale)),1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
library(bbmle)
I should probably have used something more systematic like the method of moments (i.e. matching the mean and variance of a Weibull distribution with the mean and variance of the data), but I just hacked around a bit to find plausible starting values:
## preliminary look (method of moments would be better)
scvec <- 10^(seq(0,4,length=101))
plot(scvec,sapply(scvec,NLLfun,shape=1))
It's important to use parscale to let R know that the parameters are on very different scales:
startvals <- list(scale=1000,shape=1)
m1 <- mle2(NLLfun,start=startvals,
control=list(parscale=unlist(startvals)))
Now try with a three-parameter Weibull (as originally requested) -- requires only a slight modification of what we already have:
library(FAdist)
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
Looks like the three-parameter fit is much better:
library(emdbook)
AICtab(m1,m2)
## dAIC df
## m2 0.0 3
## m1 21.7 2
And here's the graphical summary:
with(dd,plot(cum_em~degree_days,cex=3))
with(as.list(coef(m1)),curve(sum(dd$emergence)*
pweibull(x,shape=shape,scale=scale),col=2,
add=TRUE))
with(as.list(coef(m2)),curve(sum(dd$emergence)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))
(could also do this more elegantly with ggplot2 ...)
These don't seem like spectacularly good fits, but they're sane. (You could in principle do a chi-squared goodness-of-fit test based on the expected number of emergences per interval, and accounting for the fact that you've fitted a three-parameter model, although the values might be a bit low ...)
Confidence intervals on the fit are a bit of a nuisance; your choices are (1) bootstrapping; (2) parametric bootstrapping (resample parameters assuming a multivariate normal distribution of the data); (3) delta method.
Using bbmle::mle2 makes it easy to do things like get profile confidence intervals:
confint(m1)
## 2.5 % 97.5 %
## scale 1576.685652 1777.437283
## shape 4.223867 6.318481
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd$cum_em <- cumsum(dd$emergence)
dd$percent <- ave(dd$emergence, FUN = function(dd) 100*(dd)/46)
dd$cum_per <- ave(dd$cum_em, FUN = function(dd) 100*(dd)/46)
dd <- transform(dd)
#start 3 parameter model
library(FAdist)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$percent) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
summary(m2)
#graphical summary
windows(5,5)
with(dd,plot(cum_per~degree_days,cex=3))
with(as.list(coef(m2)),curve(sum(dd$percent)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))
In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.