Cluster center mean of DBSCAN in R? - r

Using dbscan in package fpc I am able to get an output of:
dbscan Pts=322 MinPts=20 eps=0.005
0 1
seed 0 233
border 87 2
total 87 235
but I need to find the cluster center (mean of cluster with most seeds). Can anyone show me how to proceed with this?

You need to understand that as DBSCAN looks for arbitrarily shaped clusters, the mean can be well outside of the cluster. Looking at means of DBSCAN clusters therefore is not really sensible.

Just index back into the original data using the cluster ID of your choice. Then you can easily do whatever further processing you want to the subset. Here is an example:
library(fpc)
n = 100
set.seed(12345)
data = matrix(rnorm(n*3), nrow=n)
data.ds = dbscan(data, 0.5)
> data.ds
dbscan Pts=100 MinPts=5 eps=0.5
0 1 2 3
seed 0 1 3 1
border 83 4 4 4
total 83 5 7 5
> colMeans(data[data.ds$cluster==0, ])
[1] 0.28521404 -0.02804152 -0.06836167

Related

Errors using powerSim and powerCurve for a clmm in R

I'm new to clmm and run into the following problem:
I want to obtain the optimal sample size for my study with R using powerSim and powerCurve. Because my data is ordinal, I'm using a clmm. Study participants (VPN) should evaluate three sentence types (SH1,SM1, and SP1) on a 5 point likert scale (evaluation.likert). I need to account for my participants as a random factor while the sentence types and the evaluation are my fixed factors.
Here's a glimpse of my data (count of VPN goes up to 40 for each of the parameters, I just shortened it here):
VPN parameter evaluation.likert
1 1 SH1 2
2 2 SH1 4
3 3 SH1 5
4 4 SH1 3
...
5 1 SM1 4
6 2 SM1 2
7 3 SM1 2
8 4 SM1 5
...
9 1 SP1 1
10 2 SP1 1
11 3 SP1 3
12 4 SP1 5
...
Now, with some help I created the following model:
clmm(likert~parameter+(1|VPN), data=dfdata)
With this model, I'm doing the simulation:
ps1 <- powerSim(power, test=fixed("likert:parameter", "anova"), nsim=40)
Warning:
In observedPowerWarning(sim) :
This appears to be an "observed power" calculation
print(ps1)
Power for predictor 'likert:parameter', (95% confidence interval):
0.00% ( 0.00, 8.81)
Test: Type-I F-test
Based on 40 simulations, (0 warnings, 40 errors)
alpha = 0.05, nrow = NA
Time elapsed: 0 h 0 m 0 s
nb: result might be an observed power calculation
In the above example, I tried it with 40 participants but I already also ran a simulation with 2000000 participants to check if I just need a huge amount of people. The results were the same though: 0.0%.
lastResult()$errors tells me that I'm using a method which is not applicable for clmm:
not applicable method for'simulate' on object of class "clmm"
But besides the anova I'm doing here, I've also already tried z, t, f, chisq, lr, sa, kr, pb. (And instead of test=fixed, I've also already tried test=compare, test=fcompare, test=rcompare, and even test=random())
So I guess there must be something wrong with my model? Or are really none of these methods applicaple for clmms?
Many thanks in advance, your help is already very much appreciated!

Adjusted survival curve based on weigthed cox regression

I'm trying to make an adjusted survival curve based on a weighted cox regression performed on a case cohort data set in R, but unfortunately, I can't make it work. I was therefore hoping that some of you may be able to figure it out why it isn't working.
In order to illustrate the problem, I have used (and adjusted a bit) the example from the "Package 'survival'" document, which means im working with:
data("nwtco")
subcoh <- nwtco$in.subcohort
selccoh <- with(nwtco, rel==1|subcoh==1)
ccoh.data <- nwtco[selccoh,]
ccoh.data$subcohort <- subcoh[selccoh]
ccoh.data$age <- ccoh.data$age/12 # Age in years
fit.ccSP <- cch(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data,subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing")
The data set is looking like this:
seqno instit histol stage study rel edrel age in.subcohort subcohort
4 4 2 1 4 3 0 6200 2.333333 TRUE TRUE
7 7 1 1 4 3 1 324 3.750000 FALSE FALSE
11 11 1 2 2 3 0 5570 2.000000 TRUE TRUE
14 14 1 1 2 3 0 5942 1.583333 TRUE TRUE
17 17 1 1 2 3 1 960 7.166667 FALSE FALSE
22 22 1 1 2 3 1 93 2.666667 FALSE FALSE
Then, I'm trying to illustrate the effect of stage in an adjusted survival curve, using the ggadjustedcurves-function from the survminer package:
library(suvminer)
ggadjustedcurves(fit.ccSP, variable = ccoh.data$stage, data = ccoh.data)
#Error in survexp(as.formula(paste("~", variable)), data = ndata, ratetable = fit) :
# Invalid rate table
But unfortunately, this is not working. Can anyone figure out why? And can this somehow be fixed or done in another way?
Essentially, I'm looking for a way to graphically illustrate the effect of a continuous variable in a weighted cox regression performed on a case cohort data set, so I would, generally, also be interested in hearing if there are other alternatives than the adjusted survival curves?
Two reasons it is throwing errors.
The ggadjcurves function is not being given a coxph.object, which it's halp page indicated was the designed first object.
The specification of the variable argument is incorrect. The correct method of specifying a column is with a length-1 character vector that matches one of the names in the formula. You gave it a vector whose value was a vector of length 1154.
This code succeeds:
fit.ccSP <- coxph(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data)
ggadjustedcurves(fit.ccSP, variable = 'stage', data = ccoh.data)
It might not answer your desires, but it does answer the "why-error" part of your question. You might want to review the methods used by Therneau, Cynthia S Crowson, and Elizabeth J Atkinson in their paper on adjusted curves:
https://cran.r-project.org/web/packages/survival/vignettes/adjcurve.pdf

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

Stata twoway graph of means with confidence intervals

Using
clear
score group test
2 0 A
3 0 B
6 0 B
8 0 A
2 0 A
2 0 A
10 1 B
7 1 B
8 1 A
5 1 A
10 1 A
11 1 B
end
I want to scatter plot mean score by group for each test (same graph) with confidence intervals (the real data has thousands of observations). The resulting graph would have two sets of two dots. One set of dots for test==a (group==0 vs group==1) and one set of dots for test==b (group==0 vs group==1).
My current approach works but it is laborious. I compute all of the needed statistics using egen: the mean, number of observations, standard deviations...for each group by test. I then collapse the data and plot.
There has to be another way, no?
I assumed that Stata would be able to take as its input the score group and test variables and then compute and present this pretty standard graph.
After spending a lot of time on Google, I had to ask.
Although there are user-written programs, I lean towards statsby as a basic approach here. Discussion is accessible in this paper.
This example takes your data example (almost executable code). Some choices depend on the large confidence intervals implied. Note that if your version of Stata is not up-to-date, the syntax of ci will be different. (Just omit means.)
clear
input score group str1 test
2 0 A
3 0 B
6 0 B
8 0 A
2 0 A
2 0 A
10 1 B
7 1 B
8 1 A
5 1 A
10 1 A
11 1 B
end
save cj12 , replace
* test A
statsby mean=r(mean) ub=r(ub) lb=r(lb) N=r(N), by(group) clear : ///
ci means score if test == "A"
gen test = "A"
save cj12results, replace
* test B
use cj12
statsby mean=r(mean) ub=r(ub) lb=r(lb) N=r(N), by(group) clear : ///
ci means score if test == "B"
gen test = "B"
append using cj12results
* graph; show sample sizes too, but where to show them is empirical
set scheme s1color
gen where = -20
scatter mean group, ms(O) mcolor(blue) || ///
rcap ub lb group, lcolor(blue) ///
by(test, note("95% confidence intervals") legend(off)) ///
subtitle(, fcolor(ltblue*0.2)) ///
ytitle(score) xla(0 1) xsc(r(-0.25 1.25)) yla(-10(10)10, ang(h)) || ///
scatter where group, ms(none) mla(N) mlabpos(12) mlabsize(*1.5)
We can't compare your complete code or your graph, because you show neither.

Working with Self Organizing Maps - How do I interpret the results?

I have this data set that I thought would be a good candidate for making a SOM.
So, I converted it to text thusly:
10
12 1 0 0
13 3 0 0
14 21 0 0
19 1983 15 0
24 5329 48 0
29 4543 50 0
34 3164 32 0
39 1668 22 1
44 459 4 0
49 17 0 0
I'm using Octave, so I transformed the data with these commands:
dataIn = fopen('data.txt','r');
n = fscanf(dataIn,'%d',1);
D = fscanf(dataIn,'%f'); %D is a 1 x n column matrix
D = D'; %Transpose the data D is now an n x 1 matrix
D = reshape(D, 4, []); % give D the shape of a 4 x n/4 matrix
D = D(2:4, :); % the dimensions to be used for the SOM will come from the bottom three rows
Now, I'm applying an SOM script to produce a map using D.
The script is here
and it's using findBMU defined as:
%finds best matching unit in SOM O
function [r c ] = findBMU( iv,O )
dist = zeros(size(O)); for i=1:3
dist(:,:,i) = O(:,:,i)-iv(i);
iv(i);
end
dist = sum(dist.^2,3);
[v r] = min(min(dist,[],2));
[v c] = min(min(dist,[],1));
In the end, it starts with a random map that looks like this:
and it becomes:
The thing is, I don't know what my SOM is saying. How do I read it?
Firstly, you should be aware that Octave provides at best an approximation to the SOM methodology. The main methodological advantage of the SOM is the potential transparent access of (all) the implied parameters, and those cannot be accessed in Octave any more.
Secondly, considering your data, it does not make much sense first to seriously destroy information by summarizing it then feeding a SOM with it. Basically you have four variables in your table shown above: age, total N, single N, twin N. What you have destroyed is the information about the region.
Such you put three distributions into the SOM. The only thing you could expect is clusters. Yet, the SOM is not built for building clusters. Instead, SOM is used for diagnostic and predictive modeling, in order to find the most accurate model and the most relevant variables. Note the term "best matching unit"!
In your example however you find just a distribution in the SOM. Basically, there is no interpretation, as there are neither variables nor is there a predictive/diagnostic purpose.
You could build a model, for instance, determining the similarity of distributions. Yet, for that you should use a goodness-of-fit test (non-parametric, Kolmogorof-Smirnov), not the SOM.

Resources