I've recently been attempting to evaluate output from k-modes (a cluster label), relative to a so-called True cluster label (labelled 'class' below).
In other words: I've been attempting to external validate the clustering output. However, when I tried external validation measures from the 'fpc' package, I was unsuccessful (error term posted below script).
I've attached my code for the mushroom dataset. I would appreciate if anyone could show me how to successful execute these external validation measures in the context of categorical data.
Any help appreciated.
# LIBRARIES
install.packages('klaR')
install.packages('fpc')
library(klaR)
library(fpc)
#MUSHROOM DATA
mushrooms <- read.csv(file = "https://raw.githubusercontent.com/miachen410/Mushrooms/master/mushrooms.csv", header = FALSE)
names(mushrooms) <- c("edibility", "cap-shape", "cap-surface", "cap-color",
"bruises", "odor", "gill-attachment", "gill-spacing",
"gill-size", "gill-color", "stalk-shape", "stalk-root",
"stalk-surface-above-ring", "stalk-surface-below-ring",
"stalk-color-above-ring", "stalk-color-below-ring", "veil-type",
"veil-color", "ring-number", "ring-type", "spore-print-color",
"population", "habitat")
names(mushrooms)[names(mushrooms)=="edibility"] <- "class"
indexes <- apply(mushrooms, 2, function(x) any(is.na(x) | is.infinite(x)))
colnames(mushrooms)[indexes]
table(mushrooms$class)
str(mushrooms)
#REMOVING CLASS VARIABLE
mushroom.df <- subset(mushrooms, select = -c(class))
#KMODES ANALYSIS
result.kmode <- kmodes(mushroom.df, 2, iter.max = 50, weighted = FALSE)
#EXTERNAL VALIDATION ATTEMPT
mushrooms$class <- as.factor(mushrooms$class)
class <- as.numeric(mushrooms$class))
clust_stats <- cluster.stats(d = dist(mushroom.df),
class, result.kmode$cluster)
#ERROR TERM
Error in silhouette.default(clustering, dmatrix = dmat) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In dist(mushroom.df) : NAs introduced by coercion
I try to model a SEIR for UK to evaluate the implemented containment measures and found some code with the pomp package here: https://kingaa.github.io/clim-dis/parest/parest.html
I tried to transfer this to my case which adds one stage (E) and three more variables. In the end i want to do a least squared estimation to find the optimal beta.
Data_UK_beta0 consists of the variable date (int from 0 to 165) and new_cases (from John Hopkins University dataset).
Data_UK_pomp_beta0 <- pomp(
data= Data_UK_beta0,
times ="date", t0=0,
skeleton = vectorfield(
Csnippet("
DS=-beta1*S*I/N;
DE= beta1*S*I/N-delta1*E;
DI=delta1*E-(gamma1+eta1)*I;
DR=gamma1*I;")),
rinit = Csnippet("
S=S_0;
E=E_0;
I=I_0;
R=N-S_0-E_0-I_0;"),
statenames = c("S","E","I","R"),
paramnames = c("beta1","delta1","gamma1","eta1","N","S_0","E_0","I_0"))
sse_UK_beta0 <- function (params) {
x <- trajectory(Data_UK_pomp_beta0,params=params)
discrep <- x["I",,]-obs(Data_UK_pomp_beta0)
sum(discrep^2)
}
install.packages("apricom")
library(apricom)
beta_reg <- function (beta0) {
params <- c(beta1=beta0, delta1=1/5.1, gamma1=1, eta1=0.012649, N=67886004, S_0=67886004, E_0=5, I_0=2)
sse(params)
}
beta0 <- seq(from=1,to=40,by=1)
SSE <- sapply(beta0, beta_reg)
got the following error traceback (unfortunately in german, but i guess the message should be clear :)
Fehler bei der Auswertung des Argumentes 'x' bei der Methodenauswahl für Funktion 'as.matrix': Argument "dataset" fehlt (ohne Standardwert)
7.
h(simpleError(msg, call))
6.
.handleSimpleError(function (cond)
.Internal(C_tryCatchHelper(addr, 1L, cond)), "Argument \"dataset\" fehlt (ohne Standardwert)",
base::quote(as.matrix(dataset)))
5.
as.matrix(dataset)
4.
sse(params)
3.
FUN(X[[i]], ...)
2.
lapply(X = X, FUN = FUN, ...)
1.
sapply(beta0, beta_reg)
What did i do wrong?
The sse function you've imported from apricom has nothing (as far as I can see) to do with this problem. (This also doesn't have anything to do with C(++) code compilation, so the [compiler-errors] tag in your question is a little misleading.)
You haven't given us a way to get your Data_UK_beta0 data set so I can't reproduce this, but I assume that you actually want something like:
beta_reg <- function (beta0) {
params <- c(beta1=beta0, delta1=1/5.1, gamma1=1,
eta1=0.012649, N=67886004, S_0=67886004, E_0=5, I_0=2)
sse_UK_beta0(params)
}
beta0 <- seq(from=1,to=40,by=1)
SSE <- sapply(beta0, beta_reg)
You should also be aware that are a bunch of other potentially tricky issues in what you're doing. One that jumps out is that you are likely comparing the prevalence of infection in the modeling (the current number in the I compartment) to case report data, which is a (lagged, biased, imperfect) measure of the incidence of disease, i.e. the number of new cases per unit time ...
I am using PC algorithm function, in which Conditional Independence is one of the attribute. Facing error in the following code. Note that 'data' here is the data that I have been using, and 1,6,2 in gaussCItest are the node positions in my adjacency matrix x and y of the data.
code:
library(pcalg)
suffstat <- list(C = cor(data), n = nrow(data))
pc.data <- pc(suffstat,
indepTest=gaussCItest(1,6,2,suffstat),
p=ncol(data),alpha=0.01)
Error:
Error in indepTest(x, y, nbrs[S], suffStat) :
could not find function "indepTest"
Below is the code that worked.removed the parameters for gaussCItest as its a function, which can be used directly.
library(pcalg)
suffstat <- list(C = cor(data), n = nrow(data))
pc.data <- pc(suffstat,indepTest=gaussCItest, p=ncol(data),alpha=0.01)
I am analyzing the ratio of (biomass of one part of a plant community) vs. (total plant community biomass) across different treatments in time (i.e. repeated measures) in R. Hence, it seems natural to use beta regression with a mixed component (available with the glmmTMB package) in order to account for repeated measures.
My problem is about computing post hoc comparisons across my treatments with the function lsmeans from the lsmeans package. glmmTMB objects are not handled by the lsmeans function so Ben Bolker on recommended to add the following code before loading the packages {glmmTMB} and {lsmeans}:
recover.data.glmmTMB <- function(object, ...) {
fcall <- getCall(object)
recover.data(fcall,delete.response(terms(object)),
attr(model.frame(object),"na.action"), ...)}
lsm.basis.glmmTMB <- function (object, trms, xlev, grid, vcov.,
mode = "asymptotic", component="cond", ...) {
if (mode != "asymptotic") stop("only asymptotic mode is available")
if (component != "cond") stop("only tested for conditional component")
if (missing(vcov.))
V <- as.matrix(vcov(object)[[component]])
else V <- as.matrix(.my.vcov(object, vcov.))
dfargs = misc = list()
if (mode == "asymptotic") {
dffun = function(k, dfargs) NA
}
## use this? misc = .std.link.labels(family(object), misc)
contrasts = attr(model.matrix(object), "contrasts")
m = model.frame(trms, grid, na.action = na.pass, xlev = xlev)
X = model.matrix(trms, m, contrasts.arg = contrasts)
bhat = fixef(object)[[component]]
if (length(bhat) < ncol(X)) {
kept = match(names(bhat), dimnames(X)[[2]])
bhat = NA * X[1, ]
bhat[kept] = fixef(object)[[component]]
modmat = model.matrix(trms, model.frame(object), contrasts.arg = contrasts)
nbasis = estimability::nonest.basis(modmat)
}
else nbasis = estimability::all.estble
list(X = X, bhat = bhat, nbasis = nbasis, V = V, dffun = dffun,
dfargs = dfargs, misc = misc)
}
Here is my code and data:
trt=c(rep("T5",13),rep("T4",13),
rep("T3",13),rep("T1",13),rep("T2",13),rep("T1",13),
rep("T2",13),rep("T3",13),rep("T5",13),rep("T4",13))
year=rep(2005:2017,10)
plot=rep(LETTERS[1:10],each=13)
ratio=c(0.0046856237844411,0.00100861922394448,0.032516291436091,0.0136507743972955,0.0940240065096705,0.0141337428305094,0.00746709315018945,0.437009092691189,0.0708021091805216,0.0327952505849285,0.0192685194751524,0.0914696394299481,0.00281889216102303,0.0111928453399615,0.00188119596836005,NA,0.000874623692966351,0.0181192859074754,0.0176635391424644,0.00922358069727823,0.0525280029990213,0.0975006760149882,0.124726170684951,0.0187132600944396,0.00672592365451266,0.106399234215126,0.0401776844073239,0.00015382736648373,0.000293356424756535,0.000923659501144292,0.000897412901472504,0.00315930225856196,0.0636501228611642,0.0129422445492391,0.0143526630252398,0.0136775931834926,0.00159292971508751,0.0000322313783211749,0.00125352390811532,0.0000288862579879126,0.00590690336494395,0.000417043974238875,0.0000695808216192379,0.001301299696752,0.000209355138230326,0.000153151660178623,0.0000646279598274632,0.000596704590065324,9.52943306579156E-06,0.000113476446629278,0.00825405312309618,0.0001025984082064,0.000887617767039489,0.00273668802742924,0.00469409165130462,0.00312377000134233,0.0015579322817235,0.0582615988387306,0.00146933878743163,0.0405139497779372,0.259097955479886,0.00783997376383192,0.110638003652979,0.00454029511918275,0.00728290246595241,0.00104674197030363,0.00550563937846687,0.000121380392484705,0.000831904606687671,0.00475778829159394,0.000402799910756391,0.00259524300745195,0.000210249875492504,0.00550104485802363,0.000272849546913495,0.0025389089622392,0.00129370075116459,0.00132810234020792,0.00523285954007915,0.00506230599388357,0.00774104695265855,0.00098348404576587,0.174079173227248,0.0153486840317039,0.351820365452281,0.00347674458928481,0.147309225196026,0.0418825705903947,0.00591271021100856,0.0207139520537443,0.0563647804012055,0.000560012457272534,0.00191564842393647,0.01493480083524,0.00353400674061077,0.00771828473058641,0.000202009136938048,0.112695841130448,0.00761492172670762,0.038797330459115,0.217367765362878,0.0680958660605668,0.0100870294641921,0.00493875324236991,0.00136539944656238,0.00264262100866192,0.0847732305020654,0.00460985241335143,0.235802638543116,0.16336020383325,0.225776236687456,0.0204568107372349,0.0455390585228863,0.130969863489582,0.00679523322812889,0.0172325334280024,0.00299970176999806,0.00179347656925317,0.00721658257996989,0.00822443690003783,0.00913096724026346,0.0105920192618379,0.0158013204589482,0.00388803567197835,0.00366268607026078,0.0545418725650633,0.00761485067129418,0.00867583194858734,0.0188232707241144,0.018652666214789)
dat=data.frame(trt,year,plot,ratio)
require(glmmTMB)
require(lsmeans)
mod=glmmTMB(ratio~trt*scale(year)+(1|plot),family=list(family="beta",link="logit"),data=dat)
summary(mod)
ls=lsmeans(mod,pairwise~trt)`
Finally, I get the following error message that I've never encountered and on which I could find no information:
In model.matrix.default(trms, m, contrasts.arg = contrasts) :
variable 'plot' is absent, its contrast will be ignored
Could anyone shine their light? Thanks!
This is not an error message, it's a (harmless) warning message. It occurs because the hacked-up method I wrote doesn't exclude factor variables that are only used in the random effects. You should worry more about this output:
NOTE: Results may be misleading due to involvement in interactions
which is warning you that you are evaluating main effects in a model that contains interactions; you have to think about this carefully to make sure you're doing it right.
I am using a glmulti wrapper for glmer (binomial) and the summary is:
This is glmulti 1.0.7, Apr. 2013.
Length Class Mode
0 NULL NULL
Following what has been done on this this thread, though this is for lmer,
glmulti runs indefinitely when using genetic algorithm with lme4, I get the same result as above. Could it be that the versions have changed since and the wrapping has to be done differently? The following is the dummy code (lifted form the link above):
x = as.factor(round(runif(30),1))# dummy grouping factor
yind = runif(30,0,10) # mock dependent variable
a = runif(30) # dummy covariate
b = runif(30) # another dummy covariate
c = runif(30) # an another one
d = runif(30)
tmpdata <- data.frame(x=x,yind=yind,a=a,b=b,c=c,d=d)
lmer.glmulti <- function (formula, data, random = "", ...) {
lmer(paste(deparse(formula), random), data = data, REML=F, ...)
}
summary(glmulti(formula = yind~a*b*c*d,
data = tmpdata,
random = '+(1|x)',
level = 2,
method = 'h',
crit = 'aicc',
marginality = TRUE,
fitfunc = lmer.glmulti))
lme4 version: 1.1.5
glmulti version: 1.0.7
"R version 3.0.2 (2013-09-25)"
SOLUTION
This works:
lmer.glmulti <- function (formula, data, random, ...) {
lmer(paste(deparse(formula), random), data = data)
}
glmulti(y = yind~a*b*c*d,
data = tmpdata,
random = '+(1|x)',
level = 2,
method = 'h',
crit = 'aicc',
marginality = TRUE,
fitfunc = lmer.glmulti)
packageVersion('lme4')
‘1.1.5’
packageVersion('glmulti')
‘1.0.7’
R.version: 3.1.0
FYI: From the package maintainer:
"fitfunc must be the name of a function so your other call including the function definition in the glmulti call cannot work."
"you named the first argument to glmulti 'formula', where it must be unnamed or 'y'... Sorry. But y is a formula (if passing a string it is the dependent variable only). "