I am experimenting with the cpt and bcp packages in R. I import the following data from a CSV file:
Here is the CSV file link: http://www.filedropper.com/cpttest
Running bcp:
p.bcp=bcp(p$Rate);
plot(p.bcp)
Point 28 is a change point according to bcp.
When running cpt, I get no indication that a change point exists:
p.cpt=cpt.mean(p$Rate,method="AMOC")
p.cpt
Wondering if anyone could advise why cpt does not detect a change point?
See if the changepoint package satisfies your purpose.
Replace the word method with PELT, BinSeg, or SegNeigh.
library(changepoint)
p.cpt <- changepoint::cpt.mean(p$Rate,method="method")
cpts(p.cpt)
There are many change point packages in R and you could try others. I've compiled a list here. Disclosure: I am the developer of the mcp package.
To use mcp for your problem, do:
model = list(
Rate ~ 1, # Intercept
~ 1 # Another intercept
)
library(mcp)
fit = mcp(model, p)
plot(fit)
See the estimate of the change point (cp_1) as well as the other parameters of the model. You could also see it as a narrow blue distribution in the plot above.
summary(fit)
Family: gaussian(link = 'identity')
Iterations: 9000 from 3 chains.
Segments:
1: Rate ~ 1
2: Rate ~ 1 ~ 1
Population-level parameters:
name mean lower upper Rhat n.eff
cp_1 30.5009 3.0e+01 3.1e+01 1 6250
int_1 0.0132 1.3e-02 1.3e-02 1 5501
int_2 0.0071 6.8e-03 7.3e-03 1 5514
sigma_1 0.0008 6.8e-04 9.4e-04 1 5056
Related
I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.
when computing profile confidence intervals using confint(m1) where m1 is a glmer() model there is a term ( or a few ) at the top which are labelled .sig01, .sig02, I can't find any documentation which explains what these mean though.
You probably didn't find the documentation for the 'merMod' class method of confint. In ?lme4::confint.merMod the following can be read at the parameters:
oldNames: (logical) use old-style names for variance-covariance parameters,
e.g. ".sig01", rather than newer (more informative) names such as
"sd_(Intercept)|Subject"? (See signames argument to profile).
The default option for oldNames is TRUE. Setting it to FALSE will give you a clearer output.
Example
library(lme4)
(gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial))
confint(gm1, oldNames=FALSE)
# 2.5 % 97.5 %
# sd_(Intercept)|herd 0.3460732 1.0999743
# (Intercept) -1.9012119 -0.9477540
# period2 -1.6167830 -0.4077632
# period3 -1.8010241 -0.5115362
# period4 -2.5007502 -0.8007554
I am currently trying to fit ODE functional responses using the Levenberg-Marquardt routine (nls.lm) in pkg-minpack.lm following the tutorial here (http://www.r-bloggers.com/learning-r-parameter-fitting-for-models-involving-differential-equations/).
In the example, he fits the data by first setting up a function rxnrate which I modified shown below:
library(ggplot2) #library for plotting
library(reshape2) # library for reshaping data (tall-narrow <-> short-wide)
library(deSolve) # library for solving differential equations
library(minpack.lm) # library for least squares fit using levenberg-marquart algorithm
# prediction of concentration
# rate function
rxnrate=function(t,c,parms){
# rate constant passed through a list called parms
k1=parms$k1
k2=parms$k2
k3=parms$k3
# c is the concentration of species
# derivatives dc/dt are computed below
r=rep(0,length(c))
r[1]=-k1*c["A"] #dcA/dt
r[2]=k1*c["A"]-k2*c["B"]+k3*c["C"] #dcB/dt
r[3]=k2*c["B"]-k3*c["C"] #dcC/dt
# the computed derivatives are returned as a list
# order of derivatives needs to be the same as the order of species in c
return(list(r))
}
My problem is that the initial condition of each states can be also considered as the estimated parameters. However, it does not work properly at the moment.
Below is my code:
# function that calculates residual sum of squares
ssq=function(myparms){
# inital concentration
cinit=c(A=myparms[4],B=0,C=0)
# time points for which conc is reported
# include the points where data is available
t=c(seq(0,5,0.1),df$time)
t=sort(unique(t))
# parms from the parameter estimation routine
k1=myparms[1]
k2=myparms[2]
k3=myparms[3]
# solve ODE for a given set of parameters
out=ode(y=cinit,times=t,func=rxnrate,parms=list(k1=k1,k2=k2,k3=k3))
# Filter data that contains time points where data is available
outdf=data.frame(out)
outdf=outdf[outdf$time %in% df$time,]
# Evaluate predicted vs experimental residual
preddf=melt(outdf,id.var="time",variable.name="species",value.name="conc")
expdf=melt(df,id.var="time",variable.name="species",value.name="conc")
ssqres=preddf$conc-expdf$conc
# return predicted vs experimental residual
return(ssqres)
}
# parameter fitting using levenberg marquart algorithm
# initial guess for parameters
myparms=c(k1=0.5,k2=0.5,k3=0.5,A=1)
# fitting
fitval=nls.lm(par=myparms,fn=ssq)
Once I run this, an error comes out like this
Error in chol.default(object$hessian) :
the leading minor of order 1 is not positive definite
The problem of your code is the following one:
In the code-line cinit=c(A=myparms[4],B=0,C=0) you gave A the value of myparms[4] AND the name of myparms[4]. Let's see:
myparms=c(k1=0.5,k2=0.5,k3=0.5,A=1)
cinit=c(A=myparms[4],B=0,C=0)
print(cinit)
A.A B C
1 0 0
to solve this problem, you can do this:
myparms=c(k1=0.5,k2=0.5,k3=0.5,A=1)
cinit=c(A=unname(myparms[4]),B=0,C=0)
print(cinit)
A B C
1 0 0
or this:
myparms=c(k1=0.5,k2=0.5,k3=0.5,1)
cinit=c(A=unname(myparms[4]),B=0,C=0)
print(cinit)
A B C
1 0 0
Then your code will work!
Best regards,
J_F
Is it possible to change the average estimator in a region by something different from the mean, like median or geometric mean using the rpart library in R? (or another library)
I believe my tree partitioning is highly affected by extreme values and I would like to build trees showing other estimators.
Thanks!
One of the usual tricks for right-skewed responses would be to take logs. In many applications this makes the response distribution more symmetric and then you don't need to switch from the usual mean predictions.
Another solution for changing the learning of the tree would be to use some more robust scores, e.g., ranks etc. The ctree() function from the partykit offers a nonparametric inference framework for this.
Finally, the partykit package also allows to compute other predictions than the means from all the terminal nodes. You can easily transform rpart trees to party trees via as.party(). A very simple example would be to learn an rpart tree for the cars data
library("rpart")
data("cars", package = "datasets")
rp <- rpart(dist ~ speed, data = cars)
And then transform it to party:
library("partykit")
pr <- as.party(rp)
The tree structure remains unchanged but you get enhanced plotting and predictions. The default plot methods yield:
Furthermore, the default predictions on both objects are the same.
nd <- data.frame(speed = c(10, 15, 20))
predict(rp, nd)
## 1 2 3
## 18.20000 39.75000 65.26316
predict(pr, nd)
## 1 2 3
## 18.20000 39.75000 65.26316
However, the latter allows you to specify a FUNction that should be used in each of the nodes. This must be of the form function(y, w) where y is the response and w are the case weights. As we haven't used any weights here, we can simply ignore that argument and do:
predict(pr, nd, FUN = function(y, w) mean(y))
## 1 2 3
## 18.20000 39.75000 65.26316
predict(pr, nd, FUN = function(y, w) median(y))
## 1 2 3
## 18 35 64
predict(pr, nd, FUN = function(y, w) quantile(y, 0.9))
## 1 2 3
## 28.0 57.0 92.2
And so on... See the package vignettes for more details.
I am using the mlogit package in program R. I have converted my data from its original wide format to long format. Here is a sample of the converted data.frame which I refer to as 'long_perp'. All of the independent variables are individual specific. I have 4258 unique observations in the data-set.
date_id act2 grp.bin pdist ship sea avgknots shore day location chid alt
4.dive 40707_004 TRUE 2 2.250 second light 14.06809 2.30805 12 Lower 4 dive
4.fly 40707_004 FALSE 2 2.250 second light 14.06809 2.30805 12 Lower 4 fly
4.none 40707_004 FALSE 2 2.250 second light 14.06809 2.30805 12 Lower 4 none
5.dive 40707_006 FALSE 2 0.000 second light 15.12650 2.53312 12 Lower 5 dive
5.fly 40707_006 TRUE 2 0.000 second light 15.12650 2.53312 12 Lower 5 fly
5.none 40707_006 FALSE 2 0.000 second light 15.12650 2.53312 12 Lower 5 none
6.dive 40707_007 FALSE 1 1.995 second light 14.02101 2.01680 12 Lower 6 dive
6.fly 40707_007 TRUE 1 1.995 second light 14.02101 2.01680 12 Lower 6 fly
6.none 40707_007 FALSE 1 1.995 second light 14.02101 2.01680 12 Lower 6 none
'act2' is the dependent variable and consists of choices a bird floating on the water could make when approached by a ship; fly, dive, or none. I am interested in how these probabilities relate to the remaining independent variables in the data.frame, i.e. perpendicular distance to the ship path (pdist) sea conditions (sea), speed (avgknots), distance to shore (shore) etc. The independent variables are made of dichotomous, factor and continuous variables.
I ran two multinomial logit models, one including all the choice options and another including only a subset. I then compared these models with the hmftest() function to test for the IIA assumption. The results were confusing the say the least. I will include the codes for the two models and the test output (in case I am miss-specifying the models in the code).
# model including all choice options (fly, dive, none)
mod.1 <- mlogit(act2 ~ 1 | pdist + as.factor(grp.bin) +
as.factor(sea) + avgknots + shore + as.factor(location),long_perp ,
reflevel = 'none')
# model including only a subset of choice options (fly, dive)
mod.alt <- mlogit(act2 ~ 1 | pdist + as.factor(grp.bin) +
as.factor(sea) + avgknots + shore + as.factor(location),long_perp ,
reflevel = 'none', alt.subset = c("fly","dive"))
# IIA test
hmftest(mod.1, mod.alt)
# output
Hausman-McFadden test
data: long_perp
chisq = -968.7303, df = 7, p-value = 1
alternative hypothesis: IIA is rejected
As you can see the chisquare statistic is negative! I assume I am either 1. doing something wrong, or 2. IIA is violated. This result holds true for choice subset (fly, dive), but the IIA assumption is upheld with choice subset (none, dive)? This confuses me.
Next I tried to formulate a nested model as a way to relax the IIA assumption. I nested the choices as nest1 = none, nest2 = fly, dive. This makes sense to me as this seems like a logical break, the bird decides to react or not then decides which reaction to make.
I am unclear on how to run the nested logit models (even after reading the two vignettes for mlogit, Croissant vignette and Train vignette).
When I run my analysis following the example in the Croissant vignette I get the following error.
nested.1 <- mlogit(act2 ~ 0 | pdist + as.factor(grp.bin) + as.factor(ship) +
as.factor(sea) + avgknots + shore + as.factor(location),
long_perp , reflevel="none",nests = list(noact = "none",
react = c("dive","fly")), unscaled = TRUE)
# Error in solve.default(crossprod(attr(x, "gradi")[, !fixed])) :
Lapack routine dgesv: system is exactly singular: U[19,19] = 0
I have read a bit about this error message and it may occur because of complete separation. I have looked at some tables of the data and do not believe this is happening as I have 4,000+ observations and only one factor variable with more than 2 levels (it has 3).
Help on these specific problems is greatly appreciated but I am also open to alternate analyses that I can use to answer my question. I am mainly interested in the probability of flying as a function of perpendicular distance to the ships path.
Thanks, Tim
To get a positive chi-sq, change the code as follows:
alt.subset = c("none", "fly")
that is, the ref level will be in the subset too. It may help, though the P-value may not change much.