I have binary (1 or 0) time series of an event and I want to calculate its ACF. The problem is that I need to split the TS into clusters according to their duration and to calculate ACF of each subset.
Let me show you an example:
TS : (1,1,1,0,0,1,1,0,0,0,1)
I'd like to have an ACF that is a sum of :
ACF of cluster 1 : (1,1,1,0,0,0,0,0,0,0,0)
ACF of cluster 2 : (1,1,0,0,0,0,0,0,0,0,0)
ACF of cluster 3 : (1,0,0,0,0,0,0,0,0,0,0)
and then average these 3 vectors to get the result I need. The number of clusters is arbitrary, approximate duration of time series varies between 1k to 10k observations
It's not clear to me at all what you're trying to do.
In agreement with #OttoKässi I don't understand the logic behind the subsets. Why three? Why those three? What is the (mathematical) rationale for constructing those subsets.
More fundamentally, averaging correlation coefficients makes little sense to me. In autocorrelation, you calculate Pearson's product-moment correlation coefficients of the vector with different lagged versions of that same vector. Then you want to do that for three different (orthogonal) vectors, and average the coefficients? Why? That makes no statistical sense to me.
That aside, to calculate the autocorrelation for the three vectors you can do the following:
# Your sample vectors
v <- list(
v1 = c(1,1,1,0,0,0,0,0,0,0,0),
v2 = c(1,1,0,0,0,0,0,0,0,0,0),
v3 = c(1,0,0,0,0,0,0,0,0,0,0));
# Calculate acf for lag = 0 ... 10 and store as columns in dataframe
# The rows correspond to lag = 0 ... 10
acf <- as.data.frame(lapply(v, function(x) as.numeric(acf(x, plot = FALSE)$acf)));
acf;
# v1 v2 v3
#1 1.00000000 1.00000000 1.000000000
#2 0.63257576 0.47979798 -0.009090909
#3 0.26515152 -0.04040404 -0.018181818
#4 -0.10227273 -0.06060606 -0.027272727
#5 -0.13636364 -0.08080808 -0.036363636
#6 -0.17045455 -0.10101010 -0.045454545
#7 -0.20454545 -0.12121212 -0.054545455
#8 -0.23863636 -0.14141414 -0.063636364
#9 -0.27272727 -0.16161616 -0.072727273
#10 -0.18181818 -0.18181818 -0.081818182
#11 -0.09090909 -0.09090909 -0.090909091
If you now insist, you could calculate average correlation coefficients for different lags by taking the row averages. Mind you, I don't see how this makes statistical sense though.
rowMeans(acf);
#[1] 1.00000000 0.36776094 0.06885522 -0.06338384 -0.08451178 -0.10563973
#[7] -0.12676768 -0.14789562 -0.16902357 -0.14848485 -0.09090909
Related
I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.
How do I determine the optimal number of clusters while using hierarchical clustering. If I am just having the distance matrix as I am measuring only pairwise distances (levenshtein distances), how do I find out the optimal number of clusters? I referred to other posts they all use k-means, hierarchical but not for string type of data as shown below. Any suggestions on how to use R to find the number of clusters?
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i) {do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
Several statistics can be used.
Look for example at the WeightedCluster package that can compute and plot a series of such statistics.
To illustrate, you get the optimal number of groups for each available statistics as follows:
library("WeightedCluster")
hcRange <- as.clustrange(hc, diss=as.dist(d), ncluster=6)
summary(hcRange)
## 1. N groups 1. stat
## PBC 3 0.8799136
## HG 3 1.0000000
## HGSD 3 0.9987651
## ASW 3 0.4136550
## ASWw 3 0.4722895
## CH 3 8.3605263
## R2 6 0.4734561
## CHsq 3 20.6538462
## R2sq 6 0.6735039
## HC 3 0.0000000
You can also plot the statistics (here we show the Average silhouette width, ASWw, Huber's Gamma, HG, and the Point biserial correlation) for all the computed solutions
plot(hcRange, stat = c("ASWw", "HG", "PBC"), lwd = 2)
The better solution seems to be the three groups solution.
I'm a beginner in R, and am working through data where I need to randomly sample from a von Mises distribution 10 times per row. I have already calculated the concentration parameter (kappa) of my data, and am using rvm() from the package CircStats to generate random samples. For each real observation, I have a von Mises mean (in degrees, "Example" column below):
Obs Example
1 1 69.43064
2 2 -41.80749
3 3 133.83900
4 4 -12.82486
5 5 -137.57358
6 6 -19.27882
Therefore if I were to calculate a random sample of 10 from a von Mises distribution with a concentration parameter (kappa) of .44, my code for the first observation would look like:
rvm(10,rad(69.43064),.44)
[1] 0.7695183 5.9182905 2.6174674 5.6028430 2.4213428 5.4660423 6.1753582
[8] 2.6910969 4.2964024 5.4806146
I want to refer to the observed mean in the data, so:
rvm(10,rad(BearEx$Example), .44)
I'm looking to calculate 10 random values per observation. My ideal output would look like:
Obs Random
1 0.7695183
1 5.9182905
1 2.6174674
1 5.602843
1 2.4213428
1 5.4660423
1 6.1753582
1 2.6910969
1 4.2964024
1 5.4806146
And so on with each observation. I feel like this is a pretty basic problem, but I'm having trouble coding that loop with the observation number.
Thanks for your time!
If you have a data frame of the observations and their means ex:
obs_plus_mean = data.frame(obs = 1:5, mean = c(69.43064, -41.80749, 133.83900, -12.82486, -137.57358, -19.27882))
Then for a somewhat robust solution you can start by making a function that takes input of kappa, the sample mean and sample size n to generate a sample of size n for each observation. For convenience you can even put your default values for kappa and sample size.
von_mis_sample = function(mean, size = 10 , kappa = .44) {
sample = rvm(size,rad(mean),kappa)
}
Then you can compute your samples by the call
samples = sapply(obs_plus_mean$mean, von_mis_sample(mean))
(In your case I think you want BearEx$Example instead of obs_plus_mean$mean here)
That should work, please let me know otherwise.
I have survival data from an experiment in flies which examines rates of aging in various genotypes. The data is available to me in several layouts so the choice of which is up to you, whichever suits the answer best.
One dataframe (wide.df) looks like this, where each genotype (Exp, of which there is ~640) has a row, and the days run in sequence horizontally from day 4 to day 98 with counts of new deaths every two days.
Exp Day4 Day6 Day8 Day10 Day12 Day14 ...
A 0 0 0 2 3 1 ...
I make the example using this:
wide.df2<-data.frame("A",0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2)
colnames(wide.df2)<-c("Exp","Day4","Day6","Day8","Day10","Day12","Day14","Day16","Day18","Day20","Day22","Day24","Day26","Day28","Day30","Day32","Day34","Day36")
Another version is like this, where each day has a row for each 'Exp' and the number of deaths on that day are recorded.
Exp Deaths Day
A 0 4
A 0 6
A 0 8
A 2 10
A 3 12
.. .. ..
To make this example:
df2<-data.frame(c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),c(0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2),c(4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36))
colnames(df2)<-c("Exp","Deaths","Day")
What I would like to do is perform a Gompertz Analysis (See second paragraph of "the life table" here). The equation is:
μx = α*e β*x
Where μx is probability of death at a given time, α is initial mortality rate, and β is the rate of aging.
I would like to be able to get a dataframe which has α and β estimates for each of my ~640 genotypes for further analysis later.
I need help going from the above dataframes to an output of these values for each of my genotypes in R.
I have looked through the package flexsurv which may house the answer but I have failed in attempts to find and implement it.
This should get you started...
Firstly, for the flexsurvreg function to work, you need to specify your input data as a Surv object (from package:survival). This means one row per observation.
The first thing is to re-create the 'raw' data from the summary tables you provide.
(I know rbind is not efficient, but you can always switch to data.table for large sets).
### get rows with >1 death
df3 <- df2[df2$Deaths>1, 2:3]
### expand to give one row per death per time
df3 <- sapply(df3, FUN=function(x) rep(df3[, 2], df3[, 1]))
### each death is 1 (occurs once)
df3[, 1] <- 1
### add this to the rows with <=1 death
df3 <- rbind(df3, df2[!df2$Deaths>1, 2:3])
### convert to Surv object
library(survival)
s1 <- with(df3, Surv(Day, Deaths))
### get parameters for Gompertz distribution
library(flexsurv)
f1 <- flexsurvreg(s1 ~ 1, dist="gompertz")
giving
> f1$res
est L95% U95%
shape 0.165351912 0.1281016481 0.202602176
rate 0.001767956 0.0006902161 0.004528537
Note that this is an intercept-only model as all your genotypes are A.
You can loop this over multiple survival objects once you have re-created the per-observation data as above.
From the flexsurv docs:
Gompertz distribution with shape parameter a and rate parameter
b has hazard function
H(x: a, b) = b.e^{ax}
So it appears your alpha is b, the rate, and beta is a, the shape.
I am trying to obtain survival estimates for different people at a specific time.
My code is as follows:
s = Surv(outcome.[,1], outcome.[,2])
survplot= (survfit(s ~ person.list[,1]))
plot(survplot, mark.time=FALSE)
summary(survplot[1], times=4)[1]
This code creates the survival object, creates a survival curve for each 11 of the people, plots each of the curves, and with the summary function I can obtain the survival estimate for person 1 at time = 4.
I am trying to create a list of the survival estimates for each person at a specified time (time = 4).
Any help would be appreciated.
Thanks,
Matt
If all that you say is true, then this is a typical way of generating a list using indices as arguments:
t4list <- lapply(1:11, function(x) summary(survplot[x], times=4)[1] )
t4list
If you really meant that you wanted a vector of survival estimates based at that time, then sapply would make an attempt to simply the result to an atomic form such as a numeric vector or a matrix in the case where the results were "multidimensional". I would have thought that you could have gotten a useful result with just:
summary(survplot, times=4)[1]
That should have succeeded in giving you a vector of predicted survival times (assuming such times exist.) If you get too greedy and push out the 'times' value past where there are estimates, then you will throw an error. Ironically that error will not be thrown if there is at least one time where all levels of the covariates have an estimate. Using the example in the help page as a starting point:
fit <- survfit(Surv(time, status) ~ x, data = aml)
summary(fit, times=c(10, 20, 60) )[1]
#$surv
#[1] 0.9090909 0.7159091 0.1840909 0.6666667 0.5833333
# not very informative about which times and covariates were estimated
# and which are missing
# this is more informative
as.data.frame( summary(fit, times=c(10, 20, 60) )[c("surv", "time", "strata")])
surv time strata
1 0.9090909 10 x=Maintained
2 0.7159091 20 x=Maintained
3 0.1840909 60 x=Maintained
4 0.6666667 10 x=Nonmaintained
5 0.5833333 20 x=Nonmaintained
Whereas if you just use 60 you get an error message:
> summary(fit, times=c( 60) )[1]
Error in factor(rep(1:nstrat, scount), labels = names(fit$strata)) :
invalid labels; length 2 should be 1 or 1