Congruence among different values within samples - r

I'd like to test the congruence among different scores within each sampled site. These scores were calculated with five different methods to measure species diversity (http://en.wikipedia.org/wiki/Diversity_index). For instance, if the value of index "a" is high, should the value of index b, c, d, and e by high as well? In this way, I'd like to calculate that congruence within each sampled site.
Should you guys suggest any method to test this congruence? I've tried to calculate the coefficient of variation within each site, but it does not make sense to me because they vary in different scales. I provided an example of the dataset below.
Thank you in advance.
Sample data
df <- data.frame(a=rnorm(11, 5, 2),
b=rnorm(11, 1, 1),
c=rnorm(11, 2, 1),
d=rnorm(11, 0, 1),
e=rnorm(11, 3, 2))
rownames(df) <- paste("site", 1:11, sep="")
df

A classification tree would automatically optimize your congruence index. The rpart package in R offers the Gini index and the Information index (I think that is the same as the Entropy Index). You would need to stack your data (using reshape2 package here). In this example I assumed you were trying to classify species by the numeric observation and the site location.
Also if you have a more statistics inspired question with a bit of R, you should feel free to try https://stats.stackexchange.com/
require(rpart)
require(reshape2)
df$site = rownames(df)
stackDF = melt(df, variable.name="species", value.name="observation")
str(stackDF)
classTree <- rpart(species ~ observation + site,data=stackDF, parms=list(split="gini"))
# classTree <- rpart(species ~ site + observation,data=stackDF, parms=list(split="information"))
printcp(classTree)
table(actual=stackDF$species, predicted=predict(classTree,type="class"))
plot(classTree,compress=T,uniform=T,branch=0.4,margin=0.1)
text(classTree)
Roland makes a good suggestion to use principal components. You can use pck = princomp(stackDF[,-which(colnames(stackDF)=="species"),drop=F]) and then change the formula in your tree to be stackDF$species ~ pck +.... You can check the cross-validation with printcp and prune the tree with prune.
> table(actual=stackDF$species, predicted=predict(classTree,type="class"))
predicted
actual a b c d e
a 10 1 0 0 0
b 0 6 3 2 0
c 0 1 9 1 0
d 0 3 0 8 0
e 9 0 0 2 0
Of course none of the classifications in the example make sense because they are random.

Related

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

how can I decide a series of 0-1 combination is stochastically distributed?

I have a series composed by 0 and 1, and the 0 shows up without specfic order (as far as I can tell), how can I decide if the 0 is stochastically distributed?
pls find the toy sample for reference
library(magrittr)
s1 <- runif(10)*10 %>% mod(10) %>% round(0) %>% `>`(5) %>% ifelse(1,0)
s2 <- c(0,0,1,0,1,1,1,0,1,0)
The runs test is what you want:
The Wald–Wolfowitz runs test (or simply runs test), named after
statisticians Abraham Wald and Jacob Wolfowitz is a non-parametric
statistical test that checks a randomness hypothesis for a two-valued
data sequence. More precisely, it can be used to test the hypothesis
that the elements of the sequence are mutually independent.
It is implemented in the snpar package.
Are you looking for rbinom? This function simulates a Bernoulli process with a chance of success (1) equal to some probability p. Otherwise, the result is 0.
The usage of rbinom is rbinom(n, size, prob), where n is the number of random numbers to generate, size is the number of trials, and prob is the probability of getting a success. So to generate a bunch of binomial random numbers with equal probability of 1 or 0, use:
set.seed(100) # for reproducibility
rbinom(n = 10, size = 1, prob = 0.5)
[1] 0 0 1 0 0 0 1 0 1 0

McNemar test in R - sparse data

I'm attempting to run a good sized dataset through R, using the McNemar test to determine whether I have a difference in the proportion of objects detected by one method over another on paired samples. I've noticed that the test works fine when I have a 2x2 table of
test1
y n
y 34 2
n 12 16
but if I try and run something more like:
34 0
12 0
it errors telling me that ''x' and 'y' must have the same number of levels (minimum 2)'.
I should clarify, that I've tried converting wide data to a 2x2 matrix using the table function on my wide data set, where rather than appearing as above, it negates the final column, giving me.
test1
y
y 34
n 12
I've also run mcnemar.test using the factor object option, which gives me the same error, so I'm assuming that it does something similar. I'm wondering whether there is either a way to force the table function to generate the 2nd column despite their being no observations which would fall under either of those categories, or whether there would be a way to make the test overlook this missing data?
Perhaps there's a better way to do this, but you can force R to construct a sparse contingency table by ensuring that the tabulated factors have the same levels attribute and that there are exactly 2 distinct levels specified.
# Example data
x1 <- c(rep("y", 34), rep("n", 12))
x2 <- rep("n", 46)
# Set levels explicitly
x1 <- factor(x1, levels = c("y", "n"))
x2 <- factor(x2, levels = c("y", "n"))
table(x1, x2)
# x2
# x1 y n
# y 0 34
# n 0 12
mcnemar.test(table(x1, x2))
#
# McNemar's Chi-squared test with continuity correction
#
# data: table(x1, x2)
# McNemar's chi-squared = 32.0294, df = 1, p-value = 1.519e-08

Applying ezANOVA error work-around to Long Format data

I have a similar problem as described here:
https://stats.stackexchange.com/questions/58435/repeated-measures-error-in-r-ezanova-using-more-levels-than-subjects-balanced-d
Here is an example of what my dataframe looks like:
Participant Visual Audio StimCondition Accuracy
1 Bottom Circle 1st 2 Central Beeps AO2 0.92
1 SIM Circle Left Beep AO2 0.86
2 Bottom Circle 1st 2 Central Beeps CT4 0.12
2 SIM Circle Left Beep CT4 0.56
I have 3 Visual conditions, 5 Audio conditions & 5 StimConditions & 12 participants exposed to all conditions.
When I run the following ezANOVA:
Model <- ezANOVA(data = Shaped.means, dv = .(Accuracy), wid = .(Participant), within = .(Visual, Audio, StimCondition), type = 3, detailed = TRUE)
I get the same error as the linked question above. I have tried changing Type to equal 1 and it does return the output but minus the Sphericity Test.
I've tried to apply the solution to the linked question to my dataset but as mine is in Long Format I'm a bit lost as to what exactly I need to do to achieve the desired stats.
I'll keep playing with it my end but if anyone could help in the mean time it would be much appreciated.
Thanks.
Following the linked question, you have don't have to change much. Assuming your dataset is exactly as you describe, the following should work for you.
Let's first create a dataset to reflect your description
set.seed(123) ## make reproducible
N <- 12 ## number of Participants
S <- 5 ## number of StimCondition groups
V <- 3 ## number of Visual groups
A <- 5 ## number of Audio groups
Accuracy <- abs(round(runif(N*V*S*A), 2)) ## (N x (PxQ))-matrix with voltages
init.Df <- expand.grid(Participant=gl(N,1),
Visual=gl(V, 1),
Audio=gl(A, 1),
StimCondition=gl(S,1))
df <- cbind(init.Df, Accuracy)
Now we have a dataframe with 3 Visual conditions, 5 Audio conditions & 5 StimConditions & 12 participants exposed to all conditions. This should be at the stage you are currently at. We can do the between-subjects call easily.
# If you just read in the data set and don't know how many subjects
# N <- length(unique(df$Participant))
fit <- lm(matrix(df[,c("Accuracy")], nrow=N) ~ 1)
For the factor component, this is the only real change. If you simply generate your model design, you can pass it to anova.
library(car)
# You can create your within design table
# You can get these values from your dataset as well
# V <- nlevels(df$Visual)
# A <- nlevels(df$Audio)
# S <- nlevels(df$StimCondition)
# If you want the labels with gl, you can use the levels function (e.g. labels=levels(df$Visual))
inDf <- expand.grid(Visual=gl(V, 1),
Audio=gl(A, 1),
StimCondition=gl(S,1))
# Test for Visual
anova(fit, M=~Visual, X=~1, idata=inDf, test="Spherical")
# Test for Audio
anova(fit, M=~Visual+Audio, X=~Visual, idata=inDf, test="Spherical")
# Test for Visual:Audio interaction
anova(fit, M=~Visual+Audio+Visual:Audio, X=~Visual+Audio, idata=inDf, test="Spherical")
#etc...

Gompertz Aging analysis in R

I have survival data from an experiment in flies which examines rates of aging in various genotypes. The data is available to me in several layouts so the choice of which is up to you, whichever suits the answer best.
One dataframe (wide.df) looks like this, where each genotype (Exp, of which there is ~640) has a row, and the days run in sequence horizontally from day 4 to day 98 with counts of new deaths every two days.
Exp Day4 Day6 Day8 Day10 Day12 Day14 ...
A 0 0 0 2 3 1 ...
I make the example using this:
wide.df2<-data.frame("A",0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2)
colnames(wide.df2)<-c("Exp","Day4","Day6","Day8","Day10","Day12","Day14","Day16","Day18","Day20","Day22","Day24","Day26","Day28","Day30","Day32","Day34","Day36")
Another version is like this, where each day has a row for each 'Exp' and the number of deaths on that day are recorded.
Exp Deaths Day
A 0 4
A 0 6
A 0 8
A 2 10
A 3 12
.. .. ..
To make this example:
df2<-data.frame(c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),c(0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2),c(4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36))
colnames(df2)<-c("Exp","Deaths","Day")
What I would like to do is perform a Gompertz Analysis (See second paragraph of "the life table" here). The equation is:
μx = α*e β*x
Where μx is probability of death at a given time, α is initial mortality rate, and β is the rate of aging.
I would like to be able to get a dataframe which has α and β estimates for each of my ~640 genotypes for further analysis later.
I need help going from the above dataframes to an output of these values for each of my genotypes in R.
I have looked through the package flexsurv which may house the answer but I have failed in attempts to find and implement it.
This should get you started...
Firstly, for the flexsurvreg function to work, you need to specify your input data as a Surv object (from package:survival). This means one row per observation.
The first thing is to re-create the 'raw' data from the summary tables you provide.
(I know rbind is not efficient, but you can always switch to data.table for large sets).
### get rows with >1 death
df3 <- df2[df2$Deaths>1, 2:3]
### expand to give one row per death per time
df3 <- sapply(df3, FUN=function(x) rep(df3[, 2], df3[, 1]))
### each death is 1 (occurs once)
df3[, 1] <- 1
### add this to the rows with <=1 death
df3 <- rbind(df3, df2[!df2$Deaths>1, 2:3])
### convert to Surv object
library(survival)
s1 <- with(df3, Surv(Day, Deaths))
### get parameters for Gompertz distribution
library(flexsurv)
f1 <- flexsurvreg(s1 ~ 1, dist="gompertz")
giving
> f1$res
est L95% U95%
shape 0.165351912 0.1281016481 0.202602176
rate 0.001767956 0.0006902161 0.004528537
Note that this is an intercept-only model as all your genotypes are A.
You can loop this over multiple survival objects once you have re-created the per-observation data as above.
From the flexsurv docs:
Gompertz distribution with shape parameter a and rate parameter
b has hazard function
H(x: a, b) = b.e^{ax}
So it appears your alpha is b, the rate, and beta is a, the shape.

Resources