I have a large data set comprised of thousands of unique tests (TestNum) and their associated responses (Response), with responses varying in lengths across tests. Tests are dropped out based on some criteria, hence the lack of sequence in TestNum values. Here a simplified example:
dat <- data.frame(Response=c(rlnorm(10, 2.9, 0.3), rlnorm(14, 2.88, 0.38), rlnorm(19, 2.44, 0.08)),TestNum=rep(c(1,4,9), times=c(10,14,19)))
dat$TestNum<-factor(dat$TestNum)
dat
I am fitting a lnorm distribution to each TestNum and extracting coefficients
dat_fit1 <- with(dat,
by(dat[,1], TestNum, fitdist, "lnorm"))
dat_fit2 <-t(sapply(dat_fit1, coef))
I want to test other distributions, but would need the Goodness-of-fit statistics (gofstat; for example "chi"chisqpvalue", "cvm", "ad", "ks", "aic", "bic") from each fitted curve by TestNum. I can get the "aic" and "bic" with the code below, but not the rest of the statistics.
gof_f<-do.call(rbind, dat_fit1)
gof_f<-gof_f[,7:8]
Any suggestions would be appreciated!
dat.lnorm <- with(dat,
by(dat[,1], dat[,2],
function(x){
fit<-fitdist(x,"lnorm", method="mme")
coef_meanlog <-fit[[1]][[1]]
coef_sdlog <-fit[[1]][[2]]
ks <-ks.test(jitter(x),"plnorm", meanlog=coef_meanlog, sdlog=coef_sdlog)$p.value
ad <-ad.test(plnorm(x, meanlog=coef_meanlog, sdlog=coef_sdlog))$p.value
return(list(cbind(rbind(fit[7:8]), ks, ad)))
}))
Related
R function prcomp() for performing principal components analysis provides a handy summary() method which (as is expected) offers a summary of the results at a glance:
data(cars)
pca <- prcomp(cars)
summary(pca)
Importance of components:
PC1 PC2
Standard deviation 26.1252 3.08084
Proportion of Variance 0.9863 0.01372
Cumulative Proportion 0.9863 1.00000
I would like a similar summary for displaying the results of a principal coordinates analysis, usually performed using the cmdscale() function:
pco <- cmdscale(dist(cars), eig=TRUE)
But this function does not provide a summary() method and therefore there is no direct way of displaying the results as percent variances, etc.
Has anyone out there already developed some simple method for summarizing such results from PCO?
Thanks in advance for any assistance you can provide!
With best regards,
Instead of the basic cmdscale, one can use capscale or dbRda from package vegan instead. These functions generalize PCO as they "also perform unconstrained principal coordinates analysis, optionally using extended dissimilarities" (citation from ?capscale help page). This is much more than a workaround.
The following gives an example:
library(vegan)
A <- data.frame(
x = c(-2.48, -4.03, 1.15, 0.94, 5.33, 4.72),
y = c(-3.92, -3.4, 0.92, 0.78, 3.44, 0.5),
z = c(-1.11, -2.18, 0.21, 0.34, 1.74, 1.12)
)
pco <- capscale(dist(A) ~ 1) # uses euclidean distance
# or
pco <- capscale(A ~ 1) # with same result
pco # shows some main results
summary(pco) # shows more results, including eigenvalues
The summary function has several arguments, documented in ?summary.cca. Parts of the result can be extracted and formated by the user, the eigenvalues for example with pco$CA$eig.
we are trying to create a distribution that estimates pathogens presence on vegetables. This was done using different methods, each providing a distribution:
- method S (from sludge concentration) is best fitted by weibull(1.55, 8.57)
- method SO (from soil) is best fitted by logN(0.68, 0.63)
- method F (from field data) PERT(0.093, 0.34, 0.52)
Theoretically the 3 methods should estimate the same value. What would be the best way to combine them?
I have searched online but I could only find & understand how to do it using normal distributions. The posterior normal distribution would have a mean that is a weighted average (see page 3 on https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading15a.pdf)
How to update different types of distributions?
Thank you for your help.
library(mc2d)
soil.df <- matrix(data=0, nrow=10000, ncol=3)
colnames(soil.df) <- c("from sludge","soil sample","field data")
for (i in 1:10000) {
migration <- 0.27
application <- rpert(1,0.01,0.02,0.25)
C <- rweibull(1,1.57,85.79)
soil.df[i,1] <- C*application*migration ##from sludge
soil.df[i,2]<- 10^rnorm(1,0.68,0.63)*migration ## from soil concentration
soil.df[i,3] <- rpert(1,0.093, 0.34, 0.52) ##from field data
}
par(mfrow=c(1,1))
plot(density(soil.df[,1]), col="red", xlim=c(0,15), ylim=c(0,1), main="Ova/gr soil")
lines(density(soil.df[,2]), col="black")
lines(density(soil.df[,3]), col="green")
I have three factors: word, type and register. In SPSS, it is very easy to conduct a pairwise comparison (or simple comparison) in SPSS, the syntax is:
/EMMEANS=TABLES(word*register*type) COMPARE(type) ADJ (BONFERRONI)
And it will give me a result like this:
But how can I achieve this in R with Multcomp package? Say if I have a anova result like this:
library (multcomp)
anova <- aov (data, min~word*register*type)
summary (anova)
How can I do SPSS-like pairwise comparisons?
hsb2<-read.table("http://www.ats.ucla.edu/stat/data/hsb2.csv", sep=",", header=T)
attach(hsb2) # attach is a horrible practice, but just roll with it for this example
pairwise.t.test(write, ses, p.adj = "bonf")
That gives you the Bonferonni pairwise comparison that you see in SPSS.
This may help further and in general UCLA provides some good resources that relate commands in SAS, SPSS, Stata, Mplus and R:
http://statistics.ats.ucla.edu/stat/r/faq/posthoc.htm
http://www.ats.ucla.edu/stat/dae/
So, that's for pairwise comparisons. You can also use p.adjust with multiple comparisons (multi-way). See this manual page "Adjust P-values for Multiple Comparisons".
The example they give showing the different adjustments which can be made, including Bonferroni, is:
require(graphics)
set.seed(123)
x <- rnorm(50, mean = c(rep(0, 25), rep(3, 25)))
p <- 2*pnorm(sort(-abs(x)))
round(p, 3)
round(p.adjust(p), 3)
round(p.adjust(p, "BH"), 3)
## or all of them at once (dropping the "fdr" alias):
p.adjust.M <- p.adjust.methods[p.adjust.methods != "fdr"]
p.adj <- sapply(p.adjust.M, function(meth) p.adjust(p, meth))
p.adj.60 <- sapply(p.adjust.M, function(meth) p.adjust(p, meth, n = 60))
stopifnot(identical(p.adj[,"none"], p), p.adj <= p.adj.60)
round(p.adj, 3)
## or a bit nicer:
noquote(apply(p.adj, 2, format.pval, digits = 3))
## and a graphic:
matplot(p, p.adj, ylab="p.adjust(p, meth)", type = "l", asp = 1, lty = 1:6,
main = "P-value adjustments")
legend(0.7, 0.6, p.adjust.M, col = 1:6, lty = 1:6)
## Can work with NA's:
pN <- p; iN <- c(46, 47); pN[iN] <- NA
pN.a <- sapply(p.adjust.M, function(meth) p.adjust(pN, meth))
## The smallest 20 P-values all affected by the NA's :
round((pN.a / p.adj)[1:20, ] , 4)
This page also gives some nice pairwise comparison examples with ANOVA where they have mental, physical, and medical factors and adjust the p-values in various ways.
This is another excellent resource by Bretz.
Finally, here's a SO question on N-way ANOVA in R where they show how to do 3-way and up to 100-way.
I am new to R and I have a request that I am not sure is possible. We have a number of retail locations that my boss would like to use affinity propagation to group into clusters. We will not be clustering based on geographic location. Once he has found a configuration he likes, he wants to be able to input other locations to determine which of those set clusters they should fall into.
The only solution I have been able to come up with is to use the same options and re-cluster with the original points and the new ones added in, however I believe that this might change the outcome.
Am I understanding this right, or are there other options?
Sorry for the late answer, I just incidentally stumbled over your question.
I agree with Anony-Mousse's answer that clustering is the first step and classification is the second. However, I'm not sure whether this is the best option here. Elena601b is obviously talking about a task with truly spatial data, so my impression is that the best approach is to cluster first and then to "classify" new points/samples/locations by looking for the closest cluster exemplar. Here is some code for synthetic data:
## if not available, run the following first:
## install.packages("apcluster")
library(apcluster)
## create four synthetic 2D clusters
cl1 <- cbind(rnorm(30, 0.3, 0.05), rnorm(30, 0.7, 0.04))
cl2 <- cbind(rnorm(30, 0.7, 0.04), rnorm(30, 0.4, .05))
cl3 <- cbind(rnorm(20, 0.50, 0.03), rnorm(20, 0.72, 0.03))
cl4 <- cbind(rnorm(25, 0.50, 0.03), rnorm(25, 0.42, 0.04))
x <- rbind(cl1, cl2, cl3, cl4)
## run apcluster() (you may replace the Euclidean distance by a different
## distance, e.g. driving distance, driving time)
apres <- apcluster(negDistMat(r=2), x, q=0)
## create new samples
xNew <- cbind(rnorm(10, 0.3, 0.05), rnorm(10, 0.7, 0.04))
## auxiliary predict() function
predict.apcluster <- function(s, exemplars, newdata)
{
simMat <- s(rbind(exemplars, newdata),
sel=(1:nrow(newdata)) + nrow(exemplars))[1:nrow(exemplars), ]
unname(apply(simMat, 2, which.max))
}
## assign new data samples to exemplars
predict.apcluster(negDistMat(r=2), x[apres#exemplars, ], xNew)
## ... the result is a vector of indices to which exemplar/cluster each
## data sample is assigned
I will probably add such a predict() method in a future release of the package (I am the maintainer of the package). I hope that helps.
Clustering is not a drop-in replacement for classification.
Few clustering algorithms can meaningfully integrate new information.
The usual approach for your problem however is simple:
Do clustering.
use the cluster labels as class labels
train a classifier
apply the classifier to the new data
Ok straight to the question. I have a database with lots and lots of categorical variable.
Sample database with a few variables as below
gender <- as.factor(sample( letters[6:7], 100, replace=TRUE, prob=c(0.2, 0.8) ))
smoking <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.6,0.4)))
alcohol <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.3,0.7)))
htn <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.2,0.8)))
tertile <- as.factor(sample(c(1,2,3),size=100,replace=T,prob=c(0.3,0.3,0.4)))
df <- as.data.frame(cbind(gender,smoking,alcohol,htn,tertile))
I want to test the hypothesis, using a chi square test, that there is a difference in the portion of smokers, alcohol use, hypertension (htn) etc by tertile (3 factors). I then want to extract the p values for each variable.
Now i know i can test each individual variable using a 2 by 3 cross tabulation but is there a more efficient code to derive the test statistic and p-value across all variables in one go and extract the p value across each variable
Thanks in advance
Anoop
If you want to do all the comparisons in one statement, you can do
mapply(function(x, y) chisq.test(x, y)$p.value, df[, -5], MoreArgs=list(df[,5]))
# gender smoking alcohol htn
# 0.4967724 0.8251178 0.5008898 0.3775083
Of course doing tests this way is somewhat statistically inefficient since you are doing multiple tests here so some correction is required to maintain an appropriate type 1 error rate.
You can run the following code chunk if you want to get the test result in details:
lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE))
You can get just p-values:
lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE)$p.value)
This is to get the p-values in the data frame:
data.frame(lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE)$p.value))
Thanks to RPub for inspiring.
http://www.rpubs.com/kaz_yos/1204