R function prcomp() for performing principal components analysis provides a handy summary() method which (as is expected) offers a summary of the results at a glance:
data(cars)
pca <- prcomp(cars)
summary(pca)
Importance of components:
PC1 PC2
Standard deviation 26.1252 3.08084
Proportion of Variance 0.9863 0.01372
Cumulative Proportion 0.9863 1.00000
I would like a similar summary for displaying the results of a principal coordinates analysis, usually performed using the cmdscale() function:
pco <- cmdscale(dist(cars), eig=TRUE)
But this function does not provide a summary() method and therefore there is no direct way of displaying the results as percent variances, etc.
Has anyone out there already developed some simple method for summarizing such results from PCO?
Thanks in advance for any assistance you can provide!
With best regards,
Instead of the basic cmdscale, one can use capscale or dbRda from package vegan instead. These functions generalize PCO as they "also perform unconstrained principal coordinates analysis, optionally using extended dissimilarities" (citation from ?capscale help page). This is much more than a workaround.
The following gives an example:
library(vegan)
A <- data.frame(
x = c(-2.48, -4.03, 1.15, 0.94, 5.33, 4.72),
y = c(-3.92, -3.4, 0.92, 0.78, 3.44, 0.5),
z = c(-1.11, -2.18, 0.21, 0.34, 1.74, 1.12)
)
pco <- capscale(dist(A) ~ 1) # uses euclidean distance
# or
pco <- capscale(A ~ 1) # with same result
pco # shows some main results
summary(pco) # shows more results, including eigenvalues
The summary function has several arguments, documented in ?summary.cca. Parts of the result can be extracted and formated by the user, the eigenvalues for example with pco$CA$eig.
Related
A previous employee from my organization performed all of their analyses on a different statistical program than R (with no documentation), and no one currently employed knows which program was used. Looking at the model output table and comparing it to Google search results, I think they used Statistica. In an effort to be transparent with other organizations who work with us, I'm trying to replicate their work and potentially reevaluate it.
Model: They modeled the relationship between three variables which I will call A, B, C. Variables were chosen based on exploratory analyses (i.e., correlation matrices and GLM modeling). Parameter estimates are used for prediction purposes. From what I can tell, they used a GLM with a log-link function to model C as a function of A and B.
Data:
A <- c(0.937918714, 1.277501774, 34.46428571, 3.843879361, 5.135520685, 0.324675325, 1.038421599, 0.333333333, 0.058139535, 0.09009009, 0.080515298, 5.174234424, 10.625, 21.9047619, 0.162074554, 2.372881356, 1.084430674, 18.53658537, 6.438631791, 0.172413793, 0.291120815, 9.090909091, 5.882352941)
B <- c(0.416666667, 0.555555556, 0.833333333, 0.4, 0.833333333, 0.4, 0.625, 0.625, 0.294117647, 0.37037037, 0.285714286, 1.111111111, 0.588235294, 0.476190476, 0.555555556, 0.833333333, 0.666666667, 0.476190476, 0.208333333, 0.163934426, 0.163934426, 0.3125, 0.454545455)
C <- c(0.009533367, 0.020812183, 0.056208054, 0.015002587, 0.042735043, 0.013661202, 0.004377736, 0.00635324, 0.001345895, 0.001940492, 0.00446144, 0.043768997, 0.021134594, 0.004471772, 0.023488256, 0.029441118, 0.052287582, 0.003526093, 0.030984508, 0.010891089, 0.020812686, 0.016032064, 0.018145161)
My Approach:
I combined each vector into a data frame (dat) and modeled using the following:
glm(formula = C ~ A + B, family = binomial(link = logit), data = dat)
The Question:
I notice we have different parameter estimates; in fact, their analysis includes 'Scale' as a factor, and an associated parameter estimate and standard error (see below). I haven't figured out how to include separate 'Scale' factor. My parameter estimates are close to these, but are obviously different with the inclusion of a new variable.
Anyone familiar with this [Statistica] output and how I could replicate it in R? Primarily, how would I incorporate the Scale factor into my analyses?
Side-note
I've also posted this to Reddit (r/rstats - Replicating an analysis performed in different software).
Much appreciated!
we are trying to create a distribution that estimates pathogens presence on vegetables. This was done using different methods, each providing a distribution:
- method S (from sludge concentration) is best fitted by weibull(1.55, 8.57)
- method SO (from soil) is best fitted by logN(0.68, 0.63)
- method F (from field data) PERT(0.093, 0.34, 0.52)
Theoretically the 3 methods should estimate the same value. What would be the best way to combine them?
I have searched online but I could only find & understand how to do it using normal distributions. The posterior normal distribution would have a mean that is a weighted average (see page 3 on https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading15a.pdf)
How to update different types of distributions?
Thank you for your help.
library(mc2d)
soil.df <- matrix(data=0, nrow=10000, ncol=3)
colnames(soil.df) <- c("from sludge","soil sample","field data")
for (i in 1:10000) {
migration <- 0.27
application <- rpert(1,0.01,0.02,0.25)
C <- rweibull(1,1.57,85.79)
soil.df[i,1] <- C*application*migration ##from sludge
soil.df[i,2]<- 10^rnorm(1,0.68,0.63)*migration ## from soil concentration
soil.df[i,3] <- rpert(1,0.093, 0.34, 0.52) ##from field data
}
par(mfrow=c(1,1))
plot(density(soil.df[,1]), col="red", xlim=c(0,15), ylim=c(0,1), main="Ova/gr soil")
lines(density(soil.df[,2]), col="black")
lines(density(soil.df[,3]), col="green")
I have a large data set comprised of thousands of unique tests (TestNum) and their associated responses (Response), with responses varying in lengths across tests. Tests are dropped out based on some criteria, hence the lack of sequence in TestNum values. Here a simplified example:
dat <- data.frame(Response=c(rlnorm(10, 2.9, 0.3), rlnorm(14, 2.88, 0.38), rlnorm(19, 2.44, 0.08)),TestNum=rep(c(1,4,9), times=c(10,14,19)))
dat$TestNum<-factor(dat$TestNum)
dat
I am fitting a lnorm distribution to each TestNum and extracting coefficients
dat_fit1 <- with(dat,
by(dat[,1], TestNum, fitdist, "lnorm"))
dat_fit2 <-t(sapply(dat_fit1, coef))
I want to test other distributions, but would need the Goodness-of-fit statistics (gofstat; for example "chi"chisqpvalue", "cvm", "ad", "ks", "aic", "bic") from each fitted curve by TestNum. I can get the "aic" and "bic" with the code below, but not the rest of the statistics.
gof_f<-do.call(rbind, dat_fit1)
gof_f<-gof_f[,7:8]
Any suggestions would be appreciated!
dat.lnorm <- with(dat,
by(dat[,1], dat[,2],
function(x){
fit<-fitdist(x,"lnorm", method="mme")
coef_meanlog <-fit[[1]][[1]]
coef_sdlog <-fit[[1]][[2]]
ks <-ks.test(jitter(x),"plnorm", meanlog=coef_meanlog, sdlog=coef_sdlog)$p.value
ad <-ad.test(plnorm(x, meanlog=coef_meanlog, sdlog=coef_sdlog))$p.value
return(list(cbind(rbind(fit[7:8]), ks, ad)))
}))
I am new to R and I have a request that I am not sure is possible. We have a number of retail locations that my boss would like to use affinity propagation to group into clusters. We will not be clustering based on geographic location. Once he has found a configuration he likes, he wants to be able to input other locations to determine which of those set clusters they should fall into.
The only solution I have been able to come up with is to use the same options and re-cluster with the original points and the new ones added in, however I believe that this might change the outcome.
Am I understanding this right, or are there other options?
Sorry for the late answer, I just incidentally stumbled over your question.
I agree with Anony-Mousse's answer that clustering is the first step and classification is the second. However, I'm not sure whether this is the best option here. Elena601b is obviously talking about a task with truly spatial data, so my impression is that the best approach is to cluster first and then to "classify" new points/samples/locations by looking for the closest cluster exemplar. Here is some code for synthetic data:
## if not available, run the following first:
## install.packages("apcluster")
library(apcluster)
## create four synthetic 2D clusters
cl1 <- cbind(rnorm(30, 0.3, 0.05), rnorm(30, 0.7, 0.04))
cl2 <- cbind(rnorm(30, 0.7, 0.04), rnorm(30, 0.4, .05))
cl3 <- cbind(rnorm(20, 0.50, 0.03), rnorm(20, 0.72, 0.03))
cl4 <- cbind(rnorm(25, 0.50, 0.03), rnorm(25, 0.42, 0.04))
x <- rbind(cl1, cl2, cl3, cl4)
## run apcluster() (you may replace the Euclidean distance by a different
## distance, e.g. driving distance, driving time)
apres <- apcluster(negDistMat(r=2), x, q=0)
## create new samples
xNew <- cbind(rnorm(10, 0.3, 0.05), rnorm(10, 0.7, 0.04))
## auxiliary predict() function
predict.apcluster <- function(s, exemplars, newdata)
{
simMat <- s(rbind(exemplars, newdata),
sel=(1:nrow(newdata)) + nrow(exemplars))[1:nrow(exemplars), ]
unname(apply(simMat, 2, which.max))
}
## assign new data samples to exemplars
predict.apcluster(negDistMat(r=2), x[apres#exemplars, ], xNew)
## ... the result is a vector of indices to which exemplar/cluster each
## data sample is assigned
I will probably add such a predict() method in a future release of the package (I am the maintainer of the package). I hope that helps.
Clustering is not a drop-in replacement for classification.
Few clustering algorithms can meaningfully integrate new information.
The usual approach for your problem however is simple:
Do clustering.
use the cluster labels as class labels
train a classifier
apply the classifier to the new data
I have generated some data which is effectively a cumulative distribution, the code below gives an example of X and Y from my data:
X<- c(0.09787761, 0.10745590, 0.11815422, 0.15503521, 0.16887488, 0.18361325, 0.22166727,
0.23526786, 0.24198808, 0.25432602, 0.26387961, 0.27364063, 0.34864672, 0.37734113,
0.39230736, 0.40699061, 0.41063824, 0.42497043, 0.44176913, 0.46076456, 0.47229330,
0.53134509, 0.56903577, 0.58308938, 0.58417653, 0.60061901, 0.60483849, 0.61847521,
0.62735245, 0.64337353, 0.65783302, 0.67232004, 0.68884473, 0.78846000, 0.82793293,
0.82963446, 0.84392010, 0.87090024, 0.88384044, 0.89543314, 0.93899033, 0.94781219,
1.12390279, 1.18756693, 1.25057774)
Y<- c(0.0090, 0.0210, 0.0300, 0.0420, 0.0580, 0.0700, 0.0925, 0.1015, 0.1315, 0.1435,
0.1660, 0.1750, 0.2050, 0.2450, 0.2630, 0.2930, 0.3110, 0.3350, 0.3590, 0.3770, 0.3950,
0.4175, 0.4475, 0.4715, 0.4955, 0.5180, 0.5405, 0.5725, 0.6045, 0.6345, 0.6585, 0.6825,
0.7050, 0.7230, 0.7470, 0.7650, 0.7950, 0.8130, 0.8370, 0.8770, 0.8950, 0.9250, 0.9475,
0.9775, 1.0000)
plot(X,Y)
I would like to obtain the median, mean and some quantile information (say for example 5%, 95%) from this data. The way I was thinking of doing this was to fit a defined distribution to it and then integrate to get my quantiles, mean and median values.
The question is how to fit the most appropriate cumulative distribution function to this data (I expect this may well be the Normal Cumulative Distribution Function).
I have seen lots of ways to fit a PDF but I can't find anything on fitting a CDF.
(I realise this may seem a basic question to many of you but it has me struggling!!)
Thanks in advance
Perhaps you could use nlm to find parameters that minimize the squared differences from your observed Y values and the expected for a normal distribution. Here an example using your data
fn <- function(x) {
mu <- x[1];
sigma <- exp(x[2])
sum((Y-pnorm(X,mu,sigma))^2)
}
est <- nlm(fn, c(1,1))$estimate
plot(X,Y)
curve(pnorm(x, est[1], exp(est[2])), add=T)
Unfortunately I don't know an easy with with this method to constrain sigma>0 without doing the exp transformation on the variable. But the fit seems reasonable