PCA scaling not applied to individuals? - r

This issue applies to my own data, but for the sake of reproducability, my issue/question is also present in the FactoExtra vignette, or here, so I'll use that for the sake of simplicity.
To start, a simple PCA was generated (scale = T) and the coordinate variables from the first 4 axes extracted:
head(var$coord) # coordinates of variables
> Dim.1 Dim.2 Dim.3 Dim.4
> Sepal.Length 0.8901688 -0.36082989 0.27565767 0.03760602
> Sepal.Width -0.4601427 -0.88271627 -0.09361987 -0.01777631
> Petal.Length 0.9915552 -0.02341519 -0.05444699 -0.11534978
> Petal.Width 0.9649790 -0.06399985 -0.24298265 0.07535950
This was also done for the "individuals." Here is the output:
head(ind$coord) # coordinates of individuals
> Dim.1 Dim.2 Dim.3 Dim.4
> 1 -2.257141 -0.4784238 0.12727962 0.024087508
> 2 -2.074013 0.6718827 0.23382552 0.102662845
> 3 -2.356335 0.3407664 -0.04405390 0.028282305
4 -2.291707 0.5953999 -0.09098530 -0.065735340
5 -2.381863 -0.6446757 -0.01568565 -0.035802870
6 -2.068701 -1.4842053 -0.02687825 0.006586116
Since the PCA was generated with scale=T, I'm highly confused as to why the individual coordinates are not scaled (-1 to 1?). For instance, "individual 1" has a DIM-1 score of -2.257141, but I have no comparative basis for the variable coordinates which range from -0.46 to 0.991. How can a score of -2.25 be interpreted with a scaled PCA range of -1 to 1?
Am I missing something?
Thanks for your time!
Updated with all relevant code gaps filled:
> data(iris)
> res.pca <- prcomp(iris[, -5], scale = TRUE)
> ind <- get_pca_ind(res.pca)
> print(ind)
>var <- get_pca_var(res.pca)
> print(var)

I asked the author of FactoExtra this question. Here was his reply:
Scale = TRUE will normalize the variables to make them comparable. This is particularly recommended when variables are measured in different scales (e.g: kilograms, kilometers, centimeters, …);(http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/).
In this case, the correlation between a variable and a principal component (PC) is used as the coordinates of the variable on the PC. The representation of variables differs from the plot of the observations: The observations are represented by their projections, but the variables are represented by their correlations.
So, the coordinates of individuals are not expected to be between -1 and 1, even if scale = TRUE.
It’s only possible to interpret the relative position of individuals and variables by creating a biplot as described at: http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/.
A biplot isn't idea for me, but I have tried rescale and it works. Also, I suppose I could take an individual and project them into the PCA to see where they fall.
Anyways, that's the end of that. Thanks for your help #Hack-r!

The scaling that is being done when prcomp(...,scale=T) is scaling of the input variables to unit variance.
I don't think it does anything about range standardization of the individual co-ordinates, unless perhaps center = ... is used. However, it would be easy to do post-hoc (or pre). Here's a related post:
Range standardization (0 to 1) in R

Related

R: Find cutoffpoint for continous variable to assign observations to two groups

I have the following data
Species <- c(rep('A', 47), rep('B', 23))
Value<- c(3.8711, 3.6961, 3.9984, 3.8641, 4.0863, 4.0531, 3.9164, 3.8420, 3.7023, 3.9764, 4.0504, 4.2305,
4.1365, 4.1230, 3.9840, 3.9297, 3.9945, 4.0057, 4.2313, 3.7135, 4.3070, 3.6123, 4.0383, 3.9151,
4.0561, 4.0430, 3.9178, 4.0980, 3.8557, 4.0766, 4.3301, 3.9102, 4.2516, 4.3453, 4.3008, 4.0020,
3.9336, 3.5693, 4.0475, 3.8697, 4.1418, 4.0914, 4.2086, 4.1344, 4.2734, 3.6387, 2.4088, 3.8016,
3.7439, 3.8328, 4.0293, 3.9398, 3.9104, 3.9008, 3.7805, 3.8668, 3.9254, 3.7980, 3.7766, 3.7275,
3.8680, 3.6597, 3.7348, 3.7357, 3.9617, 3.8238, 3.8211, 3.4176, 3.7910, 4.0617)
D<-data.frame(Species,Value)
I have the two species A and B and want to find out which is the best cutoffpoint for value to determine the species.
I found the following question:
R: Determine the threshold that maximally separates two groups based on a continuous variable?
and followed the accepted answer to find the best value with the dose.p function from the MASS package. I have several similar values and it worked for them, but not for the one given above (which is also the reason why i needed to include all 70 observations here).
D$Species_b<-ifelse(D$Species=="A",0,1)
my.glm<-glm(Species_b~Value, data = D, family = binomial)
dose.p(my.glm,p=0.5)
gives me 3.633957 as threshold:
Dose SE
p = 0.5: 3.633957 0.1755291
this results in 45 correct assignments. however, if I look at the data, it is obvious that this is not the best value. By trial and error I found that 3.8 gives me 50 correct assignments, which is obviously better.
Why does the function work for other values, but not for this one? Am I missing an obvious mistake? Or is there maybe a different/ better approach to solving my problem? I have several values I need to do this for, so I really do not want to just randomly test values until I find the best one.
Any help would be greatly appreciated.
I would typically use a receiver operating characteristic curve (ROC) for this type of analysis. This allows a visual and numerical assessment of how the sensitivity and specificity of your cutoff changes as you adjust your threshold. This allows you to select the optimum threshold based on when the overall accuracy is optimum. For example, using pROC:
library(pROC)
species_roc <- roc(D$Species, D$Value)
We can get a measure of how good a discriminator Value is for predicting Species by examining the area under the curve:
auc(species_roc)
#> Area under the curve: 0.778
plot(species_roc)
and we can find out the optimum cut-off threshold like this:
coords(species_roc, x = "best")
#> threshold specificity sensitivity
#> 1 3.96905 0.6170213 0.9130435
We see that this threshold correctly identifies 50 cases:
table(Actual = D$Species, Predicted = c("A", "B")[1 + (D$Value < 3.96905)])
#> Predicted
#> Actual A B
#> A 29 18
#> B 2 21

different output for PR AUC for different R packages

I find different numeric values for the computation of the Area Under the Precision Recall Curve (PRAUC) with the dataset I am working on when computed via 2 different R packages: yardstick and caret.
I am afraid I was not able to reproduce this mismatch with synthetic data, but only with my dataset (this is strange as well)
In order to make this reproducible, I am sharing the prediction output of my model, you can download it here https://drive.google.com/open?id=1LuCcEw-RNRcdz6cg0X5bIEblatxH4Rdz (don't worry, it's a small csv).
The csv contains a dataframe with 4 columns:
yes probability estimate of being in class yes
no = 1 - yes
obs actual class label
pred predicted class label (with .5 threshold)
here follows the code to produce the 2 values of PRAUC
require(data.table)
require(yardstick)
require(caret)
pr <- fread('pred_sample.csv')
# transform to factors
# put the positive class in the first level
pr[, obs := factor(obs, levels = c('yes', 'no'))]
pr[, pred := factor(pred, levels = c('yes', 'no'))] # this is actually not needed
# compute yardstick PRAUC
pr_auc(pr, obs, yes) # 0.315
# compute caret PRAUC
prSummary(pr, lev = c('yes', 'no')) # 0.2373
I could understand a little difference, due to the approximation when computing the area (interpolating the curve), but this seems way too high.
I even tried a third package, PRROC, and the result is still different, namely around .26.

Removing Multivariate Outliers With mvoutlier

Problem
I have a dataframe that composes of > 5 variables at any time and am trying to do a K-Means of it. Because K-Means is greatly affected by outliers, I've been trying to look for a few hours on how to calculate and remove multivariate outliers. Most examples demonstrated are with 2 variables.
Possible Solutions Explored
mvoutlier - Kind user here noted that mvoutlier may be what I need.
Another Outlier Detection Method - Poster here commented with a mix of R functions to generate an ordered list of outliers.
Issues thus Far
Regarding mvoutlier, I was unable to generate a result because it noted my dataset contained negatives and it could not work because of that. I'm not sure how to alter my data to only positive since I need negatives in the set I am working with.
Regarding Another Outlier Detection Method I was able to come up with a list of outliers, but am unsure how to exclude them from the current data set. Also, I do know that these calculations are done after K-Means, and thus I probably will apply the math prior to doing K-Means.
Minimal Verifiable Example
Unfortunately, the dataset I'm using is off-limits to be shown to anyone, so what you'll need is any random data set with more than 3 variables. The code below is code converted from the Another Outlier Detection Method post to work with my data. It should work dynamically if you have a random data set as well. But it should have enough data where cluster center amount should be okay with 5.
clusterAmount <- 5
cluster <- kmeans(dataFrame, centers = clusterAmount, nstart = 20)
centers <- cluster$centers[cluster$cluster, ]
distances <- sqrt(rowSums(clusterDataFrame - centers)^2)
m <- tapply(distances, cluster$cluster, mean)
d <- distances/(m[cluster$cluster])
# 1% outliers
outliers <- d[order(d, decreasing = TRUE)][1:(nrow(clusterDataFrame) * .01)]
Output: A list of outliers ordered by their distance away from the center they reside in I believe. The issue then is getting these results paired up to the respective rows in the data frame and removing them so I can start my K-Means procedure. (Note, while in the example I used K-Means prior to removing outliers, I'll make sure to take the necessary steps and remove outliers before K-Means upon solution).
Question
With Another Outlier Detection Method example in place, how do I pair the results with the information in my current data frame to exclude those rows before doing K-Means?
I don't know if this is exactly helpful but if your data is multivariate normal you may want to try out a Wilks (1963) based method. Wilks showed that the mahalanobis distances of multivariate normal data follow a Beta distribution. We can take advantage of this (iris Sepal data used as an example):
test.dat <- iris[,-c(1,2))]
Wilks.function <- function(dat){
n <- nrow(dat)
p <- ncol(dat)
# beta distribution
u <- n * mahalanobis(dat, center = colMeans(dat), cov = cov(dat))/(n-1)^2
w <- 1 - u
F.stat <- ((n-p-1)/p) * (1/w-1) # computing F statistic
p <- 1 - round( pf(F.stat, p, n-p-1), 3) # p value for each row
cbind(w, F.stat, p)
}
plot(test.dat,
col = "blue",
pch = c(15,16,17)[as.numeric(iris$Species)])
dat.rows <- Wilks.function(test.dat); head(dat.rows)
# w F.stat p
#[1,] 0.9888813 0.8264127 0.440
#[2,] 0.9907488 0.6863139 0.505
#[3,] 0.9869330 0.9731436 0.380
#[4,] 0.9847254 1.1400985 0.323
#[5,] 0.9843166 1.1710961 0.313
#[6,] 0.9740961 1.9545687 0.145
Then we can simply find which rows of our multivariate data are significantly different from the beta distribution.
outliers <- which(dat.rows[,"p"] < 0.05)
points(test.dat[outliers,],
col = "red",
pch = c(15,16,17)[as.numeric(iris$Species[outliers])])

Chi squared goodness of fit for a geometric distribution

As an assignment I had to develop and algorithm and generate a samples for a given geometric distribution with PMF
Using the inverse transform method, I came up with the following expression for generating the values:
Where U represents a value, or n values depending on the size of the sample, drawn from a Unif(0,1) distribution and p is 0.3 as stated in the PMF above.
I have the algorithm, the implementation in R and I already generated QQ Plots to visually assess the adjustment of the empirical values to the theoretical ones (generated with R), i.e., if the generated sample follows indeed the geometric distribution.
Now I wanted to submit the generated sample to a goodness of fit test, namely the Chi-square, yet I'm having trouble doing this in R.
[I think this was moved a little hastily, in spite of your response to whuber's question, since I think before solving the 'how do I write this algorithm in R' problem, it's probably more important to deal with the 'what you're doing is not the best approach to your problem' issue (which certainly belongs where you posted it). Since it's here, I will deal with the 'doing it in R' aspect, but I would urge to you go back an ask about the second question (as a new post).]
Firstly the chi-square test is a little different depending on whether you test
H0: the data come from a geometric distribution with parameter p
or
H0: the data come from a geometric distribution with parameter 0.3
If you want the second, it's quite straightforward. First, with the geometric, if you want to use the chi-square approximation to the distribution of the test statistic, you will need to group adjacent cells in the tail. The 'usual' rule - much too conservative - suggests that you need an expected count in every bin of at least 5.
I'll assume you have a nice large sample size. In that case, you'll have many bins with substantial expected counts and you don't need to worry so much about keeping it so high, but you will still need to choose how you will bin the tail (whether you just choose a single cut-off above which all values are grouped, for example).
I'll proceed as if n were say 1000 (though if you're testing your geometric random number generation, that's pretty low).
First, compute your expected counts:
dgeom(0:20,.3)*1000
[1] 300.0000000 210.0000000 147.0000000 102.9000000 72.0300000 50.4210000
[7] 35.2947000 24.7062900 17.2944030 12.1060821 8.4742575 5.9319802
[13] 4.1523862 2.9066703 2.0346692 1.4242685 0.9969879 0.6978915
[19] 0.4885241 0.3419669 0.2393768
Warning, dgeom and friends goes from x=0, not x=1; while you can shift the inputs and outputs to the R functions, it's much easier if you subtract 1 from all your geometric values and test that. I will proceed as if your sample has had 1 subtracted so that it goes from 0.
I'll cut that off at the 15th term (x=14), and group 15+ into its own group (a single group in this case). If you wanted to follow the 'greater than five' rule of thumb, you'd cut it off after the 12th term (x=11). In some cases (such as smaller p), you might want to split the tail across several bins rather than one.
> expec <- dgeom(0:14,.3)*1000
> expec <- c(expec, 1000-sum(expec))
> expec
[1] 300.000000 210.000000 147.000000 102.900000 72.030000 50.421000
[7] 35.294700 24.706290 17.294403 12.106082 8.474257 5.931980
[13] 4.152386 2.906670 2.034669 4.747562
The last cell is the "15+" category. We also need the probabilities.
Now we don't yet have a sample; I'll just generate one:
y <- rgeom(1000,0.3)
but now we want a table of observed counts:
(x <- table(factor(y,levels=0:14),exclude=NULL))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <NA>
292 203 150 96 79 59 47 25 16 10 6 7 0 2 5 3
Now you could compute the chi-square directly and then calculate the p-value:
> (chisqstat <- sum((x-expec)^2/expec))
[1] 17.76835
(pval <- pchisq(chisqstat,15,lower.tail=FALSE))
[1] 0.2750401
but you can also get R to do it:
> chisq.test(x,p=expec/1000)
Chi-squared test for given probabilities
data: x
X-squared = 17.7683, df = 15, p-value = 0.275
Warning message:
In chisq.test(x, p = expec/1000) :
Chi-squared approximation may be incorrect
Now the case for unspecified p is similar, but (to my knowledge) you can no longer get chisq.test to do it directly, you have to do it the first way, but you have to estimate the parameter from the data (by maximum likelihood or minimum chi-square), and then test as above but you have one fewer degree of freedom for estimating the parameter.
See the example of doing a chi-square for a Poisson with estimated parameter here; the geometric follows the much same approach as above, with the adjustments as at the link (dealing with the unknown parameter, including the loss of 1 degree of freedom).
Let us assume you've got your randomly-generated variates in a vector x. You can do the following:
x <- rgeom(1000,0.2)
x_tbl <- table(x)
x_val <- as.numeric(names(x_tbl))
x_df <- data.frame(count=as.numeric(x_tbl), value=x_val)
# Expand to fill in "gaps" in the values caused by 0 counts
all_x_val <- data.frame(value = 0:max(x_val))
x_df <- merge(all_x_val, x_df, by="value", all.x=TRUE)
x_df$count[is.na(x_df$count)] <- 0
# Get theoretical probabilities
x_df$eprob <- dgeom(x_df$val, 0.2)
# Chi-square test: once with asymptotic dist'n,
# once with bootstrap evaluation of chi-sq test statistic
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE)
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE,
simulate.p.value=TRUE, B=10000)
There's a "goodfit" function described as "Goodness-of-fit Tests for Discrete Data" in package "vcd".
G.fit <- goodfit(x, type = "nbinomial", par = list(size = 1))
I was going to use the code you had posted in an earlier question, but it now appears that you have deleted that code. I find that offensive. Are you using this forum to gather homework answers and then defacing it to remove the evidence? (Deleted questions can still be seen by those of us with sufficient rep, and the interface prevents deletion of question with upvoted answers so you should not be able to delete this one.)
Generate a QQ Plot for testing a geometrically distributed sample
--- question---
I have a sample of n elements generated in R with
sim.geometric <- function(nvals)
{
p <- 0.3
u <- runif(nvals)
ceiling(log(u)/log(1-p))
}
for which i want to test its distribution, specifically if it indeed follows a geometric distribution. I want to generate a QQ PLot but have no idea how to.
--------reposted answer----------
A QQ-plot should be a straight line when compared to a "true" sample drawn from a geometric distribution with the same probability parameter. One gives two vectors to the functions which essentially compares their inverse ECDF's at each quantile. (Your attempt is not particularly successful:)
sim.res <- sim.geometric(100)
sim.rgeom <- rgeom(100, 0.3)
qqplot(sim.res, sim.rgeom)
Here I follow the lead of the authors of qqplot's help page (which results in flipping that upper curve around the line of identity):
png("QQ.png")
qqplot(qgeom(ppoints(100),prob=0.3), sim.res,
main = expression("Q-Q plot for" ~~ {G}[n == 100]))
dev.off()
---image not included---
You can add a "line of good fit" by plotting a line through through the 25th and 75th percentile points for each distribution. (I added a jittering feature to this to get a better idea where the "probability mass" was located:)
sim.res <- sim.geometric(500)
qqplot(jitter(qgeom(ppoints(500),prob=0.3)), jitter(sim.res),
main = expression("Q-Q plot for" ~~ {G}[n == 100]), ylim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )),
xlim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )))
qqline(sim.res, distribution = function(p) qgeom(p, 0.3),
prob = c(0.25, 0.75), col = "red")

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources