PCA - All variables with same signal on PC1 coordinates - r
So, I am analyzing a dataset that consists of 160 observations and 20 variables and am performing a PCA. It is about patients affected by a disease and the variables are antibodies levels measured in the same experiment and the values are on the same units (u/mL). These variables are all positive values so I can't understand how I would have samples on the positive PC1 side of the plot without any variable contributing to that side (given that there are no negative values involved on these variables).
For confounding factors, what I have is: patients' age, gender and the duration of infection, but these 3 were not added in the PC analysis.
I am having some trouble to understand the following: when using the rpackage factoextra's function fviz_pca_biplot() to see both the sample distribution as well as each variable contribution to PCs 1 and 2, I realized that my 20 variables have high negative value for PC1.
For the following images, I generated them using a small sample of my original data and, eventhough the variables contribution are not the same, they are still highly negative for PC1. This is understandable if I do not center my data in the prcomp() function (image 1) as it is possible to see that all of my samples are on the negative side of the PC1 component and it explains most of the data inertia.
library(factoextra)
PCAf <- read.table("PCA_small_sample.csv", sep = ";", header = T, row.names = 1)
res.pca <- prcomp(PCAf, scale = TRUE, center = F)
fviz_pca_biplot(res.pca)
However, I have been taught that it is necessary to center the data when performing PCA and the image becomes like this:
res.pca <- prcomp(PCAf, scale = TRUE)
fviz_pca_biplot(res.pca)
This diminishes PC1 explained variance and increases PC2 but, eventhough it changes the variables coordinates, there is no positive coord to PC1.
res.var <- get_pca_var(res.pca)
res.var$coord
These are the values for the non centered PCA:
And for the centered PCA:
Am I doing something wrong, should I really present my analysis with the second image eventhough the vectors do not match what we are seeing?
My main question is: When presenting the PCA, it is better to do so with the centralized data, right? Then, should I perform some sort of correction to the variables' coordinates/contribution to the PCs? Because this second image does not seem too reliable to me, but this may be due to lack of experience... I mean, since all variables are going toward the left side of the plot, what would be pulling some of the samples (e.g. 7,10,8,4,20) towards the right side of the plot (positive PC1)? It seems counterintuitive that there isn't even a single vector on the right side.
This also brings me the question: Should I add confounding factors when performing a PCA? I performed linear regression to account for them but did not include them in the PC analysis.
Anyway, thank you all so much in advance.
PS: I uploaded a file containing a sample of my data, code and images on github
PS2: When plotting this with a generic dataset, I do not see the same issue. At first it happens but when centering the data, there are vectors on the four quadrants, for which I am able to extract some rationale.
data.matrix <- matrix(nrow=100, ncol=10)
colnames(data.matrix) <- c(
paste("wt", 1:5, sep=""),
paste("ko", 1:5, sep=""))
rownames(data.matrix) <- paste("gene", 1:100, sep="")
for (i in 1:100) {
wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))
ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))
data.matrix[i,] <- c(wt.values, ko.values)
}
PCAf <- t(data.matrix)
res.pca_NC <- prcomp(PCAf, scale = TRUE, center = F)
res.pca_C <- prcomp(PCAf, scale = TRUE, center = T)
fviz_pca_biplot(res.pca_NC)
fviz_pca_biplot(res.pca_C)
Not centered - generic PCA:
Centered - generic PCA:
Related
DEA analysis: variables are excluded in analysis?
I’m working on a DEA (Data Envelopment Analysis) analysis to analyze the relative effects of different banks efficiencies. The packages I’m using are rDEA and kableExtra. What this analysis if doing is measuring the relative effect of input and output variables that I use to examine the efficiency for each individual bank. The problem is that my code only includes two out of four output variables and I can’t find anywhere in the code where I ask it to do so. Can some of you identify the problem? Thank you in advance! I have tried to format the data in several different ways, assign the created "inp_var" and "out_var" as a matrix'. #install.packages('rDEA') #install.packages('dplyr') #install.packages('kableExtra') library(kableExtra) library(rDEA) library(dplyr) dea <- tbl_df(PANELDATA) head(dea) inp_var <- select(dea, 'IE', 'NIE') out_var <- select(dea, 'L', 'D', 'II','NII') inp_var <- as.matrix(inp_var) out_var <- as.matrix(out_var) model <- dea(XREF= inp_var, YREF = out_var, X = inp_var, Y = out_var, model= "output", RTS = "constant") model I want a number between 0 and 1 for every observation, where the most efficient one receives a 1. What I get now is the same result no matter if I include the two extra output variables L and II or not. L stands for Loans to the public and II for interest income and it would be weird if these variables had NO effect for the efficiency of banks.
I think you could type this: result <- cbind(round(model$thetaOpt, 3), round(model$lambda, 3)) rownames(result)<-dea[[1]] colnames(result)<-c("Efficiency", rownames(result)) kable(result[,])
Maximum pseudo-likelihood estimator for soft-core point process
I am trying to fit a soft-core point process model on a set of point pattern using maximum pseudo-likelihood. I followed the instructions given in this paper by Baddeley and Turner And here is the R-code I came up with `library(deldir) library(tidyverse) library(fields) #MPLE # irregular parameter k k <- 0.4 ## Generate dummy points 50X50. "RA" and "DE" are x and y coordinates dum.x <- seq(ramin, ramax, length = 50) dum.y <- seq(demin, demax, length = 50) dum <- expand.grid(dum.x, dum.y) colnames(dum) <- c("RA", "DE") ## Combine with data and specify which is data point and which is dummy, X is the point pattern to be fitted bind.x <- bind_rows(X, dum) %>% mutate(Ind = c(rep(1, nrow(X)), rep(0, nrow(dum)))) ## Calculate Quadrature weights using Voronoi cell area w <- deldir(bind.x$RA, bind.x$DE)$summary$dir.area ## Response y <- bind.x$Ind/w # the sum of distances between all pairs of points (the sufficient statistics) tmp <- cbind(bind.x$RA, bind.x$DE) t1 <- rdist(tmp)^(-2/k) t1[t1 == Inf] <- 0 t1 <- rowSums(t1) t <- -t1 # fit the model using quasipoisson regression fit <- glm(y ~ t, family = quasipoisson, weights = w) ` However, the fitted parameter for t is negative which is obviously not a correct value for a softcore point process. Also, my point pattern is actually simulated from a softcore process so it does not make sense that the fitted parameter is negative. I tried my best to find any bugs in the code but I can't seem to find it. The only potential issue I see is that my sufficient statistics is extremely large (on the order of 10^14) which I fear may cause numerical issues. But the statistics are large because my observation window spans a very small unit and the average distance between a pair of points is around 0.006. So sufficient statistics based on this will certainly be very large and my intuition tells me that it should not cause a numerical problem and make the fitted parameter to be negative. Can anybody help and check if my code is correct? Thanks very much!
r qqp function - why is the 'perfect fit' a flat line on 0?
This may be more of a statistical question than a programming one. I just wanted to make sure I was getting the programming right first. I have a large count dataset (108 sites with 31 species = 3348 observations) but a lot of these are 0 counts because only not species were not present at every site. I have had log transformation suggested to me but others have also said that you shouldn't log transform count data. Here is my data for the first 8 species (also contains the very abundant species with the highest counts): example.abund <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0, 0,0,1,0,8,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,1,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,2,0,3,1,0,0,0,0,0,0,0,0,0, 2,0,1,1,0,0,0,0,1,1,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1, 0,1,0,0,0,28,1,0,1,0,0,1,0,2,0,0,2,0,0,0,1,0,0,0,1,0,0,0,2,0,0,1,0,0, 0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,2,0,1,0,0,8,7,7,1,1,13,0,8,0,3,0,1,1, 1,4,4,0,1,0,1,0,0,0,0,6,5,2,0,2,58,4,2,47,4,0,0,0,2,59,2,0,0,6,1,36,28,2, 1,1,0,6,0,0,2,5,0,0,0,0,87,7,0,1,1,1,0,0,1,1,0,6,11,0,0,0,3,0,4,0,7,2, 0,5,0,4,1,0,1,12,0,2,0,9,0,1,0,0,0,24,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,3,1,0,1,0,1,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,1,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,15,0,2, 81,0,1,32,26,13,2,61,0,66,2,2,0,17,43,43,0,25,19,2,25,26,91,61,0,13,0,62,186,1,4,22,1,50,3,67,86,11,56,26,74,0,6,8,7,0152,8,14,1,97,1,0,12,11,3,1,1,112,2,35,36,5,61,26,211,15,8,173,17,97,22,18,88,11,1,66,15,3,3,3,2,0,1,0,41,9,14,1,0,38,0,0,51,27,11,38,31,1,0,221,68,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,2,0,0,2,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,29,0,0,0,0, 0,82,12,0,0,3,0,9,0,0,164,0,0,0,0,1,0,15,0,0,0,6,56,0,0,0,6,0,0,1,0,5,5,8, 0,4,0,0,6,0,0,2,0,0,3,0,0,0,0,683,0,0,0,0,3,149,252,11,13,195,19,0,59,0,0,1,28,0, 0,0,0,0,0,0,0,0,0,0,31,55,85,0,142,0,44,52,0,0192,0,45,0,0,0,0,0,0,11,2,0,0,6, 0,0,0,0,0,0,0,0,0,0,0,0,0,19,3,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0) I am need to make a mixed model to fit the data, but first I am trying to figure out the most appropriate distribution to use. I was following the steps in this blog. But all of the red lines (meant to represent 'perfect fit' for that distribution) are coming up as being 0 along the entire plot. My question is: have a coded this correctly and there are so many 0s in my data that the perfect fit is 0? Or is there something wrong with the way I have coded? Code example: #so that the families without 0s can recognise data example.abund.1 <- example.abund + 1 plot(hist(example.abund)) qqp(example.abund, "norm") qqp(example.abund.1, "lnorm") #lognorm #have to generate estimates of parameters: nbinom <- fitdistr(example.abund.1, "Negative Binomial") qqp(example.abund.1, "nbinom", size = nbinom$estimate[[1]], mu = nbinom$estimate[[2]]) poisson <- fitdistr(example.abund.1, "Poisson") qqp(example.abund.1, "pois", poisson$estimate) gamma <- fitdistr(example.abund.1, "gamma") qqp(example.abund.1, "gamma", shape = gamma$estimate[[1]], rate = gamma$estimate[[2]])
Bug in R e1071 Naive Bayes?
I have no experience in the R community, so please point me somewhere else if this is not the appropriate forum... Long story short, I'm afraid that e1071::naiveBayes favors giving labels by alphabetical order. In an earlier question here I had noticed some strange behavior with numerical predictors in the e1071 implementation of naive Bayes. While I got a more reasonable answer, some probabilities seemed biased upwards. Can anyone shed any light on why this simulation would end up like this? I can only imagine that it is a bug at this point... library(e1071) # get a data frame with numObs rows, and numDistinctLabels possible labels # each label is randomly drawn from letters a-z # each label has its own distribution of a numeric variable # this is normal(i*100, 10), i in 1:numDistinctLabels # so, if labels are t, m, and q, t is normal(100, 10), m is normal(200, 10), etc # the idea is that all labels should be predicted just as often # but it seems that "a" will be predicted most, "b" second, etc doExperiment = function(numObs, numDistinctLabels){ possibleLabels = sample(letters, numDistinctLabels, replace=F) someFrame = data.frame( x=rep(NA, numObs), label=rep(NA, numObs) ) numObsPerLabel = numObs / numDistinctLabels for(i in 1:length(possibleLabels)){ label = possibleLabels[i] whichAreNA = which(is.na(someFrame$label)) whichToSet = sample(whichAreNA, numObsPerLabel, replace=F) someFrame[whichToSet, "label"] = label someFrame[whichToSet, "x"] = rnorm(numObsPerLabel, 100*i, 10) } someFrame = as.data.frame(unclass(someFrame)) fit = e1071::naiveBayes(label ~ x, someFrame) # The threshold argument doesn't seem to change the matter... someFrame$predictions = predict(fit, someFrame, threshold=0) someFrame } # given a labeled frame, return the label that was predicted most getMostFrequentPrediction = function(labeledFrame){ names(which.max(sort(table(labeledFrame$prediction)))) } # run the experiment a few thousand times mostPredictedClasses = sapply(1:2000, function(x) getMostFrequentPrediction(doExperiment(100, 5))) # make a bar chart of the most frequently predicted labels plot(table(mostPredictedClasses)) This gives a plot like: Giving every label the same normal distribution (i.e. mean 100, stdev 10) gives: Regarding confusion in comment: This is maybe getting away from Stack Overflow territory here, but anyways... While I would expect classification to be less clumpy, the effect of the standard deviations does a lot to flatten out the pdfs, and you can observe if you do this enough that one or two actually tend to dominate (red and black in this case). Too bad we can't exploit the knowledge that the standard deviation is the same for all of them. If you add just a little noise to the mean it becomes much more evenly distributed, even though there's still some misclassification.
The problem is not naiveBayes, it's your getMostFrequentPrediction function. You are returning only one value even when there are ties for first. Since you are using table(), the counts are being implicitly sorted alphabetically in the table. So when you grab the first max value, it will also be the "smallest" alphabetically speaking. So if you dun this a bunch of times: getMostFrequentPrediction(data.frame(predictions=sample(rep(letters[1:3], 5)))) you will always get "a" even though the letters "a" "b" and "c" all appear 5 times. If you want to randomly choose one of the most frequently predicted categories, here's another possible implementation getMostFrequentPrediction = function(labeledFrame){ tt<-table(labeledFrame$predictions) names(sample(tt[tt==max(tt)], 1)) } This gives
Bootstrapping to compare two groups
In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from. x <- c(11.4,25.3,29.9,16.5,21.1) y <- c(23.7,26.6,28.5,14.2,17.9,24.3) total <- c(x,y) library(boot) diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]]) b <- boot(total, diff, R = 10000) ci <- boot.ci(b) p.value <- sum(b$t>=b$t0)/b$R What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y). Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 : Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function : total <- c(x,y) id <- as.factor(c(rep("x",length(x)),rep("y",length(y)))) b <- boot(total, diff, strata=id, R = 10000) ... Be aware you're not going to get even close to a correct estimate of your p.value : x <- c(1.4,2.3,2.9,1.5,1.1) y <- c(23.7,26.6,28.5,14.2,17.9,24.3) total <- c(x,y) b <- boot(total, diff, strata=id, R = 10000) ci <- boot.ci(b) p.value <- sum(b$t>=b$t0)/b$R > p.value [1] 0.5162 How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first? The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows: x <- c(1.4,2.3,2.9,1.5,1.1) y <- c(23.7,26.6,28.5,14.2,17.9,24.3) b_x <- boot(x, sum, R = 10000) b_y <- boot(y, sum, R = 10000) z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1])) pnorm(z) So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case. If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.