I have a dataset that looks like this:
India China Brasil Russia SAfrica Kenya States Indonesia States Argentina Chile Netherlands HongKong
0.0854026763 0.1389383234 0.1244184371 0.0525460881 0.2945586244 0.0404562539 0.0491597968 0 0 0.0618342901 0.0174891774 0.0634064181 0
0.0519483159 0.0573851759 0.0756806292 0.0207164181 0.0409872092 0.0706355932 0.0664503936 0.0775285039 0.008545575 0.0365674701 0.026595575 0.064280902 0.0338135148
0 0 0 0 0 0 0 0 0 0 0 0 0
0.0943708876 0 0 0.0967733329 0 0.0745076688 0 0 0 0.0427047276 0 0.0583873189 0
0.0149521013 0.0067569437 0.0108914448 0.0229991162 0.0151678343 0.0413174214 0 0.0240999375 0 0.0608951432 0.0076549109 0 0.0291972756
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0.0096710124 0.0095669967 0 0.0678582869 0 0 0.0170707337 0.0096565543 0.0116698364 0.0122773071
0.1002690681 0.0934563916 0.0821680095 0.1349534369 0.1017157777 0.1113249348 0.1713480649 0.0538715423 0.4731833978 0.1956743964 0.6865919069 0.2869189344 0.5364034876
1.5458338337 0.2675380321 0.6229046372 0.5059107039 0.934209603 0.4933799388 0.4259769181 0.3534169521 14.4134845836 4.8817632117 13.4034293299 3.7849346739 12.138551171
0.4625375671 0.320258205 0.4216459567 0.4992764309 0.4115887595 0.4783677078 0.4982410179 0.2790259278 0.3804405781 0.2594924212 0.4542162376 0.3012339384 0.3450847892
0.357614592 0.3932670219 0.3803417257 0.4615355254 0.3807061655 0.4122433346 0.4422282977 0.3053712842 0.297943232 0.2658160167 0.3244018409 0.2523836582 0.3106600754
0.359953567 0.3958391813 0.3828293473 0.4631507073 0.3831961707 0.4138590365 0.4451206879 0.3073685624 0.2046559772 0.2403036541 0.2326305393 0.2269373716 0.2342962436
0.7887404662 0.6545878236 0.7443676393 0.7681244767 0.5938002158 0.5052305973 0.4354571648 0.40511005 0.8372481106 0.5971130339 0.8025313223 0.5708610817 0.8556609579
0.5574207497 1.2175251783 0.8797484259 0.952685465 0.4476585005 1.1919229479 1.03612509 0.5490564488 0.2407034171 0.5675492645 0.4994121344 0.5460544861 0.3779468604
0.5632651223 1.0181714714 1.1253803155 1.228293512 0.6949993291 1.0346288085 0.5955221073 0.5212567091 1.1674901423 1.2442735568 1.207624867 1.3854352274 0.7557131826
0.6914760031 0.7831502333 1.0282730148 0.750270567 0.7072739935 0.8041764647 0.8918512571 0.6998554585 2.3448306081 1.2905783367 2.4295927684 1.3029766224 1.9310763864
0.3459898177 0.7474525109 0.7253451876 0.7182493014 0.3081791886 0.7462088907 0.5950509439 0.4443221541 3.6106852374 2.7647504885 3.3698608994 2.6523062395 1.8016571476
0.4629523517 0.6549211677 0.6158018856 0.7637088814 0.4951554309 0.6277236471 0.6227669055 0.383909839 2.9502307101 1.803480973 2.3083113522 1.668759497 1.7130459012
0.301548861 0.5961888126 0.4027007075 0.5540290853 0.4078662541 0.5108773106 0.4610682726 0.3712800134 0.3813402422 0.7391417247 1.0935364978 0.691857974 0.4416304953
2.5038287529 3.2005148394 2.9181517373 3.557918333 1.8868234768 2.9369926312 0.4117894127 0.3074815035 3.9187777037 7.3161555954 6.9586996112 5.7096144353 2.7007439732
2.5079707359 3.2058093222 2.9229791182 3.563804054 1.8899447728 2.9418511798 0.4124706194 0.269491388 3.9252603798 7.3282584169 6.9702111077 5.7190596205 2.7052117051
2.6643724791 1.2405320493 2.0584120188 2.2354369334 1.7199730388 2.039829709 1.7428132997 0.9977029725 8.9650886611 4.6035139163 8.1430131464 5.2450639988 6.963309864
0.5270581435 0.8222128903 0.7713479951 0.8785815313 0.624993821 0.7410405193 0.5350834321 0.4797121891 1.3753525725 1.2219267886 1.397221881 1.2433155977 0.8647136903
0.2536079475 0.5195514789 0.0492623195 0.416102668 0.2572670724 0.4805482899 0.4866090738 0.4905212099 0.2002506403 0.5508609827 0.3808572148 0.6276294938 0.3191452919
0.3499009885 0.5837491529 0.4914807442 0.5851537888 0.3638549977 0.537655052 0.5757185943 0.4730102035 0.9098072064 0.6197285737 0.7781825654 0.6424684366 0.6424429128
0.6093076876 0.9456457011 0.8518013605 1.1360347777 0.511960743 0.9038104168 0.5048413575 0.2777622235 0.2915840525 0.6628516415 0.4600364351 0.7996524113 0.3765721177
0.9119207879 1.2363073271 1.3285269752 1.4027039939 0.9250782309 2.1599381031 1.312307839 0 0 0.8253250513 0 0 0.8903632354
It is stored in a data.txt file.
I want to have a PCA multiplot that looks like this:
What I am doing:
d <- read.table("data.txt", header=TRUE, as.is=TRUE)
model <- prcomp(d, scale=TRUE)
After this I am lost.
How can I cluster the dataset according to the PCA projections and obtain the pictures similar to those above?
You are actually asking two different questions:
How to cluster the data after PCA projections.
How to obtain the above plots.
However before getting to those I would like to add that if your samples are in columns, then you are not doing PCA correctly. You should do it on transposed dataset instead like so:
model <- prcomp(t(d), scale=TRUE)
But for that to work you would have to remove all the constant rows in your data.
Now I assume that you did your PCA step how you wanted.
prcomp returns the rotated matrix when you specify retX=TRUE (it's true by default). So you will want to use model$x.
Your next step is clustering the data based on principal components. This can be done in various ways. One is hierarchical clustering. If you want 5 groups in the end here is one way:
fit <- hclust(dist(model$x[,1:3]), method="complete") # 1:3 -> based on 3 components
groups <- cutree(fit, k=5) # k=5 -> 5 groups
This step will get you groups that will be later used for coloring.
The final step is plotting. Here I wrote a simple function to do all in one shot:
library(rgl)
plotPCA <- function(x, nGroup) {
n <- ncol(x)
if(!(n %in% c(2,3))) { # check if 2d or 3d
stop("x must have either 2 or 3 columns")
}
fit <- hclust(dist(x), method="complete") # cluster
groups <- cutree(fit, k=nGroup)
if(n == 3) { # 3d plot
plot3d(x, col=groups, type="s", size=1, axes=F)
axes3d(edges=c("x--", "y--", "z"), lwd=3, axes.len=2, labels=FALSE)
grid3d("x")
grid3d("y")
grid3d("z")
} else { # 2d plot
maxes <- apply(abs(x), 2, max)
rangeX <- c(-maxes[1], maxes[1])
rangeY <- c(-maxes[2], maxes[2])
plot(x, col=groups, pch=19, xlab=colnames(x)[1], ylab=colnames(x)[2], xlim=rangeX, ylim=rangeY)
lines(c(0,0), rangeX*2)
lines(rangeY*2, c(0,0))
}
}
This function is simple: it takes two arguments: 1) a matrix of scores, with principal components in columns and your samples in rows. You can basically use model$x[,c(1,2,4)] if you want (for example) 1st, 2nd and 4th components. 2) number of groups for clustering.
Then it cluster the data based on passed principal components and plots (either 2D or 3D depending on the number of columns passed)
Here are few examples:
plotPCA(model$x[,1:2], 5)
And 3D example (based on 3 first principal components):
plotPCA(model$x[,1:3], 5)
This last plot will be interactive so you can rotate it to or zoom in/out.
Hope this helps.
Related
I am looking at the answers to the Chapter 6 exercise 3 questions from the Applied Predictive Modeling book Max Kuhn and I am getting an error in the imputation prediction step (despite following their answer exactly). The reproducible code and problem is as follows:
library(AppliedPredictiveModeling)
library(caret)
library(RANN)
data(ChemicalManufacturingProcess)
predictors <- subset(ChemicalManufacturingProcess,select= -Yield)
yield <- subset(ChemicalManufacturingProcess,select="Yield")
# Impute
#Split data into training and test sets
set.seed(517)
trainingRows <- createDataPartition(yield$Yield,
p = 0.7,
list = FALSE)
trainPredictors <- predictors[trainingRows,]
trainYield <- yield[trainingRows,]
testPredictors <- predictors[-trainingRows,]
testYield <- yield[-trainingRows,]
#Pre-process trainPredictors and apply to trainPredictors and testPredictors
pp <- preProcess(trainPredictors,method=c("BoxCox","center","scale","knnImpute"))
ppTrainPredictors <- predict(pp,newdata=trainPredictors)
ppTestPredictors <- predict(pp,newdata=testPredictors) # This results in an error
The error it gives is: Error in RANN::nn2(old[, non_missing_cols, drop = FALSE], new[, non_missing_cols, : NA/NaN/Inf in foreign function call (arg 2)
When I use YeoJohnson transformation instead it seems to work (which i read is able to handle non positive numbers)
However, I dont understand why it isn't working on the test data seeing as its just a different subset of the training data? and it's just used for the imputation step of the question?
I was unable to find any answers to this which seems strange as surely other people who have followed the book would have noticed? or am I being thick?
Thank you
You get that error because the boxcox transformation does not take in zero. If you check the help page for BoxCoxTrans it writes:
If any(y <= 0) or if length(unique(y)) < numUnique, lambda is not
estimated and no transformation is applied.
So if your preProcess() is run on a train set with no zeros in the columns, the boxcox transformation is applied, but it will not work on a test set with zeros.
In the book example above, most likely the seed was set with an older R version, so it works. If you are using a newer version of R, then it doesn't work. So if I check with your example:
cbind(colSums(trainPredictors==0,na.rm=TRUE),colSums(testPredictors==0,na.rm=TRUE))
[,1] [,2]
BiologicalMaterial01 0 0
BiologicalMaterial02 0 0
BiologicalMaterial03 0 0
BiologicalMaterial04 0 0
BiologicalMaterial05 0 0
BiologicalMaterial06 0 0
BiologicalMaterial07 0 0
BiologicalMaterial08 0 0
BiologicalMaterial09 0 0
BiologicalMaterial10 0 0
BiologicalMaterial11 0 0
BiologicalMaterial12 0 0
ManufacturingProcess01 1 2
ManufacturingProcess02 29 6
ManufacturingProcess03 0 0
ManufacturingProcess04 0 0
ManufacturingProcess05 0 0
ManufacturingProcess06 0 0
ManufacturingProcess07 0 0
ManufacturingProcess08 0 0
ManufacturingProcess09 0 0
ManufacturingProcess10 0 0
ManufacturingProcess11 0 0
ManufacturingProcess12 104 38
ManufacturingProcess13 0 0
ManufacturingProcess14 0 0
ManufacturingProcess15 0 0
ManufacturingProcess16 1 0
ManufacturingProcess17 0 0
ManufacturingProcess18 1 0
You can see that ManufacturingProcess16, ManufacturingProcess18 will give you problems.
Yeo-Johnson transformation can deal with zeros or negative values, so it is not a problem.
If you would like to carry on with the work example, you can try to use another seed:
set.seed(517)
trainingRows <- createDataPartition(yield$Yield,
p = 0.7,
list = FALSE)
trainPredictors <- predictors[trainingRows,]
trainYield <- yield[trainingRows,]
testPredictors <- predictors[-trainingRows,]
testYield <- yield[-trainingRows,]
With regard to this link Predicted probabilities in R ranger package, I have a question.
Imagine I have a mixed data frame, df (comprising of factor and numeric variables) and I want to do classification using ranger. I am splitting this data frame as test and train sets as Train_Set and Test_Set. BiClass is my prediction factor variable and comprises of 0 and 1 (2 levels)
I want to calculate and attach class probabilities to the data frame using ranger using the following commands:
Biclass.ranger <- ranger(BiClass ~ ., ,data=Train_Set, num.trees = 500, importance="impurity", save.memory = TRUE, probability=TRUE)
probabilities <- as.data.frame(predict(Biclass.ranger, data = Test_Set, num.trees = 200, type='response', verbose = TRUE)$predictions)
The data frame probabilities is a data frame consisting of 2 columns (0 and 1) with number of rows equal to the number of rows in Test_Set.
Does it mean, if I append or attach this data frame, namely, probabilities to the Test_Set as the last two columns, it shows the probability of each row being either 0 or 1? Is my understanding correct?
My second question, when I attempt to calcuate confusion matrix through
pred = predict(Biclass.ranger, data=Test_Set, num.trees = 500, type='response', verbose = TRUE)
table(Test_Set$BiClass, pred$predictions)
I get the following error:
Error in table(Test_Set$BiClass, pred$predictions) :
all arguments must have the same length
What am I doing wrong?
For your first question yes, it shows the probability of each row being 0 or 1. Using the example below:
library(ranger)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor",1,0))
Train_Set = data[idx,]
Test_Set = data[-idx,]
mdl <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)
probabilities <- as.data.frame(predict(mdl, data = Test_Set,type='response', verbose = TRUE)$predictions)
We can always check whether they agree:
par(mfrow=c(1,2))
boxplot(probabilities[,"0"] ~ Test_Set$Species,ylab="Prob 0",xlab="Actual label")
boxplot(probabilities[,"1"] ~ Test_Set$Species,ylab="Prob 1",xlab="Actual label")
Not the best plot, but sometimes if the labels are flipped you will see something weird. We need to find the column that has the max probability and assign the label, for this we do:
max.col(probabilities) - 1
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0
[39] 0 0 0 0 0 0 0 0 0 0 0 0
This goes through each row of probabilities returns 1 or 2 depending on which column has maximum probability and we simply subtract 1 from it to get 0,1. For the confusion matrix:
caret::confusionMatrix(table(max.col(probabilities) - 1,Test_Set$Species))
Confusion Matrix and Statistics
0 1
0 31 2
1 0 17
Accuracy : 0.96
95% CI : (0.8629, 0.9951)
No Information Rate : 0.62
P-Value [Acc > NIR] : 2.048e-08
In your case, you can just do:
confusionMatrix(table(max.col(probabilities)-1,Test_Set$BiClass))
I have a logistic model fit, say myfit that I've saved. The data frame I'm using is in the format of (where the first column is the outcome).
medical10 age female nonwhite bmi smoked condxs insuredd smi2d
1 0 60 0 1 29.97 0 0 1 0
2 0 42 0 1 25.85 1 3 1 1
3 0 62 1 0 25.06 0 1 1 0
4 0 62 0 0 36.27 0 2 0 0
5 0 32 0 0 33.36 0 0 1 0
6 0 41 0 0 21.70 1 0 0 0
...
What I would like to do is to make a logistic plot (in this form: http://ww2.coastal.edu/kingw/statistics/R-tutorials/logistic.html) for each combination of variables.
Since there are 8 variables, there are 2^8 permutations of having one variable on the x-axis while holding the other 7 constant. Is there a way I can automate the plot using ggplot2?
For instance, if 'x' was age, I would get the mean of bmi, and then pick 0 for female, 0 for nonwhite, 0 for smoked, 0 for condxs, 0 for insuredd and 0 for smi2d. I would then do a prediction and make a ggplot of x vs y.
However, this is quite tedious and I was hoping there was a better way?
I don't know of anything particular in ggplot that will make this easy. But I did find a way (though it was more work than I was expecting. Perhaps others can improve. Anyway, first let's define a more useful set of sample data
N<-100
set.seed(15)
invlogit <- function(x) exp(x)/(exp(x)+1)
dd <- transform(data.frame(
age=runif(N,30,60),
female=sample(0:1, N, replace=T),
white=sample(c("Y","N"), N, replace=T),
bmi=rnorm(N,30,2)),
medical=as.numeric(invlogit((-60+2*age-1.5*bmi+3*female)/10)>runif(N)))
fit<-glm(medical~. ,dd, family=binomial)
So now we have some data and a model. Now i'll define a helper function that will predict values for a single variable while holding the others at the mean value.
predictone<-function(fit, var, xlim=NULL, fix=list(), n=101,
xname=var, type="response") {
tt <- terms(fit)
vv <- as.list(attr(tt, "variables"))[-c(1,attr(tt, "response")+1)]
vn <- sapply(vv, deparse)
stopifnot(var %in% vn)
others <- vn[vn != var]
def<-lapply(others, function(x) {
if(x %in% names(fix)) {
if(is.factor(val)) {
stopifnot(fix[[x]] %in% levels(val))
val[val==fix[[x]]][1]
} else {
fix[[x]]
}
} else {
val <- fit$data[[x]]
if(is.factor(val)) {
val[val==names(sort(table(val))[1])][1]
} else {
mean(val)
}
}
})
if(is.factor(fit$data[[var]])) {
newdata <- data.frame(def, unique(fit$data[[var]]))
} else {
if(is.null(xlim)) {
xlim <- range(fit$data[[var]])
}
newdata <- data.frame(def, seq(min(xlim), max(xlim), length.out=n))
}
names(newdata)<-c(others, var)
pp<-data.frame(newdata[[var]], predict(fit,newdata, type=type))
names(pp)<-c(xname, type)
attr(pp,"fixed")<-setNames(def, others)
pp
}
Basically this function exists to calculate the averages of all the other variables and then do the actual prediction. We can use it with the test data to make a bunch of plots with
plots<-lapply(names(dd)[1:4], function(x) {
if(is.factor(dd[[x]])) {
ggplot(predictone(fit, x), aes_string(x=x, y="response")) + geom_point()
} else {
ggplot(predictone(fit, x), aes_string(x=x, y="response")) + geom_line()
}
})
require(gridExtra)
do.call(grid.arrange, plots)
which will return
Note that factors are treated differently than regular numeric values. When you code categorical variables with 0/1 R can't tell they are categorical so it doesn't do a good job of inferring the values which make sense. I would encourage you to convert 0/1 values to a proper factor variable.
An update to the R rms package to be posted on CRAN on about 2015-01-01 includes a new function ggplot.Predict (called by ggplot()) that provides a general way to generate such curves using ggplot2, handling multiple moving variables, interactions, etc. You can see some example usage at https://github.com/harrelfe/rms/blob/master/man/ggplot.Predict.Rd . You can do all this with the current version of rms using lattice graphics and the plot.Predict function.
Suppose I have two matrices: A for Label matrix and B for corresponding predicted probability matrix of A. Now I would like to calculate the the AUPR (Area Under Precision/Recall Curve) according to matrices A and B. For common AUC (Area Under ROC Curve), there are many packages in R, such as ROCR, pROC, can directly calculate the AUC value, but currently, what packages in R can calculate the AUPR? or Can you help give the method the compute the AUPR?
Here is the two example matrics:
> pp
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.01792 0.00155 -0.00140 0.00522 0.01320 0.22506 0.00454
[2,] 0.05883 0.11256 0.82862 0.12406 0.08298 -0.00392 0.30724
[3,] 0.00743 0.06357 0.14500 0.00213 0.00545 0.03452 0.11189
[4,] 0.02571 0.01460 0.01108 0.00494 0.01246 0.11880 0.05504
[5,] 0.02407 0.00961 0.00720 0.00382 0.01039 0.10974 0.04512
> ll
D00040 D00066 D00067 D00075 D00088 D00094 D00105
hsa190 0 0 0 0 0 1 0
hsa2099 0 1 1 0 0 0 1
hsa2100 0 0 0 0 0 0 1
hsa2101 0 0 0 0 0 0 0
hsa2103 0 0 0 0 0 0 0
pp is the predicted probability matrix for the true label ll matrix, and ll is just the label matrix.
Thanks in advance.
I would first convert the prediction scores and classes into vectors from matrix.
There is a "PRROC" package that provides the similar function of generating ROC and PRC as "ROCR", and it also gives the AUC of the PRC.
Specifically, I'm using the data ROCR.simple from "ROCR" package as an example.
library(PRROC)
library(ROCR)
data("ROCR.simple")
scores <- data.frame(ROCR.simple$predictions, ROCR.simple$labels)
pr <- pr.curve(scores.class0=scores[scores$ROCR.simple.labels=="1",]$ROCR.simple.predictions,
scores.class1=scores[scores$ROCR.simple.labels=="0",]$ROCR.simple.predictions,
curve=T)
Note that here in this function, the "scores.class0" needs to be the scores for the positive class (which is a little confusing, because personally I consider 0 as negative and 1 as positive). So I switched the order of 0 and 1.
This way, the PR curve and AUC are all saved in the pr variable.
pr
Precision-recall curve
Area under curve (Integral):
0.7815038
Area under curve (Davis & Goadrich):
0.7814246
Curve for scores from 0.005422562 to 0.9910964
( can be plotted with plot(x) )
Then, you can plot the PRC with plot(pr) or with ggplot:
y <- as.data.frame(pr$curve)
ggplot(y, aes(y$V1, y$V2))+geom_path()+ylim(0,1)
The resulting curve is the same with the curve made by ROCR package.
I have 3 variables on which I ran PCA using prcomp. I tried to reconstruct the variables using the loadings and factors but residuals is not zero. Statistically (I might be wrong here) I was expecting to be able to reconstruct the original data. Am I missing something?
test = read.table(text='0.8728891 0.7403704 0.6655271
0.8697503 0.7447901 0.6629487
0.8569866 0.7321241 0.6493666
0.8824890 0.7405750 0.6505887
0.8912246 0.7334331 0.6508194
0.8930270 0.7381421 0.6448108
0.8721081 0.7173891 0.6355404
0.8649705 0.7326563 0.6493313
0.8976412 0.7249211 0.6437649
0.9233625 0.7406451 0.6454023',sep=' ')
pca = prcomp(test,center=T,scale=F)
pca$x %*% pca$rotation + matrix(1,nrow=nrow(test),ncol=1) %*% pca$center - test
V1 V2 V3
-0.0020186611 0.0071487188 -0.0240478838
-0.0004352159 -0.0005375912 -0.0262594828
0.0008042558 -0.0039840874 -0.0019352850
0.0009905100 -0.0053390749 -0.0067663626
-0.0008375576 0.0041104957 0.0016244986
0.0013586563 -0.0060476694 0.0036526104
0.0004278214 0.0009280342 0.0298641699
0.0005504918 -0.0026885505 -0.0009348334
-0.0011619165 0.0073130849 0.0185829183
0.0003216158 -0.0009033601 0.0062196504
I use the following function for reconstructing data from a prcomp object:
#This function reconstructs a data set using a defined set of principal components.
#arguments "pca" is the pca object from prcomp, "pcs" is a vector of principal components
#to be used for reconstruction (default includes all pcs)
prcomp.recon <- function(pca, pcs=NULL){
if(is.null(pcs)) pcs <- seq(pca$sdev)
recon <- as.matrix(pca$x[,pcs]) %*% t(as.matrix(pca$rotation[,pcs]))
if(pca$scale[1] != FALSE){
recon <- scale(recon , center=FALSE, scale=1/pca$scale)
}
if(pca$center[1] != FALSE){
recon <- scale(recon , center=-pca$center, scale=FALSE)
}
recon
}
I couldn't figure out exactly what was wrong with your code, but using the prcomp.recon function gives the right result:
> prcomp.recon(pca) - test
V1 V2 V3
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
10 0 0 0