Suppose I have two matrices: A for Label matrix and B for corresponding predicted probability matrix of A. Now I would like to calculate the the AUPR (Area Under Precision/Recall Curve) according to matrices A and B. For common AUC (Area Under ROC Curve), there are many packages in R, such as ROCR, pROC, can directly calculate the AUC value, but currently, what packages in R can calculate the AUPR? or Can you help give the method the compute the AUPR?
Here is the two example matrics:
> pp
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.01792 0.00155 -0.00140 0.00522 0.01320 0.22506 0.00454
[2,] 0.05883 0.11256 0.82862 0.12406 0.08298 -0.00392 0.30724
[3,] 0.00743 0.06357 0.14500 0.00213 0.00545 0.03452 0.11189
[4,] 0.02571 0.01460 0.01108 0.00494 0.01246 0.11880 0.05504
[5,] 0.02407 0.00961 0.00720 0.00382 0.01039 0.10974 0.04512
> ll
D00040 D00066 D00067 D00075 D00088 D00094 D00105
hsa190 0 0 0 0 0 1 0
hsa2099 0 1 1 0 0 0 1
hsa2100 0 0 0 0 0 0 1
hsa2101 0 0 0 0 0 0 0
hsa2103 0 0 0 0 0 0 0
pp is the predicted probability matrix for the true label ll matrix, and ll is just the label matrix.
Thanks in advance.
I would first convert the prediction scores and classes into vectors from matrix.
There is a "PRROC" package that provides the similar function of generating ROC and PRC as "ROCR", and it also gives the AUC of the PRC.
Specifically, I'm using the data ROCR.simple from "ROCR" package as an example.
library(PRROC)
library(ROCR)
data("ROCR.simple")
scores <- data.frame(ROCR.simple$predictions, ROCR.simple$labels)
pr <- pr.curve(scores.class0=scores[scores$ROCR.simple.labels=="1",]$ROCR.simple.predictions,
scores.class1=scores[scores$ROCR.simple.labels=="0",]$ROCR.simple.predictions,
curve=T)
Note that here in this function, the "scores.class0" needs to be the scores for the positive class (which is a little confusing, because personally I consider 0 as negative and 1 as positive). So I switched the order of 0 and 1.
This way, the PR curve and AUC are all saved in the pr variable.
pr
Precision-recall curve
Area under curve (Integral):
0.7815038
Area under curve (Davis & Goadrich):
0.7814246
Curve for scores from 0.005422562 to 0.9910964
( can be plotted with plot(x) )
Then, you can plot the PRC with plot(pr) or with ggplot:
y <- as.data.frame(pr$curve)
ggplot(y, aes(y$V1, y$V2))+geom_path()+ylim(0,1)
The resulting curve is the same with the curve made by ROCR package.
Related
I am calculating the p-values of categorical variables in R for a multinomial logisitic regression.
How do I interpret the result of VEHICLE_TYPE6, VEHICLE_TYPE7, and VEHICLE_TYPE8? The output is a single digit zero...seems a bit odd...not sure if it's telling me that the result is highly significant or perhaps some type of error. Can anyone help?
> p_values <- (1 - pnorm(abs(z_stats)))*2
> p_values
(Intercept) SURFACE_COND2 SURFACE_COND3 SURFACE_COND4 SURFACE_COND5 SURFACE_COND9 VEHICLE_TYPE2 VEHICLE_TYPE3
2 0 1.732286e-08 0.1979489 7.654519e-07 1.634368e-02 7.505911e-03 3.917932e-04 0.024577610
3 0 4.137271e-07 0.1505311 0.000000e+00 1.776133e-10 3.497501e-05 1.126374e-05 0.001057668
VEHICLE_TYPE4 VEHICLE_TYPE5 VEHICLE_TYPE6 VEHICLE_TYPE7 VEHICLE_TYPE8 VEHICLE_TYPE9 VEHICLE_TYPE10 VEHICLE_TYPE11
2 0.1873876 6.542883e-04 0 0 0 1.866588e-08 1.678066e-01 0.02322403
3 0.1458607 7.277671e-05 0 0 0 4.074296e-12 5.167644e-09 0.01476668
I'm trying to implement an EM algorithm for family data where I'm assuming my observations have a multivariate t distribution. I have only two siblings per family, so all of the family groups have only two observations. Basically I'm trying to follow the E(C)M steps in this article:
https://pdfs.semanticscholar.org/9445/ef865c4eb1431f9cb2abdb5efc1c361172cc.pdf
However, now I'm not sure if EM is doable for this kind of data, since my correlation matrix Psi should be block diagonal for families.
So here's an R example of how my families are structured
fam_id = sort(rep(1:5, 2))
Z= matrix(0, nrow = length(fam_id), ncol = length(unique(fam_id)))
colnames(Z) = unique(fam_id)
k = 1
i = 1
# Random effects dummy matrix
while (k <= ncol(Z)) {
Z[i:(i+1), k] = c(1, 1)
k = k +1
i = i+2
}
> Z
1 2 3 4 5
[1,] 1 0 0 0 0
[2,] 1 0 0 0 0
[3,] 0 1 0 0 0
[4,] 0 1 0 0 0
...
The EM algorithm chokes after 5th iteration saying that the correlation matrix Psi is not:
Error in solve.default(psi_hat) :
system is computationally singular
If anyone could shed some light to this, I'd be very happy!
Please check this answer in the Statschange website
https://stats.stackexchange.com/questions/76488/error-system-is-computationally-singular-when-running-a-glm
Your are probably ending up with a non invertible matrix in your 5th iteration
I'm using the mgcv package to fit some polynomial splines to some data via:
x.gam <- gam(cts ~ s(time, bs = "ad"), data = x.dd,
family = poisson(link = "log"))
I'm trying to extract the functional form of the fit. x.gam is a gamObject, and I've been reading the documentation but haven't found enough information in order to manually reconstruct the fitted function.
x.gam$smooth contains information about whether the knots have been placed;
x.gam$coefficients gives the spline coefficients, but I don't know what order polynomial splines are used and looking in the code has not revealed anything.
Is there a neat way to extract the knots, coefficients and basis used so that one can manually reconstruct the fit?
I don't have your data, so I take the following example from ?adaptive.smooth to show you where you can find information you want. Note that though this example is for Gaussian data rather than Poisson data, only the link function is different; all the rest are just standard.
x <- 1:1000/1000 # data between [0, 1]
mu <- exp(-400*(x-.6)^2)+5*exp(-500*(x-.75)^2)/3+2*exp(-500*(x-.9)^2)
y <- mu+0.5*rnorm(1000)
b <- gam(y~s(x,bs="ad",k=40,m=5))
Now, all information on smooth construction is stored in b$smooth, we take it out:
smooth <- b$smooth[[1]] ## extract smooth object for first smooth term
knots:
smooth$knots gives you location of knots.
> smooth$knots
[1] -0.081161 -0.054107 -0.027053 0.000001 0.027055 0.054109 0.081163
[8] 0.108217 0.135271 0.162325 0.189379 0.216433 0.243487 0.270541
[15] 0.297595 0.324649 0.351703 0.378757 0.405811 0.432865 0.459919
[22] 0.486973 0.514027 0.541081 0.568135 0.595189 0.622243 0.649297
[29] 0.676351 0.703405 0.730459 0.757513 0.784567 0.811621 0.838675
[36] 0.865729 0.892783 0.919837 0.946891 0.973945 1.000999 1.028053
[43] 1.055107 1.082161
Note, three external knots are placed beyond each side of [0, 1] to construct spline basis.
basis class
attr(smooth, "class") tells you the type of spline. As you can read from ?adaptive.smooth, for bs = ad, mgcv use P-splines, hence you get "pspline.smooth".
mgcv use 2nd order pspline, you can verify this by checking the difference matrix smooth$D. Below is a snapshot:
> smooth$D[1:6,1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 -2 1 0 0 0
[2,] 0 1 -2 1 0 0
[3,] 0 0 1 -2 1 0
[4,] 0 0 0 1 -2 1
[5,] 0 0 0 0 1 -2
[6,] 0 0 0 0 0 1
coefficients
You have already known that b$coefficients contain model coefficients:
beta <- b$coefficients
Note this is a named vector:
> beta
(Intercept) s(x).1 s(x).2 s(x).3 s(x).4 s(x).5
0.37792619 -0.33500685 -0.30943814 -0.30908847 -0.31141148 -0.31373448
s(x).6 s(x).7 s(x).8 s(x).9 s(x).10 s(x).11
-0.31605749 -0.31838050 -0.32070350 -0.32302651 -0.32534952 -0.32767252
s(x).12 s(x).13 s(x).14 s(x).15 s(x).16 s(x).17
-0.32999553 -0.33231853 -0.33464154 -0.33696455 -0.33928755 -0.34161055
s(x).18 s(x).19 s(x).20 s(x).21 s(x).22 s(x).23
-0.34393354 -0.34625650 -0.34857906 -0.05057041 0.48319491 0.77251118
s(x).24 s(x).25 s(x).26 s(x).27 s(x).28 s(x).29
0.49825345 0.09540020 -0.18950763 0.16117012 1.10141701 1.31089436
s(x).30 s(x).31 s(x).32 s(x).33 s(x).34 s(x).35
0.62742937 -0.23435309 -0.19127140 0.79615752 1.85600016 1.55794576
s(x).36 s(x).37 s(x).38 s(x).39
0.40890236 -0.20731309 -0.47246357 -0.44855437
basis matrix / model matrix / linear predictor matrix (lpmatrix)
You can get model matrix from:
mat <- predict.gam(b, type = "lpmatrix")
This is an n-by-p matrix, where n is the number of observations, and p is the number of coefficients. This matrix has column name:
> head(mat[,1:5])
(Intercept) s(x).1 s(x).2 s(x).3 s(x).4
1 1 0.6465774 0.1490613 -0.03843899 -0.03844738
2 1 0.6437580 0.1715691 -0.03612433 -0.03619157
3 1 0.6384074 0.1949416 -0.03391686 -0.03414389
4 1 0.6306815 0.2190356 -0.03175713 -0.03229541
5 1 0.6207361 0.2437083 -0.02958570 -0.03063719
6 1 0.6087272 0.2688168 -0.02734314 -0.02916029
The first column is all 1, giving intercept. While s(x).1 suggests the first basis function for s(x). If you want to view what individual basis function look like, you can plot a column of mat against your variable. For example:
plot(x, mat[, "s(x).20"], type = "l", main = "20th basis")
linear predictor
If you want to manually construct the fit, you can do:
pred.linear <- mat %*% beta
Note that this is exactly what you can get from b$linear.predictors or
predict.gam(b, type = "link")
response / fitted values
For non-Gaussian data, if you want to get response variable, you can apply inverse link function to linear predictor to map back to original scale.
Family information are stored in gamObject$family, and gamObject$family$linkinv is the inverse link function. The above example will certain gives you identity link, but for your fitted object x.gam, you can do:
x.gam$family$linkinv(x.gam$linear.predictors)
Note this is the same to x.gam$fitted, or
predict.gam(x.gam, type = "response").
Other links
I have just realized that there were quite a lot of similar questions before.
This answer by Gavin Simpson is great, for predict.gam( , type = 'lpmatrix').
This answer is about predict.gam(, type = 'terms').
But anyway, the best reference is always ?predict.gam, which includes extensive examples.
I have a dataset that looks like this:
India China Brasil Russia SAfrica Kenya States Indonesia States Argentina Chile Netherlands HongKong
0.0854026763 0.1389383234 0.1244184371 0.0525460881 0.2945586244 0.0404562539 0.0491597968 0 0 0.0618342901 0.0174891774 0.0634064181 0
0.0519483159 0.0573851759 0.0756806292 0.0207164181 0.0409872092 0.0706355932 0.0664503936 0.0775285039 0.008545575 0.0365674701 0.026595575 0.064280902 0.0338135148
0 0 0 0 0 0 0 0 0 0 0 0 0
0.0943708876 0 0 0.0967733329 0 0.0745076688 0 0 0 0.0427047276 0 0.0583873189 0
0.0149521013 0.0067569437 0.0108914448 0.0229991162 0.0151678343 0.0413174214 0 0.0240999375 0 0.0608951432 0.0076549109 0 0.0291972756
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0.0096710124 0.0095669967 0 0.0678582869 0 0 0.0170707337 0.0096565543 0.0116698364 0.0122773071
0.1002690681 0.0934563916 0.0821680095 0.1349534369 0.1017157777 0.1113249348 0.1713480649 0.0538715423 0.4731833978 0.1956743964 0.6865919069 0.2869189344 0.5364034876
1.5458338337 0.2675380321 0.6229046372 0.5059107039 0.934209603 0.4933799388 0.4259769181 0.3534169521 14.4134845836 4.8817632117 13.4034293299 3.7849346739 12.138551171
0.4625375671 0.320258205 0.4216459567 0.4992764309 0.4115887595 0.4783677078 0.4982410179 0.2790259278 0.3804405781 0.2594924212 0.4542162376 0.3012339384 0.3450847892
0.357614592 0.3932670219 0.3803417257 0.4615355254 0.3807061655 0.4122433346 0.4422282977 0.3053712842 0.297943232 0.2658160167 0.3244018409 0.2523836582 0.3106600754
0.359953567 0.3958391813 0.3828293473 0.4631507073 0.3831961707 0.4138590365 0.4451206879 0.3073685624 0.2046559772 0.2403036541 0.2326305393 0.2269373716 0.2342962436
0.7887404662 0.6545878236 0.7443676393 0.7681244767 0.5938002158 0.5052305973 0.4354571648 0.40511005 0.8372481106 0.5971130339 0.8025313223 0.5708610817 0.8556609579
0.5574207497 1.2175251783 0.8797484259 0.952685465 0.4476585005 1.1919229479 1.03612509 0.5490564488 0.2407034171 0.5675492645 0.4994121344 0.5460544861 0.3779468604
0.5632651223 1.0181714714 1.1253803155 1.228293512 0.6949993291 1.0346288085 0.5955221073 0.5212567091 1.1674901423 1.2442735568 1.207624867 1.3854352274 0.7557131826
0.6914760031 0.7831502333 1.0282730148 0.750270567 0.7072739935 0.8041764647 0.8918512571 0.6998554585 2.3448306081 1.2905783367 2.4295927684 1.3029766224 1.9310763864
0.3459898177 0.7474525109 0.7253451876 0.7182493014 0.3081791886 0.7462088907 0.5950509439 0.4443221541 3.6106852374 2.7647504885 3.3698608994 2.6523062395 1.8016571476
0.4629523517 0.6549211677 0.6158018856 0.7637088814 0.4951554309 0.6277236471 0.6227669055 0.383909839 2.9502307101 1.803480973 2.3083113522 1.668759497 1.7130459012
0.301548861 0.5961888126 0.4027007075 0.5540290853 0.4078662541 0.5108773106 0.4610682726 0.3712800134 0.3813402422 0.7391417247 1.0935364978 0.691857974 0.4416304953
2.5038287529 3.2005148394 2.9181517373 3.557918333 1.8868234768 2.9369926312 0.4117894127 0.3074815035 3.9187777037 7.3161555954 6.9586996112 5.7096144353 2.7007439732
2.5079707359 3.2058093222 2.9229791182 3.563804054 1.8899447728 2.9418511798 0.4124706194 0.269491388 3.9252603798 7.3282584169 6.9702111077 5.7190596205 2.7052117051
2.6643724791 1.2405320493 2.0584120188 2.2354369334 1.7199730388 2.039829709 1.7428132997 0.9977029725 8.9650886611 4.6035139163 8.1430131464 5.2450639988 6.963309864
0.5270581435 0.8222128903 0.7713479951 0.8785815313 0.624993821 0.7410405193 0.5350834321 0.4797121891 1.3753525725 1.2219267886 1.397221881 1.2433155977 0.8647136903
0.2536079475 0.5195514789 0.0492623195 0.416102668 0.2572670724 0.4805482899 0.4866090738 0.4905212099 0.2002506403 0.5508609827 0.3808572148 0.6276294938 0.3191452919
0.3499009885 0.5837491529 0.4914807442 0.5851537888 0.3638549977 0.537655052 0.5757185943 0.4730102035 0.9098072064 0.6197285737 0.7781825654 0.6424684366 0.6424429128
0.6093076876 0.9456457011 0.8518013605 1.1360347777 0.511960743 0.9038104168 0.5048413575 0.2777622235 0.2915840525 0.6628516415 0.4600364351 0.7996524113 0.3765721177
0.9119207879 1.2363073271 1.3285269752 1.4027039939 0.9250782309 2.1599381031 1.312307839 0 0 0.8253250513 0 0 0.8903632354
It is stored in a data.txt file.
I want to have a PCA multiplot that looks like this:
What I am doing:
d <- read.table("data.txt", header=TRUE, as.is=TRUE)
model <- prcomp(d, scale=TRUE)
After this I am lost.
How can I cluster the dataset according to the PCA projections and obtain the pictures similar to those above?
You are actually asking two different questions:
How to cluster the data after PCA projections.
How to obtain the above plots.
However before getting to those I would like to add that if your samples are in columns, then you are not doing PCA correctly. You should do it on transposed dataset instead like so:
model <- prcomp(t(d), scale=TRUE)
But for that to work you would have to remove all the constant rows in your data.
Now I assume that you did your PCA step how you wanted.
prcomp returns the rotated matrix when you specify retX=TRUE (it's true by default). So you will want to use model$x.
Your next step is clustering the data based on principal components. This can be done in various ways. One is hierarchical clustering. If you want 5 groups in the end here is one way:
fit <- hclust(dist(model$x[,1:3]), method="complete") # 1:3 -> based on 3 components
groups <- cutree(fit, k=5) # k=5 -> 5 groups
This step will get you groups that will be later used for coloring.
The final step is plotting. Here I wrote a simple function to do all in one shot:
library(rgl)
plotPCA <- function(x, nGroup) {
n <- ncol(x)
if(!(n %in% c(2,3))) { # check if 2d or 3d
stop("x must have either 2 or 3 columns")
}
fit <- hclust(dist(x), method="complete") # cluster
groups <- cutree(fit, k=nGroup)
if(n == 3) { # 3d plot
plot3d(x, col=groups, type="s", size=1, axes=F)
axes3d(edges=c("x--", "y--", "z"), lwd=3, axes.len=2, labels=FALSE)
grid3d("x")
grid3d("y")
grid3d("z")
} else { # 2d plot
maxes <- apply(abs(x), 2, max)
rangeX <- c(-maxes[1], maxes[1])
rangeY <- c(-maxes[2], maxes[2])
plot(x, col=groups, pch=19, xlab=colnames(x)[1], ylab=colnames(x)[2], xlim=rangeX, ylim=rangeY)
lines(c(0,0), rangeX*2)
lines(rangeY*2, c(0,0))
}
}
This function is simple: it takes two arguments: 1) a matrix of scores, with principal components in columns and your samples in rows. You can basically use model$x[,c(1,2,4)] if you want (for example) 1st, 2nd and 4th components. 2) number of groups for clustering.
Then it cluster the data based on passed principal components and plots (either 2D or 3D depending on the number of columns passed)
Here are few examples:
plotPCA(model$x[,1:2], 5)
And 3D example (based on 3 first principal components):
plotPCA(model$x[,1:3], 5)
This last plot will be interactive so you can rotate it to or zoom in/out.
Hope this helps.
Suppose I have a matrix like so:
data=matrix(c(1,0,0,0,0,0,1,0,0.6583,0,0,0,1,0,0,0,0.6583,0,1,0,0,0,0,0,1),nrow=5,ncol=5)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0.0000 0 0.0000 0
[2,] 0 1.0000 0 0.6583 0
[3,] 0 0.0000 1 0.0000 0
[4,] 0 0.6583 0 1.0000 0
[5,] 0 0.0000 0 0.0000 1
How do I create another matrix, say "data2", such that it has the same number of off-diagonal nonzero elements as "data" but in another location other than the one in data? The randomly simulated data will be uniform (so runif).
Here is a somewhat clumsy way to do this. It works well for small matrices but would be too slow if you're going to use this for some very high-dimensional problems.
# Current matrix:
data=matrix(c(1,0,0,0,0,0,1,0,0.6583,0,0,0,1,0,0,0,0.6583,0,1,0,0,0,0,0,1),nrow=5,ncol=5)
# Number of nonzero elements in upper triangle:
no.nonzero<-sum(upper.tri(data)*data>0)
# Generate same number of new nonzero correlations:
new.cor<-runif(no.nonzero,-1,1)
# Create new diagonal matrix:
p<-dim(data)[1]
data2<-diag(1,p,p)
### Insert nonzero correlations: ###
# Step 1. Identify the places where the nonzero elements can be placed:
pairs<-(p^2-p)/2 # Number of element in upper triangle
combinations<-matrix(NA,pairs,2) # Matrix containing indices for those elements (i.e. (1,2), (1,3), ... (2,3), ... and so on)
k<-0
for(i in 1:(p-1))
{
for(j in {i+1}:p)
{
k<-k+1
combinations[k,]<-c(i,j)
}
}
# Step 2. Randomly pick indices:
places<-sample(1:k,no.nonzero)
# Step 3. Insert nonzero correlations:
for(i in 1:no.nonzero)
{
data2[combinations[places[i],1],combinations[places[i],2]]<-data2[combinations[places[i],2],combinations[places[i],1]]<-new.cor[i]
}
Not really understood the question. There are two off-diagonal and non-zero elements (0.6583) in the example, right? Is matrix with two elements the result you want in this case?
data=matrix(c(1,0,0,0,0,0,1,0,0.6583,0,0,0,1,0,0,0,0.6583,0,1,0,0,0,0,0,1),nrow=5,ncol=5)
# Convert to vector
data2 <- as.numeric(data)
# Remove diagonal
data2 <- data2[as.logical(upper.tri(data) | lower.tri(data))]
# Remove 0 elements
data2 <- data2[data2 != 0]
data2