When using the knn() function in package class in R, there is an argument called "prob". If I make this true, I get the probability of that particular value being classified to whatever it is classified as.
I have a dataset where the classifier has 9 levels. Is there any way in which I can get the probability of a particular observation for all the 9 levels?
As far as I know the knn() function in class only returns the highest probability.
However, you can use the knnflex package which allows you to return all probability levels using knn.probability (see here, page 9-10).
This question still require proper answer.
If the probability for the most probable class is needed then the class package will be still suited. The clue is to set the argument prob to TRUE and k to higher than default 1 - class::knn(tran, test, cl, k = 5, prob = TRUE). The k has to be higher than default 1 to not get always 100% probability for each observation.
However if you want to get probabilities for each of the classes I will recommend the caret::knn3 function with predict one.
data(iris3)
train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
# class package
# take into account k higher than 1 and prob equal TRUE
model <- class::knn(train, test, cl, k = 5, prob = TRUE)
tail(attributes(model)$prob, 10)
#> [1] 1.0 1.0 1.0 1.0 1.0 1.0 0.8 1.0 1.0 0.8
# caret package
model2 <- predict(caret::knn3(train, cl, k = 3), test)
tail(model2, 10)
#> c s v
#> [66,] 0.0000000 0 1.0000000
#> [67,] 0.0000000 0 1.0000000
#> [68,] 0.0000000 0 1.0000000
#> [69,] 0.0000000 0 1.0000000
#> [70,] 0.0000000 0 1.0000000
#> [71,] 0.0000000 0 1.0000000
#> [72,] 0.3333333 0 0.6666667
#> [73,] 0.0000000 0 1.0000000
#> [74,] 0.0000000 0 1.0000000
#> [75,] 0.3333333 0 0.6666667
Created on 2021-07-20 by the reprex package (v2.0.0)
I know there is an answer already marked here, but this is possible to complete without utilizing another function or package.
What you can do instead is build your knn model knn_model and check out it's attributes for the "prob" output, as such.
attributes(knn_model)$prob
Related
Issues with evaluating ranger. In both, unable to subset the data (want the first column of rf.trnprob)
rangermodel= ranger(outcome~., data=traindata, num.trees=200, probability=TRUE)
rf.trnprob= predict(rangerModel, traindata, type='prob')
trainscore <- subset(traindata, select=c("outcome"))
trainscore$score<-rf.trnprob[, 1]
Error:
incorrect number of dimensions
table(pred = rf.trbprob, true=traindata$outcome)
Error:
all arguments must have the same length
Seems like the predict function is called wrongly, it should be response instead of type. Using an example dataset:
library(ranger)
traindata =iris
traindata$Species = factor(as.numeric(traindata$Species=="versicolor"))
rangerModel = ranger(Species~.,data=traindata,probability=TRUE)
rf.trnprob= predict(rangerModel, traindata, response='prob')
Probability is stored here, one column for each class:
head(rf.trnprob$predictions)
0 1
[1,] 1.0000000 0.000000000
[2,] 0.9971786 0.002821429
[3,] 1.0000000 0.000000000
[4,] 1.0000000 0.000000000
[5,] 1.0000000 0.000000000
[6,] 1.0000000 0.000000000
But seems like you want to do a confusion matrix, so you can get the predictions by doing:
pred = levels(traindata$Species)[max.col(rf.trnprob$predictions)]
Then:
table(pred,traindata$Species)
pred 0 1
0 100 2
1 0 48
I have a dataset with five categorical variables. And I ran a multinomial logistic regression with the function multinom in package nnet, and then derived the p values from the coefficients. But I do not know how to interpret the results.
The p values were derived according to UCLA's tutorial: https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/ .
Just like this:
z <- summary(test)$coefficients/summary(test)$standard.errors
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p
And I got this:
(Intercept) Age1 Age2 Age3 Age4 Unit1 Unit2 Unit3 Unit4 Unit5 Level1 Level2 Area1 Area2
Not severe 0.7388029 9.094373e-01 0 0.000000e+00 0.000000e+00 0 0.75159758 0 0 0.0000000 0.8977727 0.9333862 0.6285447 0.4457171
Very severe 0.0000000 1.218272e-09 0 6.599380e-06 7.811761e-04 0 0.00000000 0 0 0.0000000 0.7658748 0.6209889 0.0000000 0.0000000
Severe 0.0000000 8.744405e-08 0 1.052835e-06 3.299770e-04 0 0.00000000 0 0 0.0000000 0.8843606 0.4862364 0.0000000 0.0000000
Just so so 0.0000000 1.685045e-07 0 5.507560e-03 2.973261e-06 0 0.08427447 0 NaN 0.3010429 0.5552963 0.7291180 0.0000000 0.0000000
Not severe at all 0.0000000 0.000000e+00 0 0.000000e+00 0.000000e+00 0 NaN NaN 0 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
But how should I interpret these p values? Age3 is significantly related to Very severe? I am green to statistics and have no idea. Help me understand the results please. Thank you in advance.
I suggest using stargazer package to display coefficients and p-values (I believe that it is a more convenient and common way)
Regarding the interpretation of the results, in a multinomial model you can say: keeping all other variables constant, if Age3 is higher by one unit, the log odds for Very Severe relative to the reference category is higher/lower by that amount indicated by the value of the coefficient. The p-value just shows you whether the association between these two variables (predictor and response) is significant or not. Interpretation is the same that of other models.
Note: in case of p-value the null hypothesis is always that the coefficient is equal to zero (no effect at all). When p-value is less than 0.05, you can safely reject the null hypothesis and state that the predictor has an effect on the response variable.
I hope I could give you some hints
I've tried to fit the following into a ADBUG model using the nls function in r, but the singular matrix error kept repeating and I don't really know how to proceed on doing this...
nprice nlv2
[1,] 0.6666667 1.91666667
[2,] 0.7500000 1.91666667
[3,] 0.8333333 1.91666667
[4,] 0.9166667 1.44444444
[5,] 1.0000000 1.00000000
[6,] 1.0833333 0.58333333
[7,] 1.1666667 0.22222222
[8,] 1.2500000 0.08333333
[9,] 1.3333333 0.02777778
code:
fit <- nls(f=nprice~a+b*nlv2^c/(nlv2^c+d),start=list(a=0.083,b=1.89,c=-10.95,d=0.94))
Error in nls(f = nprice ~ a + b * nlv2^c/(nlv2^c + d), start = list(a = 0.083, :
singular gradient
Package nlsr provides an updated version of nls through function nlxb that in most cases avoids the "singular gradient" error.
library(nlsr)
fit <- nlxb(f = nprice~a+b*nlv2^c/(nlv2^c+d),
data = df,
start = list(a=0.083,b=1.89,c=-10.95,d=0.94))
## vn:[1] "nprice" "a" "b" "nlv2" "c" "d"
## no weights
fit$coefficients
## a b c d
## -2.1207e+04 2.1208e+04 -7.4083e-01 1.6236e-05
The fitted coefficients are far away from the starting values and quite big, indicating the problem is not well grounded.
I am stuck on how to proceed with coding in RStudio for the Bonferroni Correction and the raw P values for the Pearson Correlation Matrix. I am a student and am new to R. I am also lost on how to get a table of the mean,SD, and n for the data. When I calculated the Pearson Correlation Matrix I just got the r value and not the raw probabilities value also. I am not sure how to code to get that in RStudio. I then tried to calculate the Bonferroni Correction and received an error message saying list object cannot be coerced to type double. How do I fix my code so this goes away? I also tried to create a table of the mean, SD, and n for the data and I became stuck on how to proceed.
My data is as follows:
Tree Height DBA Leaf Diameter
45.3 14.9 0.76
75.2 26.6 1.06
70.1 22.9 1.19
95 31.8 1.59
107.8 35.5 0.43
93 26.2 1.49
91.5 29 1.19
78.5 29.2 1.1
85.2 30.3 1.24
50 16.8 0.67
47.1 12.8 0.98
73.2 28.4 1.2
Packages I have installed dplyr,tidyr,multcomp,multcompview
I Read in the data from excel CSV(comma delimited) file and This creates data>dataHW8_1 12obs. of 3 variables
summary(dataHW8_1)
I then created Scatterplots of the data
plot(dataHW8_1$Tree_Height,dataHW8_1$DBA,main="Scatterplot Tree Height Vs Trunk Diameter at Breast Height (DBA)",xlab="Tree Height (cm)",ylab="DBA (cm)")
plot(dataHW8_1$Tree_Height,dataHW8_1$Leaf_Diameter,main="Scatterplot Tree Height Vs Leaf Diameter",xlab="Tree Height (cm)",ylab="Leaf Diameter (cm)")
plot(dataHW8_1$DBA,dataHW8_1$Leaf_Diameter,main="Scatterplot Trunk Diameter at Breast Height (DBA) Vs Leaf Diameter",xlab="DBA (cm)",ylab="Leaf Diameter (cm)")
I then noticed that the data was not linear so I transformed it using the log() fucntion
dataHW8_1log = log(dataHW8_1)
I then re-created my Scatterplots using the transformed data
plot(dataHW8_1log$Tree_Height,dataHW8_1log$DBA,main="Scatterplot of
Transformed (log)Tree Height Vs Trunk Diameter at Breast Height
(DBA)",xlab="Tree Height (cm)",ylab="DBA (cm)")
plot(dataHW8_1log$Tree_Height,dataHW8_1log$Leaf_Diameter,main="Scatterplot
of Transformed (log)Tree Height Vs Leaf Diameter",xlab="Tree Height
(cm)",ylab="Leaf Diameter (cm)")
plot(dataHW8_1log$DBA,dataHW8_1log$Leaf_Diameter,main="Scatterplot of
Transformed (log) Trunk Diameter at Breast Height (DBA) Vs Leaf
Diameter",xlab="DBA (cm)",ylab="Leaf Diameter (cm)")
I then created a matrix plot of Scatterplots
pairs(dataHW8_1log)
I then calculated the correlation coefficent using the Pearson method
this does not give an uncorreted matrix of P values------How do you do that?
cor(dataHW8_1log,method="pearson")
I am stuck on what to do to get a matrix of the raw probabilities (uncorrected P values) of the data
I then calculated the Bonferroni correction-----How do you do that?
Data$Bonferroni =
p.adjust(dataHW8_1log,
method = "bonferroni")
Doing this gave me the follwing error:
Error in p.adjust(dataHW8_1log, method = "bonferroni") :
(list) object cannot be coerced to type 'double'
I tried to fix using lapply, but that did not fix my promblem
I then tried to make a table of mean, SD, n, but I was only able to create the following code and became stuck on where to go from there------How do you do that?
(,data = dataHW8_1log,
FUN = function(x) c(Mean = mean(x, na.rm = T),
n = length(x),
sd = sd(x, na.rm = T))
I have tried following examples online, but none of them have helped me with the getting the Bonferroni Correction to code correctly.If anyone can help explain what I did wrong and how to make the Matrices/table I would greatly appreciate it.
Here is an example using a 50 rows by 10 columns sample dataframe.
# 50 rows x 10 columns sample dataframe
df <- as.data.frame(matrix(runif(500), ncol = 10));
We can show pairwise scatterplots.
# Pairwise scatterplot
pairs(df);
We can now use cor.test to get p-values for a single comparison. We use a convenience function cor.test.p to do this for all pairwise comparisons. To give credit where credit is due, the function cor.test.p has been taken from this SO post, and takes as an argument a dataframe whilst returning a matrix of uncorrected p-values.
# cor.test on dataframes
# From: https://stackoverflow.com/questions/13112238/a-matrix-version-of-cor-test
cor.test.p <- function(x) {
FUN <- function(x, y) cor.test(x, y)[["p.value"]];
z <- outer(
colnames(x),
colnames(x),
Vectorize(function(i,j) FUN(x[,i], x[,j])));
dimnames(z) <- list(colnames(x), colnames(x));
return(z);
}
# Uncorrected p-values from pairwise correlation tests
pval <- cor.test.p(df);
We now correct for multiple hypothesis testing by applying the Bonferroni correction to every row (or column, since the matrix is symmetric) and we're done. Note that p.adjust takes a vector of p-values as an argument.
# Multiple hypothesis-testing corrected p-values
# Note: pval is a symmetric matrix, so it doesn't matter if we correct
# by column or by row
padj <- apply(pval, 2, p.adjust, method = "bonferroni");
padj;
#V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#V1 0 1 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 1
#V2 1 0 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 1
#V3 1 1 0.0000000 1 0.9569498 1.0000000 1 1 1.0000000 1
#V4 1 1 1.0000000 0 1.0000000 1.0000000 1 1 1.0000000 1
#V5 1 1 0.9569498 1 0.0000000 1.0000000 1 1 1.0000000 1
#V6 1 1 1.0000000 1 1.0000000 0.0000000 1 1 0.5461443 1
#V7 1 1 1.0000000 1 1.0000000 1.0000000 0 1 1.0000000 1
#V8 1 1 1.0000000 1 1.0000000 1.0000000 1 0 1.0000000 1
#V9 1 1 1.0000000 1 1.0000000 0.5461443 1 1 0.0000000 1
#V10 1 1 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 0
This question already has answers here:
Obtaining threshold values from a ROC curve
(5 answers)
Closed 7 years ago.
My data is kind of irregular to apply ROC etc. for a threshold determination. To simplify, here is a demo, let x is
x<-c(0,0,0,12, 3, 4, 5, 15, 15.3, 20,18, 26)
Suppose x=15.1 is the unknown true threshold and the corresponding test outcome y will be negative (0) if x==0 OR x > 15.1, otherwise y is positive(1) such that:
y<-c(0,0,0,1, 1, 1, 1, 1, 0,0,0,0)
Due to 0 is a positive outcome in x, I'm wondering in which way I can determine the threshold of x to predict y the best. I have tried R packaged pROC and ROCR, both seem not straight forward for this situation. Would somebody have me some suggestions?
You have a situation where you predict 0 for high values of x and predict 1 for low values of x, except you always predict 0 if x == 0. Standard packages like pROC and ROCR expect low values of x to be associated with predicting y=0. You could transform your data to this situation by:
Flipping the sign of all your predictions
Replacing 0 with a small negative value in x
In code (using this answer to extract TPR and FPR for each cutoff):
x2 <- -x
x2[x2 == 0] <- -1000
library(ROCR)
pred <- prediction(x2, y)
perf <- performance(pred, "tpr", "fpr")
data.frame(cut=perf#alpha.values[[1]], fpr=perf#x.values[[1]],
tpr=perf#y.values[[1]])
# cut fpr tpr
# 1 Inf 0.0000000 0.0
# 2 -3.0 0.0000000 0.2
# 3 -4.0 0.0000000 0.4
# 4 -5.0 0.0000000 0.6
# 5 -12.0 0.0000000 0.8
# 6 -15.0 0.0000000 1.0
# 7 -15.3 0.1428571 1.0
# 8 -18.0 0.2857143 1.0
# 9 -20.0 0.4285714 1.0
# 10 -26.0 0.5714286 1.0
# 11 -1000.0 1.0000000 1.0
Now you can select your favorite cutoff based on the true and false positive rates, remembering that the selected cutoff value will be negated from the original value.