I have a dataset with five categorical variables. And I ran a multinomial logistic regression with the function multinom in package nnet, and then derived the p values from the coefficients. But I do not know how to interpret the results.
The p values were derived according to UCLA's tutorial: https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/ .
Just like this:
z <- summary(test)$coefficients/summary(test)$standard.errors
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p
And I got this:
(Intercept) Age1 Age2 Age3 Age4 Unit1 Unit2 Unit3 Unit4 Unit5 Level1 Level2 Area1 Area2
Not severe 0.7388029 9.094373e-01 0 0.000000e+00 0.000000e+00 0 0.75159758 0 0 0.0000000 0.8977727 0.9333862 0.6285447 0.4457171
Very severe 0.0000000 1.218272e-09 0 6.599380e-06 7.811761e-04 0 0.00000000 0 0 0.0000000 0.7658748 0.6209889 0.0000000 0.0000000
Severe 0.0000000 8.744405e-08 0 1.052835e-06 3.299770e-04 0 0.00000000 0 0 0.0000000 0.8843606 0.4862364 0.0000000 0.0000000
Just so so 0.0000000 1.685045e-07 0 5.507560e-03 2.973261e-06 0 0.08427447 0 NaN 0.3010429 0.5552963 0.7291180 0.0000000 0.0000000
Not severe at all 0.0000000 0.000000e+00 0 0.000000e+00 0.000000e+00 0 NaN NaN 0 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
But how should I interpret these p values? Age3 is significantly related to Very severe? I am green to statistics and have no idea. Help me understand the results please. Thank you in advance.
I suggest using stargazer package to display coefficients and p-values (I believe that it is a more convenient and common way)
Regarding the interpretation of the results, in a multinomial model you can say: keeping all other variables constant, if Age3 is higher by one unit, the log odds for Very Severe relative to the reference category is higher/lower by that amount indicated by the value of the coefficient. The p-value just shows you whether the association between these two variables (predictor and response) is significant or not. Interpretation is the same that of other models.
Note: in case of p-value the null hypothesis is always that the coefficient is equal to zero (no effect at all). When p-value is less than 0.05, you can safely reject the null hypothesis and state that the predictor has an effect on the response variable.
I hope I could give you some hints
Related
I have an sf object containing the following columns of data:
HR60 HR70 HR80 HR90 HC60 HC70 HC80 HC90
0.000000 0.000000 8.855827 0.000000 0.0000000 0.0000000 0.3333333 0.0000000
0.000000 0.000000 17.208742 15.885624 0.0000000 0.0000000 1.0000000 1.0000000
1.863863 1.915158 3.450775 6.462453 0.3333333 0.3333333 1.0000000 2.0000000
...
How can I calculate the median of HR60 to HR90 columns for all observations and place it in a different column, let's say HR-median? I tried to use apply(), but this kind of works for the whole dataset only and I need only these 4 columns to be considered.
We can select those columns
df1$HR_median <- apply(subset(df1, select = HR60:HR90), 1, median)
Issues with evaluating ranger. In both, unable to subset the data (want the first column of rf.trnprob)
rangermodel= ranger(outcome~., data=traindata, num.trees=200, probability=TRUE)
rf.trnprob= predict(rangerModel, traindata, type='prob')
trainscore <- subset(traindata, select=c("outcome"))
trainscore$score<-rf.trnprob[, 1]
Error:
incorrect number of dimensions
table(pred = rf.trbprob, true=traindata$outcome)
Error:
all arguments must have the same length
Seems like the predict function is called wrongly, it should be response instead of type. Using an example dataset:
library(ranger)
traindata =iris
traindata$Species = factor(as.numeric(traindata$Species=="versicolor"))
rangerModel = ranger(Species~.,data=traindata,probability=TRUE)
rf.trnprob= predict(rangerModel, traindata, response='prob')
Probability is stored here, one column for each class:
head(rf.trnprob$predictions)
0 1
[1,] 1.0000000 0.000000000
[2,] 0.9971786 0.002821429
[3,] 1.0000000 0.000000000
[4,] 1.0000000 0.000000000
[5,] 1.0000000 0.000000000
[6,] 1.0000000 0.000000000
But seems like you want to do a confusion matrix, so you can get the predictions by doing:
pred = levels(traindata$Species)[max.col(rf.trnprob$predictions)]
Then:
table(pred,traindata$Species)
pred 0 1
0 100 2
1 0 48
I am stuck on how to proceed with coding in RStudio for the Bonferroni Correction and the raw P values for the Pearson Correlation Matrix. I am a student and am new to R. I am also lost on how to get a table of the mean,SD, and n for the data. When I calculated the Pearson Correlation Matrix I just got the r value and not the raw probabilities value also. I am not sure how to code to get that in RStudio. I then tried to calculate the Bonferroni Correction and received an error message saying list object cannot be coerced to type double. How do I fix my code so this goes away? I also tried to create a table of the mean, SD, and n for the data and I became stuck on how to proceed.
My data is as follows:
Tree Height DBA Leaf Diameter
45.3 14.9 0.76
75.2 26.6 1.06
70.1 22.9 1.19
95 31.8 1.59
107.8 35.5 0.43
93 26.2 1.49
91.5 29 1.19
78.5 29.2 1.1
85.2 30.3 1.24
50 16.8 0.67
47.1 12.8 0.98
73.2 28.4 1.2
Packages I have installed dplyr,tidyr,multcomp,multcompview
I Read in the data from excel CSV(comma delimited) file and This creates data>dataHW8_1 12obs. of 3 variables
summary(dataHW8_1)
I then created Scatterplots of the data
plot(dataHW8_1$Tree_Height,dataHW8_1$DBA,main="Scatterplot Tree Height Vs Trunk Diameter at Breast Height (DBA)",xlab="Tree Height (cm)",ylab="DBA (cm)")
plot(dataHW8_1$Tree_Height,dataHW8_1$Leaf_Diameter,main="Scatterplot Tree Height Vs Leaf Diameter",xlab="Tree Height (cm)",ylab="Leaf Diameter (cm)")
plot(dataHW8_1$DBA,dataHW8_1$Leaf_Diameter,main="Scatterplot Trunk Diameter at Breast Height (DBA) Vs Leaf Diameter",xlab="DBA (cm)",ylab="Leaf Diameter (cm)")
I then noticed that the data was not linear so I transformed it using the log() fucntion
dataHW8_1log = log(dataHW8_1)
I then re-created my Scatterplots using the transformed data
plot(dataHW8_1log$Tree_Height,dataHW8_1log$DBA,main="Scatterplot of
Transformed (log)Tree Height Vs Trunk Diameter at Breast Height
(DBA)",xlab="Tree Height (cm)",ylab="DBA (cm)")
plot(dataHW8_1log$Tree_Height,dataHW8_1log$Leaf_Diameter,main="Scatterplot
of Transformed (log)Tree Height Vs Leaf Diameter",xlab="Tree Height
(cm)",ylab="Leaf Diameter (cm)")
plot(dataHW8_1log$DBA,dataHW8_1log$Leaf_Diameter,main="Scatterplot of
Transformed (log) Trunk Diameter at Breast Height (DBA) Vs Leaf
Diameter",xlab="DBA (cm)",ylab="Leaf Diameter (cm)")
I then created a matrix plot of Scatterplots
pairs(dataHW8_1log)
I then calculated the correlation coefficent using the Pearson method
this does not give an uncorreted matrix of P values------How do you do that?
cor(dataHW8_1log,method="pearson")
I am stuck on what to do to get a matrix of the raw probabilities (uncorrected P values) of the data
I then calculated the Bonferroni correction-----How do you do that?
Data$Bonferroni =
p.adjust(dataHW8_1log,
method = "bonferroni")
Doing this gave me the follwing error:
Error in p.adjust(dataHW8_1log, method = "bonferroni") :
(list) object cannot be coerced to type 'double'
I tried to fix using lapply, but that did not fix my promblem
I then tried to make a table of mean, SD, n, but I was only able to create the following code and became stuck on where to go from there------How do you do that?
(,data = dataHW8_1log,
FUN = function(x) c(Mean = mean(x, na.rm = T),
n = length(x),
sd = sd(x, na.rm = T))
I have tried following examples online, but none of them have helped me with the getting the Bonferroni Correction to code correctly.If anyone can help explain what I did wrong and how to make the Matrices/table I would greatly appreciate it.
Here is an example using a 50 rows by 10 columns sample dataframe.
# 50 rows x 10 columns sample dataframe
df <- as.data.frame(matrix(runif(500), ncol = 10));
We can show pairwise scatterplots.
# Pairwise scatterplot
pairs(df);
We can now use cor.test to get p-values for a single comparison. We use a convenience function cor.test.p to do this for all pairwise comparisons. To give credit where credit is due, the function cor.test.p has been taken from this SO post, and takes as an argument a dataframe whilst returning a matrix of uncorrected p-values.
# cor.test on dataframes
# From: https://stackoverflow.com/questions/13112238/a-matrix-version-of-cor-test
cor.test.p <- function(x) {
FUN <- function(x, y) cor.test(x, y)[["p.value"]];
z <- outer(
colnames(x),
colnames(x),
Vectorize(function(i,j) FUN(x[,i], x[,j])));
dimnames(z) <- list(colnames(x), colnames(x));
return(z);
}
# Uncorrected p-values from pairwise correlation tests
pval <- cor.test.p(df);
We now correct for multiple hypothesis testing by applying the Bonferroni correction to every row (or column, since the matrix is symmetric) and we're done. Note that p.adjust takes a vector of p-values as an argument.
# Multiple hypothesis-testing corrected p-values
# Note: pval is a symmetric matrix, so it doesn't matter if we correct
# by column or by row
padj <- apply(pval, 2, p.adjust, method = "bonferroni");
padj;
#V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#V1 0 1 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 1
#V2 1 0 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 1
#V3 1 1 0.0000000 1 0.9569498 1.0000000 1 1 1.0000000 1
#V4 1 1 1.0000000 0 1.0000000 1.0000000 1 1 1.0000000 1
#V5 1 1 0.9569498 1 0.0000000 1.0000000 1 1 1.0000000 1
#V6 1 1 1.0000000 1 1.0000000 0.0000000 1 1 0.5461443 1
#V7 1 1 1.0000000 1 1.0000000 1.0000000 0 1 1.0000000 1
#V8 1 1 1.0000000 1 1.0000000 1.0000000 1 0 1.0000000 1
#V9 1 1 1.0000000 1 1.0000000 0.5461443 1 1 0.0000000 1
#V10 1 1 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 0
I am having some troubles regarding a cluster analysis that I am trying to do with the pvclust package.
Specifically, I have a data matrix composed by species (rows) and sampling stations (columns). I want to perform a CA in order to group my sampling stations according to my species abundance (which I have previously log(x+1) transformed).
Once having prepared adequately my matrix,I've tried to run a CA according to the pvclust package, using Ward's clustering method and Bray-Curtis as distance index. However, every time I get the following error message:
''Error in hclust(distance, method = method.hclust) :
invalid clustering method''
I then tried to perform the same analysis using another cluster method, and I had no problem. I also tried to perform the same analysis using the hclust function from the vegan package, and I had no problem at all, too. The analysis run without any problems.
To better understand my problem, I'll display part of my matrix and the
script that I used to perfrom the analysis:
P1 P2 P3 P4 P5 P6
1 10.8750000 3.2888889 2.0769231 1.4166667 3.2395833 5.333333
3 0.3645833 0.3027778 0.3212038 0.7671958 0.4993676 0.000000
4 0.0000000 0.0000000 2.3500000 0.0000000 0.0000000 0.264000
5 0.0000000 0.7333333 0.2692308 0.0000000 0.2343750 0.000000
6 0.0000000 0.9277778 0.0000000 0.2936508 0.7291667 0.000000
7 0.4166667 6.3500000 1.0925463 0.5476190 0.1885169 0.000000
8 1.6250000 0.0000000 0.0000000 0.0000000 5.2187500 0.000000
9 0.0000000 0.8111111 0.0000000 0.0000000 0.0000000 0.000000
10 2.6770833 0.6666667 2.3304890 4.5906085 2.9652778 0.000000
15 1.8020833 0.9666667 1.4807137 3.3878968 0.1666667 0.000000
16 17.8750000 4.9555556 1.4615385 6.5000000 7.8593750 7.666667
19 4.5312500 1.0555556 3.5766941 6.7248677 2.3196181 0.000000
20 0.0000000 0.6777778 0.5384615 0.0000000 0.0000000 0.000000
21 0.0000000 0.9777778 0.0000000 0.2500000 0.0000000 0.000000
24 1.2500000 3.0583333 0.1923077 0.0000000 4.9583333 0.000000
25 0.0000000 0.0000000 2.5699634 0.0000000 0.0000000 0.000000
26 6.6666667 2.2333333 24.8730020 55.9980159 17.6239583 0.000000
Where P1-P6 are my sampling stations, and the leftmost row numbers are my different species. I'll denote this example matrix just as ''platforms''.
Afterwards, I've used the following code lines:
dist <- function(x, ...){
vegdist(x, ...)
}
result<-pvclust(platforms,method.dist = "bray",method.hclust = "ward")
It is noteworthy that I run the three first codelines, since the bray-curtis index isn't originally available in the pvclust package. Thus, running these codelines allowed me to specify the bray-curtis index in the pvclust function
Does anyone know why it doesn't work with the pvclust package?
Any help will be much appreciated.
Kind regards,
Marie
There are two related issues:
When calling method.hclust you need to pass hclust compatible methods. In theory pvclust checks for ward and converts to ward.D, but you probably want to pass the (correct) names of either ward.D or ward.D2.
You cannot over-write dist in that fashion. However, you can pass a custom function to pvclust.
For instance, this should work:
library(vegan)
library(pvclust)
sample.data <- "P1 P2 P3 P4 P5 P6
10.8750000 3.2888889 2.0769231 1.4166667 3.2395833 5.3333330
0.3645833 0.3027778 0.3212038 0.7671958 0.4993676 0.0000000
0.0000000 0.0000000 2.3500000 0.0000000 0.0000000 0.2640000
0.0000000 0.7333333 0.2692308 0.0000000 0.2343750 0.0000000
0.0000000 0.9277778 0.0000000 0.2936508 0.7291667 0.0000000
0.4166667 6.3500000 1.0925463 0.5476190 0.1885169 0.0000000
1.6250000 0.0000000 0.0000000 0.0000000 5.2187500 0.0000000
0.0000000 0.8111111 0.0000000 0.0000000 0.0000000 0.0000000
2.6770833 0.6666667 2.3304890 4.5906085 2.9652778 0.0000000
1.8020833 0.9666667 1.4807137 3.3878968 0.1666667 0.0000000
17.8750000 4.9555556 1.4615385 6.5000000 7.8593750 7.6666670
4.5312500 1.0555556 3.5766941 6.7248677 2.3196181 0.0000000
0.0000000 0.6777778 0.5384615 0.0000000 0.0000000 0.0000000
0.0000000 0.9777778 0.0000000 0.2500000 0.0000000 0.0000000
1.2500000 3.0583333 0.1923077 0.0000000 4.9583333 0.0000000
0.0000000 0.0000000 2.5699634 0.0000000 0.0000000 0.0000000
6.6666667 2.2333333 24.8730020 55.9980159 17.6239583 0.0000000"
platforms <- read.table(text = sample.data, header = TRUE)
result <- pvclust(platforms,
method.dist = function(x){
vegdist(x, "bray")
},
method.hclust = "ward.D")
When using the knn() function in package class in R, there is an argument called "prob". If I make this true, I get the probability of that particular value being classified to whatever it is classified as.
I have a dataset where the classifier has 9 levels. Is there any way in which I can get the probability of a particular observation for all the 9 levels?
As far as I know the knn() function in class only returns the highest probability.
However, you can use the knnflex package which allows you to return all probability levels using knn.probability (see here, page 9-10).
This question still require proper answer.
If the probability for the most probable class is needed then the class package will be still suited. The clue is to set the argument prob to TRUE and k to higher than default 1 - class::knn(tran, test, cl, k = 5, prob = TRUE). The k has to be higher than default 1 to not get always 100% probability for each observation.
However if you want to get probabilities for each of the classes I will recommend the caret::knn3 function with predict one.
data(iris3)
train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
# class package
# take into account k higher than 1 and prob equal TRUE
model <- class::knn(train, test, cl, k = 5, prob = TRUE)
tail(attributes(model)$prob, 10)
#> [1] 1.0 1.0 1.0 1.0 1.0 1.0 0.8 1.0 1.0 0.8
# caret package
model2 <- predict(caret::knn3(train, cl, k = 3), test)
tail(model2, 10)
#> c s v
#> [66,] 0.0000000 0 1.0000000
#> [67,] 0.0000000 0 1.0000000
#> [68,] 0.0000000 0 1.0000000
#> [69,] 0.0000000 0 1.0000000
#> [70,] 0.0000000 0 1.0000000
#> [71,] 0.0000000 0 1.0000000
#> [72,] 0.3333333 0 0.6666667
#> [73,] 0.0000000 0 1.0000000
#> [74,] 0.0000000 0 1.0000000
#> [75,] 0.3333333 0 0.6666667
Created on 2021-07-20 by the reprex package (v2.0.0)
I know there is an answer already marked here, but this is possible to complete without utilizing another function or package.
What you can do instead is build your knn model knn_model and check out it's attributes for the "prob" output, as such.
attributes(knn_model)$prob