Ranger predict incorrect number of dimensions in R - r

Issues with evaluating ranger. In both, unable to subset the data (want the first column of rf.trnprob)
rangermodel= ranger(outcome~., data=traindata, num.trees=200, probability=TRUE)
rf.trnprob= predict(rangerModel, traindata, type='prob')
trainscore <- subset(traindata, select=c("outcome"))
trainscore$score<-rf.trnprob[, 1]
Error:
incorrect number of dimensions
table(pred = rf.trbprob, true=traindata$outcome)
Error:
all arguments must have the same length

Seems like the predict function is called wrongly, it should be response instead of type. Using an example dataset:
library(ranger)
traindata =iris
traindata$Species = factor(as.numeric(traindata$Species=="versicolor"))
rangerModel = ranger(Species~.,data=traindata,probability=TRUE)
rf.trnprob= predict(rangerModel, traindata, response='prob')
Probability is stored here, one column for each class:
head(rf.trnprob$predictions)
0 1
[1,] 1.0000000 0.000000000
[2,] 0.9971786 0.002821429
[3,] 1.0000000 0.000000000
[4,] 1.0000000 0.000000000
[5,] 1.0000000 0.000000000
[6,] 1.0000000 0.000000000
But seems like you want to do a confusion matrix, so you can get the predictions by doing:
pred = levels(traindata$Species)[max.col(rf.trnprob$predictions)]
Then:
table(pred,traindata$Species)
pred 0 1
0 100 2
1 0 48

Related

How does predict.cv.glmnet() calculate the link value for binomial models?

I'm thinking this is a coding error on my part, but I can't figure out what I'm doing wrong. I'm trying to create marginal effects plots for x when x+x^2 are in a binomial glmnet model. When I made predictions using predict.cv.glmnet() on a new dataset it almost seemed as if the the signs of the model coefficients were switched. So I've been comparing predictions when hardcoding the model formula vs. the predict function. I'm getting different predictions on the new data but not the training data and I'm not sure why.
#############example############
library(tidyverse)
library(glmnet)
data(BinomialExample)
#head(x)
#head(y)
#training data
squareit<-function(x){x^2}
x%>%
as.data.frame()%>%
dplyr::select(V11, V23, V30)%>%#select three variables
mutate_all(list(SQ = squareit))%>%#square them
as.matrix()%>%{.->>x1}#back to matrix
# x1[1:5,]
# V11 V23 V30 V11_SQ V23_SQ V30_SQ
# [1,] 0.5706039 0.01473546 0.9080937 0.3255888 0.0002171337 0.8246342
# [2,] 0.4138668 -0.81898245 -2.0492817 0.1712857 0.6707322586 4.1995554
# [3,] -0.4876853 0.03927982 -1.2089006 0.2378369 0.0015429039 1.4614407
# [4,] 0.4644753 -0.98728378 1.7796744 0.2157373 0.9747292699 3.1672408
# [5,] -0.5239595 -0.13288456 -1.1970731 0.2745336 0.0176583076 1.4329841
#example model + coefficients
set.seed(4484)
final.glmnet<-cv.glmnet(x1, y, type.measure="auc",alpha=1,family="binomial")
coef(final.glmnet,s="lambda.min")#print model coefficients
# (Intercept) 0.4097381
# V11 -0.4487949
# V23 0.3322128
# V30 -0.2924600
# V11_SQ -0.3334376
# V23_SQ -0.2867963
# V30_SQ 0.3864748
#get model formula - I'm sure there is a better way
mc<-data.frame(as.matrix(coef(final.glmnet,s="lambda.min")))%>%
rownames_to_column(.,var="rowname")%>%mutate(rowname=recode(rowname,`(Intercept)`="1"))
paste("(",mc$rowname,"*",mc$X1,")",sep="",collapse = "+")
#model formula
# (1*0.409738094215867)+(V11*-0.44879489356345)+(V23*0.332212780157358)+
# (V30*-0.292459974168585)+(V11_SQ*-0.333437624504966)+
# (V23_SQ*-0.286796292157334)+(V30_SQ*0.386474813542322)
#####predict on training data -- same predictions!
#1 predict using predict.cv.glmnet()
pred_cv.glmnet<-predict(final.glmnet, s=final.glmnet$lambda.min, newx=x1[1:5,], type='link')
round(pred_cv.glmnet,3)
#2 predict using model formula
pred.model.formula<-data.frame(x1[1:5,])%>%
mutate(link=(1*0.409738094215867)+(V11*-0.44879489356345)+
(V23*0.332212780157358)+(V30*-0.292459974168585)+
(V11_SQ*-0.333437624504966)+(V23_SQ*-0.286796292157334)+
(V30_SQ*0.386474813542322))%>%
dplyr::select(link)
round(pred.model.formula,3)
# 1 0.103
# 2 1.925
# 3 1.480
# 4 0.225
# 5 1.408
#new data - set up for maringal effects plot
colnames(x1)
test<-data.frame(V23=seq(min(x1[,2]),max(x1[,2]),0.1))%>%#let V23 vary from min-max by .1
mutate(V23_SQ=V23^2, V11=0, V11_SQ=0, V30=0, V30_SQ=0)%>%#square V23 and set others to 0
as.matrix()
test[1:5,]
# > test[1:5,]
# V23 V23_SQ V11 V11_SQ V30 V30_SQ
# [1,] -2.071573 4.291414 0 0 0 0
# [2,] -1.971573 3.887100 0 0 0 0
# [3,] -1.871573 3.502785 0 0 0 0
# [4,] -1.771573 3.138470 0 0 0 0
# [5,] -1.671573 2.794156 0 0 0 0
#####predict on new data -- different predictions!
#1 predict using predict.cv.glmnet()
pred_cv.glmnet<-predict(final.glmnet, s=final.glmnet$lambda.min, newx=test[1:5,], type='link')
round(pred_cv.glmnet,3)
# [1,] 2.765
# [2,] 2.586
# [3,] 2.413
# [4,] 2.247
# [5,] 2.088
#2 predict using model formula
pred.model.formula<-data.frame(test[1:5,])%>%
mutate(link.longhand=(1*0.409738094215867)+(V11*-0.44879489356345)+
(V23*0.332212780157358)+(V30*-0.292459974168585)+
(V11_SQ*-0.333437624504966)+(V23_SQ*-0.286796292157334)+
(V30_SQ*0.386474813542322))%>%
dplyr::select(link.longhand)
round(pred.model.formula,3)
# 1 -1.509
# 2 -1.360
# 3 -1.217
# 4 -1.079
# 5 -0.947
Your understanding of how the calculation works appears to be correct. However, the predict function requires the newx matrix to have the same dimensions as the original (i.e. columns need to be in the same order). When the columns are mixed up the predict function results will go haywire, as you see in your example.
If you add a quick adjustment to put your test matrix in the same order as your x1 matrix (handy shortcut is to just use the dimnames from your original matrix in a dplyr::select like line 3 below)...
test <- data.frame(V23 = seq(min(x1[, 2]), max(x1[, 2]), 0.1)) %>% #let V23 vary from min-max by .1
mutate(V23_SQ = V23^2, V11 = 0, V11_SQ = 0, V30 = 0, V30_SQ = 0) %>% #square V23 and set others to 0
select(dimnames(x1)[[2]]) %>% #use the dimnames from x1 to select test in proper order!
as.matrix()
...you will find that you get the proper results which align as expected.
#1 predict using predict.cv.glmnet()
pred_cv.glmnet <- predict(final.glmnet, s = 'lambda.min', newx = test[1:5,], type = 'link')
round(pred_cv.glmnet,3)
# [1,] -1.509
# [2,] -1.360
# [3,] -1.217
# [4,] -1.079
# [5,] -0.947
#2 predict using model formula
pred.model.formula <- data.frame(test[1:5,]) %>%
mutate(link.longhand = ((1 * 0.409738094215867) +
(V11 * -0.44879489356345) +
(V23 * 0.332212780157358) +
(V30 * -0.292459974168585) +
(V11_SQ * -0.333437624504966) +
(V23_SQ * -0.286796292157334) +
(V30_SQ * 0.386474813542322))) %>%
dplyr::select(link.longhand)
round(pred.model.formula, 3)
# link.longhand
# 1 -1.509
# 2 -1.360
# 3 -1.217
# 4 -1.079
# 5 -0.947

r ADBUG model nls singular gradient

I've tried to fit the following into a ADBUG model using the nls function in r, but the singular matrix error kept repeating and I don't really know how to proceed on doing this...
nprice nlv2
[1,] 0.6666667 1.91666667
[2,] 0.7500000 1.91666667
[3,] 0.8333333 1.91666667
[4,] 0.9166667 1.44444444
[5,] 1.0000000 1.00000000
[6,] 1.0833333 0.58333333
[7,] 1.1666667 0.22222222
[8,] 1.2500000 0.08333333
[9,] 1.3333333 0.02777778
code:
fit <- nls(f=nprice~a+b*nlv2^c/(nlv2^c+d),start=list(a=0.083,b=1.89,c=-10.95,d=0.94))
Error in nls(f = nprice ~ a + b * nlv2^c/(nlv2^c + d), start = list(a = 0.083, :
singular gradient
Package nlsr provides an updated version of nls through function nlxb that in most cases avoids the "singular gradient" error.
library(nlsr)
fit <- nlxb(f = nprice~a+b*nlv2^c/(nlv2^c+d),
data = df,
start = list(a=0.083,b=1.89,c=-10.95,d=0.94))
## vn:[1] "nprice" "a" "b" "nlv2" "c" "d"
## no weights
fit$coefficients
## a b c d
## -2.1207e+04 2.1208e+04 -7.4083e-01 1.6236e-05
The fitted coefficients are far away from the starting values and quite big, indicating the problem is not well grounded.

Unexpected output containing plus, minus, and letters produced by subtracting one column of numbers from another in R

I have a data.frame containing a vector of numeric values (prcp_log).
waterdate PRCP prcp_log
<date> <dbl> <dbl>
1 2007-10-01 0 0
2 2007-10-02 0.02 0.0198
3 2007-10-03 0.31 0.270
4 2007-10-04 1.8 1.03
5 2007-10-05 0.03 0.0296
6 2007-10-06 0.19 0.174
I then pass this data through Christiano-Fitzgerald band pass filter using the following command from the mfilter package.
library(mFilter)
US1ORLA0076_cffilter <- cffilter(US1ORLA0076$prcp_log,pl=180,pu=365,root=FALSE,drift=FALSE,
type=c("asymmetric"),
nfix=NULL,theta=1)
Which creates an S3 object containing, among other things, and vector of "trend" values and a vector of "cycle" values, like so:
head(US1ORLA0076_cffilter$trend)
[,1]
[1,] 0.05439408
[2,] 0.07275321
[3,] 0.32150292
[4,] 1.07958965
[5,] 0.07799329
[6,] 0.22082246
head(US1ORLA0076_cffilter$cycle)
[,1]
[1,] -0.05439408
[2,] -0.05295058
[3,] -0.05147578
[4,] -0.04997023
[5,] -0.04843449
[6,] -0.04686915
Plotted:
plot(US1ORLA0076_cffilter)
I then apply the following mathematical operation in attempt to remove the trend and seasonal components from the original numeric vector:
US1ORLA0076$decomp <- ((US1ORLA0076$prcp_log - US1ORLA0076_cffilter$trend) - US1ORLA0076_cffilter$cycle)
Which creates an output of values which includes unexpected elements such as dashes and letters.
head(US1ORLA0076$decomp)
[,1]
[1,] 0.000000e+00
[2,] 0.000000e+00
[3,] 1.387779e-17
[4,] -2.775558e-17
[5,] 0.000000e+00
[6,] 6.938894e-18
What has happened here? What do these additional characters signify? How can perform this mathematical operation and achieve the desired output of simply $log_prcp minus both the $tend and $cycle values?
I am happy to provide any additional info that will help right away, just ask.

R:How to Fix Error Coding for A Bonferroni Correction

I am stuck on how to proceed with coding in RStudio for the Bonferroni Correction and the raw P values for the Pearson Correlation Matrix. I am a student and am new to R. I am also lost on how to get a table of the mean,SD, and n for the data. When I calculated the Pearson Correlation Matrix I just got the r value and not the raw probabilities value also. I am not sure how to code to get that in RStudio. I then tried to calculate the Bonferroni Correction and received an error message saying list object cannot be coerced to type double. How do I fix my code so this goes away? I also tried to create a table of the mean, SD, and n for the data and I became stuck on how to proceed.
My data is as follows:
Tree Height DBA Leaf Diameter
45.3 14.9 0.76
75.2 26.6 1.06
70.1 22.9 1.19
95 31.8 1.59
107.8 35.5 0.43
93 26.2 1.49
91.5 29 1.19
78.5 29.2 1.1
85.2 30.3 1.24
50 16.8 0.67
47.1 12.8 0.98
73.2 28.4 1.2
Packages I have installed dplyr,tidyr,multcomp,multcompview
I Read in the data from excel CSV(comma delimited) file and This creates data>dataHW8_1 12obs. of 3 variables
summary(dataHW8_1)
I then created Scatterplots of the data
plot(dataHW8_1$Tree_Height,dataHW8_1$DBA,main="Scatterplot Tree Height Vs Trunk Diameter at Breast Height (DBA)",xlab="Tree Height (cm)",ylab="DBA (cm)")
plot(dataHW8_1$Tree_Height,dataHW8_1$Leaf_Diameter,main="Scatterplot Tree Height Vs Leaf Diameter",xlab="Tree Height (cm)",ylab="Leaf Diameter (cm)")
plot(dataHW8_1$DBA,dataHW8_1$Leaf_Diameter,main="Scatterplot Trunk Diameter at Breast Height (DBA) Vs Leaf Diameter",xlab="DBA (cm)",ylab="Leaf Diameter (cm)")
I then noticed that the data was not linear so I transformed it using the log() fucntion
dataHW8_1log = log(dataHW8_1)
I then re-created my Scatterplots using the transformed data
plot(dataHW8_1log$Tree_Height,dataHW8_1log$DBA,main="Scatterplot of
Transformed (log)Tree Height Vs Trunk Diameter at Breast Height
(DBA)",xlab="Tree Height (cm)",ylab="DBA (cm)")
plot(dataHW8_1log$Tree_Height,dataHW8_1log$Leaf_Diameter,main="Scatterplot
of Transformed (log)Tree Height Vs Leaf Diameter",xlab="Tree Height
(cm)",ylab="Leaf Diameter (cm)")
plot(dataHW8_1log$DBA,dataHW8_1log$Leaf_Diameter,main="Scatterplot of
Transformed (log) Trunk Diameter at Breast Height (DBA) Vs Leaf
Diameter",xlab="DBA (cm)",ylab="Leaf Diameter (cm)")
I then created a matrix plot of Scatterplots
pairs(dataHW8_1log)
I then calculated the correlation coefficent using the Pearson method
this does not give an uncorreted matrix of P values------How do you do that?
cor(dataHW8_1log,method="pearson")
I am stuck on what to do to get a matrix of the raw probabilities (uncorrected P values) of the data
I then calculated the Bonferroni correction-----How do you do that?
Data$Bonferroni =
p.adjust(dataHW8_1log,
method = "bonferroni")
Doing this gave me the follwing error:
Error in p.adjust(dataHW8_1log, method = "bonferroni") :
(list) object cannot be coerced to type 'double'
I tried to fix using lapply, but that did not fix my promblem
I then tried to make a table of mean, SD, n, but I was only able to create the following code and became stuck on where to go from there------How do you do that?
(,data = dataHW8_1log,
FUN = function(x) c(Mean = mean(x, na.rm = T),
n = length(x),
sd = sd(x, na.rm = T))
I have tried following examples online, but none of them have helped me with the getting the Bonferroni Correction to code correctly.If anyone can help explain what I did wrong and how to make the Matrices/table I would greatly appreciate it.
Here is an example using a 50 rows by 10 columns sample dataframe.
# 50 rows x 10 columns sample dataframe
df <- as.data.frame(matrix(runif(500), ncol = 10));
We can show pairwise scatterplots.
# Pairwise scatterplot
pairs(df);
We can now use cor.test to get p-values for a single comparison. We use a convenience function cor.test.p to do this for all pairwise comparisons. To give credit where credit is due, the function cor.test.p has been taken from this SO post, and takes as an argument a dataframe whilst returning a matrix of uncorrected p-values.
# cor.test on dataframes
# From: https://stackoverflow.com/questions/13112238/a-matrix-version-of-cor-test
cor.test.p <- function(x) {
FUN <- function(x, y) cor.test(x, y)[["p.value"]];
z <- outer(
colnames(x),
colnames(x),
Vectorize(function(i,j) FUN(x[,i], x[,j])));
dimnames(z) <- list(colnames(x), colnames(x));
return(z);
}
# Uncorrected p-values from pairwise correlation tests
pval <- cor.test.p(df);
We now correct for multiple hypothesis testing by applying the Bonferroni correction to every row (or column, since the matrix is symmetric) and we're done. Note that p.adjust takes a vector of p-values as an argument.
# Multiple hypothesis-testing corrected p-values
# Note: pval is a symmetric matrix, so it doesn't matter if we correct
# by column or by row
padj <- apply(pval, 2, p.adjust, method = "bonferroni");
padj;
#V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#V1 0 1 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 1
#V2 1 0 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 1
#V3 1 1 0.0000000 1 0.9569498 1.0000000 1 1 1.0000000 1
#V4 1 1 1.0000000 0 1.0000000 1.0000000 1 1 1.0000000 1
#V5 1 1 0.9569498 1 0.0000000 1.0000000 1 1 1.0000000 1
#V6 1 1 1.0000000 1 1.0000000 0.0000000 1 1 0.5461443 1
#V7 1 1 1.0000000 1 1.0000000 1.0000000 0 1 1.0000000 1
#V8 1 1 1.0000000 1 1.0000000 1.0000000 1 0 1.0000000 1
#V9 1 1 1.0000000 1 1.0000000 0.5461443 1 1 0.0000000 1
#V10 1 1 1.0000000 1 1.0000000 1.0000000 1 1 1.0000000 0

Probabilities of all classifications in knn in R

When using the knn() function in package class in R, there is an argument called "prob". If I make this true, I get the probability of that particular value being classified to whatever it is classified as.
I have a dataset where the classifier has 9 levels. Is there any way in which I can get the probability of a particular observation for all the 9 levels?
As far as I know the knn() function in class only returns the highest probability.
However, you can use the knnflex package which allows you to return all probability levels using knn.probability (see here, page 9-10).
This question still require proper answer.
If the probability for the most probable class is needed then the class package will be still suited. The clue is to set the argument prob to TRUE and k to higher than default 1 - class::knn(tran, test, cl, k = 5, prob = TRUE). The k has to be higher than default 1 to not get always 100% probability for each observation.
However if you want to get probabilities for each of the classes I will recommend the caret::knn3 function with predict one.
data(iris3)
train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
# class package
# take into account k higher than 1 and prob equal TRUE
model <- class::knn(train, test, cl, k = 5, prob = TRUE)
tail(attributes(model)$prob, 10)
#> [1] 1.0 1.0 1.0 1.0 1.0 1.0 0.8 1.0 1.0 0.8
# caret package
model2 <- predict(caret::knn3(train, cl, k = 3), test)
tail(model2, 10)
#> c s v
#> [66,] 0.0000000 0 1.0000000
#> [67,] 0.0000000 0 1.0000000
#> [68,] 0.0000000 0 1.0000000
#> [69,] 0.0000000 0 1.0000000
#> [70,] 0.0000000 0 1.0000000
#> [71,] 0.0000000 0 1.0000000
#> [72,] 0.3333333 0 0.6666667
#> [73,] 0.0000000 0 1.0000000
#> [74,] 0.0000000 0 1.0000000
#> [75,] 0.3333333 0 0.6666667
Created on 2021-07-20 by the reprex package (v2.0.0)
I know there is an answer already marked here, but this is possible to complete without utilizing another function or package.
What you can do instead is build your knn model knn_model and check out it's attributes for the "prob" output, as such.
attributes(knn_model)$prob

Resources