How to interpret H2O's confusion matrix?

How to interpret H2O's confusion matrix? - r

I am using h2o version 3.10.4.8.
library(magrittr)
library(h2o)
h2o.init(nthreads = -1, max_mem_size = "6g")
data.url <- "https://raw.githubusercontent.com/DarrenCook/h2o/bk/datasets/"
iris.hex <- paste0(data.url, "iris_wheader.csv") %>%
h2o.importFile(destination_frame = "iris.hex")
y <- "class"
x <- setdiff(names(iris.hex), y)
model.glm <- h2o.glm(x, y, iris.hex, family = "multinomial")
preds <- h2o.predict(model.glm, iris.hex)
h2o.confusionMatrix(model.glm)
h2o.table(preds["predict"])
This is the output of h2o.confusionMatrix(model.glm):
Confusion Matrix: vertical: actual; across: predicted
Iris-setosa Iris-versicolor Iris-virginica Error Rate
Iris-setosa 50 0 0 0.0000 = 0 / 50
Iris-versicolor 0 48 2 0.0400 = 2 / 50
Iris-virginica 0 1 49 0.0200 = 1 / 50
Totals 50 49 51 0.0200 = 3 / 150
Since it says across:predicted, I interpret this to mean that the model made 50 (0 + 48 + 2) predictions that are Iris-versicolor.
This is the output of h2o.table(preds["predict"]):
predict Count
1 Iris-setosa 50
2 Iris-versicolor 49
3 Iris-virginica 51
This tells me that the model made 49 predictions that are Iris-versicolor.
Is the confusion matrix incorrectly labelled or did I make a mistake in interpreting the results?

Row names (vertical) are the actual labels.
Column names (across) are the predicted labels.

You did not make a mistake; the labels are confusing (and causing people to think that the rows and columns were switched). This was fixed recently and will be included in the next release of H2O.

Related

"variable lengths differ" error while running regressions in a loop

I am trying to run a regression loop based on code that I have found in a previous answer (How to Loop/Repeat a Linear Regression in R) but I keep getting an error. My outcomes (dependent) are 940 variables (metabolites) and my exposure (independent) are "bmi","Age", "sex","lpa2c", and "smoking". where BMI and Age are continuous. BMI is the mean exposure, and for others, I am controlling for them.
So I'm testing the effect of BMI on 940 metabolites.
Also, I would like to know how I can extract coefficient, p-value, standard error, and confidence interval for BMI only and when it is significant.
This is the code I have used:
y<- c(1653:2592) # response
x1<- c("bmi","Age", "sex","lpa2c", "smoking") # predictor
for (i in x1){
model <- lm(paste("y ~", i[[1]]), data= QBB_clean)
print(summary(model))
}
And this is the error:
Error in model.frame.default(formula = paste("y ~", i[[1]]), data = QBB_clean, :
variable lengths differ (found for 'bmi').
y1 y2 y3 y4 bmi age sex lpa2c smoking
1 0.2875775201 0.59998896 0.238726027 0.784575267 24 18 1 0.470681834 1
2 0.7883051354 0.33282354 0.962358936 0.009429905 12 20 0 0.365845473 1
3 0.4089769218 0.48861303 0.601365726 0.779065883 18 15 0 0.121272054 0
4 0.8830174040 0.95447383 0.515029727 0.729390652 16 21 0 0.046993681 0
5 0.9404672843 0.48290240 0.402573342 0.630131853 18 28 1 0.262796304 1
6 0.0455564994 0.89035022 0.880246541 0.480910830 13 13 0 0.968641168 1
7 0.5281054880 0.91443819 0.364091865 0.156636851 11 12 0 0.488495482 1
8 0.8924190444 0.60873498 0.288239281 0.008215520 21 23 0 0.477822030 0
9 0.5514350145 0.41068978 0.170645235 0.452458394 18 17 1 0.748792881 0
10 0.4566147353 0.14709469 0.172171746 0.492293329 20 15 1 0.667640231 1

If you want to loop over responses you will want something like this:
respvars <- names(QBB_clean[1653:2592])
predvars <- c("bmi","Age", "sex","lpa2c", "smoking")
results <- list()
for (v in respvars) {
form <- reformulate(predvars, response = v)
results[[v]] <- lm(form, data = QBB_clean)
}
You can then print the results with something like lapply(results, summary), extract coefficients, etc.. (I have a little trouble seeing how it's going to be useful to just print the results of 940 regressions ... are you really going to inspect them all?
If you want coefficients etc. for BMI, I think this should work (not tested):
t(sapply(results, function(m) coef(summary(m))["bmi",]))
Or for coefficients:
t(sapply(results, function(m) confint(m)["bmi",]))

How to read the indexes from the prediction output of predict.ranger, R

Using the ranger package I run the following script:
rf <- ranger(Surv(time, Y) ~ ., data = train_frame[1:50000, ], write.forest = TRUE, num.trees = 100)
test_frame <- train_frame[50001:100000, ]
preds <- predict(rf, test_frame)
chfs <- preds$chf
plot(chfs[1, ])
The cumulative hazard function has indexes 1 - 36 on the X-axis. Obviously this corresponds with time, but I'm not sure how: my time of observation variable ranges from a minimum of 0 to a maximum of 399. What is the mapping between the original data and the predicted output from predict.ranger, and how can I operationalize this to quantify degree of risk for a given subject after a given length of time?
Here's a sample of what my time/event data looks like:
Y time
<int> <dbl>
1 1 358
2 0 90
3 0 162
4 0 35
5 0 307
6 0 69
7 0 184
8 0 24
9 0 366
10 0 33
And here's what the CHF of the first subject looks like:
Can anyone help me connect the dots? There are no row or columns names on the "matrix" object that is preds$chf.

In the prediction object is vector called unique.death.times containing the time points where the CHF and survival estimates are computed. The chf matrix has observations in the rows and these time points in the columns, same for survival.
Reproducible example:
library(survival)
library(ranger)
## Split the data
n <- nrow(veteran)
idx <- sample(n, 2/3*n)
train <- veteran[idx, ]
test <- veteran[-idx, ]
## Grow RF and predict
rf <- ranger(Surv(time, status) ~ ., train, write.forest = TRUE)
preds <- predict(rf, test)
## Example CHF plot
plot(preds$unique.death.times, preds$chf[1, ])
## Example survival plot
plot(preds$unique.death.times, preds$survival[1, ])
Setting importance = "impurity" for survival forests should throw an error.

How to use predict on a test set?

I am going to eventually do multivariate regression for a vary large set of predictors. To make sure that I am putting the data in correctly and getting expected results with a toy model. However when I try to use predict it does not predict on the new data, also since the size of the new data is different from the training set it gives me an error.
I have looked and tried various things on the Internet and none have worked. I am almost ready to give up and write my own functions but I am also building models with the please package, which I am guessing probably calls this internally already so I want to be consistent. Here is the short script I wrote:
x1<-c(1.1,3.4,5.6,1.2,5,6.4,0.9,7.2,5.4,3.1) # Orginal Variables
x2<-c(10,21,25,15.2,18.9,19,16.2,22.1,18.6,22)
y<-2.0*x1+1.12*x2+rnorm(10,mean=0,sd=0.2) # Define output variable
X<-data.frame(x1,x2)
lfit<-lm(y~.,X) # fit model
n_fit<-lfit$coefficients
xg1<-runif(15,1,10) # define new data
xg2<-runif(15,10,30)
X<-data.frame(xg1,xg2)# put into data frame
y_guess<-predict(lfit,newdata=X) #Predict based on fit
y_actual<-2.0*xg1+1.12*xg2 # actual values because I know the coefficients
y_pred=n_fit[1]+n_fit[2]*xg1+n_fit[3]*xg2 # What predict should give me based on fit
print(y_guess-y_actual) #difference check
print(y_guess-y_pred)
These are the values I am getting and the error message:
[1] -4.7171499 -16.9936498 6.9181074 -6.1964788 -11.1852816 0.9257043 -13.7968731 -6.6624086 15.5365141 -8.5009428
[11] -22.8866505 2.0804016 -1.8728602 -18.7670797 1.2251849
[1] -4.582645 -16.903164 7.038968 -5.878723 -11.149987 1.162815 -13.473351 -6.483111 15.731694 -8.456738
[11] -22.732886 2.390507 -1.662446 -18.627342 1.431469
Warning messages:
1: 'newdata' had 15 rows but variables found have 10 rows
2: In y_guess - y_actual :
longer object length is not a multiple of shorter object length
3: In y_guess - y_pred :
longer object length is not a multiple of shorter object length
The predicted coefficient are 1.97 and 1.13 and intercept -0.25, it should be 0 but I added noise, this would not cause a big discrepancy as it is. How do I get it so I can predict an independent test set.

From the help - documentation, ?predict.lm:
"Variables are first looked for in newdata and then searched for in the usual way (which will include the environment of the formula used in the fit)."
The data.frame(), created in: X <- data.frame(xg1, xg2), has different names: (xg1, xg2). predict() cannot find the original names (x1, x2) and will then search for the correct variables in the formula instead. The result is that you obtain the fitted values from your original data.
Solve this by making your names in the newdata consistent with the original:
X <- data.frame(x1=xg1, x2=xg2) :
x1 <- c(1.1, 3.4, 5.6, 1.2, 5, 6.4, 0.9, 7.2, 5.4, 3.1) # Orginal Variables
x2 <- c(10, 21, 25, 15.2, 18.9, 19, 16.2, 22.1, 18.6, 22)
y <- 2.0*x1 + 1.12*x2 + rnorm(10, mean=0, sd=0.2) # Define output variable
X <- data.frame(x1, x2)
lfit <- lm(y~., X) # fit model
n_fit <- lfit$coefficients
xg1 <- runif(15, 1, 10) # define new data
xg2 <- runif(15, 10, 30)
X <- data.frame(x1=xg1, x2=xg2) # put into data frame
y_guess <- predict(lfit, newdata=X) #Predict based on fit
y_actual <- 2.0*xg1 + 1.12*xg2 # actual values because I know the coefficients
y_pred = n_fit[1] + n_fit[2]*xg1 + n_fit[3]*xg2 # What predict should give me based on fit
> print(y_guess - y_actual) #difference check
1 2 3 4 5 6 7 8 9 10 11 12 13
-0.060223916 -0.047790535 -0.018274280 -0.096190467 -0.079490487 -0.063736231 -0.047506981 -0.009523583 -0.047774006 -0.084276807 -0.106322290 -0.030876942 -0.067232989
14 15
-0.023060651 -0.041264431
> print(y_guess - y_pred)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Simulating an interaction effect in a lmer() model in R

Is there an R package with a function that can:
(1) simulate the different values of an interaction variable,
(2) plot a graph that demonstrates the effect of the interaction on Y for different values of the terms in interaction, and
(3) works well with the models fitted with the lmer() function of the lme4 package?
I have looked in arm, ez, coefplot2, and fanovaGraph packages, but could not find what I was looking for.

I'm not sure about a package, but you can simulate data varying the terms in the interaction, and then graph it. Here is an example for a treatment by wave (i.e. longitudinal) interaction and the syntax to plot. I think the story behind the example is a treatment to improve oral reading fluency in school age children. The term of the interaction is modified by changing the function value for bX.
library(arm)
sim1 <- function (b0=50, bGrowth=4.672,bX=15, b01=.770413, b11=.005, Vint=771, Vslope=2.24, Verror=40.34) {
#observation ID
oID<-rep(1:231)
#participant ID
ID<-rep(1:77, each=3)
tmp2<-sample(0:1,77,replace=TRUE,prob=c(.5,.5))
ITT<-tmp2[ID]
#longitudinal wave: for example 0, 4, and 7 months after treatment
wave <-rep(c(0,4,7), 77)
bvaset<-rnorm(77, 0, 11.58)
bva<-bvaset[ID]
#random effect intercept
S.in <- rnorm(77, 0, sqrt(Vint))
#random effect for slope
S.sl<-rnorm(77, 0, sqrt(Vslope))
#observation level error
eps <- rnorm(3*77, 0, sqrt(Verror))
#Create Outcome as product of specified model
ORFset <- b0 + b01*bva+ bGrowth*wave +bX*ITT*wave+ S.in[ID]+S.sl[ID]*wave+eps[oID]
#if else statement to elimiante ORF values below 0
ORF<-ifelse(ORFset<0,0,ORFset)
#Put into a data frame
mydata <- data.frame( oID,ID,ITT, wave,ORF,bva,S.in[ID],S.sl[ID],eps)
#run the model
fit1<-lmer(ORF~1+wave+ITT+wave:ITT+(1+wave|ID),data=mydata)
fit1
#grab variance components
vc<-VarCorr(fit1)
#Select Tau and Sigma to select in the out object
varcomps=c(unlist(lapply(vc,diag)),attr(vc,"sc")^2)
#Produce object to output
out<-c(coef(summary(fit1))[4,"t value"],coef(summary(fit1))[4,"Estimate"],as.numeric(varcomps[2]),varcomps[3])
#outputs T Value, Estimate of Effect, Tau, Sigma Squared
out
mydata
}
mydata<-sim1(b0=50, bGrowth=4.672, bX=1.25, b01=.770413, b11=.005, Vint=771, Vslope=2.24, Verror=40.34)
xyplot(ORF~wave,groups=interaction(ITT),data=mydata,type=c("a","p","g"))

Try plotLMER.fnc() from the languageR package, or the effects package.

The merTools package has some functionality to make this easier, though it only applies to working with lmer and glmer objects. Here's how you might do it:
library(merTools)
# fit an interaction model
m1 <- lmer(y ~ studage * service + (1|d) + (1|s), data = InstEval)
# select an average observation from the model frame
examp <- draw(m1, "average")
# create a modified data.frame by changing one value
simCase <- wiggle(examp, var = "service", values = c(0, 1))
# modify again for the studage variable
simCase <- wiggle(simCase, var = "studage", values = c(2, 4, 6, 8))
After this, we have our simulated data which looks like:
simCase
y studage service d s
1 3.205745 2 0 761 564
2 3.205745 2 1 761 564
3 3.205745 4 0 761 564
4 3.205745 4 1 761 564
5 3.205745 6 0 761 564
6 3.205745 6 1 761 564
7 3.205745 8 0 761 564
8 3.205745 8 1 761 564
Next, we need to generate prediction intervals, which we can do with merTools::predictInterval (or without intervals you could use lme4::predict)
preds <- predictInterval(m1, level = 0.9, newdata = simCase)
Now we get a preds object, which is a 3 column data.frame:
preds
fit lwr upr
1 3.312390 1.2948130 5.251558
2 3.263301 1.1996693 5.362962
3 3.412936 1.3096006 5.244776
4 3.027135 1.1138965 4.972449
5 3.263416 0.6324732 5.257844
6 3.370330 0.9802323 5.073362
7 3.410260 1.3721760 5.280458
8 2.947482 1.3958538 5.136692
We can then put it all together to plot:
library(ggplot2)
plotdf <- cbind(simCase, preds)
ggplot(plotdf, aes(x = service, y = fit, ymin = lwr, ymax = upr)) +
geom_pointrange() + facet_wrap(~studage) + theme_bw()
Unfortunately the data here results in a rather uninteresting, but easy to interpret plot.

Logistic Regression : not actual out put with predict function

I am new to R, when I am going to estimate a logistic model using glm() it's not predicting the response, but gives a not actual output on calling predict function like 1 for every input at my predict function.
Code:
ex2data1R <- read.csv("/media/ex2data1R.txt")
x <-ex2data1R$x
y <-ex2data1R$y
z <-ex2data1R$z
logisticmodel <- glm(z~x+y,family=binomial(link = "logit"),data=ex2data1R)
newdata = data.frame(x=c(10),y=(10))
predict(logisticmodel, newdata, type="response")
Output:
> predict(logisticmodel, newdata, type="response")
1
1.181875e-11
Data(ex2data1R.txt) :
"x","y","z"
34.62365962451697,78.0246928153624,0
30.28671076822607,43.89499752400101,0
35.84740876993872,72.90219802708364,0
60.18259938620976,86.30855209546826,1
79.0327360507101,75.3443764369103,1
45.08327747668339,56.3163717815305,0
61.10666453684766,96.51142588489624,1
75.02474556738889,46.55401354116538,1
76.09878670226257,87.42056971926803,1
84.43281996120035,43.53339331072109,1
95.86155507093572,38.22527805795094,0
75.01365838958247,30.60326323428011,0
82.30705337399482,76.48196330235604,1
69.36458875970939,97.71869196188608,1
39.53833914367223,76.03681085115882,0
53.9710521485623,89.20735013750205,1
69.07014406283025,52.74046973016765,1
67.94685547711617,46.67857410673128,0
70.66150955499435,92.92713789364831,1
76.97878372747498,47.57596364975532,1
67.37202754570876,42.83843832029179,0
89.67677575072079,65.79936592745237,1
50.534788289883,48.85581152764205,0
34.21206097786789,44.20952859866288,0
77.9240914545704,68.9723599933059,1
62.27101367004632,69.95445795447587,1
80.1901807509566,44.82162893218353,1
93.114388797442,38.80067033713209,0
61.83020602312595,50.25610789244621,0
38.78580379679423,64.99568095539578,0
61.379289447425,72.80788731317097,1
85.40451939411645,57.05198397627122,1
52.10797973193984,63.12762376881715,0
52.04540476831827,69.43286012045222,1
40.23689373545111,71.16774802184875,0
54.63510555424817,52.21388588061123,0
33.91550010906887,98.86943574220611,0
64.17698887494485,80.90806058670817,1
74.78925295941542,41.57341522824434,0
34.1836400264419,75.2377203360134,0
83.90239366249155,56.30804621605327,1
51.54772026906181,46.85629026349976,0
94.44336776917852,65.56892160559052,1
82.36875375713919,40.61825515970618,0
51.04775177128865,45.82270145776001,0
62.22267576120188,52.06099194836679,0
77.19303492601364,70.45820000180959,1
97.77159928000232,86.7278223300282,1
62.07306379667647,96.76882412413983,1
91.56497449807442,88.69629254546599,1
79.94481794066932,74.16311935043758,1
99.2725269292572,60.99903099844988,1
90.54671411399852,43.39060180650027,1
34.52451385320009,60.39634245837173,0
50.2864961189907,49.80453881323059,0
49.58667721632031,59.80895099453265,0
97.64563396007767,68.86157272420604,1
32.57720016809309,95.59854761387875,0
74.24869136721598,69.82457122657193,1
71.79646205863379,78.45356224515052,1
75.3956114656803,85.75993667331619,1
35.28611281526193,47.02051394723416,0
56.25381749711624,39.26147251058019,0
30.05882244669796,49.59297386723685,0
44.66826172480893,66.45008614558913,0
66.56089447242954,41.09209807936973,0
40.45755098375164,97.53518548909936,1
49.07256321908844,51.88321182073966,0
80.27957401466998,92.11606081344084,1
66.74671856944039,60.99139402740988,1
32.72283304060323,43.30717306430063,0
64.0393204150601,78.03168802018232,1
72.34649422579923,96.22759296761404,1
60.45788573918959,73.09499809758037,1
58.84095621726802,75.85844831279042,1
99.82785779692128,72.36925193383885,1
47.26426910848174,88.47586499559782,1
50.45815980285988,75.80985952982456,1
60.45555629271532,42.50840943572217,0
82.22666157785568,42.71987853716458,0
88.9138964166533,69.80378889835472,1
94.83450672430196,45.69430680250754,1
67.31925746917527,66.58935317747915,1
57.23870631569862,59.51428198012956,1
80.36675600171273,90.96014789746954,1
68.46852178591112,85.59430710452014,1
42.0754545384731,78.84478600148043,0
75.47770200533905,90.42453899753964,1
78.63542434898018,96.64742716885644,1
52.34800398794107,60.76950525602592,0
94.09433112516793,77.15910509073893,1
90.44855097096364,87.50879176484702,1
55.48216114069585,35.57070347228866,0
74.49269241843041,84.84513684930135,1
89.84580670720979,45.35828361091658,1
83.48916274498238,48.38028579728175,1
42.2617008099817,87.10385094025457,1
99.31500880510394,68.77540947206617,1
55.34001756003703,64.9319380069486,1
74.77589300092767,89.52981289513276,1
Let me know am I doing something wrong?

I'm not seeing any problem. Here are predictions for x,y = 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100:
newdata = data.frame(x=seq(30, 100, 5) ,y=seq(30, 100, 5))
predict(logisticmodel, newdata, type="response")
1 2 3 4 5 6
2.423648e-06 1.861140e-05 1.429031e-04 1.096336e-03 8.357794e-03 6.078786e-02
7 8 9 10 11 12
3.320041e-01 7.923883e-01 9.670066e-01 9.955766e-01 9.994218e-01 9.999247e-01
13 14 15
9.999902e-01 9.999987e-01 9.999998e-01
You were predicting x=10, y=10 which is way outside the range of your x, y values (30 - 100), but the prediction was zero which fits these results. When x and y are low (30 - 55), the prediction for z is zero. when x and y are high (75 - 100), the prediction is one (or nearly one). It may be easier to interpret the results if you round them to a few decimals:
round(predict(logisticmodel, newdata, type="response") , 5)
1 2 3 4 5 6 7 8 9 10
0.00000 0.00002 0.00014 0.00110 0.00836 0.06079 0.33200 0.79239 0.96701 0.99558
11 12 13 14 15
0.99942 0.99992 0.99999 1.00000 1.00000
Here is a simple way to predict a category and compare the results with your data:
predict <- ifelse(predict(logisticmodel, type="response")>.5, 1, 0)
xtabs(~predict+ex2data1R$z)
ex2data1R$z
predict 0 1
0 34 5
1 6 55
We used predict() on your original data and then created a rule that picks 1 if the probability is greater than .5 and 0 if it is not. Then we use xtabs() to compare the predictions to the data. When z is 0, we correctly predict zero 34 times and incorrectly predict one 6 times. When z is 1 we correctly predict one 55 times and incorrectly predict zero 5 times. We are correct 89% of the time (34+55)/100*100. You could explore the accuracy of prediction if you use .45 or .55 as the cutoff instead of .5.

In my opinion all is correct, as you can read from R manual:
newdata - optionally, a data frame in which to look for variables with
which to predict. If omitted, the fitted linear predictors are used.
If you have data frame with 1 record it will produce prediction only for that one.
For more details see R manual/glm/predict
or just in R console, after loading library glm put:
?glm

You can also use the following command to make the confusion matrix:
predict <- ifelse(predict(logisticmodel, type="response")>.5, 1, 0)
table(predict,ex2data1R$z)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to interpret H2O's confusion matrix? - r

Row names (vertical) are the actual labels. Column names (across) are the predicted labels.

You did not make a mistake; the labels are confusing (and causing people to think that the rows and columns were switched). This was fixed recently and will be included in the next release of H2O.

Related

"variable lengths differ" error while running regressions in a loop

How to read the indexes from the prediction output of predict.ranger, R

How to use predict on a test set?

Simulating an interaction effect in a lmer() model in R

Logistic Regression : not actual out put with predict function

Categories

Resources