How to use predict on a test set? - r

I am going to eventually do multivariate regression for a vary large set of predictors. To make sure that I am putting the data in correctly and getting expected results with a toy model. However when I try to use predict it does not predict on the new data, also since the size of the new data is different from the training set it gives me an error.
I have looked and tried various things on the Internet and none have worked. I am almost ready to give up and write my own functions but I am also building models with the please package, which I am guessing probably calls this internally already so I want to be consistent. Here is the short script I wrote:
x1<-c(1.1,3.4,5.6,1.2,5,6.4,0.9,7.2,5.4,3.1) # Orginal Variables
x2<-c(10,21,25,15.2,18.9,19,16.2,22.1,18.6,22)
y<-2.0*x1+1.12*x2+rnorm(10,mean=0,sd=0.2) # Define output variable
X<-data.frame(x1,x2)
lfit<-lm(y~.,X) # fit model
n_fit<-lfit$coefficients
xg1<-runif(15,1,10) # define new data
xg2<-runif(15,10,30)
X<-data.frame(xg1,xg2)# put into data frame
y_guess<-predict(lfit,newdata=X) #Predict based on fit
y_actual<-2.0*xg1+1.12*xg2 # actual values because I know the coefficients
y_pred=n_fit[1]+n_fit[2]*xg1+n_fit[3]*xg2 # What predict should give me based on fit
print(y_guess-y_actual) #difference check
print(y_guess-y_pred)
These are the values I am getting and the error message:
[1] -4.7171499 -16.9936498 6.9181074 -6.1964788 -11.1852816 0.9257043 -13.7968731 -6.6624086 15.5365141 -8.5009428
[11] -22.8866505 2.0804016 -1.8728602 -18.7670797 1.2251849
[1] -4.582645 -16.903164 7.038968 -5.878723 -11.149987 1.162815 -13.473351 -6.483111 15.731694 -8.456738
[11] -22.732886 2.390507 -1.662446 -18.627342 1.431469
Warning messages:
1: 'newdata' had 15 rows but variables found have 10 rows
2: In y_guess - y_actual :
longer object length is not a multiple of shorter object length
3: In y_guess - y_pred :
longer object length is not a multiple of shorter object length
The predicted coefficient are 1.97 and 1.13 and intercept -0.25, it should be 0 but I added noise, this would not cause a big discrepancy as it is. How do I get it so I can predict an independent test set.

From the help - documentation, ?predict.lm:
"Variables are first looked for in newdata and then searched for in the usual way (which will include the environment of the formula used in the fit)."
The data.frame(), created in: X <- data.frame(xg1, xg2), has different names: (xg1, xg2). predict() cannot find the original names (x1, x2) and will then search for the correct variables in the formula instead. The result is that you obtain the fitted values from your original data.
Solve this by making your names in the newdata consistent with the original:
X <- data.frame(x1=xg1, x2=xg2) :
x1 <- c(1.1, 3.4, 5.6, 1.2, 5, 6.4, 0.9, 7.2, 5.4, 3.1) # Orginal Variables
x2 <- c(10, 21, 25, 15.2, 18.9, 19, 16.2, 22.1, 18.6, 22)
y <- 2.0*x1 + 1.12*x2 + rnorm(10, mean=0, sd=0.2) # Define output variable
X <- data.frame(x1, x2)
lfit <- lm(y~., X) # fit model
n_fit <- lfit$coefficients
xg1 <- runif(15, 1, 10) # define new data
xg2 <- runif(15, 10, 30)
X <- data.frame(x1=xg1, x2=xg2) # put into data frame
y_guess <- predict(lfit, newdata=X) #Predict based on fit
y_actual <- 2.0*xg1 + 1.12*xg2 # actual values because I know the coefficients
y_pred = n_fit[1] + n_fit[2]*xg1 + n_fit[3]*xg2 # What predict should give me based on fit
> print(y_guess - y_actual) #difference check
1 2 3 4 5 6 7 8 9 10 11 12 13
-0.060223916 -0.047790535 -0.018274280 -0.096190467 -0.079490487 -0.063736231 -0.047506981 -0.009523583 -0.047774006 -0.084276807 -0.106322290 -0.030876942 -0.067232989
14 15
-0.023060651 -0.041264431
> print(y_guess - y_pred)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Related

Why do results of matching depend on order of data (MatchIt package)?

When using the matchit-function for full matching, the results differ by the order of the input dataframe. That is, if the order of the data is changed, results change, too. This is surprising, because in my understanding, the optimal full algorithm should yield only one single best solution.
Am I missing something or is this an error?
Similar differences occur with the optimal algorithm.
Below you find a reproducible example. Subclasses should be identical for the two data sets, which they are not.
Thank you for your help!
# create data
nr <- c(1:100)
x1 <- rnorm(100, mean=50, sd=20)
x2 <- c(rep("a", 20),rep("b", 60), rep("c", 20))
x3 <- rnorm(100, mean=230, sd=2)
outcome <- rnorm(100, mean=500, sd=20)
group <- c(rep(0, 50),rep(1, 50))
df <- data.frame(x1=x1, x2=x2, outcome=outcome, group=group, row.names=nr, nr=nr)
df_neworder <- df[order(outcome),] # re-order data.frame
# perform matching
model_oldorder <- matchit(group~x1, data=df, method="full", distance ="logit")
model_neworder <- matchit(group~x1, data=df_neworder, method="full", distance ="logit")
# store matching results
matcheddata_oldorder <- match.data(model_oldorder, distance="pscore")
matcheddata_neworder <- match.data(model_neworder, distance="pscore")
# Results based on original data.frame
head(matcheddata_oldorder[order(nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 27
2 63.949637 a 529.2733 0 2 0.5283582 1.0 32
3 52.217666 a 526.7928 0 3 0.5028106 0.5 17
4 48.936397 a 492.9255 0 4 0.4956569 1.0 9
5 36.501507 a 512.9301 0 5 0.4685876 1.0 16
# Results based on re-ordered data.frame
head(matcheddata_neworder[order(matcheddata_neworder$nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 25
2 63.949637 a 529.2733 0 2 0.5283582 1.0 31
3 52.217666 a 526.7928 0 3 0.5028106 0.5 15
4 48.936397 a 492.9255 0 4 0.4956569 1.0 7
5 36.501507 a 512.9301 0 5 0.4685876 2.0 14
Apparently, the assignment of objects to subclasses differs. In my understanding, this should not be the case.
The developers of the optmatch package (which the matchit function calls) provided useful help:
I think what we're seeing here is the result of the tolerance argument
that fullmatch has. The matching algorithm requires integer distances,
so we have to scale then truncate floating point distances. For a
given set of integer distances, there may be multiple matchings that
achieve the minimum, so the solver is free to pick among these
non-unique solutions.
Developing your example a little more:
> library(optmatch)
> nr <- c(1:100) x1 <- rnorm(100, mean=50, sd=20)
> outcome <- rnorm(100, mean=500, sd=20) group <- c(rep(0, 50),rep(1, 50))
> df_oldorder <- data.frame(x1=x1, outcome=outcome, group=group, row.names=nr, nr=nr) > df_neworder <- df_oldorder[order(outcome),] # > re-order data.frame
> glm_oldorder <- match_on(glm(group~x1, > data=df_oldorder), data = df_oldorder)
> glm_neworder <- > match_on(glm(group~x1, data=df_neworder), data = df_neworder)
> fm_old <- fullmatch(glm_oldorder, data=df_oldorder)
> fm_new <- fullmatch(glm_neworder, data=df_neworder)
> mean(sapply(matched.distances(fm_old, glm_oldorder), mean))
> ## 0.06216174
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.062058 mean(sapply(matched.distances(fm_old, glm_oldorder), mean)) -
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.00010373
which we can see is smaller than the default tolerance of 0.001. You can always decrease the tolerance level, which may
require increased run time, in order to get closer to the true
floating put minimum. We found 0.001 seemed to work well in practice,
but there is nothing special about this value.

Prediction with lm

I have the following data frame:
lm mean resids sd resids resid 1 resid 2 resid 3 intercept beta
1 0.000000e+00 6.2806844 -3.6261548 7.2523096 -3.6261548 103.62615 24.989340
2 -2.960595e-16 8.7515899 -5.0527328 10.1054656 -5.0527328 141.96786 -1.047323
3 -2.960595e-16 5.9138984 -3.4143908 6.8287817 -3.4143908 206.29046 -26.448694
4 3.700743e-17 0.5110845 0.2950748 -0.5901495 0.2950748 240.89801 -35.806642
5 7.401487e-16 6.6260504 3.8255520 -7.6511040 3.8255520 187.03479 -23.444762
6 5.921189e-16 8.7217431 5.0355007 -10.0710014 5.0355007 41.43239 3.138396
7 0.000000e+00 5.5269434 3.1909823 -6.3819645 3.1909823 -119.90628 27.817845
8 -1.480297e-16 1.0204260 -0.5891432 1.1782864 -0.5891432 -180.33773 35.623363
9 -5.921189e-16 6.9488186 -4.0119023 8.0238046 -4.0119023 -64.72245 21.820226
10 -8.881784e-16 8.6621512 -5.0010953 10.0021906 -5.0010953 191.65339 -5.218767
Each row represents an estimated linear model with window length 3. I used rollapply on a separate dataframe with the function lm(y~t) to extract the coefficients and intercepts into a new dataframe, which I have combined with the residuals from the same model and their corresponding means and residuals.
Since the window length is 3, it implies that there are 3 residuals as shown, per model, in resid 1, resid 2 and resid 3. The mean and sd of these are included accordingly.
I am seeking to predict the next observation, in essence, k+1, where k is the window length, using the intercept and beta.
Recall that lm1 takes observations 1,2,3 to estimate the intercept and the beta, and lm2 takes 2,3,4, lm3 takes 3,4,5, etc. The function for the prediction should be:
predict_lm1 = intercept_lm1 + beta_lm1*(k+1)
Where k+1 = 4. For lm2:
predict_lm2 = intercept_lm2 + beta_lm2*(k+1)
Where k+1 = 5.
Clearly, k increases by 1 every time I move down one row in the dataset. This is because the explanatory variable is time, t, which is a sequence increasing by one per observation.
Should I use a for loop, or an apply function here?
How can I make a function that iterates down the rows and calculates the predictions accordingly with the information found in that row?
Thanks.
EDIT:
I managed to find a possible solution by writing the following:
n=nrow(dataset)
for(i in n){
predictions = dataset$Intercept + dataset$beta*(k+1)
}
However, k does not increase by 1 per iteration. Thus, k+1 is always = 4.
How can I make sure k increases by 1 accordingly?
EDIT 2
I managed to add 1 to k by writing the following:
n=nrow(dataset)
for(i in n){
x = 0
x[i] = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x[i])
}
However, the first prediction is overestimated. It should be 203, whereas it is estimated as 228, implying that it sets the explanatory variable as 1 too high.
Yet, the second prediction is correct. I am not sure what I am doing wrong. Any advice?
EDIT 3
I managed to find a solution as follows:
n=nrow(dataset)
for(i in n){
x = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x)
x = x + 1
}
Your loop is not iterating:
dataset <- read.table(text="lm meanresids sdresids resid1 resid2 resid3 intercept beta
1 0.000000e+00 6.2806844 -3.6261548 7.2523096 -3.6261548 103.62615 24.989340
2 -2.960595e-16 8.7515899 -5.0527328 10.1054656 -5.0527328 141.96786 -1.047323
3 -2.960595e-16 5.9138984 -3.4143908 6.8287817 -3.4143908 206.29046 -26.448694
4 3.700743e-17 0.5110845 0.2950748 -0.5901495 0.2950748 240.89801 -35.806642
5 7.401487e-16 6.6260504 3.8255520 -7.6511040 3.8255520 187.03479 -23.444762
6 5.921189e-16 8.7217431 5.0355007 -10.0710014 5.0355007 41.43239 3.138396
7 0.000000e+00 5.5269434 3.1909823 -6.3819645 3.1909823 -119.90628 27.817845
8 -1.480297e-16 1.0204260 -0.5891432 1.1782864 -0.5891432 -180.33773 35.623363
9 -5.921189e-16 6.9488186 -4.0119023 8.0238046 -4.0119023 -64.72245 21.820226
10 -8.881784e-16 8.6621512 -5.0010953 10.0021906 -5.0010953 191.65339 -5.218767", header=T)
n <- nrow(dataset)
predictions <- data.frame()
for(i in 1:n){
k <- i ##not sure where k is coming from but put it here
predictions <- rbind(predictions, dataset$intercept[i] + dataset$beta[i]*(k+1))
}
predictions

Order confusion matrix in R

I've created a confusion matrix from the observations and its predictions in 3 classes.
classes=c("Underweight", "Normal", "Overweight")
When I compute the confusion matrix, it organizes the classes in the table alphabetical. Here is my code.
# Confusion matrix
Observations <- bmi_classification(cross.m$bmi)
Predicted <- bmi_classification(cross.m$cvpred)
conf <- table(Predicted, Observations)
library(caret)
f.conf <- confusionMatrix(conf)
print(f.conf)
This produces this output:
Confusion Matrix and Statistics
Observations
Predicted Normal Overweight Underweight
Normal 17 0 1
Overweight 1 4 0
Underweight 1 0 1
So, I would like it to first Underweight, then normal and finally Overweight. I've tried to pass the order to the matrix as an argument but no luck with that.
EDIT:
I tried reordering it,
conf <- table(Predicted, Observations)
reorder = matrix(c(9, 7, 8, 3, 1, 2, 6, 4, 5), nrow=3, ncol=3)
conf.reorder <- conf[reorder]
but I'm getting, [1] 1 1 0 1 17 1 0 0 4
Try this then redo your code:
cross.m$Observations <- factor( cross.m$Observations,
levels=c("Underweight","Normal","Overweight") )
cross.m$Predicted<- factor( cross.m$Predicted,
levels=c("Underweight","Normal","Overweight") )
conf <- table(Predicted, Observations)
library(caret)
f.conf <- confusionMatrix(conf)
print(f.conf)
Ordinary matrix methods would probably not work since a caret confusion matrix object is a list.

Simulating an interaction effect in a lmer() model in R

Is there an R package with a function that can:
(1) simulate the different values of an interaction variable,
(2) plot a graph that demonstrates the effect of the interaction on Y for different values of the terms in interaction, and
(3) works well with the models fitted with the lmer() function of the lme4 package?
I have looked in arm, ez, coefplot2, and fanovaGraph packages, but could not find what I was looking for.
I'm not sure about a package, but you can simulate data varying the terms in the interaction, and then graph it. Here is an example for a treatment by wave (i.e. longitudinal) interaction and the syntax to plot. I think the story behind the example is a treatment to improve oral reading fluency in school age children. The term of the interaction is modified by changing the function value for bX.
library(arm)
sim1 <- function (b0=50, bGrowth=4.672,bX=15, b01=.770413, b11=.005, Vint=771, Vslope=2.24, Verror=40.34) {
#observation ID
oID<-rep(1:231)
#participant ID
ID<-rep(1:77, each=3)
tmp2<-sample(0:1,77,replace=TRUE,prob=c(.5,.5))
ITT<-tmp2[ID]
#longitudinal wave: for example 0, 4, and 7 months after treatment
wave <-rep(c(0,4,7), 77)
bvaset<-rnorm(77, 0, 11.58)
bva<-bvaset[ID]
#random effect intercept
S.in <- rnorm(77, 0, sqrt(Vint))
#random effect for slope
S.sl<-rnorm(77, 0, sqrt(Vslope))
#observation level error
eps <- rnorm(3*77, 0, sqrt(Verror))
#Create Outcome as product of specified model
ORFset <- b0 + b01*bva+ bGrowth*wave +bX*ITT*wave+ S.in[ID]+S.sl[ID]*wave+eps[oID]
#if else statement to elimiante ORF values below 0
ORF<-ifelse(ORFset<0,0,ORFset)
#Put into a data frame
mydata <- data.frame( oID,ID,ITT, wave,ORF,bva,S.in[ID],S.sl[ID],eps)
#run the model
fit1<-lmer(ORF~1+wave+ITT+wave:ITT+(1+wave|ID),data=mydata)
fit1
#grab variance components
vc<-VarCorr(fit1)
#Select Tau and Sigma to select in the out object
varcomps=c(unlist(lapply(vc,diag)),attr(vc,"sc")^2)
#Produce object to output
out<-c(coef(summary(fit1))[4,"t value"],coef(summary(fit1))[4,"Estimate"],as.numeric(varcomps[2]),varcomps[3])
#outputs T Value, Estimate of Effect, Tau, Sigma Squared
out
mydata
}
mydata<-sim1(b0=50, bGrowth=4.672, bX=1.25, b01=.770413, b11=.005, Vint=771, Vslope=2.24, Verror=40.34)
xyplot(ORF~wave,groups=interaction(ITT),data=mydata,type=c("a","p","g"))
Try plotLMER.fnc() from the languageR package, or the effects package.
The merTools package has some functionality to make this easier, though it only applies to working with lmer and glmer objects. Here's how you might do it:
library(merTools)
# fit an interaction model
m1 <- lmer(y ~ studage * service + (1|d) + (1|s), data = InstEval)
# select an average observation from the model frame
examp <- draw(m1, "average")
# create a modified data.frame by changing one value
simCase <- wiggle(examp, var = "service", values = c(0, 1))
# modify again for the studage variable
simCase <- wiggle(simCase, var = "studage", values = c(2, 4, 6, 8))
After this, we have our simulated data which looks like:
simCase
y studage service d s
1 3.205745 2 0 761 564
2 3.205745 2 1 761 564
3 3.205745 4 0 761 564
4 3.205745 4 1 761 564
5 3.205745 6 0 761 564
6 3.205745 6 1 761 564
7 3.205745 8 0 761 564
8 3.205745 8 1 761 564
Next, we need to generate prediction intervals, which we can do with merTools::predictInterval (or without intervals you could use lme4::predict)
preds <- predictInterval(m1, level = 0.9, newdata = simCase)
Now we get a preds object, which is a 3 column data.frame:
preds
fit lwr upr
1 3.312390 1.2948130 5.251558
2 3.263301 1.1996693 5.362962
3 3.412936 1.3096006 5.244776
4 3.027135 1.1138965 4.972449
5 3.263416 0.6324732 5.257844
6 3.370330 0.9802323 5.073362
7 3.410260 1.3721760 5.280458
8 2.947482 1.3958538 5.136692
We can then put it all together to plot:
library(ggplot2)
plotdf <- cbind(simCase, preds)
ggplot(plotdf, aes(x = service, y = fit, ymin = lwr, ymax = upr)) +
geom_pointrange() + facet_wrap(~studage) + theme_bw()
Unfortunately the data here results in a rather uninteresting, but easy to interpret plot.

Logistic Regression : not actual out put with predict function

I am new to R, when I am going to estimate a logistic model using glm() it's not predicting the response, but gives a not actual output on calling predict function like 1 for every input at my predict function.
Code:
ex2data1R <- read.csv("/media/ex2data1R.txt")
x <-ex2data1R$x
y <-ex2data1R$y
z <-ex2data1R$z
logisticmodel <- glm(z~x+y,family=binomial(link = "logit"),data=ex2data1R)
newdata = data.frame(x=c(10),y=(10))
predict(logisticmodel, newdata, type="response")
Output:
> predict(logisticmodel, newdata, type="response")
1
1.181875e-11
Data(ex2data1R.txt) :
"x","y","z"
34.62365962451697,78.0246928153624,0
30.28671076822607,43.89499752400101,0
35.84740876993872,72.90219802708364,0
60.18259938620976,86.30855209546826,1
79.0327360507101,75.3443764369103,1
45.08327747668339,56.3163717815305,0
61.10666453684766,96.51142588489624,1
75.02474556738889,46.55401354116538,1
76.09878670226257,87.42056971926803,1
84.43281996120035,43.53339331072109,1
95.86155507093572,38.22527805795094,0
75.01365838958247,30.60326323428011,0
82.30705337399482,76.48196330235604,1
69.36458875970939,97.71869196188608,1
39.53833914367223,76.03681085115882,0
53.9710521485623,89.20735013750205,1
69.07014406283025,52.74046973016765,1
67.94685547711617,46.67857410673128,0
70.66150955499435,92.92713789364831,1
76.97878372747498,47.57596364975532,1
67.37202754570876,42.83843832029179,0
89.67677575072079,65.79936592745237,1
50.534788289883,48.85581152764205,0
34.21206097786789,44.20952859866288,0
77.9240914545704,68.9723599933059,1
62.27101367004632,69.95445795447587,1
80.1901807509566,44.82162893218353,1
93.114388797442,38.80067033713209,0
61.83020602312595,50.25610789244621,0
38.78580379679423,64.99568095539578,0
61.379289447425,72.80788731317097,1
85.40451939411645,57.05198397627122,1
52.10797973193984,63.12762376881715,0
52.04540476831827,69.43286012045222,1
40.23689373545111,71.16774802184875,0
54.63510555424817,52.21388588061123,0
33.91550010906887,98.86943574220611,0
64.17698887494485,80.90806058670817,1
74.78925295941542,41.57341522824434,0
34.1836400264419,75.2377203360134,0
83.90239366249155,56.30804621605327,1
51.54772026906181,46.85629026349976,0
94.44336776917852,65.56892160559052,1
82.36875375713919,40.61825515970618,0
51.04775177128865,45.82270145776001,0
62.22267576120188,52.06099194836679,0
77.19303492601364,70.45820000180959,1
97.77159928000232,86.7278223300282,1
62.07306379667647,96.76882412413983,1
91.56497449807442,88.69629254546599,1
79.94481794066932,74.16311935043758,1
99.2725269292572,60.99903099844988,1
90.54671411399852,43.39060180650027,1
34.52451385320009,60.39634245837173,0
50.2864961189907,49.80453881323059,0
49.58667721632031,59.80895099453265,0
97.64563396007767,68.86157272420604,1
32.57720016809309,95.59854761387875,0
74.24869136721598,69.82457122657193,1
71.79646205863379,78.45356224515052,1
75.3956114656803,85.75993667331619,1
35.28611281526193,47.02051394723416,0
56.25381749711624,39.26147251058019,0
30.05882244669796,49.59297386723685,0
44.66826172480893,66.45008614558913,0
66.56089447242954,41.09209807936973,0
40.45755098375164,97.53518548909936,1
49.07256321908844,51.88321182073966,0
80.27957401466998,92.11606081344084,1
66.74671856944039,60.99139402740988,1
32.72283304060323,43.30717306430063,0
64.0393204150601,78.03168802018232,1
72.34649422579923,96.22759296761404,1
60.45788573918959,73.09499809758037,1
58.84095621726802,75.85844831279042,1
99.82785779692128,72.36925193383885,1
47.26426910848174,88.47586499559782,1
50.45815980285988,75.80985952982456,1
60.45555629271532,42.50840943572217,0
82.22666157785568,42.71987853716458,0
88.9138964166533,69.80378889835472,1
94.83450672430196,45.69430680250754,1
67.31925746917527,66.58935317747915,1
57.23870631569862,59.51428198012956,1
80.36675600171273,90.96014789746954,1
68.46852178591112,85.59430710452014,1
42.0754545384731,78.84478600148043,0
75.47770200533905,90.42453899753964,1
78.63542434898018,96.64742716885644,1
52.34800398794107,60.76950525602592,0
94.09433112516793,77.15910509073893,1
90.44855097096364,87.50879176484702,1
55.48216114069585,35.57070347228866,0
74.49269241843041,84.84513684930135,1
89.84580670720979,45.35828361091658,1
83.48916274498238,48.38028579728175,1
42.2617008099817,87.10385094025457,1
99.31500880510394,68.77540947206617,1
55.34001756003703,64.9319380069486,1
74.77589300092767,89.52981289513276,1
Let me know am I doing something wrong?
I'm not seeing any problem. Here are predictions for x,y = 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100:
newdata = data.frame(x=seq(30, 100, 5) ,y=seq(30, 100, 5))
predict(logisticmodel, newdata, type="response")
1 2 3 4 5 6
2.423648e-06 1.861140e-05 1.429031e-04 1.096336e-03 8.357794e-03 6.078786e-02
7 8 9 10 11 12
3.320041e-01 7.923883e-01 9.670066e-01 9.955766e-01 9.994218e-01 9.999247e-01
13 14 15
9.999902e-01 9.999987e-01 9.999998e-01
You were predicting x=10, y=10 which is way outside the range of your x, y values (30 - 100), but the prediction was zero which fits these results. When x and y are low (30 - 55), the prediction for z is zero. when x and y are high (75 - 100), the prediction is one (or nearly one). It may be easier to interpret the results if you round them to a few decimals:
round(predict(logisticmodel, newdata, type="response") , 5)
1 2 3 4 5 6 7 8 9 10
0.00000 0.00002 0.00014 0.00110 0.00836 0.06079 0.33200 0.79239 0.96701 0.99558
11 12 13 14 15
0.99942 0.99992 0.99999 1.00000 1.00000
Here is a simple way to predict a category and compare the results with your data:
predict <- ifelse(predict(logisticmodel, type="response")>.5, 1, 0)
xtabs(~predict+ex2data1R$z)
ex2data1R$z
predict 0 1
0 34 5
1 6 55
We used predict() on your original data and then created a rule that picks 1 if the probability is greater than .5 and 0 if it is not. Then we use xtabs() to compare the predictions to the data. When z is 0, we correctly predict zero 34 times and incorrectly predict one 6 times. When z is 1 we correctly predict one 55 times and incorrectly predict zero 5 times. We are correct 89% of the time (34+55)/100*100. You could explore the accuracy of prediction if you use .45 or .55 as the cutoff instead of .5.
In my opinion all is correct, as you can read from R manual:
newdata - optionally, a data frame in which to look for variables with
which to predict. If omitted, the fitted linear predictors are used.
If you have data frame with 1 record it will produce prediction only for that one.
For more details see R manual/glm/predict
or just in R console, after loading library glm put:
?glm
You can also use the following command to make the confusion matrix:
predict <- ifelse(predict(logisticmodel, type="response")>.5, 1, 0)
table(predict,ex2data1R$z)

Resources