predict.MCMCglmm error -- giving broad non-granular values - r

I'm using the MCMCglmm function in its package to create a mixed-effects model for three-level categorical model. The goal is to get probabilities of each of the three classes in response for each row of data in the 5000 row test_week test set.
prior6 <- list(R=list(V=diag(2), nu=0.0001),
G=list(G1=list(V=diag(4), nu=0.0001, alpha.mu=rep(0,4), alpha.V=diag(4) * 25),
G2=list(V=diag(1), nu=0.0001, alpha.mu=rep(0,1), alpha.V=diag(1) * 25),
G3=list(V=diag(2), nu=0.0001, alpha.mu=rep(0,2), alpha.V=diag(2) * 25),
G4=list(V=diag(1), nu=0.0001, alpha.mu=rep(0,1), alpha.V=diag(1) * 25),
G5=list(V=diag(2), nu=0.0001, alpha.mu=rep(0,2), alpha.V=diag(2) * 25),
G6=list(V=diag(1), nu=0.0001, alpha.mu=rep(0,1), alpha.V=diag(1) * 25)))
mix_mod_fit6 <- MCMCglmm(response ~ 1 + x + y + z, random=~us(1 + x + y + z):a +
us(1):d + us(1 + x):b + us(1):e + us(1 + z):c + us(1):f,
rcov=~ us(trait):units, prior=prior6, family='categorical',
data=train_weeks_sample1, nitt=3850, thin=45, burnin=2500)
mixed_final_predictions6 <- predict.MCMCglmm(mix_mod_fit6, test_week,
type='response', interval='prediction')
The issue arises with the predict function, returning a 10000x1 matrix of numbers that very roughly mirror the probabilities of the 2nd and 3rd levels in response (the first 5000 rows corresponding to the 2nd level and the following 5000 to the 3rd level) after I split them into a 5000x2 matrix. I can then backsolve to get the predictions for the 1st level, but the predictions are problematic. They only predict in multiples of 1/30 as shown below (and also illustrated in the histograms in the picture).
temp6
0 0.0333333333333333 0.0666666666666667 0.1 0.133333333333333 0.166666666666667
7935 3914 2199 1901 1883 2173
0.2 0.233333333333333 0.266666666666667 0.3 0.333333333333333 0.366666666666667
2257 2198 1991 1703 1465 1184
0.4 0.433333333333333 0.466666666666667 0.5 0.533333333333333 0.566666666666667
987 756 527 410 268 164
0.6 0.633333333333333 0.666666666666667 0.7
98 35 24 6
Any insight or examples on how the predict.MCMCglmm function works, how to receive predictions more granular than just thirtieths, as well as predictions for all three levels rather than just two would be greatly appreciated! Thank you!
Note: The prior's V values were specifically selected to match the element sizes in the MCMCglmm call, and cannot be changed. However, altering the value of nu has no effect on the predictions, nor does ommitting alpha.mu and alpha.V.
Current results vs ideal results distribution example:

Related

Error: variables were specified with different types from the fit

In carpackage, I am trying to predict the response variable called prestige in a dataset also named Prestige based on income, education, and factor type by lm function. But before I fit data, I want to scale education and income. The code below if you copy and run it in R stuido, the console would say Error: variables ‘income’, ‘I(income^2)’, ‘education’, ‘I(education^2)’ were specified with different types from the fit
library(car)
summary(Prestige)
Prestige$education <- scale(Prestige$education)
Prestige$income <- scale(Prestige$income)
fit <- lm(prestige ~ income + I(income^2) + education + I(education^2)
+ income:education + type + type:income + type:I(income^2)
+ type:education + type:I(education^2)+ type:income:education, Prestige)
summary(fit)
pred <- expand.grid(income = c(1000, 20000), education = c(10,20),type = levels(Prestige $ type))
pred $ prestige.pred <- predict(fit, newdata = pred)
pred
Without scaling the predictors, it can successfully work. So the error is definitely due to the scaling before prediction and I am wondering how to fix this issue?
Note that scale() actually change the class of your columns. See
class(car::Prestige$education)
# [1] "numeric"
class(scale(car::Prestige$education))
# [1] "matrix"
You would be safe simplying them to numeric vectors. You can use the dimension-stripping properties of c() for this
Prestige$education <- c(scale(Prestige$education))
Prestige$income <- c(scale(Prestige$income))
Then I was able to run your model with
fit <- lm(prestige ~ income + I(income^2) + education + I(education^2)
+ income:education + type + type:income + type:I(income^2)
+ type:education + type:I(education^2)+ type:income:education,
Prestige, na.action="na.omit")
and the prediction returned
income education type prestige.pred
1 1000 10 bc -1352364.5
2 20000 10 bc -533597423.4
3 1000 20 bc -1382361.7
4 20000 20 bc -534229639.3
5 1000 10 prof 398464.2
6 20000 10 prof 155567014.1
7 1000 20 prof 409271.3
8 20000 20 prof 155765754.7
9 1000 10 wc -7661464.3
10 20000 10 wc -3074382169.9
11 1000 20 wc -7634693.8
12 20000 20 wc -3073902696.6
Also note you cam simplify your formula somewhat with
fit<-lm(prestige ~ (income + I(income^2) + education + I(education^2))*type +
income:education + type:income:education, Prestige, na.action="na.omit")
This uses * to create many of the interaction terms.
scale() adds attributes that seem to create problems with lm(). Using
Prestige$education <- as.numeric(scale(Prestige$education))
Prestige$education <- as.numeric(scale(Prestige$income))
make everything works.

Neuronal Net accuracy is too low

I just wrote my first attempt of providing a neuronal network for household classification by using energy consumption features. So far I could make it run but the output seems to be questionable.
So like you can see I'm using 18 features (maybe to much?) to predict if it's a single or non-single household.
I got 3488 rows like this:
c_day c_weekend c_evening c_morning c_night c_noon c_max c_min r_mean_max r_min_mean r_night_day r_morning_noon
12 14 1826 9 765 3 447 2 878 0 7338 4
r_evening_noon t_above_1kw t_above_2kw t_above_mean t_daily_max single
3424 1 695 0 174319075712881 1
My neuronal network using these parameters:
net.nn <- neuralnet(single
~ c_day
+ c_weekend
+ c_weekday
+ c_evening
+ c_morning
+ c_night
+ c_noon
+ c_max
+ c_min
+ r_mean_max
+ r_min_mean
+ r_night_day
+ r_morning_noon
+ r_evening_noon
+ t_above_1kw
+ t_above_2kw
+ t_above_mean
+ t_daily_max
,train, hidden=15, threshold=0.01,linear.output=F)
1 repetition was calculated.
Error Reached Threshold Steps
1 126.3425379 0.009899229932 4091
I normalized the data before by using the min-max normalization formula:
for(i in names(full_data)){
x <- as.numeric(full_data[,i])
full_data[,i] <- (x-min(x)/max(x)-min(x))
}
I got 3488 rows of data and splitted them into a training and a test set.
half <- nrow(full_data)/2
train <- full_data[1:half,]
test <- full_data[half:3488,]
net.results <- compute(net.nn,test)
nn$net.result
I used the prediction method and bound it to the actual "single[y/no]"-column to compare the result:
predict <- nn$net.result
cleanoutput <- cbind(predict,full_data$single[half:3488])
colnames(cleanoutput) <- c("predicted","actual")
So when I print it, this is my classification result for the first 10 rows:
predicted actual
1701 0.1661093405 0
1702 0.1317067578 0
1703 0.1677147708 1
1704 0.2051188618 1
1705 0.2013035634 0
1706 0.2088726723 0
1707 0.2683753128 1
1708 0.1661093405 0
1709 0.2385537285 1
1710 0.1257108821 0
So if I understand it right, when I round the predicted output it should be either a 0 or 1 but it always ends up being a 0!
Am I using the wrong parameters? Is my data simply not suitable for nn prediction? Is the normalization wrong?
It means your model performance is still not good. Once you have reached good model performance after tuning you should get correct expected behavior. Neural net techniques are very susceptible of scale difference between different columns so standardization of data [mean =0 std =1] is a good practice. As pointed by OP scale() does the job.
Using scale(full_data) for the entire data did the trick. Now the data is normalized through standard-mean deviation and the output seems much more reliable.

Simulating an interaction effect in a lmer() model in R

Is there an R package with a function that can:
(1) simulate the different values of an interaction variable,
(2) plot a graph that demonstrates the effect of the interaction on Y for different values of the terms in interaction, and
(3) works well with the models fitted with the lmer() function of the lme4 package?
I have looked in arm, ez, coefplot2, and fanovaGraph packages, but could not find what I was looking for.
I'm not sure about a package, but you can simulate data varying the terms in the interaction, and then graph it. Here is an example for a treatment by wave (i.e. longitudinal) interaction and the syntax to plot. I think the story behind the example is a treatment to improve oral reading fluency in school age children. The term of the interaction is modified by changing the function value for bX.
library(arm)
sim1 <- function (b0=50, bGrowth=4.672,bX=15, b01=.770413, b11=.005, Vint=771, Vslope=2.24, Verror=40.34) {
#observation ID
oID<-rep(1:231)
#participant ID
ID<-rep(1:77, each=3)
tmp2<-sample(0:1,77,replace=TRUE,prob=c(.5,.5))
ITT<-tmp2[ID]
#longitudinal wave: for example 0, 4, and 7 months after treatment
wave <-rep(c(0,4,7), 77)
bvaset<-rnorm(77, 0, 11.58)
bva<-bvaset[ID]
#random effect intercept
S.in <- rnorm(77, 0, sqrt(Vint))
#random effect for slope
S.sl<-rnorm(77, 0, sqrt(Vslope))
#observation level error
eps <- rnorm(3*77, 0, sqrt(Verror))
#Create Outcome as product of specified model
ORFset <- b0 + b01*bva+ bGrowth*wave +bX*ITT*wave+ S.in[ID]+S.sl[ID]*wave+eps[oID]
#if else statement to elimiante ORF values below 0
ORF<-ifelse(ORFset<0,0,ORFset)
#Put into a data frame
mydata <- data.frame( oID,ID,ITT, wave,ORF,bva,S.in[ID],S.sl[ID],eps)
#run the model
fit1<-lmer(ORF~1+wave+ITT+wave:ITT+(1+wave|ID),data=mydata)
fit1
#grab variance components
vc<-VarCorr(fit1)
#Select Tau and Sigma to select in the out object
varcomps=c(unlist(lapply(vc,diag)),attr(vc,"sc")^2)
#Produce object to output
out<-c(coef(summary(fit1))[4,"t value"],coef(summary(fit1))[4,"Estimate"],as.numeric(varcomps[2]),varcomps[3])
#outputs T Value, Estimate of Effect, Tau, Sigma Squared
out
mydata
}
mydata<-sim1(b0=50, bGrowth=4.672, bX=1.25, b01=.770413, b11=.005, Vint=771, Vslope=2.24, Verror=40.34)
xyplot(ORF~wave,groups=interaction(ITT),data=mydata,type=c("a","p","g"))
Try plotLMER.fnc() from the languageR package, or the effects package.
The merTools package has some functionality to make this easier, though it only applies to working with lmer and glmer objects. Here's how you might do it:
library(merTools)
# fit an interaction model
m1 <- lmer(y ~ studage * service + (1|d) + (1|s), data = InstEval)
# select an average observation from the model frame
examp <- draw(m1, "average")
# create a modified data.frame by changing one value
simCase <- wiggle(examp, var = "service", values = c(0, 1))
# modify again for the studage variable
simCase <- wiggle(simCase, var = "studage", values = c(2, 4, 6, 8))
After this, we have our simulated data which looks like:
simCase
y studage service d s
1 3.205745 2 0 761 564
2 3.205745 2 1 761 564
3 3.205745 4 0 761 564
4 3.205745 4 1 761 564
5 3.205745 6 0 761 564
6 3.205745 6 1 761 564
7 3.205745 8 0 761 564
8 3.205745 8 1 761 564
Next, we need to generate prediction intervals, which we can do with merTools::predictInterval (or without intervals you could use lme4::predict)
preds <- predictInterval(m1, level = 0.9, newdata = simCase)
Now we get a preds object, which is a 3 column data.frame:
preds
fit lwr upr
1 3.312390 1.2948130 5.251558
2 3.263301 1.1996693 5.362962
3 3.412936 1.3096006 5.244776
4 3.027135 1.1138965 4.972449
5 3.263416 0.6324732 5.257844
6 3.370330 0.9802323 5.073362
7 3.410260 1.3721760 5.280458
8 2.947482 1.3958538 5.136692
We can then put it all together to plot:
library(ggplot2)
plotdf <- cbind(simCase, preds)
ggplot(plotdf, aes(x = service, y = fit, ymin = lwr, ymax = upr)) +
geom_pointrange() + facet_wrap(~studage) + theme_bw()
Unfortunately the data here results in a rather uninteresting, but easy to interpret plot.

Probabilty heatmap in ggplot

I asked this question a year ago and got code for this "probability heatmap":
numbet <- 32
numtri <- 1e5
prob=5/6
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
library(plyr)
mxcum2 <- ddply(mxcum, .(bet, outcome), nrow)
mxcum3 <- ddply(mxcum2, .(bet), summarize,
ymin=c(0, head(seq_along(V1)/length(V1), -1)),
ymax=seq_along(V1)/length(V1),
fill=(V1/sum(V1)))
head(mxcum3)
library(ggplot2)
p <- ggplot(mxcum3, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", formatter="percent", low="red", high="blue") +
scale_y_continuous(formatter="percent") +
xlab("Bet")
print(p)
(May need to change this code slightly because of this)
This is almost exactly what I want. Except each vertical shaft should have different numbers of bins, ie the first should have 2, second 3, third 4 (N+1). In the graph shaft 6 +7 have the same number of bins (7), where 7 should have 8 (N+1).
If I'm right, the reason the code does this is because it is the observed data and if I ran more trials we would get more bins. I don't want to rely on the number of trials to get the correct number of bins.
How can I adapt this code to give the correct number of bins?
I have used R's dbinom to generate the frequency of heads for n=1:32 trials and plotted the graph now. It will be what you expect. I have read some of your earlier posts here on SO and on math.stackexchange. Still I don't understand why you'd want to simulate the experiment rather than generating from a binomial R.V. If you could explain it, it would be great! I'll try to work on the simulated solution from #Andrie to check out if I can match the output shown below. For now, here's something you might be interested in.
set.seed(42)
numbet <- 32
numtri <- 1e5
prob=5/6
require(plyr)
out <- ldply(1:numbet, function(idx) {
outcome <- dbinom(idx:0, size=idx, prob=prob)
bet <- rep(idx, length(outcome))
N <- round(outcome * numtri)
ymin <- c(0, head(seq_along(N)/length(N), -1))
ymax <- seq_along(N)/length(N)
data.frame(bet, fill=outcome, ymin, ymax)
})
require(ggplot2)
p <- ggplot(out, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", low="red", high="blue") +
xlab("Bet")
The plot:
Edit: Explanation of how your old code from Andrie works and why it doesn't give what you intend.
Basically, what Andrie did (or rather one way to look at it) is to use the idea that if you have two binomial distributions, X ~ B(n, p) and Y ~ B(m, p), where n, m = size and p = probability of success, then, their sum, X + Y = B(n + m, p) (1). So, the purpose of xcum is to obtain the outcome for all n = 1:32 tosses, but to explain it better, let me construct the code step by step. Along with the explanation, the code for xcum will also be very obvious and it can be constructed in no time (without any necessity for for-loop and constructing a cumsum everytime.
If you have followed me so far, then, our idea is first to create a numtri * numbet matrix, with each column (length = numtri) having 0's and 1's with probability = 5/6 and 1/6 respectively. That is, if you have numtri = 1000, then, you'll have ~ 834 0's and 166 1's *for each of the numbet columns (=32 here). Let's construct this and test this first.
numtri <- 1e3
numbet <- 32
set.seed(45)
xcum <- t(replicate(numtri, sample(0:1, numbet, prob=c(5/6,1/6), replace = TRUE)))
# check for count of 1's
> apply(xcum, 2, sum)
[1] 169 158 166 166 160 182 164 181 168 140 154 142 169 168 159 187 176 155 151 151 166
163 164 176 162 160 177 157 163 166 146 170
# So, the count of 1's are "approximately" what we expect (around 166).
Now, each of these columns are samples of binomial distribution with n = 1 and size = numtri. If we were to add the first two columns and replace the second column with this sum, then, from (1), since the probabilities are equal, we'll end up with a binomial distribution with n = 2. Similarly, instead, if you had added the first three columns and replaced th 3rd column by this sum, you would have obtained a binomial distribution with n = 3 and so on...
The concept is that if you cumulatively add each column, then you end up with numbet number of binomial distributions (1 to 32 here). So, let's do that.
xcum <- t(apply(xcum, 1, cumsum))
# you can verify that the second column has similar probabilities by this:
# calculate the frequency of all values in 2nd column.
> table(xcum[,2])
0 1 2
694 285 21
> round(numtri * dbinom(2:0, 2, prob=5/6))
[1] 694 278 28
# more or less identical, good!
If you divide the xcum, we have generated thus far by cumsum(1:numbet) over each row in this manner:
xcum <- xcum/matrix(rep(cumsum(1:numbet), each=numtri), ncol = numbet)
this will be identical to the xcum matrix that comes out of the for-loop (if you generate it with the same seed). However I don't quite understand the reason for this division by Andrie as this is not necessary to generate the graph you require. However, I suppose it has something to do with the frequency values you talked about in an earlier post on math.stackexchange
Now on to why you have difficulties obtaining the graph I had attached (with n+1 bins):
For a binomial distribution with n=1:32 trials, 5/6 as probability of tails (failures) and 1/6 as the probability of heads (successes), the probability of k heads is given by:
nCk * (5/6)^(k-1) * (1/6)^k # where nCk is n choose k
For the test data we've generated, for n=7 and n=8 (trials), the probability of k=0:7 and k=0:8 heads are given by:
# n=7
0 1 2 3 4 5
.278 .394 .233 .077 .016 .002
# n=8
0 1 2 3 4 5
.229 .375 .254 .111 .025 .006
Why are they both having 6 bins and not 8 and 9 bins? Of course this has to do with the value of numtri=1000. Let's see what's the probabilities of each of these 8 and 9 bins by generating probabilities directly from the binomial distribution using dbinom to understand why this happens.
# n = 7
dbinom(7:0, 7, prob=5/6)
# output rounded to 3 decimal places
[1] 0.279 0.391 0.234 0.078 0.016 0.002 0.000 0.000
# n = 8
dbinom(8:0, 8, prob=5/6)
# output rounded to 3 decimal places
[1] 0.233 0.372 0.260 0.104 0.026 0.004 0.000 0.000 0.000
You see that the probabilities corresponding to k=6,7 and k=6,7,8 corresponding to n=7 and n=8 are ~ 0. They are very low in values. The minimum value here is 5.8 * 1e-7 actually (n=8, k=8). This means that you have a chance of getting 1 value if you simulated for 1/5.8 * 1e7 times. If you check the same for n=32 and k=32, the value is 1.256493 * 1e-25. So, you'll have to simulate that many values to get at least 1 result where all 32 outcomes are head for n=32.
This is why your results were not having values for certain bins because the probability of having it is very low for the given numtri. And for the same reason, generating the probabilities directly from the binomial distribution overcomes this problem/limitation.
I hope I've managed to write with enough clarity for you to follow. Let me know if you've trouble going through.
Edit 2:
When I simulated the code I've just edited above with numtri=1e6, I get this for n=7 and n=8 and count the number of heads for k=0:7 and k=0:8:
# n = 7
0 1 2 3 4 5 6 7
279347 391386 233771 77698 15763 1915 117 3
# n = 8
0 1 2 3 4 5 6 7 8
232835 372466 259856 104116 26041 4271 392 22 1
Note that, there are k=6 and k=7 now for n=7 and n=8. Also, for n=8, you have a value of 1 for k=8. With increasing numtri you'll obtain more of the other missing bins. But it'll require a huge amount of time/memory (if at all).

Logistic Regression : not actual out put with predict function

I am new to R, when I am going to estimate a logistic model using glm() it's not predicting the response, but gives a not actual output on calling predict function like 1 for every input at my predict function.
Code:
ex2data1R <- read.csv("/media/ex2data1R.txt")
x <-ex2data1R$x
y <-ex2data1R$y
z <-ex2data1R$z
logisticmodel <- glm(z~x+y,family=binomial(link = "logit"),data=ex2data1R)
newdata = data.frame(x=c(10),y=(10))
predict(logisticmodel, newdata, type="response")
Output:
> predict(logisticmodel, newdata, type="response")
1
1.181875e-11
Data(ex2data1R.txt) :
"x","y","z"
34.62365962451697,78.0246928153624,0
30.28671076822607,43.89499752400101,0
35.84740876993872,72.90219802708364,0
60.18259938620976,86.30855209546826,1
79.0327360507101,75.3443764369103,1
45.08327747668339,56.3163717815305,0
61.10666453684766,96.51142588489624,1
75.02474556738889,46.55401354116538,1
76.09878670226257,87.42056971926803,1
84.43281996120035,43.53339331072109,1
95.86155507093572,38.22527805795094,0
75.01365838958247,30.60326323428011,0
82.30705337399482,76.48196330235604,1
69.36458875970939,97.71869196188608,1
39.53833914367223,76.03681085115882,0
53.9710521485623,89.20735013750205,1
69.07014406283025,52.74046973016765,1
67.94685547711617,46.67857410673128,0
70.66150955499435,92.92713789364831,1
76.97878372747498,47.57596364975532,1
67.37202754570876,42.83843832029179,0
89.67677575072079,65.79936592745237,1
50.534788289883,48.85581152764205,0
34.21206097786789,44.20952859866288,0
77.9240914545704,68.9723599933059,1
62.27101367004632,69.95445795447587,1
80.1901807509566,44.82162893218353,1
93.114388797442,38.80067033713209,0
61.83020602312595,50.25610789244621,0
38.78580379679423,64.99568095539578,0
61.379289447425,72.80788731317097,1
85.40451939411645,57.05198397627122,1
52.10797973193984,63.12762376881715,0
52.04540476831827,69.43286012045222,1
40.23689373545111,71.16774802184875,0
54.63510555424817,52.21388588061123,0
33.91550010906887,98.86943574220611,0
64.17698887494485,80.90806058670817,1
74.78925295941542,41.57341522824434,0
34.1836400264419,75.2377203360134,0
83.90239366249155,56.30804621605327,1
51.54772026906181,46.85629026349976,0
94.44336776917852,65.56892160559052,1
82.36875375713919,40.61825515970618,0
51.04775177128865,45.82270145776001,0
62.22267576120188,52.06099194836679,0
77.19303492601364,70.45820000180959,1
97.77159928000232,86.7278223300282,1
62.07306379667647,96.76882412413983,1
91.56497449807442,88.69629254546599,1
79.94481794066932,74.16311935043758,1
99.2725269292572,60.99903099844988,1
90.54671411399852,43.39060180650027,1
34.52451385320009,60.39634245837173,0
50.2864961189907,49.80453881323059,0
49.58667721632031,59.80895099453265,0
97.64563396007767,68.86157272420604,1
32.57720016809309,95.59854761387875,0
74.24869136721598,69.82457122657193,1
71.79646205863379,78.45356224515052,1
75.3956114656803,85.75993667331619,1
35.28611281526193,47.02051394723416,0
56.25381749711624,39.26147251058019,0
30.05882244669796,49.59297386723685,0
44.66826172480893,66.45008614558913,0
66.56089447242954,41.09209807936973,0
40.45755098375164,97.53518548909936,1
49.07256321908844,51.88321182073966,0
80.27957401466998,92.11606081344084,1
66.74671856944039,60.99139402740988,1
32.72283304060323,43.30717306430063,0
64.0393204150601,78.03168802018232,1
72.34649422579923,96.22759296761404,1
60.45788573918959,73.09499809758037,1
58.84095621726802,75.85844831279042,1
99.82785779692128,72.36925193383885,1
47.26426910848174,88.47586499559782,1
50.45815980285988,75.80985952982456,1
60.45555629271532,42.50840943572217,0
82.22666157785568,42.71987853716458,0
88.9138964166533,69.80378889835472,1
94.83450672430196,45.69430680250754,1
67.31925746917527,66.58935317747915,1
57.23870631569862,59.51428198012956,1
80.36675600171273,90.96014789746954,1
68.46852178591112,85.59430710452014,1
42.0754545384731,78.84478600148043,0
75.47770200533905,90.42453899753964,1
78.63542434898018,96.64742716885644,1
52.34800398794107,60.76950525602592,0
94.09433112516793,77.15910509073893,1
90.44855097096364,87.50879176484702,1
55.48216114069585,35.57070347228866,0
74.49269241843041,84.84513684930135,1
89.84580670720979,45.35828361091658,1
83.48916274498238,48.38028579728175,1
42.2617008099817,87.10385094025457,1
99.31500880510394,68.77540947206617,1
55.34001756003703,64.9319380069486,1
74.77589300092767,89.52981289513276,1
Let me know am I doing something wrong?
I'm not seeing any problem. Here are predictions for x,y = 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100:
newdata = data.frame(x=seq(30, 100, 5) ,y=seq(30, 100, 5))
predict(logisticmodel, newdata, type="response")
1 2 3 4 5 6
2.423648e-06 1.861140e-05 1.429031e-04 1.096336e-03 8.357794e-03 6.078786e-02
7 8 9 10 11 12
3.320041e-01 7.923883e-01 9.670066e-01 9.955766e-01 9.994218e-01 9.999247e-01
13 14 15
9.999902e-01 9.999987e-01 9.999998e-01
You were predicting x=10, y=10 which is way outside the range of your x, y values (30 - 100), but the prediction was zero which fits these results. When x and y are low (30 - 55), the prediction for z is zero. when x and y are high (75 - 100), the prediction is one (or nearly one). It may be easier to interpret the results if you round them to a few decimals:
round(predict(logisticmodel, newdata, type="response") , 5)
1 2 3 4 5 6 7 8 9 10
0.00000 0.00002 0.00014 0.00110 0.00836 0.06079 0.33200 0.79239 0.96701 0.99558
11 12 13 14 15
0.99942 0.99992 0.99999 1.00000 1.00000
Here is a simple way to predict a category and compare the results with your data:
predict <- ifelse(predict(logisticmodel, type="response")>.5, 1, 0)
xtabs(~predict+ex2data1R$z)
ex2data1R$z
predict 0 1
0 34 5
1 6 55
We used predict() on your original data and then created a rule that picks 1 if the probability is greater than .5 and 0 if it is not. Then we use xtabs() to compare the predictions to the data. When z is 0, we correctly predict zero 34 times and incorrectly predict one 6 times. When z is 1 we correctly predict one 55 times and incorrectly predict zero 5 times. We are correct 89% of the time (34+55)/100*100. You could explore the accuracy of prediction if you use .45 or .55 as the cutoff instead of .5.
In my opinion all is correct, as you can read from R manual:
newdata - optionally, a data frame in which to look for variables with
which to predict. If omitted, the fitted linear predictors are used.
If you have data frame with 1 record it will produce prediction only for that one.
For more details see R manual/glm/predict
or just in R console, after loading library glm put:
?glm
You can also use the following command to make the confusion matrix:
predict <- ifelse(predict(logisticmodel, type="response")>.5, 1, 0)
table(predict,ex2data1R$z)

Resources