I'm trying to create a function that has set arguments, and in the body of the function it has a set formula but a random number of variables in the formula, determined only when the data is received. How do I write the function body so that it can adjust itself for an unknown number of variables?
Here's the backstory: I'm using the nls.lm package to optimize a function for a set of parameters in the function. nls.lm requires a function that returns a vector of residuals. This function is pretty simple: observed-predicted values. However, I also need to create a function to actually get the predicted values. This is where it gets tricky, since the predicted value formula contains the parameters that need to be regressed and optimized.
This is my general formula I am trying to perform non-linear regression on:
Y=A+(B-A)/(1+(10^(X-C-N))
Where A and B are global parameters shared across the entire data set and N is some constant. C can be anywhere from 1 to 8 parameters that need to be determined individually depending on the data set associated with each parameter.
Right now my working function contains the formula and 8 parameters to be estimated.
getPredictors<- function(params, xvalues) {
(params$B) + ((params$A-params$B)/(1+(10^(xvalues-
params$1*Indicator[1,]-params$2*Indicator[2,]-
params$3*Indicator[3,]-params$4*Indicator[4,]-
params$5*Indicator[5,]-params$6*Indicator[6,]-
params$7*Indicator[7,]-params$8*Indicator[8,]-
constant))))
}
params is a list of parameters with an initial value. Indicator is a table where each row consists 1's and 0's that act as an indicator variable to correctly pair each individual parameter with its associated data points. In its simplest form, if it had only one data point per parameter, it would look like a square identity matrix.
When I pair this function with nls.lm() I am successful in my regression:
residFun<- function(p, observed, xx) {
observed - getPredictors(p,xx)
}
nls.out<- nls.lm(parameterslist, fn = residFun, observed = Yavg, xx = Xavg)
> summary(nls.out)
Parameters:
Estimate Std. Error t value Pr(>|t|)
1 -6.1279 0.1857 -32.997 <2e-16 ***
2 -6.5514 0.1863 -35.174 <2e-16 ***
3 -6.2077 0.1860 -33.380 <2e-16 ***
4 -6.4275 0.1863 -34.495 <2e-16 ***
5 -6.4805 0.1863 -34.783 <2e-16 ***
6 -6.1777 0.1859 -33.235 <2e-16 ***
7 -6.3098 0.1862 -33.882 <2e-16 ***
8 -7.7044 0.1865 -41.303 <2e-16 ***
A 549.7203 11.5413 47.631 <2e-16 ***
B 5.9515 25.4343 0.234 0.816
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 82.5 on 86 degrees of freedom
Number of iterations to termination: 7
Reason for termination: Relative error in the sum of squares is at most `ftol'.
Now, the problem comes in when the data I receive does not contain 8 parameters. I can't just substitute 0 for these values since I am certain the degrees of freedom change with less parameters. Therefore I will need some way to create a getPredictors function on the fly, depending on the data I receive.
I've tried a couple things. I've tried combining all the parameters into a list of strings like so: (It's still 8 parameters, for comparison reasons, but it can be anywhere from 1-7 parameters.)
for (i in 1:length(data$subjects)){
paramsandindics[i]<-c(paste0("params$",i,"*","Indicator[",i,",]"))
}
combined<-paste0(paramsandindics, collapse="-")
> combined
[1] "params$1*Indicator[1,]-params$2*Indicator[2,]-
params$3*Indicator[3,]-params$4*Indicator[4,]-
params$5*Indicator[5,]-params$6*Indicator[6,]-
params$7*Indicator[7,]-params$8*Indicator[8,]"
Which appears to get me what I need. So I try dropping it into a new equation
getPredictors2<- function(params, xvalues) {
(params$B) + ((params$A-params$B)/
(1+(10^(xvalues-parse(text=combined)-constant))))
}
But I get an error "non-numeric argument to binary operator". Makes sense, it's probably trying to subtract a character string which won't work. So I switch to:
getPredictors2<- function(params, xvalues) {
(params$B) + ((params$A-params$B)/
(1+(10^(xvalues-eval(parse(text=combined))-constant))))
}
Which immediately evaluates the whole thing, producing only 1 parameter, which breaks my regression.
Ultimately I'd like a function that is written to accept a variable or dynamic number of variables to be filled in in the body of the function. These variables need to be written as-is and not evaluated immediately because the Levenberg-Marquardt algorithm, which is employed in the nls.lm (part of the minpack.lm package) requires an equation in addition to initial parameter guesses and residuals to minimize.
A simple example should suffice. I'm sorry if none of my stuff is reproducible- the data set is quite specific and too large to properly upload.
Sorry if this is long winded. This is my first time trying any of this (coding, nonlinear regression, stackoverflow) so I am a little lost. I'm not sure I am even asking the right question. Thank you for your time and consideration.
EDIT
I've included a smaller sample involving 2 parameters as an example. I hope it can help.
Subjects<-c("K1","K2")
#Xvalues
Xvals<-c(-11, -10, -9, -11, -10, -9)
#YValues, Observed
Yobs<-c(467,330,220,567,345,210)
#Indicator matrix for individual parameters
Indicators<-matrix(nrow = 2, ncol = 6)
Indicators[1,]<-c(1,1,1,0,0,0)
Indicators[2,]<-c(0,0,0,1,1,1)
#Setting up the parameters and functions needed for nls.lm
parameternames<-c("K1","K2","A","B")
#Starting values that nls.lm will iterate on
startingestimates<-c(-7,-7,0,500)
C<-.45
parameterlist<-as.list(setNames(startingestimates, parameternames))
getPredictors<- function(params, xx){
(params$A) + ((params$B-params$A)/
(1+(10^(xx-params$K1*Indicators[1,]-params$K2*Indicators[2,]-C))))}
residFunc<- function(p, observed, xx) {
observed - getPredictors(p,xx)
}
nls.output<- nls.lm(parameterlist, fn = residFunc, observed = Yobs, xx = Xvals)
#Latest attempt at creating a dynamic getPredictor function
combinationtext<-c()
combination<-c()
for (i in 1:length(Subjects)){
combinationtext[i]<-c(paste0("params$K",i,"*","Indicators[",i,",]"))
}
combination<-paste0(combinationtext, collapse="-")
getPredictorsDynamic<-function(params, xx){
(params$A) + ((params$B-params$A)/
(1+(10^(xx-(parse(text=combination))-C))))}
residFunc2<- function(p, observed, xx) {
observed - getPredictorsDynamic(p,xx)
}
nls.output2<-nls.lm(parameterlist, fn = residFunc2, observed = Yobs, xx = Xvals)
#Does not work
Related
I am using the BradleyTerry2 package in R to analyse my data. When using the BTm function to calculate ability scores, the first item in the dataset is removed as a reference, given a score of 0 and then other ability scores are calculated relative to this reference.
Is there a way to use a null hypothesis as a reference, rather than using the first item in the dataset?
This is the code I am using. The "ID" field is player id. This code calculates an ability score for each "Matchup," relative to the first matchup in the dataset.
BTv1 <- BTm(player1=winner,player2=loser,id="ID",formula=~Matchup+(1|ID),data=btmdata)
I am trying to test against the null hypothesis that matchup has no effect on match outcomes, but currently I don't know what ability score corresponds to the null hypothesis. I would like to use this null hypothesis as a reference, rather than using the first matchup in the dataset.
For those wanting to reproduce my results, you can find my files on my university onedrive.
You can test the significance of terms in the model for ability using the anova function, i.e.
anova(BTv1, test = "Chisq")
Using the example data and script that you shared, we get the following result:
Sequential Wald Tests
Model: binomial, link: logit
Response: NULL
Predictor: ~Characters + (1 | ID)
Terms added sequentially (first to last)
Statistic Df P(>|Chi|)
NULL
Characters 46.116 26 0.008853 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Edit: For the model BTv2 with log-ability modelled by ~ Matchup+(1|ID)
Before investigating individual matchups, we should check the significance of the term overall. Unfortunately the anova() method for BTm objects doesn't currently work for terms with inestimable parameters, as in this case. So we'll compute this directly:
cf <- coef(BTv2)[!is.na(coef(BTv2))]
V <- vcov(BTv2)
ind <- grep("Matchup", names(cf))
chisq <- c(t(cf[ind]) %*% chol2inv(chol(V[ind, ind])) %*% cf[ind])
df <- length(ind)
c(chisq = chisq, df = df)
# chisq df
# 107.5667 167.0000
The Chi-squared statistic is less than the degrees of freedom, so the Matchup term is not significant - the model is over-fitting and it's not a good idea to investigate matchup-specific effects.
All the same, let's look at the model when fitted to the matches involving just 3 of the characters, for illustration.
summary(BTv2)$fixef
# Estimate Std. Error z value Pr(>|z|)
# MatchupCaptainFalcon;Falco -0.1327177 0.3161729 -0.4197632 0.6746585
# MatchupCaptainFalcon;Peach 0.1464518 0.3861823 0.3792297 0.7045173
# MatchupFalco;Peach -0.4103029 0.3365761 -1.2190496 0.2228254
In this case only 3 parameters are estimable, the rest are fixed to zero. Under model BTv2 for players i and j playing characters c and d respectively, we have
logit(p(i playing c beats j playing d))
= log_ability_i - log_ability_j + U_i - U_j
= Matchup_{c;d} - Matchup_{d;c} + U_i - U_j
where U_i and U_j are random player effects. So for players of the same baseline ability we have for example,
logit(p(CaptainFalcon beats Falco)) = -0.1327177 - 0 = -0.1327177
logit(p(Falco beats CaptainFalcon)) = 0 - (-0.1327177) = 0.1327177
So this tells you whether one character is favoured over another in a particular pairwise matchup.
Let's return to the BTv1 model based on all the data. In this model, for players of the same baseline ability we have
logit(p(i playing c beats j playing d)) = log_ability_i - log_ability_j
= Characters_c - Characters_d
The effect for "CharactersBowser" is set to zero, the rest are estimable. So e.g.
summary(BTv1)$fixef[c("CharactersFalco", "CharactersPeach"),]
# Estimate Std. Error z value Pr(>|z|)
# CharactersFalco 2.038925 0.9576332 2.129130 0.03324354
# CharactersPeach 2.119304 0.9508804 2.228781 0.02582845
means that
logit(p(Bowser beats Peach)) = 0 - 2.119304 = -2.119304
logit(p(Falcon beats Peach)) = 2.038925 - 2.119304 = -0.080379
So we can still compare characters in a particular matchup. We can use quasi-variances to compare the character effects
# add in character with fixed effect set to zero (Bowser)
V <- cbind(XCharactersBowser = 0, rbind(XCharactersBowser = 0,
vcov(BTv1)))
cf <- c(CharactersBowser = 0, coef(BTv1))
# compute quasi-variances
qv <- qvcalc(V, "XCharacters", estimates = cf,
labels = sub("Characters", "", names(cf)))
# plot and compare
# (need to set ylim because some estimates are essentially infinite)
par(mar = c(7, 4, 3, 1))
plot(qv, ylim = c(-5, 5))
See e.g. https://doi.org/10.1093/biomet/91.1.65 for more on quasi-variances.
I am asked to compute a test statistic using the t.test() function, but I need to reduce the type I error. My prof showed us how to change a confidence level for this function, but not the acceptable type I error for null hypothesis testing. The goal is for the argument to automatically compute a p-value based on a .01 error rate rather than the normal .05.
The r code below involves a data set that I have downloaded.
t.test(mid$log_radius_area, mu=8.456)
I feel like I've answered this somewhere, but can't seem to find it on SO or CrossValidated.
Similarly to this question, the answer is that t.test() doesn't specify any threshold for rejecting/failing to reject the null hypothesis; it reports a p-value, and you get to decide whether to reject or not. (The conf.level argument is for adjusting which confidence interval the output reports.)
From ?t.test:
t.test(1:10, y = c(7:20))
Welch Two Sample t-test
data: 1:10 and c(7:20)
t = -5.4349, df = 21.982, p-value = 1.855e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.052802 -4.947198
sample estimates:
mean of x mean of y
5.5 13.5
Here the p-value is reported as 1.855e-05, so the null hypothesis would be rejected for any (pre-specified) alpha level >1.855e-05. Note that the output doesn't say anywhere "the null hypothesis is rejected at alpha=0.05" or anything like that. You could write your own function to do that, using the $p.value element that is saved as part of the test results:
report_test <- function(tt, alpha=0.05) {
cat("the null hypothesis is ")
if (tt$p.value > alpha) {
cat("**NOT** ")
}
cat("rejected at alpha=",alpha,"\n")
}
tt <- t.test(1:10, y = c(7:20))
report_test(tt)
## the null hypothesis is rejected at alpha= 0.05
Most R package/function writers don't bother to do this, because they figure that it should be simple enough for users to do for themselves.
I'm doing a beta regression in R, which requires values between 0 and 1, endpoints excluded, i.e. (0,1) instead of [0,1].
I have some 0 and 1 values in my dataset, so I'd like to convert them to the smallest possible neighbor, such as 0.0000...0001 and 0.9999...9999. I've used .Machine$double.xmin (which gives me 2.225074e-308), but betareg() still gives an error:
invalid dependent variable, all observations must be in (0, 1)
If I use 0.000001 and 0.999999, I got a different set of errors:
1: In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
failed to invert the information matrix: iteration stopped prematurely
2: In sqrt(wpp) :
Error in chol.default(K) :
the leading minor of order 4 is not positive definite
Only if I use 0.0001 and 0.9999 I can run without errors. Is there any way I can improve this minimum values with betareg? Or should I just be happy with that?
Try it with eps (displacement from 0 and 1) first equal to 1e-4 (as you have here) and then with 1e-3. If the results of the models don't differ in any way you care about, that's great. If they are, you need to be very careful, because it suggests your answers will be very sensitive to assumptions.
In the example below the dispersion parameter phi changes a lot, but the intercept and slope parameter don't change very much.
If you do find that the parameters change by a worrying amount for your particular data, then you need to think harder about the process by which zeros and ones arise, and model that process appropriately, e.g.
a censored-data model: zero/one arise through a minimum/maximum detection threshold, models the zero/one values as actually being somewhere in the tails or
a hurdle/zero-one inflation model: zeros and ones arise through a separate process from the rest of the data, use a binomial or multinomial model to characterize zero vs. (0,1) vs. one, then use a Beta regression on the (0,1) component)
Questions about these steps are probably more appropriate for CrossValidated than for SO.
sample data
set.seed(101)
library(betareg)
dd <- data.frame(x=rnorm(500))
rbeta2 <- function(n, prob=0.5, d=1) {
rbeta(n, shape1=prob*d, shape2=(1-prob)*d)
}
dd$y <- rbeta2(500,plogis(1+5*dd$x),d=1)
dd$y[dd$y<1e-8] <- 0
trial fitting function
ss <- function(eps) {
dd <- transform(dd,
y=pmin(1-eps,pmax(eps,y)))
m <- try(betareg(y~x,data=dd))
if (inherits(m,"try-error")) return(rep(NA,3))
return(coef(m))
}
ss(0) ## fails
ss(1e-8) ## fails
ss(1e-4)
## (Intercept) x (phi)
## 0.3140810 1.5724049 0.7604656
ss(1e-3) ## also fails
ss(1e-2)
## (Intercept) x (phi)
## 0.2847142 1.4383922 1.3970437
ss(5e-3)
## (Intercept) x (phi)
## 0.2870852 1.4546247 1.2029984
try it for a range of values
evec <- seq(-4,-1,length=51)
res <- t(sapply(evec, function(e) ss(10^e)) )
library(ggplot2)
ggplot(data.frame(e=10^evec,reshape2::melt(res)),
aes(e,value,colour=Var2))+
geom_line()+scale_x_log10()
I want to run MLR on my data using lm function in R. However, I am using data splitting cross validation method to access the reliability of the model. I intend using "sample" function to randomly split the data into the calibration and validation datasets by 80:20 ratio. This I want to repeat in say 100 times. Without setting a seed I believe the model from the different samplings will differ. I came across the function in previous post here and it solves the first part;
lst <- lapply(1:100, function(repetition) {
mod <- lm(...)
# Replace this with the code you need to train your model
return(mod)
})
save(lst, file="myfile.RData")
The concern now is how do I validate each of these 100 models and obtain reliability test statistics like RSME, ME, Rsquare for each of the models and hopefully obtain the confidence interval.
If I can get an output in the form of dataframe containing the predicted values for all the 100 models then I should proceed from there.
Any help please?
Thanks
To quickly recap your question: it seems that you want to fit an MLR model to a large training set and then use this model to make predictions on the remaining validation set. You want to repeat this process 100 times and afterwards you want to be able to analyze the characteristics and predictions of the individual models.
To accomplisch this you could just store temporary modelinformation in a datastructure during the modelgeneration and prediction process. You can then re-obtain and process all the information afterwards. You did not provide your own dataset in the description, so I will use one of R's built in datasets in order to demonstrate how this might work:
> library(car)
> Prestige <- Prestige[,c("prestige","education","income","women")]
> Prestige[,c("income")] <- log2(Prestige[,c("income")])
> head(Prestige,n=5)
prestige education income women
gov.administrators 68.8 13.11 -0.09620212 11.16
general.managers 69.1 12.26 -0.04955335 4.02
accountants 63.4 12.77 -0.11643822 15.70
purchasing.officers 56.8 11.42 -0.11972061 9.11
chemists 73.5 14.62 -0.12368966 11.68
We start by initializing some variables first. Let's say you want to create 100 models and use 80% of your data for training purposes:
nrIterations=100
totalSize <- nrow(Prestige)
trainingSize <- floor(0.80*totalSize)
We also want to create the datastructure that will be used to hold the intermediate modelinformation. R is quite a generic high level language in this regard, so we will just create a list of lists. This means that every listentry can by itself again hold another list of information. This gives us the flexibility to add whatever we need:
trainTestTuple <- list(mode="list",length=nrIterations)
We are now ready to create our models and predictions. During every loopiteration a different random trainingsubset is created while using the remaining data for testing purposes. Next, we fit our model to the trainingdata and we then use this obtained model to make predictions on the testdata. Note that we explicitly use the independent variables in order to predict the dependent variable:
for(i in 1:nrIterations)
{
trainIndices <- sample(seq_len(totalSize),size = trainingSize)
trainSet <- Prestige[trainIndices,]
testSet <- Prestige[-trainIndices,]
trainingFit <- lm(prestige ~ education + income + women, data=trainSet)
# Perform predictions on the testdata
testingForecast <- predict(trainingFit,newdata=data.frame(education=testSet$education,income=testSet$income,women=testSet$women),interval="confidence",level=0.95)
# Do whatever else you want to do (compare with actual values, calculate other stuff/metrics ...)
# ...
# add your training and testData to a tuple and add it to a list
tuple <- list(trainingFit,testingForecast) # Add whatever else you need ..
trainTestTuple[[i]] <- tuple # Add this list to the "list of lists"
}
Now, the relevant part: At the end of the iteration we put both the fitted model and the out of sample prediction results in a list. This list contains all the intermediate information that we want to save for the current iteration. We finish by putting this list in our list of lists.
Now that we are done with the modeling, we still have access to all the information we need and we can process and analyze it any way we want. We will take a look at the modeling and prediction results of model 50. First, we extract both the model and the prediction results from the list of lists:
> tuple_50 <- trainTestTuple[[50]]
> trainingFit_50 <- tuple_50[[1]]
> testingForecast_50 <- tuple_50[[2]]
We take a look at the model summary:
> summary(trainingFit_50)
Call:
lm(formula = prestige ~ education + log2(income) + women, data = trainSet)
Residuals:
Min 1Q Median 3Q Max
-15.9552 -4.6461 0.5016 4.3196 18.4882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -287.96143 70.39697 -4.091 0.000105 ***
education 4.23426 0.43418 9.752 4.3e-15 ***
log2(income) 155.16246 38.94176 3.984 0.000152 ***
women 0.02506 0.03942 0.636 0.526875
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.308 on 77 degrees of freedom
Multiple R-squared: 0.8072, Adjusted R-squared: 0.7997
F-statistic: 107.5 on 3 and 77 DF, p-value: < 2.2e-16
We then explicitly obtain the model R-squared and RMSE:
> summary(trainingFit_50)$r.squared
[1] 0.8072008
> summary(trainingFit_50)$sigma
[1] 7.308057
We take a look at the out of sample forecasts:
> testingForecast_50
fit lwr upr
1 67.38159 63.848326 70.91485
2 74.10724 70.075823 78.13865
3 64.15322 61.284077 67.02236
4 79.61595 75.513602 83.71830
5 63.88237 60.078095 67.68664
6 71.76869 68.388457 75.14893
7 60.99983 57.052282 64.94738
8 82.84507 78.145035 87.54510
9 72.25896 68.874070 75.64384
10 49.19994 45.033546 53.36633
11 48.00888 46.134464 49.88329
12 20.14195 8.196699 32.08720
13 33.76505 27.439318 40.09079
14 24.31853 18.058742 30.57832
15 40.79585 38.329835 43.26187
16 40.35038 37.970858 42.72990
17 38.38186 35.818814 40.94491
18 40.09030 37.739428 42.44117
19 35.81084 33.139461 38.48223
20 43.43717 40.799715 46.07463
21 29.73700 26.317428 33.15657
And finally, we obtain some more detailed results about the 2nd forecasted value and the corresponding confidence intervals:
> testingPredicted_2ndprediction <- testingForecast_50[2,1]
> testingLowerConfidence_2ndprediction <- testingForecast_50[2,2]
> testingUpperConfidence_2ndprediction <- testingForecast_50[2,3]
EDIT
After rereading, it occured to me that you are obviously not splitting up the the same exact dataset each time. You are using completely different partitions of data during each iteration and they should be split up in a 80/20 fashion. However, the same solution can still be applied with minor modifications.
Also: For cross validation purposes you should probably take a look at cv.lm()
Description from the R help:
This function gives internal and cross-validation measures of predictive accuracy for multiple linear regression. (For binary logistic regression, use the CVbinary function.) The data are randomly assigned to a number of ‘folds’. Each fold is removed, in turn, while the remaining data is used to re-fit the regression model and to predict at the deleted observations.
EDIT: Reply to comment.
You can just take the means of the relevant performance metrics that you saved. For example, you can use an sapply on the trainTestTuple in order to extract the relevant elements from each sublist. sapply will return these elements as a vector from which you can calculate the mean. This should work:
mean_ME <- mean(sapply(trainTestTuple,"[[",2))
mean_MAD <- mean(sapply(trainTestTuple,"[[",3))
mean_MSE <- mean(sapply(trainTestTuple,"[[",4))
mean_RMSE <- mean(sapply(trainTestTuple,"[[",5))
mean_adjRsq <- mean(sapply(trainTestTuple,"[[",6))
Another small edit: The calculation of your MAD looks rather strange. It might be a good thing to double check if this is exactly what you want.
I am trying to use the crossvalidation cv.glm function from the boot library in R to determine the number of misclassifications when a glm logistic regression is applied.
The function has the following signature:
cv.glm(data, glmfit, cost, K)
with the first two denoting the data and model and K specifies the k-fold.
My problem is the cost parameter which is defined as:
cost: A function of two vector arguments specifying the cost function
for the crossvalidation. The first argument to cost should correspond
to the observed responses and the second argument should correspond to
the predicted or fitted responses from the generalized linear model.
cost must return a non-negative scalar value. The default is the
average squared error function.
I guess for classification it would make sense to have a function which returns the rate of misclassification something like:
nrow(subset(data, (predict >= 0.5 & data$response == "no") |
(predict < 0.5 & data$response == "yes")))
which is of course not even syntactically correct.
Unfortunately, my limited R knowledge let me waste hours and I was wondering if someone could point me in the correct direction.
It sounds like you might do well to just use the cost function (i.e. the one named cost) defined further down in the "Examples" section of ?cv.glm. Quoting from that section:
# [...] Since the response is a binary variable an
# appropriate cost function is
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
This does essentially what you were trying to do with your example. Replacing your "no" and "yes" with 0 and 1, lets say you have two vectors, predict and response. Then cost() is nicely designed to take them and return the mean classification rate:
## Simulate some reasonable data
set.seed(1)
predict <- seq(0.1, 0.9, by=0.1)
response <- rbinom(n=length(predict), prob=predict, size=1)
response
# [1] 0 0 0 1 0 0 0 1 1
## Demonstrate the function 'cost()' in action
cost(response, predict)
# [1] 0.3333333 ## Which is right, as 3/9 elements (4, 6, & 7) are misclassified
## (assuming you use 0.5 as the cutoff for your predictions).
I'm guessing the trickiest bit of this will be just getting your mind fully wrapped around the idea of passing a function in as an argument. (At least that was for me, for the longest time, the hardest part of using the boot package, which requires that move in a fair number of places.)
Added on 2016-03-22:
The function cost(), given above is in my opinion unnecessarily obfuscated; the following alternative does exactly the same thing but in a more expressive way:
cost <- function(r, pi = 0) {
mean((pi < 0.5) & r==1 | (pi > 0.5) & r==0)
}
I will try to explain the cost function in simple words. Let's take
cv.glm(data, glmfit, cost, K) arguments step by step:
data
The data consists of many observations. Think of it like series of numbers or even.
glmfit
It is generalized linear model, which runs on the above series. But there is a catch it splits data into several parts equal to K. And runs glmfit on each of them separately (test set), taking the rest of them as training set. The output of glmfit is a series consisting of same number of elements as the split input passed.
cost
Cost Function. It takes two arguments first the split input series(test set), and second the output of glmfit on the test input. The default is mean square error function.
.
It sums the square of difference between observed data point and predicted data point. Inside the function a loop runs over the test set (output and input should have same number of elements) calculates difference, squares it and adds to output variable.
K
The number to which the input should be split. Default gives leave one out cross validation.
Judging from your cost function description. Your input(x) would be a set of numbers between 0 and 1 (0-0.5 = no and 0.5-1 = yes) and output(y) is 'yes' or 'no'. So error(e) between observation(x) and prediction(y) would be :
cost<- function(x, y){
e=0
for (i in 1:length(x)){
if(x[i]>0.5)
{
if( y[i]=='yes') {e=0}
else {e=x[i]-0.5}
}else
{
if( y[i]=='no') {e=0}
else {e=0.5-x[i]}
}
e=e*e #square error
}
e=e/i #mean square error
return (e)
}
Sources : http://www.cs.cmu.edu/~schneide/tut5/node42.html
The cost function can optionally be defined if there is one you prefer over the default average squared error. If you wanted to do so then the you would write a function that returns the cost you want to minimize using two inputs: (1) the vector of known labels that you are predicting, and (2) the vector of predicted probabilities from your model for those corresponding labels. So for the cost function that (I think) you described in your post you are looking for a function that will return the average number of accurate classifications which would look something like this:
cost <- function(labels,pred){
mean(labels==ifelse(pred > 0.5, 1, 0))
}
With that function defined you can then pass it into your glm.cv() call. Although I wouldn't recommend using your own cost function over the default one unless you have reason to. Your example isn't reproducible, so here is another example:
> library(boot)
>
> cost <- function(labels,pred){
+ mean(labels==ifelse(pred > 0.5, 1, 0))
+ }
>
> #make model
> nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)
> #run cv with your cost function
> (nodal.glm.err <- cv.glm(nodal, nodal.glm, cost, nrow(nodal)))
$call
cv.glm(data = nodal, glmfit = nodal.glm, cost = cost, K = nrow(nodal))
$K
[1] 53
$delta
[1] 0.8113208 0.8113208
$seed
[1] 403 213 -2068233650 1849869992 -1836368725 -1035813431 1075589592 -782251898
...
The cost function defined in the example for cv.glm clearly assumes that the predictions are probabilities, which would require the type="response" argument in the predict function. The documentation from library(boot) should state this explicitly. I would otherwise be forced to assume that the default type="link" is used inside the cv.glm function, in which case the cost function would not work as intended.