Logistic Regression using for loops in R [duplicate] - r

This question already has answers here:
How to debug "contrasts can be applied only to factors with 2 or more levels" error?
(3 answers)
Closed 5 years ago.
I am trying to run a binary logistic regression using For loops in R.
My code for the same is as follows:
mydata5<-read.table(file.choose(),header=T,sep=",")
colnames(mydata5)
Class <- 1:16
Countries <- 1:5
Months <- 1:7
DayDiff <- 1:28
mydata5$CT <- factor(mydata5$CT)
mydata5$CC <- factor(mydata5$CC)
mydata5$C <- factor(mydata5$C)
mydata5$DD <- factor(mydata5$DD)
mydata5$UM <- factor(mydata5$UM)
for(i in seq(along=Class))
{
mydata5$C=mydata5$C[i];
for(i2 in seq(along=Countries))
{
mydata5$CC=mydata5$CC[i2];
for(i3 in seq(along=Months))
{
mydata5$UM=mydata5$UM[i3];
for(i4 in seq(along=DayDiff))
{
mydata5$DD=mydata5$DD[i4];
lrfit5 <- glm(CT ~ C+CC+UM+DD, family = binomial(link = "logit"),data=mydata5)
summary(lrfit5)
library(lattice)
in_frame<-data.frame(C="mydata5$C[i]",CC="mydata5$CC[i2]",UM="mydata5$UM[i3]",DD="mydata5$DD[i4]")
predict(lrfit5,in_frame, type="response",se.fit=FALSE)
}
}
}
}
However, I'm getting the following error:
Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Why is the error occurring? Also,the dataset "mydata5" has around 50000 rows.Please help.
Thanks in Advance.

You have tried to do a regression with a factor having only one level. Since you haven't given us your data we can't reproduce your analysis but I can simply reproduce your error message:
> d = data.frame(x=runif(10),y=factor("M",levels=c("M","F")))
> d
x y
1 0.07104688 M
2 0.11948466 M
3 0.20807068 M
4 0.24049508 M
5 0.44251492 M
6 0.69775646 M
7 0.44479983 M
8 0.64814971 M
9 0.75151207 M
10 0.38810621 M
> glm(x~y,data=d)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
By setting one of the factor values to "F" I don't get the error message:
> d$y[5]="F"
> glm(x~y,data=d)
Call: glm(formula = x ~ y, data = d)
Coefficients:
(Intercept) yF
0.39660 0.04591
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 0.5269
Residual Deviance: 0.525 AIC: 4.91
So somewhere in your loops (which we cannot run because we don't have your data) you are doing this.

Related

R Nonliner Least Squares (nls) function: Using indexed vectors as inputs?

I am trying to run the nls function in R using indexed vectors as inputs, however I am getting an error:
> a=c(1,2,3,4,5,6,7,8,9,10)
> b=c(6,7,9,11,14,18,23,30,38,50) #make some example data
>
> nls(b[1:6]~s+k*2^(a[1:6]/d),start=list(s=2,k=3,d=2.5)) #try running nls on first 6 elements of a and b
Error in parse(text = x, keep.source = FALSE) :
<text>:2:0: unexpected end of input
1: ~
^
I can run it on the full vectors:
> nls(b~s+k*2^(a/d),start=list(s=2,k=3,d=2.5))
Nonlinear regression model
model: b ~ s + k * 2^(a/d)
data: parent.frame()
s k d
1.710 3.171 2.548
residual sum-of-squares: 0.3766
Number of iterations to convergence: 3
Achieved convergence tolerance: 1.2e-07
I am fairly certain that the indexed vectors have the same variable type as the full vectors:
> a
[1] 1 2 3 4 5 6 7 8 9 10
> typeof(a)
[1] "double"
> class(a)
[1] "numeric"
> a[1:6]
[1] 1 2 3 4 5 6
> typeof(a[1:6])
[1] "double"
> class(a[1:6])
[1] "numeric"
I can run nls if I save the indexed vectors in new variables:
> a_part=a[1:6]
> b_part=b[1:6]
> nls(b_part~s+k*2^(a_part/d),start=list(s=2,k=3,d=2.5))
Nonlinear regression model
model: b_part ~ s + k * 2^(a_part/d)
data: parent.frame()
s k d
2.297 2.720 2.373
residual sum-of-squares: 0.06569
Number of iterations to convergence: 3
Achieved convergence tolerance: 1.274e-07
Furthermore, lm accepts both full and indexed vectors:
> lm(b~a)
Call:
lm(formula = b ~ a)
Coefficients:
(Intercept) a
-4.667 4.594
> lm(b[1:6]~a[1:6])
Call:
lm(formula = b[1:6] ~ a[1:6])
Coefficients:
(Intercept) a[1:6]
2.533 2.371
Is there a way to run nls on indexed vectors without saving them in new variables?
Use subset . (It would also be possible to use the weights argument giving a weight of 1 to each of the first 6 observations and 0 to the rest.)
Also you might want to use the plinear algorithm to avoid having to give the starting values for the two parameters that enter linearly. In that case provide a matrix on the RHS with column names s and k such that its first column multiplies s and the second column multiplies k.
nls(b ~ cbind(s = 1, k = 2^(a/d)), subset = 1:6, start = list(d = 2.5),
algorithm = "plinear")
giving:
Nonlinear regression model
model: b ~ cbind(s = 1, k = 2^(a/d))
data: parent.frame()
d .lin.s .lin.k
2.373 2.297 2.720
residual sum-of-squares: 0.06569
Number of iterations to convergence: 3
Achieved convergence tolerance: 7.186e-08

How do I use vector values as variables in R

I have a dataframe called repay and I have created a vector for the variables names of the variables I am interested in called variables.
variables<-names(repay)[22:36]
I want to write a for loop that does some univariate analysis on each of the variables in variables. For example:
for (i in 1:length(variables))
{
model<-glm(Successful~ variables[i]
,data=repay
,family=binomial(link='logit'))
}
However it doesn't recognize variables[i] as a variable, giving the following error message:
Error in model.frame.default(formula = Successful ~ variables[i], data
= repay, : variable lengths differ (found for 'variables[i]')
Try using the formula function in R. It will allow correct interpretation of models as below:
for (i in 1:length(variables){
myglm <- glm(formula(paste("Successful", "~", variables[i])),
data = repay, family = binomial(link = 'logit'))
See my post here for more things you can do in this context.
Alternatively you can use assign yielding in as many models as the variables.
Let us consider
repay<-data.table(Successful=runif(10),a=sample(10),b=sample(10),c=runif(10))
variables<-names(repay)[2:4]
yielding:
>repay
Successful a b c
1: 0.8457686 7 9 0.2930537
2: 0.4050198 6 6 0.5948573
3: 0.1994583 2 8 0.4198423
4: 0.1471735 1 5 0.5906494
5: 0.7765083 8 10 0.7933327
6: 0.6503692 9 4 0.4262896
7: 0.2449512 4 1 0.7311928
8: 0.6754966 3 3 0.4723299
9: 0.7792951 10 7 0.9101495
10: 0.6281890 5 2 0.9215107
Then you can perform the loop
for (i in 1:length(variables)){
assign(paste0("model",i),eval(parse(text=paste("glm(Successful~",variables[i],",data=repay,family=binomial(link='logit'))"))))
}
resulting in 3 objects: model1,model2 and model3.
>model1
Call: glm(formula = Successful ~ a, family = binomial(link = "logit"),
data = repay)
Coefficients:
(Intercept) a
-0.36770 0.05501
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 5.752
Residual Deviance: 5.69 AIC: 17.66
Idem for model2, model3 et.c.
You could create a language object from a string,
var = "cyl"
lm(as.formula(sprintf("mpg ~ %s", var)), data=mtcars)
# alternative (see also substitute)
lm(bquote(mpg~.(as.name(var))), data=mtcars)
Small workaround that might help
for (i in 22:36)
{
ivar <- repay[i] #choose variable for running the model
repay2 <- data.frame(Successful= repay$Successful, ivar) #create new data frame with 2 variables only for running the model
#run model for new data frame repay2
model<-glm(Successful~ ivar
,data=repay2
,family=binomial(link='logit'))
}

Support Vector Machine - R code - Predict Residual error of Time Series

I'm trying to predict residual error of time series using R code. My dataset have the following two columns (I will put a sample with the first 10 rows):
Observation Residuals
1 -0,087527458
2 -0,06907199
3 -0,066604145
4 -0,07796713
5 -0,081723932
6 -0,094046868
7 -0,101535816
8 -0,101884203
9 -0,11131246
10 -0,092548176
For the prediction I'm building a Support Vector Machine using R:
# Load the data from the csv file
dataDirectory <- "C://"
data <- read.csv(paste(dataDirectory, "Data_SVM_Test.csv", sep=""),sep=";", header = TRUE)
head(data)
# Plot the data
plot(data, pch=16)
# Create a linear regression model
model <- lm(Residuals ~ Observation, data)
# Add the fitted line
abline(model)
predictedY <- predict(model, data)
# display the predictions
points(data$Observation, predictedY, col = "blue", pch=4)
# This function will compute the RMSE
rmse <- function(error)
{
sqrt(mean(error^2))
}
error <- model$residuals # same as data$Y - predictedY
predictionRMSE <- rmse(error) # 5.70377
plot(data, pch=16)
plot.new()
# svr model ==============================================
if(require(e1071)){
model <- svm(Residuals ~ Observation , data)
predictedY <- predict(model, data)
points(data$Observation, predictedY, col = "red", pch=4)
# /!\ this time svrModel$residuals is not the same as data$Y - predictedY
# so we compute the error like this
error <- data$Residuals - predictedY
svrPredictionRMSE <- rmse(error) # 3.157061
}
When I execute the above code I am getting the following error message and without any output:
Warning message:
In Ops.factor(data$Residuals, predictedY) : ‘-’ not meaningful for factors
Anyone have an idea how can solve this error?
Many thanks!
When using svm for classification, the output is of type factor. This is from the documentation:
Output of svm: A vector of predicted values (for classification: a vector of labels, for density estimation: a logical vector).
This can be seen from the following example:
library(e1071)
model <- svm(Species ~ ., data = iris)
> str( predict(model, iris))
Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
It is the same for your data. Levels show that PredictedY is a factor:
> predictedY <- predict(model, df)
> predictedY
1 2 3 4 5 6 7 8 9 10
-0,087527458 -0,06907199 -0,066604145 -0,07796713 -0,081723932 -0,094046868 -0,101535816 -0,101884203 -0,11131246 -0,092548176
Levels: -0,066604145 -0,06907199 -0,07796713 -0,081723932 -0,087527458 -0,092548176 -0,094046868 -0,101535816 -0,101884203 -0,11131246
In your line of code predictedY <- predict(model, data), predictedY is of type factor. If you try to deduct a number from a factor (or vice versa) you get your error:
> 1:10 - as.factor(1:10)
[1] NA NA NA NA NA NA NA NA NA NA
Warning message:
In Ops.factor(1:10, as.factor(1:10)) : ‘-’ not meaningful for factors
If you want to make it work, you need to convert factors into numbers using as.numeric. 1:10 - as.numeric(as.factor(1:10)).
I don't know what your data looks like, but I judging from the title of the question svm is probably not a good idea for time series.

Order of predictions from merTools predictInterval()

I'm encountering an issue with predictInterval() from merTools. The predictions seem to be out of order when compared to the data and midpoint predictions using the standard predict() method for lme4. I can't reproduce the problem with simulated data, so the best I can do is show the lmerMod object and some of my data.
> # display input data to the model
> head(inputData)
id y x z
1 calibration19 1.336 0.531 001
2 calibration20 1.336 0.433 001
3 calibration22 0.042 0.432 001
4 calibration23 0.042 0.423 001
5 calibration16 3.300 0.491 001
6 calibration17 3.300 0.465 001
> sapply(inputData, class)
id y x z
"factor" "numeric" "numeric" "factor"
>
> # fit mixed effects regression with random intercept on z
> lmeFit = lmer(y ~ x + (1 | z), inputData)
>
> # display lmerMod object
> lmeFit
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | z)
Data: inputData
REML criterion at convergence: 444.245
Random effects:
Groups Name Std.Dev.
z (Intercept) 0.3097
Residual 0.9682
Number of obs: 157, groups: z, 17
Fixed Effects:
(Intercept) x
-0.4291 5.5638
>
> # display new data to predict in
> head(predData)
id x z
1 29999900108 0.343 001
2 29999900207 0.315 001
3 29999900306 0.336 001
4 29999900405 0.408 001
5 29999900504 0.369 001
6 29999900603 0.282 001
> sapply(predData, class)
id x z
"factor" "numeric" "factor"
>
> # estimate fitted values using predict()
> set.seed(1)
> preds_mid = predict(lmeFit, newdata=predData)
>
> # estimate fitted values using predictInterval()
> set.seed(1)
> preds_interval = predictInterval(lmeFit, newdata=predData, n.sims=1000) # wrong order
>
> # estimate fitted values just for the first observation to confirm that it should be similar to preds_mid
> set.seed(1)
> preds_interval_first_row = predictInterval(lmeFit, newdata=predData[1,], n.sims=1000)
>
> # display results
> head(preds_mid) # correct prediction
1 2 3 4 5 6
1.256860 1.101074 1.217913 1.618505 1.401518 0.917470
> head(preds_interval) # incorrect order
fit upr lwr
1 1.512410 2.694813 0.133571198
2 1.273143 2.521899 0.009878347
3 1.398273 2.785358 0.232501376
4 1.878165 3.188086 0.625161201
5 1.605049 2.813737 0.379167003
6 1.147415 2.417980 -0.108547846
> preds_interval_first_row # correct prediction
fit upr lwr
1 1.244366 2.537451 -0.04911808
> preds_interval[round(preds_interval$fit,3)==round(preds_interval_first_row$fit,3),] # the correct prediction ends up as observation 1033
fit upr lwr
1033 1.244261 2.457012 -0.0001299777
>
To put this into words, the first observation of my data frame predData should have a fitted value around 1.25 according to the predict() method, but it has a value around 1.5 using the predictInterval() method. This does not seem to be simply due to differences in the prediction approaches, because if I restrict the newdata argument to the first row of predData, the resulting fitted value is around 1.25, as expected.
The fact that I can't reproduce the problem with simulated data leads me to believe it has to do with an attribute of my input or prediction data. I've tried reclassifying the factor variable as character, enforcing the order of the rows prior to fitting the model, between fitting the model and predicting, but found no success.
Is this a known issue? What can I do to avoid it?
I have attempted to make a minimal reproducible example of this issue, but have been unsuccessful.
library(merTools)
d <- data.frame(x = rnorm(1000), z = sample(1:25L, 1000, replace=TRUE),
id = sample(LETTERS, 1000, replace = TRUE))
d$z <- as.factor(d$z)
d$id <- factor(d$id)
d$y <- simulate(~x+(1|z),family = gaussian,
newdata=d,
newparams=list(beta=c(2, -1.1), theta=c(.25),
sigma = c(.23)), seed =463)[[1]]
lmeFit <- lmer(y ~ x + (1|z), data = d)
predData <- data.frame(x = rnorm(25), z = sample(1:25L, 25, replace=TRUE),
id = sample(LETTERS, 25, replace = TRUE))
predData$z <- as.factor(predData$z)
predData$id <- factor(predData$id)
predict(lmeFit, predData)
predictInterval(lmeFit, predData)
predictInterval(lmeFit, predData[1, ])
But, playing around with this code I was not able to recreate the error observed above. Can you post a synthetic example or see if you can create a synthetic example?
Or can you test the issue first coercing the factors to characters and seeing if you see the same re-ordering issue?

R: cannot predict specific value [duplicate]

This question already has answers here:
Predict() - Maybe I'm not understanding it
(4 answers)
Closed 6 years ago.
> age <- c(23,19,25,10,9,12,11,8)
> steroid <- c(27.1,22.1,21.9,10.7,7.4,18.8,14.7,5.7)
> sample <- data.frame(age,steroid)
> fit2 <- lm(sample$steroid~poly(sample$age,2,raw=TRUE))
> fit2
Call:
lm(formula = sample$steroid ~ poly(sample$age, 2, raw = TRUE))
Coefficients:
(Intercept) -27.7225
poly(sample$age, 2, raw = TRUE)1 5.1819
poly(sample$age, 2, raw = TRUE)2 -0.1265
> (newdata=data.frame(age=15))
age
1 15
> predict(fit2,newdata,interval="predict")
fit lwr upr
1 24.558395 17.841337 31.27545
2 25.077825 17.945550 32.21010
3 22.781034 15.235782 30.32628
4 11.449490 5.130638 17.76834
5 8.670526 2.152853 15.18820
6 16.248596 9.708411 22.78878
7 13.975514 7.616779 20.33425
8 5.638620 -1.398279 12.67552
Warning message:
'newdata' had 1 rows but variable(s) found have 8 rows
Why does the predict function unable to predict for age=15?
Instead of lm(data$y ~ data$x), use the form lm(y ~ x, data). That should solve your problem.
EDIT: the problem is not only with the call to lm, but also the use of poly(*, raw=TRUE). If you remove the raw=TRUE bit, it should then work. Not sure why raw=TRUE breaks here.

Resources