I'm using the biglm package to run a regression on a data set. The regression runs fine using the following code:
chunkStart <- seq(1,150000000,1000000)
chunkEnd <- seq(1000000,151000000,1000000)
ff <- price ~ factor(Var1) + factor(Var2)
#for(i in 1:length(chunkStart)){
for(i in 1:5){
startRow <- chunkStart[i]
endRow <- chunkEnd[i]
curchunk <- data.frame( price=x[startRow:endRow,1]
,Var1=factor( x[startRow:endRow,6], levels=1:3), Var2= factor( x[startRow:endRow,7], levels=1:3 ) )
if(i == 1){
a <- biglm(ff,curchunk )
}
if(i != 1){
a <- update(a,curchunk )
}
rm(curchunk )
gc()
print(paste(i, " | ",startRow ," | ",endRow ," | ", sep=""))
flush.console()
}
> summary(a)
Large data regression model: biglm(ff, curchunk)
Sample size = 5000000
Coef (95% CI) SE p
(Intercept) 0.0457 0.0454 0.0461 2e-04 0
factor(Var1)2 0.0189 0.0184 0.0194 2e-04 0
factor(Var1)3 0.0148 0.0142 0.0155 3e-04 0
factor(Var2)2 -0.0331 -0.0335 -0.0326 2e-04 0
factor(Var2)3 -0.0417 -0.0426 -0.0408 4e-04 0
The problems come when I try to predict using the biglm object, 'a'.
> df1 <- data.frame(y[1:1000,])
> pred1 <- predict(a, df1)
Error in eval(expr, envir, enclos) : object 'price' not found
Why is the predict function looking for the price/ dependent variable? Any suggestions?
EDIT:
> head(df1)
Var1 Var2
1 3 3
2 3 1
3 3 2
4 2 1
5 2 2
6 1 1
> str(df1)
'data.frame': 1000 obs. of 2 variables:
$ Var1: Factor w/ 3 levels "1","2","3": 3 3 3 2 2 1 2 1 2 1 ...
$ Var2: Factor w/ 3 levels "1","2","3": 3 1 2 1 2 1 1 1 2 1 ...
> pred1 <- predict(a, df1)
Error in eval(expr, envir, enclos) : object 'price' not found
The reason it is looking for the dependent variable is that the predict method uses a call to model.frame from the stats package, and that function requires all the variables to be present in the new data. This is indicated on the model.frame help page without explanation for the motivation behind it.
All you actually need to do about this is create a variable in your new data that has the same name as the dependent variable, then fill it with zeroes (or any non-missing value). So it should work if you run this:
df1$price <- 0
pred1 <- predict(a, df1)
Related
I have a large dataset that I subsetted and created a new dataset.
I used the following code that works perfectly
require(sjPlot);require(coxme)
tab_model(coxme(Surv(comp2_years, comp2)~FEMALE+(1|TRIAL), data))
But when I used the subsetted datas set using the following code,
www<- subset(data, (data$TRIAL != 5 & data$Sex.standerd.BMI.gpM1F2 >=1))
tab_model(coxme(Surv(comp2_years, comp2)~FEMALE+(1|TRIAL), www))
it gave me the following error:
Error in coxme.fit(X, Y, strats, offset, init, control, weights = weights, :
No starting estimate was successful
This is my new data structure
str(www)
Classes ‘data.table’ and 'data.frame': 7576 obs. of 79 variables:
$ TRIAL : num 1 1 1 1 1 1 1 1 1 1 ...
$ FEMALE : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ type_comp2 : chr "0" "0" "Revasc" "0" ...
$ comp2 : num 0 0 1 0 0 0 0 0 0 1 ...
$ comp2_years : num 10 10 9.77 10 10 ...
$ Sex.standerd.BMI.gpM1F2 : num 1 1 1 1 1 1 1 1 1 1 ...
$ Trial1_4.MiddleBMI : num 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
I saw this post but I could not solve my current problem.
Any advice will be greatly appreciated.
Add the droplevels() command to your subset.
This happened to me too, and I found that using droplevels() to forget about the levels you did not include in the subset solved it:
library(survival)
library(coxme)
Change ph.ecog from number to categorical to make this point:
lung$ph.ecog <- as.factor(lung$ph.ecog)
(fit <- coxme(Surv(time, status) ~ ph.ecog + age + (1|inst), lung))
Works well for the full data set. Subset out some levels of ph.ecog, and it gives this error:
lunga <- subset(lung, !ph.ecog %in% c(2, 3))
(fita <- coxme(Surv(time, status) ~ ph.ecog + age + (1|inst), lunga))
Error in coxme.fit(X, Y, strats, offset, init, control, weights = weights, :
No starting estimate was successful
Using droplevels() to forget about empty levels allows coxme to fit again:
lungb <- droplevels(subset(lung, !ph.ecog %in% c(2, 3)))
(fitb <- coxme(Surv(time, status) ~ ph.ecog + age + (1|inst), lungb))
I am working on a two-way mixed ANOVA using the data below, using one dependent variable, one between-subjects variable and one within-subjects variable. When I tested the normality of the residuals, of the dependent variable, I find that they are not normally distributed. But at this point I am able to perform the two-way ANOVA. Howerver, when I perform a log10 transformation, and run the script again using the log transformed variable, I get the error "contrasts can be applied only to factors with 2 or more levels".
> str(m_runjumpFREQ)
'data.frame': 564 obs. of 8 variables:
$ ID1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ ID : chr "ID1" "ID2" "ID3" "ID4" ...
$ Group : Factor w/ 2 levels "II","Non-II": 1 1 1 1 1 1 1 1 1 1 ...
$ Pos : Factor w/ 3 levels "center","forward",..: 2 1 2 3 2 2 1 3 2 2 ...
$ Match_outcome : Factor w/ 2 levels "W","L": 2 2 2 2 2 2 2 2 2 1 ...
$ time : Factor w/ 8 levels "runjump_nADJmin_q1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ runjump : num 0.0561 0.0858 0.0663 0.0425 0.0513 ...
$ log_runjumpFREQ: num -1.25 -1.07 -1.18 -1.37 -1.29 ...
Some answers on StackOverflow to this error have mentioned that one or more factors in the data set, used for the ANOVA, are of less than two levels. But as seen above they are not.
Another explanation I have read is that it may be the issue of missing values, where there may be NA's. There is:
m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 88
However, I get the same error even after removing the rows including NA's as follows.
> m_runjumpFREQ <- na.omit(m_runjumpFREQ)
> m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 0
I could run the same script without log transformation and it would work, but with it, I get the same error. The factors are the same either way and the missing values do not make a difference. Either I am doing a crucial mistake or the issue is in the line of the log transformation below.
log_runjumpFREQ <- log10(m_runjumpFREQ$runjump)
m_runjumpFREQ <- cbind(m_runjumpFREQ, log_runjumpFREQ)
I appreciate the help.
It is not good enough that the factors have 2 levels. In addition those levels must actually be present. For example, below f has 2 levels but only 1 is actually present.
y <- (1:6)^2
x <- 1:6
f <- factor(rep(1, 6), levels = 1:2)
nlevels(f) # f has 2 levels
## [1] 2
lm(y ~ x + f)
## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
## contrasts can be applied only to factors with 2 or more levels
I am using the CHAID method from caret. I am getting the same error as this post that x is not a factor when all the x's are factors.
I'm using R 3.3.3 and caret_6.0-78
Here is a toy example:
library(datasets)
library(caret)
library(CHAID)
testDat<-data.frame(HairEyeColor, stringsAsFactors=T)[,1:3]
str(testDat)
'data.frame': 32 obs. of 3 variables:
$ Hair: Factor w/ 4 levels "Black","Brown",..: 1 2 3 4 1 2 3 4 1 2 ...
$ Eye : Factor w/ 4 levels "Brown","Blue",..: 1 1 1 1 2 2 2 2 3 3 ...
$ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
control <- trainControl(method="repeatedcv", number=10, repeats=3,
+ savePredictions="final", summaryFunction=twoClassSummary, classProbs=TRUE)
fit.chaid <- train(Sex~Hair+Eye, data=testDat, method="chaid", metric="ROC", trControl=control)
Error: is.factor(x) is not TRUE
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Timing stopped at: 0.02 0 0.02
warnings()
Warning messages:
1: model fit failed for Fold01.Rep1: alpha2=0.05, alpha3=-1, alpha4=0.05 Error : is.factor(x) is not TRUE
.
.
I know that this question already old, but here I got the answers for it by experimenting:
For CHAID Modeling, try to use xy modeling rather than formula modeling, like this:
fit.chaid <- train(x=testDat[,c(1,2)], #Hair and Eye Variable
y=testDat[,c(3)], #Sex Variable
method="chaid",
metric="ROC",
trControl=control)
I tried to use glm for estimate soccer teams strengths.
# data is dataframe (structure on bottom).
model <- glm(Goals ~ Home + Team + Opponent, family=poisson(link=log), data=data)
but get the error:
Error in if (any(y < 0)) stop("negative values not allowed for the 'Poisson' family") :
missing value where TRUE/FALSE needed
In addition: Warning message:
In Ops.factor(y, 0) : ‘<’ not meaningful for factors
data:
> data
Team Opponent Goals Home
1 5a51f2589d39c31899cce9d9 5a51f2579d39c31899cce9ce 3 1
2 5a51f2579d39c31899cce9ce 5a51f2589d39c31899cce9d9 0 0
3 5a51f2589d39c31899cce9da 5a51f2579d39c31899cce9cd 3 1
4 5a51f2579d39c31899cce9cd 5a51f2589d39c31899cce9da 0 0
> is.factor(data$Goals)
[1] TRUE
From the "details" section of documentation for glm() function:
A typical predictor has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response.
So you want to make sure your Goals column is numeric:
df <- data.frame( Team= c("5a51f2589d39c31899cce9d9", "5a51f2579d39c31899cce9ce", "5a51f2589d39c31899cce9da", "5a51f2579d39c31899cce9cd"),
Opponent=c("5a51f2579d39c31899cce9ce", "5a51f2589d39c31899cce9d9", "5a51f2579d39c31899cce9cd", "5a51f2589d39c31899cce9da "),
Goals=c(3,0,3,0),
Home=c(1,0,1,0))
str(df)
#'data.frame': 4 obs. of 4 variables:
# $ Team : Factor w/ 4 levels "5a51f2579d39c31899cce9cd",..: 3 2 4 1
# $ Opponent: Factor w/ 4 levels "5a51f2579d39c31899cce9cd",..: 2 3 1 4
# $ Goals : num 3 0 3 0
# $ Home : num 1 0 1 0
model <- glm(Goals ~ Home + Team + Opponent, family=poisson(link=log), data=df)
Then here is the output:
> model
Call: glm(formula = Goals ~ Home + Team + Opponent, family = poisson(link = log),
data = df)
Coefficients:
(Intercept) Home Team5a51f2579d39c31899cce9ce
-2.330e+01 2.440e+01 -3.089e-14
Team5a51f2589d39c31899cce9d9 Team5a51f2589d39c31899cce9da Opponent5a51f2579d39c31899cce9ce
-6.725e-15 NA NA
Opponent5a51f2589d39c31899cce9d9 Opponent5a51f2589d39c31899cce9da
NA NA
Degrees of Freedom: 3 Total (i.e. Null); 0 Residual
Null Deviance: 8.318
Residual Deviance: 3.033e-10 AIC: 13.98
I used the gbm() function to create the model and I want to get the accuracy. Here is my code:
df<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
str(df)
F=c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) df[,i]=as.factor(df[,i])
library(caret)
set.seed(1000)
intrain<-createDataPartition(y=df$Creditability, p=0.7, list=FALSE)
train<-df[intrain, ]
test<-df[-intrain, ]
install.packages("gbm")
library("gbm")
df_boosting<-gbm(Creditability~.,distribution = "bernoulli", n.trees=100, verbose=TRUE, interaction.depth=4,
shrinkage=0.01, data=train)
summary(df_boosting)
yhat.boost<-predict (df_boosting ,newdata =test, n.trees=100)
mean((yhat.boost-test$Creditability)^2)
However, when using the summary function, an error appears. The error message is as follows.
Error in plot.window(xlim, ylim, log = log, ...) :
유한한 값들만이 'xlim'에 사용될 수 있습니다
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
And, When measuring the MSE with the mean function, the following error also appears:
Warning message:
In Ops.factor(yhat.boost, test$Creditability) :
요인(factors)에 대하여 의미있는 ‘-’가 아닙니다.
Do you know why these two errors appear? Thank you in advance.
In your code the problem is in the definition of the (binary) response variable Creditability. You declare it as factor but gbm needs a numerical response variable.
Here is the code:
df <- read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
F <- c(2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) df[,i]=as.factor(df[,i])
str(df)
Creditability now is a binary numerical variable:
'data.frame': 1000 obs. of 21 variables:
$ Creditability : int 1 1 1 1 1 1 1 1 1 1 ...
$ Account.Balance : Factor w/ 4 levels "1","2","3","4": 1 1 2 1 1 1 1 1 4 2 ...
$ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ...
$ Payment.Status.of.Previous.Credit: Factor w/ 5 levels "0","1","2","3",..: 5 5 3 5 5 5 5 5 5 3 ...
$ Purpose : Factor w/ 10 levels "0","1","2","3",..: 3 1 9 1 1 1 1 1 4 4 ...
...
... and the remaining part of the code works nicely:
library(caret)
set.seed(1000)
intrain <- createDataPartition(y=df$Creditability, p=0.7, list=FALSE)
train <- df[intrain, ]
test <- df[-intrain, ]
library("gbm")
df_boosting <- gbm(Creditability~., distribution = "bernoulli",
n.trees=100, verbose=TRUE, interaction.depth=4,
shrinkage=0.01, data=train)
par(mar=c(3,14,1,1))
summary(df_boosting, las=2)
##########
var rel.inf
Account.Balance Account.Balance 36.8578980
Credit.Amount Credit.Amount 12.0691120
Duration.of.Credit..month. Duration.of.Credit..month. 10.5359895
Purpose Purpose 10.2691646
Payment.Status.of.Previous.Credit Payment.Status.of.Previous.Credit 9.1296524
Value.Savings.Stocks Value.Savings.Stocks 4.9620662
Instalment.per.cent Instalment.per.cent 3.3124252
...
##########
yhat.boost <- predict(df_boosting , newdata=test, n.trees=100)
mean((yhat.boost-test$Creditability)^2)
[1] 0.2719788
Hope this can help you.