Day-ahead using GLM model in R - r

I have the following code to get a day-ahead prediction for load consumption in 15 minute interval using outside air temperature and TOD(96 categorical variable, time of the day). When I run the code below, I get the following errors.
i = 97:192
formula = as.formula(load[i] ~ load[i-96] + oat[i])
model = glm(formula, data = train.set, family=Gamma(link=vlog()))
I get the following error after the last line using glm(),
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
And the following error shows up after the last line using predict(),
Warning messages:
1: In if (!se.fit) { :
the condition has length > 1 and only the first element will be used
2: 'newdata' had 96 rows but variable(s) found have 1 rows
3: In predict.lm(object, newdata, se.fit, scale = residual.scale, type = ifelse(type == :
prediction from a rank-deficient fit may be misleading
4: In if (se.fit) list(fit = predictor, se.fit = se, df = df, residual.scale = sqrt(res.var)) else predictor :
the condition has length > 1 and only the first element will be used

You're doing things in a rather roundabout fashion, and one that doesn't translate well to making out-of-sample predictions. If you want to model on a subset of rows, then either subset the data argument directly, or use the subset argument.
train.set$load_lag <- c(rep(NA, 96), train.set$load[1:96])
mod <- glm(load ~ load_lag*TOD, data=train.set[97:192, ], ...)
You also need to rethink exactly what you're doing with TOD. If it has 96 levels, then you're fitting (at least) 96 degrees of freedom on 96 observations which won't give you a sensible outcome.

Related

Estimating Dynamic Difference in Difference in R

I've been trying to estimate the above regression using a multiple cross section dataset, and I tried using the did library without success. I have a large dataset and I already formatted the data such that I have a event time dummy, but it gives an error. Treatment is in 2018 and outcome is emp, and base period should be 2017.
I tried:
df4<-df1[complete.cases(df1$treat),]
df4<-df4[complete.cases(df4$emp),]
df4<-df4[(df4$year>=2014),]
df4$g<-ifelse(df4$treat==1,2018,0)
att1 <- att_gt(yname = "emp",
tname = "period",
gname = "G",
xformla = ~treat+factor(month)+factor(year),
data = df4,
panel=FALSE
)
and it gives me
'
Error in DRDID::drdid_rc(y = Y, post = post, D = G, covariates = covariates, :
Outcome regression model coefficients have NA components.
Multicollinearity (or lack of variation) of covariates is a likely reason.
In addition: Warning messages:
1: glm.fit: algorithm did not converge
2: In DRDID::drdid_rc(y = Y, post = post, D = G, covariates = covariates, :
glm algorithm did not converge
'
I also did a regression using lm only but it implied insignificant results, which should not be the case for my assignment
`
ols1 <-lm(emp ~ relevel(factor(year),ref="2017")*treat+factor(month),
data=df4)
summary(ols1)
`

Problem predicting glmnet(): "contrasts can be applied only to factors with 2 or more levels"

I trained a penalized regression model using R's glmnet package, and X constructed using a sparse.model.matrix with a formula of "~ . * (var1)" to get every term from my data and an interaction with var1:
X3 <- sparse.model.matrix(object = ~.*(var1), data = X)[,-1]
cv_lasso <- cv.glmnet(x = X3, y = Y3,
alpha = 1,
nfold = 10,
family = "binomial",
nlambda = 100,
lambda.min.ratio=0.001,
type.measure="auc",
keep = TRUE,
parallel = TRUE)
Now, I'm trying to predict on a couple of data points, but when converting the newX to a model.matrix to use with predict.glmnet(), like below:
X_pred <- sparse.model.matrix(object = ~.*(var1), data = X_holdout)
predict(object = cv_lasso,
newx = X_pred,
s = "lambda.min")
But I get the following error:
Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I believe this might be caused by a couple of columns from X_holdout that are basically constant (which is correct since I'm trying to predict now, I already trained successfully).
How can I avoid this problem? My understanding is that, since I trained my model using interactions, I have to create a model matrix with the same interactions in my predictions.
Found the root of my problem: some of the prediction X columns were constant, since the holdout data is significantly smaller than the training data.
To fix this, I needed to use the "xlevs" argument in creating the sparse matrices for both the training data and the prediction data, and both with the same xlev.
In case you don't know what "xlev" is, it's basically a list of character vectors that indicate the levels to use when expanding factor variables into dummy/one-hot columns. This way, even if you have a column with only 1 value, sparse.matrix.model() can understand that there are more levels, it's just that they are not present in the data. This argument will also help you make sure that both the training and prediction matrices have the same number of columns, which is important for predict.glmnet()

R: Confusion matrix in RF model returns error: data` and `reference` should be factors with the same levels

I am new player in R and want to solve binary classification task.
Dataset has factor variable LABELS with 2 classes: first - 0, second - 1. The next image shows actual head of it:
TimeDate column - it's just index.
Class distribution is defined as:
print("the number of values with % in factor variable - LABELS:")
percentage <- prop.table(table(dataset$LABELS)) * 100
cbind(freq=table(dataset$LABELS), percentage=percentage)
Result of class distribution:
Also I know that Slot2 column is calculated based on formula:
Slot2 = Var3 - Slot3 + Slot4
The features Var1,Var2,Var3,Var4 were selected after analysis the correlation matrix.
Before start the modeling i divided dataset to train and test parts.
I tried to build Random forest Model for binary classification task used the next code:
rf2 <- randomForest(LABELS ~ Var1 + Var2 + Var3 + Var4,
data=train, ntree = 100,
mtry = 4, importance = TRUE)
print(rf2)
The result is:
Call:
randomForest(formula = LABELS ~ Var1 + Var2 + Var3 + Var4,
data = train, ntree = 100, mtry = 4, importance = TRUE)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 4
OOB estimate of error rate: 0.16%
Confusion matrix:
0 1 class.error
0 164957 341 0.002062941
1 280 233739 0.001196484
When I tried to do predict:
# Prediction & Confusion Matrix - train data
p1 <- predict(rf2, train, type="prob")
print("Prediction & Confusion Matrix - train data")
confusionMatrix(p1, train$LABELS)
# # Prediction & Confusion Matrix - test data
p2 <- predict(rf2, test, type="prob")
print("Prediction & Confusion Matrix - test data")
confusionMatrix(p2, test$LABELS)
I received an error in R:
[1] "Prediction & Confusion Matrix - train data"
Error: `data` and `reference` should be factors with the same levels.
Traceback:
1. confusionMatrix(p1, train$LABELS)
2. confusionMatrix.default(p1, train$LABELS)
3. stop("`data` and `reference` should be factors with the same levels.",
. call. = FALSE)
Also I have already tried to fix it by using idea from the following questions:
Error in ConfusionMatrix the data and reference factors must have the same number of levels R CARET
Error in Confusion Matrix : the data and reference factors must have the same number of levels
but it doesn't help in my case.
Could you please help me with this error?
I'll be appreciate for any ideas and comments.Thank you in advance.
An error in R:
Error: `data` and `reference` should be factors with the same levels.
was fixed by changing type parameter in the predict function, correct code:
# Prediction & Confusion Matrix - train data
p1 <- predict(rf2, train, type="response")
print("Prediction & Confusion Matrix - train data")
confusionMatrix(p1, train$LABELS)
#Camille, Thank you so much)

R: Using a variable with less observations in a regression (plm)

I have been trying to deal with this for a while now with no luck. Essentially, what I am doing is a two-stage least squares on some panel data. To do this I am using the plm package. What I want to do is
Do a 2SLS
Get the residuals from the 2SLS in 1.
Use these residuals as an instrument in a different 2SLS
The issue I have is that in the first 2SLS the number of observations used is less than the total observations in the dataset, so my residuals vector is short and I get the following error
Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data, :
variable lengths differ (found for 'ivreg.2.a$residuals')
Here is the code I am trying to run for reference, let me know if you need any more details. I really just need my residual vector to be the same length as the data used in the first 2SLS. For reference my data has 1713 observations, however, only 1550 get used in the regression and as a result my residuals vector is length 1550. My code for the two 2SLS regressions is below.
ivreg.2.a = plm(formula = diff(loda) ~ factor(year)+diff(lgdp) | index_g_l + diff(lcru_l) + diff(lcru_l_sq) + factor(year), index = c("country", "year"), model = "within", data = panel[complete.cases(panel[, c(1,2,3,4,5,7)]),])
ivreg.2.a = plm(formula = diff(lgdp) ~ factor(year)+index_g_l + diff(lcru_l) + diff(lcru_l_sq) + diff(loda)| index_g_l + diff(lcru_l) + diff(lcru_l_sq) + factor(year) + ivreg.2.a$residuals, index = c("country", "year"), model = "within", data = panel[complete.cases(panel[, c(1,2,3,4,5,7)]),])
Let me know if you need anything else.
I assume the 163 observations are dropped because they have NA in one of the relevant variables. Most *lm functions in R have a na.action argument, which can be used to pad the residuals to correct length. E.g., when missing observation 3,
residuals(lm(formula, data, na.action=na.omit)) # 1 2 4
residuals(lm(formula, data, na.action=na.exclude)) # 1 2 NA 4
Documentation of plm, however, says that this argument is "currently not fully supported", so it would be simpler if you just filter those 1550 rows to a new dataframe first, and do all subsequent work on that.
BTW, if plm behaves like lm, you shouldn't need to specify complete.cases for it to work, as it should just skip anything with NAs.

predict.glm() with three new categories in the test data (r)(error)

I have a data set called data which has 481 092 rows.
I split data into two equal halves:
The first halve (row 1: 240 546) is called train and was used for the glm();
the second halve (row 240 547 : 481 092) is called test and should be used to validate the model;
Then I started the regression:
testreg <- glm(train$returnShipment ~ train$size + train$color + train$price +
train$manufacturerID + train$salutation + train$state +
train$age + train$deliverytime,
family=binomial(link="logit"), data=train)
Now the prediction:
prediction <- predict.glm(testreg, newdata=test, type="response")
gives me an Error:
Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137
Now I know that these levels were omitted in the regression because it doesn't show any coefficients for these levels.
I have tried this: predict.lm() with an unknown factor level in test data . But it somehow doesn't work for me or I maybe just don't get how to implement it. I want to predict the dependent binary variable but of course only with the existing coefficients. The link above suggests to tell R that rows with new levels should just be called /or treated as NA.
How can I proceed?
Edit-Suggested approach by Z. Li
I got problem in the first step:
xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]
but mID125 is NULL! What have I done wrong?
It is impossible to get estimation of new factor levels, in fixed effect modelling, including linear models and generalized linear models. glm (as well as lm) keeps records of what factor levels are presented and used during model fitting, and can be found in testreg$xlevels.
Your model formula for model estimation is:
returnShipment ~ size + color + price + manufacturerID + salutation +
state + age + deliverytime
then predict complains new factor levels 125, 136, 137 for manufactureID. This means, these levels are not inside testreg$xlevels$manufactureID, therefore has no associated coefficient for prediction. In this case, we have to drop this factor variable and use a prediction formula:
returnShipment ~ size + color + price + salutation +
state + age + deliverytime
However, the standard predict routine can not take your customized prediction formula. There are commonly two solutions:
extract model matrix and model coefficients from testreg, and manually predict model terms we want by matrix-vector multiplication. This is what the link given in your post suggests to do;
reset the factor levels in test into any one level appeared in testreg$xlevels$manufactureID, for example, testreg$xlevels$manufactureID[1]. As such, we can still use the standard predict for prediction.
Now, let's first pick up a factor level used for model fitting
xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]
Then we assign this level to your prediction data:
replacement <- factor(rep(mID125, length = nrow(test)), levels = xlevels)
test$manufacturerID <- replacement
And we are ready to predict:
pred <- predict(testreg, test, type = "link") ## don't use type = "response" here!!
In the end, we adjust this linear predictor, by subtracting factor estimate:
est <- coef(testreg)[paste0(manufacturerID, mID125)]
pred <- pred - est
Finally, if you want prediction on the original scale, you apply the inverse of link function:
testreg$family$linkinv(pred)
update:
You complained that you met various troubles in trying the above solutions. Here is why.
Your code:
testreg <- glm(train$returnShipment~ train$size + train$color +
train$price + train$manufacturerID + train$salutation +
train$state + train$age + train$deliverytime,
family=binomial(link="logit"), data=train)
is a very bad way to specify your model formula. train$returnShipment, etc, will restrict the environment of getting variables strictly to data frame train, and you will have trouble in later prediction with other data sets, like test.
As a simple example for such drawback, we simulate some toy data and fit a GLM:
set.seed(0); y <- rnorm(50, 0, 1)
set.seed(0); a <- sample(letters[1:4], 50, replace = TRUE)
foo <- data.frame(y = y, a = factor(a))
toy <- glm(foo$y ~ foo$a, data = foo) ## bad style
> toy$formula
foo$y ~ foo$a
> toy$xlevels
$`foo$a`
[1] "a" "b" "c" "d"
Now, we see everything comes with a prefix foo$. During prediction:
newdata <- foo[1:2, ] ## take first 2 rows of "foo" as "newdata"
rm(foo) ## remove "foo" from R session
predict(toy, newdata)
we get an error:
Error in eval(expr, envir, enclos) : object 'foo' not found
The good style is to specify environment of getting data from data argument of the function:
foo <- data.frame(y = y, a = factor(a))
toy <- glm(y ~ a, data = foo)
then foo$ goes away.
> toy$formula
y ~ a
> toy$xlevels
$a
[1] "a" "b" "c" "d"
This would explain two things:
You complained to me in the comment that when you do testreg$xlevels$manufactureID, you get NULL;
The prediction error you posted
Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137
complains train$manufacturerID instead of test$manufacturerID.
As you have divided your train and test sample based on rownumbers, some factor levels of your variables are not equally represented in both the train and test samples.
You need to do stratified sampling to ensure that both train and test samples have all factor level representations. Use stratified from the splitstackshape package.

Resources