Why doesn't predict like the dimensions of my newdata? - r

I want to perform a multiple regression in R and make predictions based on the trained model. Below is an example code I am using:
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
predict(lm(price ~ predictors), data.frame(predictors=matrix(c(3,5),nrow=1)))
So, based on the 2-variate regression model trained by 5 samples, I want to make a prediction for the test data point where the first variate is 3 and second variate is 5. But I get a warning from above code saying that 'newdata' had 1 rows but variable(s) found have 5 rows. How can I correct above code? Below code works fine where I give the variables separately to the model formula. But since I will have hundreds of variates, I have to give them in a matrix since it would be unfeasible to append hundreds of columns using + sign.
price = c(10,18,18,11,17)
predictor1 = c(5,6,3,4,5)
predictor2 = c(2,1,8,5,6)
predict(lm(price ~ predictor1 + predictor2), data.frame(predictor1=3,predictor2=5))
Thanks in advance!

The easiest way to get past the issue of matching up variable names from a matrix of covariates to newdata data.frame column names is to put your input data into a data.frame as well. Try this
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
indata<-data.frame(price,predictors=predictors)
predict(lm(price ~ ., indata), data.frame(predictors=matrix(c(3,5),nrow=1)))
Here we combine price and predictors into a data.frame such that it will be named the same say as the newdata data.frame. We use the . in the formula to mean "all other columns" so we don't have to specify them explicitly.

Need to build the model first, then predict from it:
mod1 <- lm(price ~ predictor1 + predictor2)
predict( mod1 , data.frame(predictor1=3,predictor2=5))

Related

How to use data frames to conduct ANOVAs in R

I am currently learning R and am playing around with a dataset that has four nominal variables (Hour.Of.Arrival, Mode, Unit, Weekday), and a continuous dependent variable (Overall). This is all imported from a .csv in a data frame named basic. What I am trying to do is run an ANOVA just using this data frame, without creating separate vectors (e.g. Mode<-basic$Mode). "Fit" holds the results of the ANOVA. Here is the code that I wrote:
Fit<-aov(basic["Overall"],basic["Unit"],data=basic)
However, I keep getting the error
"Error in terms.default(formula, "Error", data = data) : no terms
component nor attribute
I hope this question isn't too basic!!
Thanks :)
I think you want something more like Fit<-aov(Overall ~ Unit,data=basic). The Overall ~ Unit tells R to treat Overall as an outcome being predicted by Unit; you already specify that the dataframe to find these variables is basic.
Here's an example to show you how it works:
> y <- rnorm(100)
> x <- factor(rep(c('A', 'B', 'C', 'D'), each = 25))
> dat <- data.frame(x, y)
> aov(y ~ x, data = dat)
Call:
aov(formula = y ~ x, data = dat)
Terms:
x Residuals
Sum of Squares 2.72218 114.54631
Deg. of Freedom 3 96
Residual standard error: 1.092333
Estimated effects may be unbalanced
Note, you don't need to use the data argument, you could also use aov(dat$y ~ dat$x), but the first argument to the function should be a formula.

R: Using a variable with less observations in a regression (plm)

I have been trying to deal with this for a while now with no luck. Essentially, what I am doing is a two-stage least squares on some panel data. To do this I am using the plm package. What I want to do is
Do a 2SLS
Get the residuals from the 2SLS in 1.
Use these residuals as an instrument in a different 2SLS
The issue I have is that in the first 2SLS the number of observations used is less than the total observations in the dataset, so my residuals vector is short and I get the following error
Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data, :
variable lengths differ (found for 'ivreg.2.a$residuals')
Here is the code I am trying to run for reference, let me know if you need any more details. I really just need my residual vector to be the same length as the data used in the first 2SLS. For reference my data has 1713 observations, however, only 1550 get used in the regression and as a result my residuals vector is length 1550. My code for the two 2SLS regressions is below.
ivreg.2.a = plm(formula = diff(loda) ~ factor(year)+diff(lgdp) | index_g_l + diff(lcru_l) + diff(lcru_l_sq) + factor(year), index = c("country", "year"), model = "within", data = panel[complete.cases(panel[, c(1,2,3,4,5,7)]),])
ivreg.2.a = plm(formula = diff(lgdp) ~ factor(year)+index_g_l + diff(lcru_l) + diff(lcru_l_sq) + diff(loda)| index_g_l + diff(lcru_l) + diff(lcru_l_sq) + factor(year) + ivreg.2.a$residuals, index = c("country", "year"), model = "within", data = panel[complete.cases(panel[, c(1,2,3,4,5,7)]),])
Let me know if you need anything else.
I assume the 163 observations are dropped because they have NA in one of the relevant variables. Most *lm functions in R have a na.action argument, which can be used to pad the residuals to correct length. E.g., when missing observation 3,
residuals(lm(formula, data, na.action=na.omit)) # 1 2 4
residuals(lm(formula, data, na.action=na.exclude)) # 1 2 NA 4
Documentation of plm, however, says that this argument is "currently not fully supported", so it would be simpler if you just filter those 1550 rows to a new dataframe first, and do all subsequent work on that.
BTW, if plm behaves like lm, you shouldn't need to specify complete.cases for it to work, as it should just skip anything with NAs.

use cox model to estimate survival

I first establish a cox model in R:
test1<- test[1:20,]
model.1 <- coxph(Surv(test1$days,test1$status==1) ~ test1$MTT+test1$ADC,data=test1)
and when i tried to predict next patient's survival like this:
covs1 <- data.frame(test[21,]$MTT,test[21,]$ADC)
summary(survfit(model.1, newdata= covs1, type ="aalen"))
it gave me too many survival results and the warning is
"'newdata' had 1 row but variables found have 20 rows "
fyi, there are 20 events and the results contain 20 survival results.
The names of the columns in the datframe being given as the basis for a prediction must have the same column names as are in the RHS of the model formula. I don't think yours will qualifiy unless you do something like this:
test1<- test[1:20,]
model.1 <- coxph( Surv(days, status==1) ~ MTT + ADC, data=test1)
covs1 <- test[21, c("MTT", "ADC")]
# then do your prediction
You should not use $ to supply arguments to Surv. It is important that the model be constructed in the environment of the dataframe.

Running a GLM with Poisson distribution with combined columns In R

Is it possible to run a GLM with a poisson distribution with a variable that has combined columns in R?
I am looking at the effects of different species, the cage density and the day that eggs are laid on how many eggs were laid and how many hatched, so I have linked the hatched and unhatched columns. My data are count data. The code works ok with family = binomial but I want to test if poisson is a better model.
My code is as follows:
attach(EggV)
density <- as.factor(Density)
day <- as.factor(Day)
Y <- cbind (Hatched, Unhatched)
model.pois <- glm(Y ~ Species + density + day, data = EggV, family = poisson)
But once I run the code it give me an error:
Error in x[good, , drop = FALSE] : (subscript) logical subscript too long
If I run the same code with only the variables "Hatched" or "Unhatched" it works but this is not sufficient for my data analysis.

How to update a glm model that contains NA's after fitting? error Number of observation not equal

I have a dataset that contains some missing values (on independent variables). I’m fitting a glm model :
f.model=glm(data = data, formula = y~x1 +x2, "binomial", na.action =na.omit )
After this model I want the ‘null’ model , so I used update:
n.model=update(f.model, . ~ 1)
This seems to work, but the number of observations in both models differ (f.model n=234; n.model n=235). So when I try to estimate a likelihood ratio I get an error: Number of observation not equal!!.
Q: How to update the model so that it accounts for the missing values?
Although it is a bit strange that na.action =na.omit dit not solve the NA problem. I decided to filter out the data.
library(epicalc) # for lrtest
vars=c(“y”, “x1”, “x2”) #variables in the model
n.data=data[,vars] #filter data
f.model=glm(data = data, formula = y~x1 +x2, binomial)
n.model=update(f.model, . ~ 1)
LR= lrtest(n.model,f.model)
If someone has a better solution or an argument way na.action in combination with update results in unequal observations, your answer or solution is more than welcome!

Resources