Running a GLM with Poisson distribution with combined columns In R - r

Is it possible to run a GLM with a poisson distribution with a variable that has combined columns in R?
I am looking at the effects of different species, the cage density and the day that eggs are laid on how many eggs were laid and how many hatched, so I have linked the hatched and unhatched columns. My data are count data. The code works ok with family = binomial but I want to test if poisson is a better model.
My code is as follows:
attach(EggV)
density <- as.factor(Density)
day <- as.factor(Day)
Y <- cbind (Hatched, Unhatched)
model.pois <- glm(Y ~ Species + density + day, data = EggV, family = poisson)
But once I run the code it give me an error:
Error in x[good, , drop = FALSE] : (subscript) logical subscript too long
If I run the same code with only the variables "Hatched" or "Unhatched" it works but this is not sufficient for my data analysis.

Related

Stratified Cox Model, Unusual Output For Stratified Factor Variable

I'm estimating a Cox PH model in R with time-varying stratification on a few of the variables. I'm using the following code to create the dataset and run the estimation
sepsis_working_fac_strat7 <- survSplit(Surv(LOS, facility) ~ ., data = sepsis_working, cut = 7, episode = "tgroup", id = "id")
cox_facility2 <- coxph(Surv(tstart, LOS, facility) ~ Age_10 + Gender + Insurance + LowInc + AbxBeforeCulture+ MorningDC+ AttendingAffilGrp+
CentralLine:strata(tgroup)+ Consults+ CountIVAbx:strata(tgroup)+ DischargeUnit+ FSSAdmit+ OralAbxBeforeDC+
OrderedToDischarge+ TimeToAbx+ TimeToBC+ UrineCulture+ Vasopressor+ Ventilator+
Behavioral_Dx + BradenGroup+ CCI+ Diabetes_Dx+ Dialysis+ PVD_Dx+ SUD_Dx, data = sepsis_working_fac_strat7, cluster = PersPersObjId)
In the output, there are four rows for the covariate "CentralLine" where there should only be two:
I have run the same estimation for other event types in the data and have not encountered this problem for CentralLine, which is a factor variable with two levels, where the base level is set to "No". Normally I see two rows for the strata, one for above and below the cut point. In the data, there are 184 events with a fairly even distribution between CentralLine == "Yes" and CentralLine == "No".
I'd like to know what is causing this output to appear as it does.
EDIT: This may be the issue: Why coxph() results some of the coefficient as NA when using survSplit() in R?

Panel regression in R with variable coefficients

We have tried to do two (similar) panel regressions in R.
1) One with time and individual fixed effects (the usual intercept dummies) using plm(). However, we are only interested mostly interested in a "slope coefficient" or beta for each individual and not one for all of the individuals:
Regression 1
Where alpha_i is the individual fixed effect, gamme_t is the time fixed effect. The sum of X is the variable X and three lags:
Sum X variable
We have already included the lagged X variables as new columns in our dataset so in our specification in the code we simply treat them as four different variables:
This is an attempt at using plm() and include our own dummy variables for each individual Beta
plm(income ~ (factor(firmid)-1)*(expense_rate + lag1 + lag2 + lag3), data = data1,
effect = c("time"), model = c("within"), index = c("name", "date"))
The lag1, lag,2, lag2 are the lagged variables of expense rate.
Data1 is in the form of a data frame.
“(factor(firmid)-1)” is an attempt at introducing dummies to get Betas for each individual instead of one for all individuals.
2) The second (and simpler) regression is:
Regression 2
This is an example of our attempt at using pvcm
pvcm1 <- pvcm(income ~ expense_rate + lag1 + lag2 + lag3, data = data1,
effect = "individual", model = "within")
Our question is what specific code and or packages/functions which would be suitable for these regressions. We have tried pvcm to no avail, running into errors such as:
“Error in table(index[1], index[2], useNA = "ifany") :
attempt to make a table with >= 2^31 elements”
and
“Error: cannot allocate vector of size 599.7 Gb”
. Furthermore, pvcm() does not seem to be able to cope with both individua and time fixed effects as in 1).

What is the correct way to use weights in a logistic regression in R?

My data includes survey data of car buyers. My data has a weight column that i used in SPSS to get sample sizes. Weight column is affected by demographic factors & vehicle sales. Now i am trying to put together a logistic regression model for a car segment which includes a few vehicles. I want to use the weight column in the logistic regression model & i tried to do so using "weights" in glm function. But the results are horrific. Deviances are too high, McFadden Rsquare too low. My dependent variable is binary, independent variables are on 1 to 5 scale. Weight column is numerical, ranging from 32 to 197. Could that be a reason that results are poor? Do i need to have values in weight column below 1?
Format of input file to R is -
WGT output I1 I2 I3 I4 I5
67 1 1 3 1 5 4
I1, I2, I3 being independent variables
logr<-glm(output~1,data=data1,weights=WGT,family="binomial")
logrstep<-step(logr,direction = "both",scope = formula(data1))\
logr1<-glm(output~ (formula from final iteration),weights = WGT,data=data1,family="binomial")
hl <- hoslem.test(data1$output,fitted(logr1),g=10)
I want a logistic regression model with better accuracy & gain a better understanding of using weights with logistic regression
I would check out the survey package. This will allow you to specify weights for the survey design using the svydesign function. Additionally, you can use the svyglm function to perform your weighted logistic regression. See http://r-survey.r-forge.r-project.org/survey/
Something like the following assuming your data is in a dataframe called df
my_svy <- svydesign(df, ids = ~1, weights = ~WGT)
Then you can do the following:
my_fit <- svyglm(output ~1, my_svy, family = "binomial")
For a full reprex check out the below example
library(survey)
# Generate Some Random Weights
mtcars$wts <- rnorm(nrow(mtcars), 50, 5)
# Make vs a factor just for illustrative purposes
mtcars$vs <- as.factor(mtcars$vs)
# Build the Complete survey Object
svy_df <- svydesign(data = mtcars, ids = ~1, weights = ~wts)
# Fit the logistic regression
fit <- svyglm(vs ~ gear + disp, svy_df, family = "binomial")
# Store the summary object
(fit_sumz <- summary(fit))
# Look at the AIC if desired
AIC(fit)
# Pull out the deviance if desired
fit_sumz$deviance
As far as the stepwise regression, this typically isn't a great methodology for a statistical point of view. It results in a higher R2 and some other issues regarding inference (see https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/).

Pooling sandwich variance estimator over multiply imputed datasets

I am running a poisson regression on multiply imputed data to predict a common binary outcome. After running mice, I have obtained a stacked data frame comprising the raw data and five imputed datasets. Here is a toy example:
df <- mice::nhanes
imp <- mice(df) #impute data
com <- complete(imp, "long", TRUE) #creates data frame
I now want to:
Run the regression on each imputed dataset
Calculate robust standard errors using a sandwich variance estimator
Combine / pool the results of both analyses
I can run the regression on the mids object using the with and pool commands:
fit.pois.mids <- with(imp, glm(hyp ~ age + bmi + chl, family = poisson))
summary(pool(fit.pois.mids))
I can also run the regression on each of the imputed datasets before combining them:
imp.df <- split(com, com$.imp); names(imp.df) <- c("raw", "imp1", "imp2", "imp3", "imp4", "imp5") #creates list of data frames representing each imputed dataset
fit.pois <- lapply(imp.df, function(x) {
fit <- glm(hyp ~ age + bmi + chl, data = x, family = poisson)
fit
})
summary(MIcombine(fit.pois))
Similarly, I can calculate the standard errors for each imputed dataset:
sand <- lapply(fit.pois, function(x) {
se <- coeftest(x, vcov = sandwich)
se
})
Unfortunately, MIcombine does not seem to return p-values. This post suggests using Zelig, but for that matter, I may as well just use mice. Further it does not appear to be possible to combine the estimates of the standard errors:
summary(MIcombine(sand.df))
Error in UseMethod("vcov") :
no applicable method for 'vcov' applied to an object of class "coeftest"
For the sake of simplicity, it seems that mice is a better option for pooling the results of the regression; however, I am wondering how I would go about updating (i.e., pooling and combining) the standard errors. What are some ways this could be addressed?

Why doesn't predict like the dimensions of my newdata?

I want to perform a multiple regression in R and make predictions based on the trained model. Below is an example code I am using:
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
predict(lm(price ~ predictors), data.frame(predictors=matrix(c(3,5),nrow=1)))
So, based on the 2-variate regression model trained by 5 samples, I want to make a prediction for the test data point where the first variate is 3 and second variate is 5. But I get a warning from above code saying that 'newdata' had 1 rows but variable(s) found have 5 rows. How can I correct above code? Below code works fine where I give the variables separately to the model formula. But since I will have hundreds of variates, I have to give them in a matrix since it would be unfeasible to append hundreds of columns using + sign.
price = c(10,18,18,11,17)
predictor1 = c(5,6,3,4,5)
predictor2 = c(2,1,8,5,6)
predict(lm(price ~ predictor1 + predictor2), data.frame(predictor1=3,predictor2=5))
Thanks in advance!
The easiest way to get past the issue of matching up variable names from a matrix of covariates to newdata data.frame column names is to put your input data into a data.frame as well. Try this
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
indata<-data.frame(price,predictors=predictors)
predict(lm(price ~ ., indata), data.frame(predictors=matrix(c(3,5),nrow=1)))
Here we combine price and predictors into a data.frame such that it will be named the same say as the newdata data.frame. We use the . in the formula to mean "all other columns" so we don't have to specify them explicitly.
Need to build the model first, then predict from it:
mod1 <- lm(price ~ predictor1 + predictor2)
predict( mod1 , data.frame(predictor1=3,predictor2=5))

Resources