How does the weights argument in glm work in R? - r

I'm really puzzled by the weighting argument in glm. I realise that this question has been asked before but Im still confused about what the weights argument does or how it works. For example, in the code below my dependant variable PCL_Sum2 is binary and highly imbalanced. I would like both levels to be equally weighted. How would I accomplish this?
Final_Frame.df <- read.csv("no_subset.csv")
Omitted_Nas.df<-na.omit(Final_Frame.df)
This yields 278 remaining observations. Then when I go ahead and perform the regression:
prelim_model<-glm(PCL_Sum2~Mean_social_combined +
Mean_traditional_time+
Mean_Passive_Use_Updated+
factor(Gender)+
factor(Ethnicity)+
factor(Age)+
factor(Location)+
factor(Income)+
factor(Education)+
factor(Working_Home)+
Perceived_Fin_Risk+
Anxiety_diagnosed+
Depression_diagnosed+
Lived_alone+
Mean_Active_Use_Updated, data=Omitted_Nas.df<-na.omit(Final_Frame.df), weights=??? family = binomial())
summary(prelim_model)
I've tried setting weights = 0.5, 0.5 but I always get the following error:
Error in model.frame.default(formula = PCL_Sum2 ~ Mean_social_combined + : variable lengths differ (found for '(weights)')
Any help would be greatly appreciated!

Related

Can't run glm due to the following error: "variable lengths differ (found for 'data')"

I try to run a regression using the glm function, however I keer getting the same error message: "variable lengths differ (found for 'data')". I can't see how my data does not have the same length as I use a sample of 1000 for both my dependent and independent variables. The reason I take a sample of my total data is because I have more than a million observations and I want to see if the model works properly. (running it with all the data takes a very long time) This is the code I use:
sample = sample(1:nrow(agg), 1000, replace = FALSE)
y=agg$TO_DEFAULT_IN_12M_INDICATOR[sample]
test <- glm(as.factor(y) ~., data = as.factor(agg[sample,]), family = binomial)
#coef(full.model)
Here agg contains all my data, and my y is an indicator function of 0's and 1's. Does anyone know how I could fix this problem?

How do you compute average marginal effects for glm.cluster models?

I am looking for a way to compute average marginal effects with clustered standard errors which i seem to be having a few problems with. My model is as follows:
cseLogit <- miceadds::glm.cluster(data = data_long,
formula = follow ~ f1_distance + f2_distance + PolFol + MediaFol,
cluster = "id",
family = binomial(link = "logit"))
Where the dependent variable is binary (0/1) and all explanatory variables are numeric. I've tried to different ways of getting average marginal effects. The first one is:
marginaleffects <- margins(cseLogit, vcov = your_matrix)
Which gives me the following error:
Error in find_data.default(model, parent.frame()) :
'find_data()' requires a formula call
I've also tried this:
marginaleffects <- with(cseLogit, margins(glm_res, vcov=vcov))
which gives me this error:
Error in eval(predvars, data, env) :
object 'f1_distance' was not found
In addition: warnings:
1: In dydx.default(X[[i]], ...) :
Class of variable, f1_distance, is unrecognized. Returning NA.
2: In dydx.default(X[[i]], ...) :
Class of variable, f2_distance, is unrecognized. Returning NA.
Can you tell me what i'm doing wrong? If i haven't provided enough information, please let me know. Thanks in advance.

How do I solve an error in x : non-conformable arguments (R)?

I am trying to implement ordinal logistic regression on my dataset in r. I use the function 'polr' for this, but cannot seem to find a lot of information regarding its implementation.
The following errors are the ones I'm stuck on:
> dat.polr <- polr(as.factor(relevance)~allterms+idf.title, data=dat.one)
Warning message:
In polr(as.factor(relevance) ~ allterms + idf.title + idf.desc + :
design appears to be rank-deficient, so dropping some coefs
> dat.pred <- predict(dat.polr,dat.test,type="class")
Error in X %*% object$coefficients : non-conformable arguments
I want to train my model to guess the relevance of a new dataset. dat.one is the dataset I'm using to train the data, dat.test is the dataset I'm using to test the data. I believe that the predict variable's error is caused by the warning in polr. However, I have no clue how to resolve this. Any help would be appreciated :)

r - Error in data.frame(data, source = namelist) : arguments imply differing number of rows: 3, 4 - in predict()

I'm trying to predict data on fitted gamlss model, and have an annoying issue,that i can't deal with.
Error in data.frame(data, source = namelist) :
arguments imply differing number of rows: 3, 4
Code&data
library('gamlss')
asfr=c(0.0000000000,0.0001818271,0.0001818271,0.0228344684,0.0228344684)
ages=c(12:16)
data=data.frame(y=asfr,x=ages)
model=gamlss(y~x,data=data,method=mixed(1,20))
test=data.frame(x=c(12,13,14))
predict(model,newdata=test, type = "response")
I searched for some similiar issues, but answers with reshape2 didn't work.
Also, as an example i used code on p.89 up here
I had the same problem, and while adding the initial model data in the predict function sometimes helped, more often than not it didn't.
So I contacted Mikis Stasinopoulos who as usual was very helpful. It turns out that the problem is that the dataset I was using was called "data" and while that's fine for estimation, it's not for prediction. Renaming the dataset "mydata" throughout solved the problem.
I had the same error fitting a BEOI family and trying to predict it in gamlss. I do not know why, but adding my initial model data source in the predict function helped resolve it. Hope it helps!
predy <- predict(mod, what= "mu", newdata= data.frame(x= predx), type= "response", data= data)

Error with ZIP and ZINB Models upon subseting and factoring data

I am trying to run ZIP and ZINB models to try look at some of the factors which might help explain disease (orf) distribution within 8 geographical regions. The models works fine for some regions and not others. However upon factoring and running the model in R I get error message.
How can I solve this problem or would there be a model that might work better with the subset as the analysis will only make sense when it’s all uniform across all regions.
zinb3 = zeroinfl(Cases2012 ~ Precip+ Altitude +factor(Breed)+ factor(Farming.Practise)+factor(Lambing.Management)+ factor(Thistles) ,data=orf3, dist="negbin",link="logit")
Error in solve.default(as.matrix(fit$hessian)) :
system is computationally singular: reciprocal condition number = 2.99934e-24
Results after fitting zerotrunc & glm as suggested by #Achim Zeileis. How do i interprete zerotruc output given that no p values. Also how can I correct the error with glm?
zerotrunc(Cases2012 ~ Flock2012+Stocking.Density2012+ Precip+ Altitude +factor(Breed)+ factor(Farming.Practise)+factor(Lambing.Management)+ factor(Thistles),data=orf1, subset = Cases2012> 0)
Call:
zerotrunc(formula = Cases2012 ~ Flock2012 + Stocking.Density2012 + Precip + Altitude +
factor(Breed) + factor(Farming.Practise) + factor(Lambing.Management) + factor(Thistles),
data = orf1, subset = Cases2012 > 0)
Coefficients (truncated poisson with log link):
(Intercept) Flock2012 Stocking.Density2012
14.1427130 -0.0001318 -0.0871504
Precip Altitude factor(Breed)2
-0.1467075 -0.0115919 -3.2138767
factor(Farming.Practise)2 factor(Lambing.Management)2 factor(Thistles)3
1.3699477 -2.9790725 2.0403543
factor(Thistles)4
0.8685876
glm(factor(Cases2012 ~ 0) ~ Precip+ Altitude +factor(Breed)+ factor(Farming.Practise)+factor(Lambing.Management)+ factor(Thistles) +Flock2012+Stocking.Density2012 ,data=orf1, family = binomial)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
It's hard to say exactly what is going on based on the information provided. However, I would suspect that the data in some regions does not allow to fit the model specified. For example, there might be some regions where certain factor levels (of Breed or Farming.Practise or Lambin.Management or Thristles) only have zero values (or only non-zero but that is less frequent in practice). Then the coefficient estimates often degenerate so that the associated zero-inflation probability goes to 1 and the count coefficient cannot be estimated.
It's typically easier to separate these effects by using the hurdle rather the zero-inflation model. Then the two parts of the model can also be fitted separately by glm(factor(y > 0) ~ ..., ..., family = binomial) and zerotrunc(y ~ ..., ..., subset = y > 0). The latter function is essentially the same code as pscl uses but has been factored into a standalone function in the package countreg on R-Forge (not yet on CRAN).

Resources