I am still new to R and still struggling. I am trying to do a logistic regression using a categorical and continuous variable and I am supposed to select the right variable for my model. There are 27 variables and a 8,000 observations.
I have gone through a couple of articles online including stepwise regression by AIC and all I do is confuse myself the more. I was also told to select my variables from the correlation matrix but when I do the correlation I don't seem to find the correlation especially with the categorical variable. I also try to fit all the model and I get some variables with p-value less than 0.5. This is the code:
d4 <- d3[,c('SW','MOI','YOI','DOI_CMC','RMOB','RYOB','RDOB_CMC',
'RCA','Region','TPR','DPR','NV','HEL','Has_Radio','Has_TV',
'Religion','WI','MOFB','YOB','DOB_CMC','DOFB_CMC','AOR','MTFBI',
'DSOUOM_CMC','RW','RH','RBMI')]
cor(d4)
d5 <- cor(d4)
round(cor(d4),2)
When I select the significant variables and try to apply logistic regression all the p value will be between 0.9 to 1. See code:
d3 <- lm(TPR ~ SW + MOI + RMOB + RYOB + RCA + Region + TPR + DPR +
NV + HEL + Has_Radio + Has_TV + Religion + WI + MOFB +
YOB + DOB_CMC + DOFB_CMC + AOR + MTFBI + DSOUOM_CMC +
RW + RH + RBMI,
data = d3, family = "binomial")
summary(d3)
I need help with this please!!
Here is the sample of d3
Related
I'm building a structural equation model that incorporates 4 latent variables: physical lifestyle, social lifestyle, trauma score, and the DV (well-being).
We have a 7 question survey of just well-being, but I think it would be more sound (less measurement error) to cull three surveys of well-being, depression, and anxiety to make them into a latent dependent variable. I received the warning that the covariance matrix was not positive definite when just using the scaled scores from the surveys, so I decided to actually incorporate the questions from the surveys themselves. However, when I do this and then look at modification indices I receive an output that suggests that the residuals are not currently correlated, when I thought that that was the default for any latent variable, which is why I am wondering whether I am specifying the well-being latent variable correctly (whether it's just a matter of adding in all questions that will ultimately comprise this latent variable).
Below is the entire model. The latent variable "well-being" currently only has questions from the phq 9, Depression Survey; and the General Anxiety Survey (but will also be adding in the well-being survey). I've added the output for the modification indices below that.
I've included some data here: https://drive.google.com/file/d/1AX50DFNik30Qsyiyp6XnPMETNfVXK83r/view?usp=sharing
Thanks much!
fit.latent_wb <- '
#factor loadings; measurement model portion
pl =~ exercisescore + mindfulnessscore + promistscore
sl =~ family_support + friendshipcount + friendshipnet +
sense_of_community + sesscore + ethnicity
trauma =~ neglectscore + abusescore + exposure + family_support + age
wb =~ phq9_1 + phq9_2 + phq9_3 + phq9_4 + phq9_5 + phq9_6 +
phq9_7 + phq9_8 + phq9_9 + gad7_1 + gad7_2 + gad7_3 + gad7_4 +
gad7_5+ gad7_6+ gad7_7
#regressions: structural model
wb ~ age + gender + ethnicity + sesscore + resiliencescore +
pl + emotionalsupportscore + trauma
resiliencescore ~ age + sesscore + emotionalsupportscore + sl
emotionalsupportscore ~ sl + gender
friendshipnet~~age
exercisescore~~sense_of_community
'
fit.latent_wb <- sem(fit.latent_wb, data = total, meanstructure = TRUE, std.lv = TRUE)
summary(fit.latent_wb, fit.measures = TRUE,standardized = TRUE, rsquare = TRUE, estimates = FALSE)
Output for Mod Indices:
I am running a gam model based on a large dataset with many variables. My response variable is the level of "recruitment" by a herd every fall/autumn. This is calculated by the fawn:female ratio every fall/autumn over a 60 year period.
My problem is that there are many years and study sites where only between 1 - 10 females are recorded. This means that the robustness of the ratio is not trustworthy. For example if one female and one fawn is seen, it has a recruitment of 100%, but if they see one more female, that drops by 50%!
I need to tell the model that years/study sites with smaller sample sizes should be weighted less than those with larger sample sizes as these smaller sample sizes are no doubt affecting the results.
Above is a table of the females observed every year and a histogram of the same.
My model is as follows:
gamFIN <- gam(Fw.FratioFall
~ s(year)
+ s(percentage_woody_coverage)
+ s(kmRoads.km2)
+ s(WELLS_ACTIVEinsideD)
+ s(d3)
+ s(WT_DEER_springsurveys)
+ s(BadlandsCoyote.1000_mi)
+ s(Average_mintemp_winter, BadlandsCoyote.1000_mi)
+ s(BadlandsCoyote.1000_mi, WELLS_ACTIVEinsideD)
+ s(BadlandsCoyote.1000_mi, d3)
+ s(YEAR, bs = "re") + s(StudyArea, bs = "re"), method = "REML", select = T, data = mydata)
How might I tell the model to weight my response variable by the sample sizes they are based on.
Do not model this as a ratio for your outcome. Instead model the fawn counts as your outcome and model the female counts via an offset() term using logged values on the RHS of the formula. You should be offsetting with the log of the fawn count. So the formula would look like this:
Fawns
~ s(year)
+ all_those_smooth_terms
+ offset( lnFemale_counts)
The gam models have an implicit log link which is the reason for the logging of the Female counts.
Edit (Gavin's correct. The default for gam is not a linear link):
gamFIN <- gam(FawnFall ~ s(year) + s(percentage_woody_coverage) + s(kmRoads.km2) +
s(WELLS_ACTIVEinsideD) + s(d3) + s(WT_DEER_springsurveys) +
s(BadlandsCoyote.1000_mi) + s(Average_mintemp_winter, BadlandsCoyote.1000_mi) +
s(BadlandsCoyote.1000_mi, WELLS_ACTIVEinsideD) + s(BadlandsCoyote.1000_mi, d3) +
s(YEAR, bs = "re") + s(StudyArea, bs = "re") + offset(FemaleFall),
family="poisson", method = "REML", select = T, data = mydata)
First of all, I am relatively new in using R and haven't used lavaan (or growth models) before so please excuse my ignorance.
I am doing my thesis and analyzing the U.S. financial industry during the financial crisis of 2007. I therefore have individual banks and several variables for each bank across time (from 2007-2013), some are time-variant (such as ROA or capital adequacy) and some are time-invariant (such as size or age). Some variables are also time-variant but not multi-level since they apply to all firms (such as the average ROA of the U.S. financial industry).
Fist of all, can I use lavaan's growth curve model ("growth") in this instance? The example given on the tutorial is for either time-varying variables (c) that influence the outcome (DV) or time-invariant variables (x1 & x2) which influence the slope (s) and intercept (i). What about time varying variables that influence the slope and intercept? I couldn't find an example for this syntax.
Also, how do I specify the "groups" (i.e. different banks) in my analysis? It is actually possible to do a multi-level growth curve model in lavaan (or R for that matter)?
Last but not least, I could find how to import a multilevel dataset in R. My dataset is basically a 3-dimensional matrix (different variables for different firms across time) so how do I input that via SPSS (or notepad?)?
Any help is much appreciated, I am basically lost on how to implement this model and sincerely need some assistance...
Thank you all in advance for your time!
Harry
edit: Here is the sytanx that I have come with so far. DO you think it makes sense?
ETHthesismodel <- '
# intercept and slope with fixed coefficients
i =~ 1*t1 + 1*t2 + 1*t3 + 1*t4
s =~ 0*t1 + 1*t2 + 2*t3 + 3*t4
#regressions (independent variables that influence the slope & intercept)
i ~ high_constr_2007 + high_constr_2008 + ... + low_constr_2007 + low_constr_2008 + ... + ... diff_2013
s ~ high_constr_2007 + high_constr_2008 + ... + low_constr_2007 + low_constr_2008 + ... + ... diff_2013
# time-varying covariates (control variables)
t1 ~ size_2007 + cap_adeq_2007 + brand_2007 +... + acquisitions_2007
t2 ~ size_2008 + cap_adeq_2008 + brand_2008 + ... + acquisitions_2008
...
t7 ~ size_2013 + cap_adeq_2013 + brand_2013 + ... + acquisitions_2013
'
fit <- growth(ETHthesismodel, data = inputdata,
group = "bank")
summary(fit)
Is there anyone can help me implement a ROC curve for a bayesian logistic regression? been trying DPpackage but is it me or it just doesn't work.
the two models i want to compare using ROC Curve are showed below:
bayes_mod=MCMClogit(Default ~ ACTIVITY + CIF + MAN + STA + PIA + COL + CurrLiq + DebtCov + GDPgr, data=mydata, burnin=500000,mcmc=10000, tune=0.6,b0=coef(mylogit.reduced),B0=information2, subset=c(-1772,-2064,-655))
bayes_mod1=MCMClogit(Default ~ ACTIVITY + CIF + MAN + STA + PIA + COL + CurrLiq + DebtCov + GDPgr, data=mydata, burnin=500000,mcmc=10000,tune=0.6,subset=c(-1772,-2064,-655))
where Default ~ ACTIVITY + CIF + MAN + STA + PIA + COL + CurrLiq + DebtCov + GDPgr are my arguments; mydata is the database; mylogit.reduced is a logistic regression estimated prior to bayesian, B0 is the covariation matrix, and subset=c are the eliminated observations.
I don't know this package, but it probably provides a predict function (actually it does, I just can't find if it does for MCMClogit models as I can't find the doc for this function). You can then pass it to a ROC function like pROC:
library(pROC)
predictions <- predict(mydata, newdata=mytestdata)
roc(mytestdata$Default, predictions)
We are actually trying to reproduce the results of a model in R, which has been coded in SAS. The model looks as follows: ln(Duration)=X'B+S*e, where X is the matrix of 10 independent variables, B a vector of coefficients, S is the scale parameter and e the error term.
The data set we use is here
There you can find the SAS code as well.
The first try looked as follows:
Dur <- survreg(Surv(Duration, Censor==0) ~ Acq_Expense + Acq_Expense_SQ + Ret_Expense + Ret_Expense_SQ + Crossbuy + Frequency + Frequency_SQ + Industry + Revenue + Employees, dist='weibull', data = daten [daten$Acquisition==1, ])
summary(Dur)
But the coefficients in this model are not correct. On the following picture you see the R output on the left and the correct SAS output on the right:
We detected a problem with the squared terms (Acq_Expense_SQ, Ret_Expense_SQ), because when we exclude those terms all other estimates are much closer to the correct values. Therefore, we tried to scale down the squared terms by the factor 0.001.
Acq_Expense_SQ2 <- data.frame(0.001*daten$Acq_Expense_SQ)
colnames(Acq_Expense_SQ2) <- c("Acq_Expense_SQ2")
daten["Acq_Expense_SQ2"] <- Acq_Expense_SQ2
date3 <- subset(daten, daten$Acquisition==1)
Ret_Expense_SQ2 <- data.frame(0.001*daten$Ret_Expense_SQ)
colnames(Ret_Expense_SQ2) <- c("Ret_Expense_SQ2")
daten["Ret_Expense_SQ2"] <- Ret_Expense_SQ2
date3 <- subset(daten, daten$Acquisition==1)
Dur <- survreg(Surv(Duration, Censor == 0, type = 'right') ~ Acq_Expense + Acq_Expense_SQ2 + Ret_Expense + Ret_Expense_SQ2 + Crossbuy + Frequency + Frequency_SQ + Industry + Revenue + Employees, dist='weibull', scale = 0, data = date3)
summary(Dur)
Now, the coefficient are much closer to the correct ones, but I do not know why.
Is there a possible explanation for this problem?
Or do you see another problem with our code?