Binomial GLM function for specific values vs all other values - r

I am doing a study on specific needs for kinship caregivers. I want to look at county vs needs to see if certain counties have more significant needs than other counties. This will hopefully be used to allocate funding or policies. I am looking for need (which has various columns for each need) and county (one variable with counties 1-39,99). Some counties only have a few participants, so I only want to compare counties with >50 respondents (6,17,27, 31, 32). So, county "6" vs other counties. I was using a similar function for my other calculations that were binomial 1,0 options so I want to figure out what to use for this comparison.
model1 <-glm(need_1 ~ factor(county), family="binomial", data=Clean.Data)
coef(summary(model1))
exp(coef(model1))
exp(confint.default(model1))
I have also tried
model1 <-glm(need_1 ~ factor(county==6), family="binomial", data=Clean.Data)
coef(summary(model1))
exp(coef(model1))
exp(confint.default(model1))
I really want to use a binomial of county choosen vs other. I would like to make the comparisons between the counties listed above and not other smaller counties.
Best,
Adrienne
model1 <-glm(need_1 ~ factor(county), family="binomial", data=Clean.Data)
coef(summary(model1))
exp(coef(model1))
exp(confint.default(model1))
My exp of my coefficients are 0 to Inf so I am not expecting that.
I also tried to create a new binomial variable using:
Clean.Data$Clark <- ifelse(Clean.Data$live == 6,1,0) but this is not quite what I want because then it compares to all counties rather than the ones that have at least 50.

Related

Use of svyglm and svydesign with R for multistage stratified cluster design

I have a complicated data set which was made by a multistage stratified cluster design. I had originally analysed this using glm, however now realise that I have to use svyglm. I'm not quite sure about how is best to model the data utilising svyglm. I was wondering if anyone could help shed some light.
I am attempting to see the effect that a variety of covariates taken at time 1 have on a binary outcome taken at time 2.
The sampling strategy was as follows: state -> urban/rural -> district -> subdistrict -> village. Within each village, individuals were randomly selected, with each of these having an id (uniqid).
I have a variable in the df for each of these stages of the sampling strategy. I also have the following variables: outcome, age, sex, income, marital_status, urban_or_rural_area, uniqid, weights. The formula that I want for my regression equation is outcome ~ age + sex + income + marital_status + urban_or_rural_area . Weights are coded by the weights variable. I had set the family to binomial(link = logit).
If anyone has any idea how such an approach could be coded in R with svyglm I would be most appreciative. I'm quite confused as to what should be inputted as ID, fpc and nest. Do I have to specify all levels of the stratified design or just some?
Any direction, or resources which explain this well would be massively appreciated.
You don't really give enough information about the design: which of the geographical units are strata and which are clusters. For example, my guess is that you sample both urban and rural in all states, and you don't sample all villages, but I don't know whether you sample all districts or subdistricts. I also don't know whether your overall sampling fraction is large or small (so whether the with-replacement approximation is ok)
Let's pretend you sample just some districts, so districts are your Primary Sampling Units, and that the overall sampling fraction of people is small. The design command is
your_design <- svydesign(id=~district, weights=~weights,
strata=~interaction(state, urban_rural,drop=TRUE),
data=your_data_frame)
That is, the strata are combinations of state and urban/rural and any combinations that aren't in your data set don't exist in the population (maybe some states are all-rural or all-urban). Within each stratum you have districts, and only some of these appear in the sample. In your geographical hierarchy, districts are then the first level that is sampled rather than exhaustively enumerated.
You don't need fpc unless you want to specify the full multistage design without replacement.
The nest option is not about how the survey was done but is about how variables are coded. The US National Center for Health Statistics (bless their hearts) set up a lot of designs that have many strata and two primary sampling units per stratum. They call these primary sampling units 1 and 2; that is, they reuse the names 1 and 2 in every stratum. The svydesign function is set up to expect different sampling unit names in different strata, and to verify that each sampling unit name appears in just one stratum, as a check against data errors. This check has to be disabled for NCHS surveys and perhaps some others that also reuse sampling unit names. You can always leave out the nest option at first; svydesign will tell you if it might be needed.
Finally, the models:
svyglm(outcome ~ age + sex + income + marital_status + urban_or_rural_area,
design=your_design, family=quasibinomial)
Using binomial or quasibinomial will give identical answers, but using binomial will give you a harmless warning about non-integer weights. If you use quasibinomial, the harmless warning is suppressed.

Model predicted values around mean using training data

I tried to ask these questions through imputations, but I want to see if this can be done with predictive modelling instead. I am trying to use information from 2003-2004 NHANES to predict future NHANES cycles. For some context, in 2003-2004 NHANES measured blood contaminants in individual people's blood. In this cycle, they also measured things such as triglycerides, cholesterol etc. that influence the concentration of these blood contaminants.
The first step in my workflow is the impute missing blood contaminant concentrations in 2003-2004 using the measured values of triglycerides, cholesterol etc. This is an easy step and very straightforward. This will be my training dataset.
For future NHANES years (for example 2005-2006), they took individual blood samples combined them (or pooled in other words) and then measured blood contaminants. I need to figure out what the individual concentrations were in these cycles. I have individual measurements for triglycerides, cholesterol etc. and the pooled value is considered the mean. Could I use the mean, 2003-2004 data to unpool or predict the values? For example, if a pool contains 8 individuals, we know the mean, the distribution (2003-2004) and the other parameters (triglycerides) which we can use in the regression to estimate the blood contaminants in those 8 individuals. This would be my test dataset where I have the same contaminants as in the training dataset, with a column for the number of individuals in each pool and the mean value. Alternatively, I can create rows of empty values for contaminants, add mean values separately.
I can easily run MICE, but I need to make sure that the distribution of the imputed data matches 2003-2004 and that the average of the imputed 8 individuals from the pools is equal to the measured pool. So the 8 values for each pool, need to average to the measured pool value while the distribution has to be the same as 2003-2004.
Does that make sense? Happy to provide context if need be. There is an outline code below.
library(mice)
library(tidyverse)
library(VIM)
#Papers detailing these functions can be found in MICE Cran package
df <- read.csv('2003_2004_template.csv', stringsAsFactors = TRUE, na.strings = c("", NA))
#Checking out the NA's that we are working with
non_detect_summary <- as.data.frame(df %>% summarize_all(funs(sum(is.na(.)))))
#helpful representation of ND
aggr_plot <- aggr(df[, 7:42], col=c('navyblue', 'red'),
numbers=TRUE,
sortVars=TRUE,
labels=names(df[, 7:42]),
cex.axis=.7,
gap=3,
ylab=c("Histogram of Missing Data", "Pattern"))
#Mice time, m is the number of imputed datasets (you can think of this as # of cycles)
#You can check out what regression methods below in console
methods(mice)
#Pick Method based on what you think is the best method. Read up.
#Now apply the right method
imputed_data <- mice(df, m = 30)
summary(imputed_data)
#if you want to see imputed values
imputed_data$imp
#finish the dataset
finished_imputed_data <- complete(imputed_data)
#Check for any missing values
sapply(finished_imputed_data, function(x) sum(is.na(x))) #All features should have a value of zero
#Helpful plot is the density plot. The density of the imputed data for each imputed dataset is showed
#in magenta while the density of the observed data is showed in blue.
#Again, under our previous assumptions we expect the distributions to be similar.
densityplot(x = imputed_data, data = ~ LBX028LA+LBX153LA+LBX189LA)
#Print off finished dataset
write_csv(finished_imputed_data, "finished_imputed_data.csv")
#This is where I need to use the finished_imputed_data to impute the values in the future years.

GLMM: Needing overall advice on selecting model terms for glmm modelling in R

I would like to create a model to understand how habitat type affects the abundance of bats found, however I am struggling to understand which terms I should include. I wish to use lme4 to carry out a glmm model, I have chosen glmm as the distribution is poisson - you can't have half a bat, and also distribution is left skewed - lots of single bats.
My dataset is very big and is comprised of abundance counts recorded by an individual on a bat survey (bat survey number is not included as it's public data). My data set includes abundance, year, month, day, environmental variables (temp, humidity, etc.), recorded_habitat, surrounding_habitat, latitude and longitude, and is structured like the set shown below. P.S Occurrence is an anonymous recording made by an observer at a set location, at a location a number of bats will be recorded - it's not relevant as it's from a greater dataset.
occurrence
abundance
latitude
longitude
year
month
day
(environmental variables
3456
45
53.56
3.45
2000
5
3
34.6
surrounding_hab
recorded_hab
A
B
Recorded habitat and surrounding habitat range in letters (A-I) corresponding to a habitat type. Also, the table is split as it wouldn't fit in the box.
These models shown below are the models I think are a good choice.
rhab1 <- glmer(individual_count ~ recorded_hab + (1|year) + latitude + longitude + sun_duration2, family = poisson, data = BLE)
summary(rhab1)
rhab2 <- glmer(individual_count ~ surrounding_hab + (1|year) + latitude + longitude + sun_duration2, family = poisson, data = BLE)
summary(rhab2)
I'll now explain my questions in regards to the models I have chosen, with my current thinking/justification.
Firstly, I am confused about the mix of categorical and numeric variables, is it wise to include the environmental variables as they are numeric? My current thinking is scaling the environmental variables allowed the model to converge so including them is okay?
Secondly, I am confused about the mix of spatial and temporal variables, primarily if I should include temporal variables as the predictor is a temporal variable. I'd like to include year as a random variable as bat populations from one year directly affect bat populations the next year, and also latitude and longitude, does this seem wise?
I am also unsure if latitude and longitude should be random? The confusion arises because latitude and longitude do have some effect on the land use.
Additionally, is it wise to include recorded_habitat and surrounding_habitat in the same model? When I have tried this is produces a massive output with a huge correlation matrix, so I'm thinking I should run two models (year ~ recorded_hab) and (year ~ surrounding_hab) then discuss them separately - hence the two models.
Sorry this question is so broad! Any help or thinking is appreciated - including data restructuring or model term choice. I'm also new to stack overflow so please do advise on question lay out/rules etc if there are glaringly obvious mistakes.

Run a regression on two categorical factors

I would like to run an LM on a time series data set that I have collated.
One of the variables X? is categoric: that of geographic Region: Middle East, Eastern Europe, North Africa etc..
MyModel <- lm(Y ~ X1 + x2 + x3, Data = mydataset)
Currently, I have been able to run my model as separate regressions for each categoric variable - Region see data below using the following code.
Model1MEAST&EE<- y ~ B1X1 + B2X2 + B3X3 + factor(COND$name) + factor(COND$Year), data=mydataset, Region== "Middle East")).
Which works fine.
Now I would like to run a regression but on two or more Regions combined but still leaving out all other regions, so, for example, a regression on Middle East and East Europe countries only say.
I have tried using the '+' command and the 'c' and 'list' command with the above code but that does not seem to work.
enter image description here
Can anyone provide the code for running a regression on two categoric factors combined not just the one?
I have included a link to an image of just a small RANDOM sample (only 4 variables) of my dataset, taken from my time series study for every country over 35 years with 50 plus economic and development indicator variables, such as GDP; I have made the categoric variable for which I which I would like to combine two regions in a subset regression in BOLD.
Dummy Variables!
What I would do first is: create dummy variable columns using the fastDummies package.
Example: df <- dummy_cols(df, select_columns = "Region")
If you wish to leave one of the dummy columns out for the sake of neater regression running, you can add an extra argument (trivial explanation: removes most frequent of the categorical variables):
df <- dummy_cols(df, select_columns = "Region", remove_most_frequent_dummy = TRUE)
Subset
If you wish to subset your data you can split by index finding
ind <- which(df$Region == "Middle East" | df$Region == "East Europe")
and then creating a new dataframe consisting solely of the Middle East and European rows.
newdf <- df[ind, ]
Once you have done this you can run a regression remembering to exclude one of your dummy variables.
Example: lm(data = mydataset, Y ~ Xi + Middle East)
Alternatively
Instead of splitting your data you can run all of your dummied variables (except one) in your regression such that you only have one model. Determining which way is better will come down to a combination of sample size availability as well as logical arguments for splitting/not splitting e.g. do we expect GDP to influence unemployment rates the same way across the world? If yes, then it might be good to have a full model. If not, then splitting may be the better direction.

Zero-Inflation Negative Binominal Regression - R - Census Data

I am fairly new to the R world. I need some help with correct syntax for running a negative binomial regression/zero-inflation negative binomial regression on some data.
I am trying to run a regression that looks at how race/ethnicity influences the number of summer meal sites in a census tract.
I have parsed all the necessary data into one "MASTER.csv" file and the format is as follows:
Column headers: GEO ID - number of Summer Meal Sites - Census Tract Name - Total Population - White - Black - Indian - Asian - Other
So an example row would look like: 48001950100 - 4 - Census Tract 9501, Anderson County, Texas - 5477 - 4400 - 859 - 14 - 21 - 0
And so on, I have a total of 5266 rows each in the same format. Race/ethnicity is reported as a count of how many individuals in that certain census tract are of a respective race/ethnicity.
I am using a zero-inflation negative binomial model to account for the dependent variable being a "count", and therefore susceptible to skewed distributions.
My dependent variable is the number of summer meal sites in each census tract. (ex. in this case, the second column, 4).
My independent variable would be the race/ethnicities. Black, White etc.. I also need to set White as my omitted ( or reference) variable since I am running a regression on nominal variables.
How would I go about doing this? Would it look similar to the code posted below?
require(MASS)
require(pscl)
zeroinfl(formula = MASTER$num_summer_meal_sites ~ .| MASTER$White + MASTER$Black + MASTER$Other, data = "MASTER", dist = "negbin")
Would this do what I need? Also, I am unclear as to how I should set "White" as the reference/omitted variable.
As pointed out above, you have a few problems with your formula. It probably should be re-written as
zeroinfl(num_summer_meal_sites ~ Black + Indian + Asian + Other,
data = MASTER, dist = "negbin")
Here we specify the data= parameter with the actual data.frame variable, not a character indicating the name of the variable. This allows you to use the names of the columns without having to prefix them all with a data.frame name; it will use data= first.
Also, rather than using "." to indicate all variables, it would be useful in this case to explicitly list the covariates you want since some seem like they may be in appropriate for regression.
And as pointed out above, it's best not to include correlated variables in a regression model. So leaving out White will help to prevent that. Since you have summary data, you don't really have a reference category like you would if you had individual data.
zeroinfl uses the | to delimit the regressors for the poisson part and the zero inflated part. Did you only want to model the inflation with the race variables? If so, your formulation was appropriate.

Resources