Run a regression on two categorical factors

Run a regression on two categorical factors - r

I would like to run an LM on a time series data set that I have collated.
One of the variables X? is categoric: that of geographic Region: Middle East, Eastern Europe, North Africa etc..
MyModel <- lm(Y ~ X1 + x2 + x3, Data = mydataset)
Currently, I have been able to run my model as separate regressions for each categoric variable - Region see data below using the following code.
Model1MEAST&EE<- y ~ B1X1 + B2X2 + B3X3 + factor(COND$name) + factor(COND$Year), data=mydataset, Region== "Middle East")).
Which works fine.
Now I would like to run a regression but on two or more Regions combined but still leaving out all other regions, so, for example, a regression on Middle East and East Europe countries only say.
I have tried using the '+' command and the 'c' and 'list' command with the above code but that does not seem to work.
enter image description here
Can anyone provide the code for running a regression on two categoric factors combined not just the one?
I have included a link to an image of just a small RANDOM sample (only 4 variables) of my dataset, taken from my time series study for every country over 35 years with 50 plus economic and development indicator variables, such as GDP; I have made the categoric variable for which I which I would like to combine two regions in a subset regression in BOLD.

Dummy Variables!
What I would do first is: create dummy variable columns using the fastDummies package.
Example: df <- dummy_cols(df, select_columns = "Region")
If you wish to leave one of the dummy columns out for the sake of neater regression running, you can add an extra argument (trivial explanation: removes most frequent of the categorical variables):
df <- dummy_cols(df, select_columns = "Region", remove_most_frequent_dummy = TRUE)
Subset
If you wish to subset your data you can split by index finding
ind <- which(df$Region == "Middle East" | df$Region == "East Europe")
and then creating a new dataframe consisting solely of the Middle East and European rows.
newdf <- df[ind, ]
Once you have done this you can run a regression remembering to exclude one of your dummy variables.
Example: lm(data = mydataset, Y ~ Xi + Middle East)
Alternatively
Instead of splitting your data you can run all of your dummied variables (except one) in your regression such that you only have one model. Determining which way is better will come down to a combination of sample size availability as well as logical arguments for splitting/not splitting e.g. do we expect GDP to influence unemployment rates the same way across the world? If yes, then it might be good to have a full model. If not, then splitting may be the better direction.

Related

Binomial GLM function for specific values vs all other values

I am doing a study on specific needs for kinship caregivers. I want to look at county vs needs to see if certain counties have more significant needs than other counties. This will hopefully be used to allocate funding or policies. I am looking for need (which has various columns for each need) and county (one variable with counties 1-39,99). Some counties only have a few participants, so I only want to compare counties with >50 respondents (6,17,27, 31, 32). So, county "6" vs other counties. I was using a similar function for my other calculations that were binomial 1,0 options so I want to figure out what to use for this comparison.
model1 <-glm(need_1 ~ factor(county), family="binomial", data=Clean.Data)
coef(summary(model1))
exp(coef(model1))
exp(confint.default(model1))
I have also tried
model1 <-glm(need_1 ~ factor(county==6), family="binomial", data=Clean.Data)
coef(summary(model1))
exp(coef(model1))
exp(confint.default(model1))
I really want to use a binomial of county choosen vs other. I would like to make the comparisons between the counties listed above and not other smaller counties.
Best,
Adrienne
model1 <-glm(need_1 ~ factor(county), family="binomial", data=Clean.Data)
coef(summary(model1))
exp(coef(model1))
exp(confint.default(model1))
My exp of my coefficients are 0 to Inf so I am not expecting that.
I also tried to create a new binomial variable using:
Clean.Data$Clark <- ifelse(Clean.Data$live == 6,1,0) but this is not quite what I want because then it compares to all counties rather than the ones that have at least 50.

When setting your obsCovs for the function pcount (package unmarked) how does R "know" which obsCov observation corresponds to each y value?

I'm relatively new at R particularly with this package. I am running n-mixture models assessing detection probabilities and abundance. I have abundance data, site covariates and observation covariates. There are three repeated observations(rounds)/site. The observation covariates are set up as columns (three column/covariate, one for each round). The rows are individual sites. The abundance data is formatted similarly, with each column heading representing a different round. I've copied my code below.
y.abun2<-COYE[2:4]
obsCovs.ss <- list(temp=Covariate2021[3:5], Date=Covariate2021[13:15], Cloud=Covariate2021[17:19], Wind=Covariate2021[21:23],Observ=Covariate2021[25:27])
siteCovs.ss <- Covariate2021[c(29,30,31,32)]
coyeabund<-unmarkedFramePCount(y=y.abun2, siteCovs = siteCovs.ss,
obsCovs = obsCovs.ss)
After this I scale using this code:
coyeabund#siteCovs$TreeCover <-
scale(coyeabund#siteCovs$TreeCover)
Moving on to my model I use this code:
abun.coye.full<-pcount(~TreeCover+temp+Date+Cloud+Wind+Observ ~ HHSDI+ProportionNH+Quality, coyeabund,mixture="NB", K=132,se=TRUE)
Is the model matching the observation covariates to the abundance measurements to each round? (i.e., is it able to tell that temp column 5 corresponds to the third round of abundance measurements?)
The models seem fine so far but I am so new at this I want to confirm that I haven't gone astray.

GLMM: Needing overall advice on selecting model terms for glmm modelling in R

I would like to create a model to understand how habitat type affects the abundance of bats found, however I am struggling to understand which terms I should include. I wish to use lme4 to carry out a glmm model, I have chosen glmm as the distribution is poisson - you can't have half a bat, and also distribution is left skewed - lots of single bats.
My dataset is very big and is comprised of abundance counts recorded by an individual on a bat survey (bat survey number is not included as it's public data). My data set includes abundance, year, month, day, environmental variables (temp, humidity, etc.), recorded_habitat, surrounding_habitat, latitude and longitude, and is structured like the set shown below. P.S Occurrence is an anonymous recording made by an observer at a set location, at a location a number of bats will be recorded - it's not relevant as it's from a greater dataset.
occurrence
abundance
latitude
longitude
year
month
day
(environmental variables
3456
45
53.56
3.45
2000
5
3
34.6
surrounding_hab
recorded_hab
A
B
Recorded habitat and surrounding habitat range in letters (A-I) corresponding to a habitat type. Also, the table is split as it wouldn't fit in the box.
These models shown below are the models I think are a good choice.
rhab1 <- glmer(individual_count ~ recorded_hab + (1|year) + latitude + longitude + sun_duration2, family = poisson, data = BLE)
summary(rhab1)
rhab2 <- glmer(individual_count ~ surrounding_hab + (1|year) + latitude + longitude + sun_duration2, family = poisson, data = BLE)
summary(rhab2)
I'll now explain my questions in regards to the models I have chosen, with my current thinking/justification.
Firstly, I am confused about the mix of categorical and numeric variables, is it wise to include the environmental variables as they are numeric? My current thinking is scaling the environmental variables allowed the model to converge so including them is okay?
Secondly, I am confused about the mix of spatial and temporal variables, primarily if I should include temporal variables as the predictor is a temporal variable. I'd like to include year as a random variable as bat populations from one year directly affect bat populations the next year, and also latitude and longitude, does this seem wise?
I am also unsure if latitude and longitude should be random? The confusion arises because latitude and longitude do have some effect on the land use.
Additionally, is it wise to include recorded_habitat and surrounding_habitat in the same model? When I have tried this is produces a massive output with a huge correlation matrix, so I'm thinking I should run two models (year ~ recorded_hab) and (year ~ surrounding_hab) then discuss them separately - hence the two models.
Sorry this question is so broad! Any help or thinking is appreciated - including data restructuring or model term choice. I'm also new to stack overflow so please do advise on question lay out/rules etc if there are glaringly obvious mistakes.

Why is this the output for my linear model and how can I fix it?

I am trying to setup a multivariable linear programming model using R but the model keeps creating new variables in the output.
Essentially I am trying to find correlations between air quality and different factors such as population, time of day, weather readings, and a few others. For this example, I am looking at multiple different sensor locations over a months time. I have data on the actual AQI, weather data, and assumed the population in the area surrounding the sensor doesn't change over time (which might be my problem). Therefore, the population varies between the different sensor, however remains constant over the months. I then combined each sensors data into a data frame to conduct the linear programming. The code for my model is below:
model = lm(AQI ~ Time.of.Day + Temp + Humidity + Pressure + pop + ind + rd_dist, data = Krakdata)
The output is given in the picture below. I do not know why it doesn't come up with just population as an output. Instead, it outputs each population reading as another factor. Thanks!
Linear Model Output:
Krakdata example. Note how the population will not change until the next sensor comes up:

pop is a categorical variable. You need to convert it to an integer, otherwise each value will be treated as a separated category and therefore separate variable.

pop is a categorical variable, hence R treats it as such. R turns the pop variable into dummy variable, therefore the output. You have to convert it to numeric if this variable is supposed to be numeric in nature/in your analysis.
As to how to convert it:
Krakdata$pop <- as.numeric(as.character(Krakdata$pop))
As to how pop is read as factors while it resembles numbers, you need to look into your previous code or to the data itself.

Zero-Inflation Negative Binominal Regression - R - Census Data

I am fairly new to the R world. I need some help with correct syntax for running a negative binomial regression/zero-inflation negative binomial regression on some data.
I am trying to run a regression that looks at how race/ethnicity influences the number of summer meal sites in a census tract.
I have parsed all the necessary data into one "MASTER.csv" file and the format is as follows:
Column headers: GEO ID - number of Summer Meal Sites - Census Tract Name - Total Population - White - Black - Indian - Asian - Other
So an example row would look like: 48001950100 - 4 - Census Tract 9501, Anderson County, Texas - 5477 - 4400 - 859 - 14 - 21 - 0
And so on, I have a total of 5266 rows each in the same format. Race/ethnicity is reported as a count of how many individuals in that certain census tract are of a respective race/ethnicity.
I am using a zero-inflation negative binomial model to account for the dependent variable being a "count", and therefore susceptible to skewed distributions.
My dependent variable is the number of summer meal sites in each census tract. (ex. in this case, the second column, 4).
My independent variable would be the race/ethnicities. Black, White etc.. I also need to set White as my omitted ( or reference) variable since I am running a regression on nominal variables.
How would I go about doing this? Would it look similar to the code posted below?
require(MASS)
require(pscl)
zeroinfl(formula = MASTER$num_summer_meal_sites ~ .| MASTER$White + MASTER$Black + MASTER$Other, data = "MASTER", dist = "negbin")
Would this do what I need? Also, I am unclear as to how I should set "White" as the reference/omitted variable.

As pointed out above, you have a few problems with your formula. It probably should be re-written as
zeroinfl(num_summer_meal_sites ~ Black + Indian + Asian + Other,
data = MASTER, dist = "negbin")
Here we specify the data= parameter with the actual data.frame variable, not a character indicating the name of the variable. This allows you to use the names of the columns without having to prefix them all with a data.frame name; it will use data= first.
Also, rather than using "." to indicate all variables, it would be useful in this case to explicitly list the covariates you want since some seem like they may be in appropriate for regression.
And as pointed out above, it's best not to include correlated variables in a regression model. So leaving out White will help to prevent that. Since you have summary data, you don't really have a reference category like you would if you had individual data.
zeroinfl uses the | to delimit the regressors for the poisson part and the zero inflated part. Did you only want to model the inflation with the race variables? If so, your formulation was appropriate.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex