Find the average of a variable compared to fixed variable - r

So I have a short table of demographic data from a survey. Age, income, race, etc.
My HW question is as follows:
I would like you to determine, from your Tulsa data, whether age and income are
significantly related. If so, what is the expected income of a 50-year-old person?
I have the first part, just need to know how to find the mean income of a 50 year old person according to my data.

step one: measure the correlation, use cor(x, y)
step two: use a linear regression (assuming the simpest scenario) with lm(y ~ x).
The result contains intercept and slope to calculate the value of the function, y = f(x), for any values of x, like 50 in your question.

Related

R won't include an important random term within my glm due to high correlation - but I need to account for it

I have a big data frame including abundance of bats per year, and I would like to model the population trend over those years in R. I need to include year additionally as a random effect, because my data points aren't independent as bat population one year directly effects the population of the next year (if there are 10 bats one year they will likely be alive the next year). I have a big dataset, however have used the group_by() function to create a simpler dataframe shown below - example of dataframe lay out. In my bigger dataset I also have month and day.
year
total individuals
2000
39
2001
84
etc.
etc.
Here is the model I wish to use with lme4.
BLE_glm6 <- glm(total_indv ~ year + (year|year), data = BLE_total, family = poisson)
Because year is the predictor variable, when adding year again R does not like it because it's highly correlated. So I am wondering, how do I account for the individuals one year directly affecting the number of individuals then next year if I can't include year as a random effect within the model?
There are a few possibilities. The most obvious would be to fit a Poisson model with the number of bats in the previous year as an offset:
## set up lagged variable
BLE_total <- transform(BLE_total,
total_indv_prev = c(NA, total_indv[-length(total_indv)])
## or use dplyr::lag() if you like tidyverse
glm(total_indv ~ year + offset(log(total_indv_prev)), data = BLE_total,
family = poisson)
This will fit the model
mu = total_indv_prev*exp(beta_0 + beta_1*year)
total_indv ~ Poisson(mu)
i.e. exp(beta_0 + beta_1*year) will be the predicted ratio between the current and previous year. (See here for further explanation of the log-offset in a Poisson model.)
If you want year as a random effect (sorry, read the question too fast), then
library(lme4)
glmer(total_indv ~ offset(log(total_indv_prev)) + (1|year), ...)

Lmer for longitudinal design

I have a longitudinal dataset where I have the following variables for each subject:
IV: 3 factors (factorA, factorB, factorC, factorD), each measured twice, at the beginning and at the end of an intervention.
DV: one outcome variable (behavior), also measure twice, at the beginning and at the end of the intervention.
I would like to create a model that uses the change in factorA, factorB, factorC, factorD (change from beginning to end of the intervention) to predict the change in behavior (again from beginning to end).
I thought to use the delta values of factorA, factorB, factorC, factorD (from pre to post intervention) and use these delta values to predict the delta values of D1. I would also like to covary-out the absolute values of each factor (A, B, C and D) (e.g. using only the value at the beginning of the intervention for each factor) to make sure I account for the change that the absolute values (rather than the change) of these IVs may have on the DV.
Here is my dataset:
enter image description here
Here is my model so far:
Model <- lmer(Delta_behavior ~ Absolute_factorA + Absolute_factorB +
Absolute_factorC + Absolute_factorD + Delta_factorA +
Delta_factorB + Delta_factorC + Delta_factorD +
(1|Subject),a)
I think I am doing something wrong because I get this error:
Error: number of levels of each grouping factor must be < number of observations
What am I doing wrong? Is the data set structured weirdly? Should I not use the delta values? Should I use another test (not lmer)?
Because you have reduced your data to a single observation per subject, you don't need to use a multi-level/mixed model. The reason that lmer is giving you an error is that in this situation the between-subject variance is confounded with the residual variance.
You can probably go ahead and use a linear model (lm) for this analysis.
More technical detail
The equation for the distribution of the ith observation is something like [fixed-effect predictors] + eps(subject(i)) + eps(i) where eps(subject(i)) is the Normal error term of the subject associated with the ith observation, and eps(i) is the Normal residual error associated with the ith observation. If we only have one observation per subject, then each observation has two error terms that are unique to it. The sum of two Normal variables with zero means and variances of V1 and V2 is also Normal with mean zero and variance V1+V2 ... therefore V1 and V2 are jointly unidentifiable. You can use lmerControl to override the error if you really want to; lmer will return some arbitrary combination of V1, V2 estimates that sum to the total variance.
There's a similar example illustrated here.

Storing and interpreting results of lm() model

I've got data on satisfaction scores for 5 questions over a 3 year period (2016 to 2018). My objective is to determine which of the 5 questions experienced the most statistically significant upward and downward trend over this 3 year period.
My dummy dataframe looks like this-
df = data.frame(Question = c('Q1','Q1','Q1','Q2','Q2','Q2','Q3','Q3','Q3','Q4','Q4','Q4','Q5','Q5','Q5'),
Year = c('2016','2017','2018','2016','2017','2018','2016','2017','2018','2016','2017','2018','2016','2017','2018'),
Score = c(0.8,0.6,0.2,0.2,0.4,0.8,0.4,0.5,0.4,0.1,0.2,0.1,0.9,0.7,0.3),
Count = c(226,117,200,323,311,380,411,408,407,222,198,201,665,668,670))
For this, I used the lm function in R to create a linear model.
lm(Score ~ Question * as.numeric(Year), data = df)
However, in order to determine the most significant upward and downward trending questions, I thought of storing the model co-efficients in a dataframe and then considering the highest and lowest co-efficients as my most significant upward and downward trending questions.
My first question - Am I using the right approach for what I want to achieve?
And my second question - If I am using the right approach, how can I store these co-efficients in a dataframe, and filter out the top and bottom values?
Any help on this would be highly appreciated.
If you store your model, you can extract coefficients and other elements like you would from a dataframe.
An example:
y = as.numeric(c("1","2","3","4","5"))
x = as.numeric(c("5","6","3","10","12"))
model=lm(y~x)
model$coefficients
(Intercept) as.numeric(x)
0.6350365 0.3284672

Cox Regression Hazard Ratio in Percentiles

I computed a Cox proportional hazards regression in R.
cox.model <- coxph(Surv(time, dead) ~ A + B + C + X, data = df)
Now, I got the hazard ratio (HR, or exp(coef)) for all these covariates, but I'm really only interested in the effects of continuous predictor X. The HR for X is 1.20. X is actually scaled to the sample measurements, such that X has a mean of 0 and SD 1. That is, an individual with a 1 SD increase in X has a 1.23 times higher chance of mortality (the event) than someone with an average value of X (I believe).
I would like to be able to say these results in something that's a bit less awkward, and actually this article does exactly what I would like to. It says:
"In a Cox proportional hazards model adjusting for age, sex and
education, a higher level of total daily physical activity was
associated with a decreased risk of death (hazard ratio=0.71;
95%CI:0.63, 0.79). Thus, an individual with high total daily physical
activity (90th percentile) had about ΒΌ the risk of death as compared
to an individual with low total daily physical activity (10th
percentile)."
Assuming only the HR (i.e. 1.20) is needed, how does one compute this comparison statement? If you need any other information, please ask me for it.
If you have x1 as your 90th percentile X value and x2 as your 10th percentile X value, and if p,q,r and s (s is1.20 as you have mentioned) and your coefficients of cox regression you need to find exp(p*A + q*B + r*C + s*x1)/exp(p*A + q*B + r*C + s*x2) where A, B, and C can be average values of the variable. This ratio give you the comparison statement.
This question is actually for stats.stackexchange.com though.

plotting glm interactions: "newdata=" structure in predict() function

My problem is with the predict() function, its structure, and plotting the predictions.
Using the predictions coming from my model, I would like to visualize how my significant factors (and their interaction) affect the probability of my response variable.
My model:
m1 <-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1, family=binomial(logit))
mating: individual has mated or not (factor, binomial: 0,1)
pop: population (factor, 4 levels)
behv: behaviour (numeric, scaled & centered)
condition: relative fat content (numeric, scaled & centered)
Significant effects after running the glm:
pop1
condition
behv*pop2
behv^2*pop1
Although I have read the help pages, previous answers to similar questions, tutorials etc., I couldn't figure out how to structure the newdata= part in the predict() function. The effects I want to visualise (given above) might give a clue of what I want: For the "behv*pop2" interaction, for example, I would like to get a graph that shows how the behaviour of individuals from population-2 can influence whether they will mate or not (probability from 0 to 1).
Really the only thing that predict expects is that the names of the columns in newdata exactly match the column names used in the formula. And you must have values for each of your predictors. Here's some sample data.
#sample data
set.seed(16)
data <- data.frame(
mating=sample(0:1, 200, replace=T),
pop=sample(letters[1:4], 200, replace=T),
behv = scale(rpois(200,10)),
condition = scale(rnorm(200,5))
)
data1<-data[1:150,] #for model fitting
data2<-data[51:200,-1] #for predicting
Then this will fit the model using data1 and predict into data2
model<-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1,
family=binomial(logit))
predict(model, newdata=data2, type="response")
Using type="response" will give you the predicted probabilities.
Now to make predictions, you don't have to use a subset from the exact same data.frame. You can create a new one to investigate a particular range of values (just make sure the column names match up. So in order to explore behv*pop2 (or behv*popb in my sample data), I might create a data.frame like this
popbbehv<-data.frame(
pop="b",
behv=seq(from=min(data$behv), to=max(data$behv), length.out=100),
condition = mean(data$condition)
)
Here I fix pop="b" so i'm only looking at the pop, and since I have to supply condition as well, i fix that at the mean of the original data. (I could have just put in 0 since the data is centered and scaled.) Now I specify a range of behv values i'm interested in. Here i just took the range of the original data and split it into 100 regions. This will give me enough points to plot. So again i use predict to get
popbbehvpred<-predict(model, newdata=popbbehv, type="response")
and then I can plot that with
plot(popbbehvpred~behv, popbbehv, type="l")
Although nothing is significant in my fake data, we can see that higher behavior values seem to result in less mating for population B.

Resources