R - using predict function when one variable is a binary factor - r

My linear model is trying to predict the amount gambled based on the variables sex, income, verbal, and status. Sex is a binary variable that is "Male" or "Female" (they are factors), while the rest are all numeric.
lm3 <- lm(gamble ~ sex + status + income + verbal, data=teengamb)
That is my linear model. I'm having trouble predicting the function for a male with mean status, income, and verbal:
newdata <- c(as.factor("Male"), mean(teengamb$status), mean(teengamb$income), mean(teengamb$verbal))
newdata <- data.frame(newdata)
predict(lm3, newdata)
I'm not sure how to go about doing this.
Note that the way I converted it to Male and Female is:
However, I have converted the 0=male, 1=female to "Male" and "Female."
teengamb$sex[teengamb$sex==0] <- "Male"
teengamb$sex[teengamb$sex==1] <- "Female"
teengamb$sex <- as.factor(teengamb$sex)

When you create your newdata data frame, you have to be sure each column has a name:
newdata <- data.frame(sex=0, status=mean(teengamb$status),
income=mean(teengamb$income),
verbal=mean(teengamb$verbal))
predict(lm3, newdata)
# 28.24252
Also note that sex is represented as 0=male, 1=female (you can see this by doing help(teengamb)).
(Which means, that it should be:
newdata <- data.frame(sex=factor("Male", levels=c("Female", "Male")),
status=mean(teengamb$status),
income=mean(teengamb$income),
verbal=mean(teengamb$verbal))

Related

Cannot fit multilevel ordinal logit model using clmm

I'm trying to fit a multilevel (random effects) ordered logit model using the ordinal package, but I keep running into this error:
Error in region:country1 : NA/NaN argument
Here's my simplified model. I'm regressing an indicator of happiness on a number of variables, including class, gender, age, etc. There are two nested levels: regions within countries.
library(ordinal)
# Set as factor
data$happiness <- as.factor(data$happiness)
# Remove NA
missing_country1 <- is.na(data$country1)
data <- data[!missing_country1, ]
missing_region <- is.na(data$region)
data <- data[!missing_region, ]
# Model
model1 <- clmm(happiness ~ age + gender + class + (1 | country1 / region),
data = data,
na.action = na.omit
)
I have removed all NA and NaN from both country1 and region.
Thanks,
Figured it out: it was because ordinal doesn't automatically convert the grouping variables to factor, so you need to do it manually.

Prediction function in R

I am working on a data set of employed and unemployed people. After estimation of parameters for (log)wages I am trying to predict values of (log)wages for unemployed people that would correspond with results that I have (in data set values for unemployed are N/A).
After using function predict I still get predictions only for employed people. Does anyone know how to get the full column of "predML"? Thank you in advance!!
ml_restricted <- selection(employed ~ schooling + age + agesq + married, logwage ~ schooling + age + agesq, data)
summary(ml_restricted)
# find predicted values
predML <- predict(ml_restricted)
data <- cbind(data, predML)
employed2 <- ifelse(data$employed == 1, "employed", "unemployed")
data <- cbind(data, employed2)

Survminer - include subset of variables in plot

Let's say I want to plot the survival curves using a model of the lung data, that controls for sex and a median split of the age variable (I could also control linearly for age and that would make my problem even worse).
I would like to make a plot of this model only showing the stratification between the levels of the sex factor. If I do what seems to be the standard, however, I get 4 instead of two survival curves.
library(survival)
library(survminor)
reg_lung <- lung %>% mutate(age_cat = ifelse(age > 63, "old", "young"))
lung_fit <- survfit(Surv(time, status) ~ age_cat + sex, data = reg_lung)
ggsurvplot(lung_fit, data = reg_lung)
resulting survival plot
That is to say, I would like to the difference sex makes while holding the influence of age fixed (either as factor old/young or linearly).
You can fit your model with coxph and define sex as strata:
lung_fit <- coxph(Surv(time, status) ~ age_cat + strata(sex), data = reg_lung)
ggsurvplot(survfit(lung_fit), data = reg_lung)

R function with separating data and finding linear regression

I want to calculate the impact that height has on earnings given the gender. I divided my data into data for male and female but when I run the lm(earnings~height+education+age, data = data_female) function it gives me an error saying: Error in model.frame.default(formula = earnings ~ height + education + :
variable lengths differ (found for 'education')
Would you be able to help in either suggesting a better way to refine my model or helping to fix this particular error? Please let me know.
setwd("~/Google Drive/R Data")
data <- read.csv('data_ass5.csv')
height <- data$height
earnings <- data$earnings
gender <- data$sex
age <- data$age
education <- data$educ
multiple_regression <- lm(earnings~height+age+gender+education,data = data)
lm(earnings~height+age+gender+education,data = data)
summary(multiple_regression)
summary(linear_regression)
multiple_regression_redefined <- lm(earnings~age+gender+education,data = data)
# Now I wish to particularly assess the impact of gender on earnings
# therefore trying to refine my model doing the following:
# but the lm last line is causing an error. Would you be able to adivse on
# if this is the correct way to refine it and/or why I am getting the error.
# I even tried putting na.rm=TRUE after the lm code, but error still.
data_female <- subset(data,gender==0)
data_male <- subset(data,gender==1)
lm(earnings~height+education+age, data = data_female)
Per docs of lm, the data argument handles variables in formula in two ways that are NOT mutually exclusive:
data
an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.
Specifically, all your vector assignments are redundant and overlap with column names in the data frame except for gender and education:
height <- data$height
earnings <- data$earnings
gender <- data$sex
age <- data$age
education <- data$educ
multiple_regression <- lm(earnings~height+age+gender+education,data = data)
When above is run, all referenced names except for gender and education derive from the dataframe. But gender and education is pulled from the global environment for the vectors you assigned above. Had you used sex and educ, values would be pulled from the data frame like all the others.
Relatedly, your subset calls use the gender vector and not sex column. Fortunately, they are the exact same that no errors or undesired results occurred.
data_female <- subset(data,gender==0)
data_male <- subset(data,gender==1)
Therefore, when you subsetted your data, lm is pulling all values from the subsetted data and one value, education, from global environment. But remember education is based on the full data frame so maintains a larger length than the columns of subsetted data frame.
Altogether, simply avoid assigning the redundant vectors and use columns for full and subsetted data frames.
height <- data$height
earnings <- data$earnings
gender <- data$sex
age <- data$age
education <- data$educ
# REPLACE gender WITH sex AND education WITH educ (RENAME COLS IF NEEDED)
multiple_regression <- lm(earnings ~ height + age + sex + educ, data = data)
# REPLACE gender WITH sex
data_female <- subset(data, sex==0)
data_male <- subset(data, sex==1)
# REPLACE education WITH educ
lm(earnings ~ height + educ + age, data = data_female)

Adding a vector of dummy variables in logistic regression

I am currently trying to conduct logistic regression where one of the variables is a vector of 32 dummy variables. Each dummy represents a type of crime. For example:
narcotics <- ifelse(train$PRIMARY.DESCRIPTION == "NARCOTICS", 1,0)
Then the vector is created:
crime.type <- c(narcotics, theft, other.offense, burglary, motor.vehicle.theft, battery, robbery, assault, criminal.damage, deceptive.practice, kidnapping, etc.)
The logistic model is as follows:
logit.mod.train <- lm(street1 ~ BEAT+WARD+X.COORDINATE+Y.COORDINATE+LATITUDE+LONGITUDE+crime.type, data = train, family = "binomial")
It's important to note that street1 is actually a dummy variable for the location of the crime being on the street. So the column is LOCATION.DESCRIPTION and the element is street.
street1 <- ifelse(train$LOCATION.DESCRIPTION == "STREET", 1,0).
It yields this error:
Error in model.frame.default(formula = street1 ~ BEAT + WARD + X.COORDINATE + :
variable lengths differ (found for 'crime.type')
I thought this would work because they are derived from the same data set and the dummies represent each unique element of one of the columns. When I input each dummy variable separately it's successful but I want to condense the regression and make it more efficient.
Thank you in advance
If you intend for each type of crime to be its own predictor, you'll need to bind them to train, and then specify the variables in your lm formula. (Actually for logit it should be glm().)
For a more compact formula, subset train in the data= argument of glm() to include only your response variable and your intended design matrix. Then use street1 ~ . as your formula.
train <- cbind(train, narcotics, theft)I
model.vars <- c("narcotics", "theft", "street1")
logit.mod.train <- glm(street1 ~ ., data = train[,model.vars], family = "binomial")
More explanation:
Using ifelse as you've done produces a 1 or 0 for every element in train.
When you define crime.type as narcotics (which has the length of train) plus any additional elements, crime.type is longer than the number of rows in train.
Then you're asking lm() to process a lopsided design matrix, where one predictor (crime.type) has more elements in it than the other predictors. That's why you're getting the error.
Here's a replication of the issue:
N <- 100
train <- data.frame(PRIMARY.DESCRIPTION=sample(c("A","B"), replace = T, size = N),
response = rbinom(n=N, prob=0.7, size=1))
dim(train) # 100 2
narcotics <- ifelse(train$PRIMARY.DESCRIPTION == "A", 1, 0)
length(narcotics) # 100
theft <- ifelse(train$PRIMARY.DESCRIPTION == "B", 1, 0)
length(theft) # 100
crime.type <- c(desc.A, desc.B)
length(crime.type) # 200
logit.mod.train <- glm(response ~ PRIMARY.DESCRIPTION+crime.type, data = train, family = "binomial")
Error in model.frame.default(formula = response ~ PRIMARY.DESCRIPTION + :
variable lengths differ (found for 'crime.type')

Resources