Do I need Indicators for Regression with Categorical Variables? - r

It is always said that we need to create predictor variables for categorical values in order to the regression. I made a test, by creating first a predictor column of 1, 2, 3s for a five-layered categorical variable. Then, I ran the same model, without the predictor column, but with the categorical column itself.
In conclusion, the coefficients are different; however, their relative importance and effect on the y-value are the same. Moreover R-squared and p-value numbers are exactly the same in these two cases. So, do I have to create the predictor variable, or is R smart enough to do it automatically?
for(i in 1:74)
{
if(travel$accommodation[i] == "Hotel")
{
travel$pred_hotel[i] <- 1
}
if(travel$accommodation[i] == "Airbnb")
{
travel$pred_hotel[i] <- 2
}
if(travel$accommodation[i] == "Hostel")
{
travel$pred_hotel[i] <- 3
}
if(travel$accommodation[i] == "With friend/family")
{
travel$pred_hotel[i] <- 4
}
if(travel$accommodation[i] == "Other")
{
travel$pred_hotel[i] <- 5
}
}
travel$pred_hotel <- as.factor(travel$pred_hotel)
Then:
msf <- lm(ticket_events_money ~ museum_fee + nationality +
ticket_events_frequency + accommodation + line + activity_1 +
locals + vacation_days, data = travel[-1, ])
mm <- lm(ticket_events_money ~ museum_fee + nationality +
ticket_events_frequency + pred_hotel + line + activity_1 +
locals + vacation_days, data = travel[-1, ])
summary(msf)
summary(mm)

The problem is, you originally have a character column accommodation. Your new variable pred_hotel is a factor. Function lm automatically converts character covariate into factor. In your test, the only difference will be in factor levels; all the rest is the same. If you want to see difference, remove the as.factor line.
Another common failure is as in the following minimal, reproducible example.
dat <- data.frame(y = rnorm(20), x = rep(letters[1:2], 10), stringsAsFactors = FALSE)
m1 <- lm(y ~ x, dat)
dat$x[dat$x == 'a'] <- 1
dat$x[dat$x == 'b'] <- 2
class(dat$x) # still a character column!!
m2 <- lm(y ~ x, dat)
But you will see difference, if you use real numeric:
dat$x <- as.numeric(dat$x)
m3 <- lm(y ~ x, dat)

Related

How can I loop a list of models to get slope estimate

I have a list of models as specified by the following code:
varlist <- list("PRS_Kunkle", "PRS_Kunkle_e07",
"PRS_Kunkle_e06","PRS_Kunkle_e05", "PRS_Kunkle_e04",
"PRS_Kunkle_e03", "PRS_Kunkle_e02", "PRS_Kunkle_e01",
"PRS_Kunkle_e00", "PRS_Jansen", "PRS_deroja_KANSL")
PRS_age_pacc3 <- lapply(varlist, function(x) {
lmer(substitute(z_pacc3_ds ~ i*AgeAtVisit + i*I(AgeAtVisit^2) +
APOE_score + gender + EdYears_Coded_Max20 +
VisNo + famhist + X1 + X2 + X3 + X4 + X5 +
(1 |family/DBID),
list(i=as.name(x))), data = WRAP_all, REML = FALSE)
})
I want to obtain the slope of PRS at different age points in each of the models. How can I write code to achieve this goal? Without loop, the raw code should be:
test_stat1 <- simple_slopes(PRS_age_pacc3[[1]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat2 <- simple_slopes(PRS_age_pacc3[[2]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat3 <- simple_slopes(PRS_age_pacc3[[3]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat4 <- simple_slopes(PRS_age_pacc3[[4]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat5 <- simple_slopes(PRS_age_pacc3[[5]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat6 <- simple_slopes(PRS_age_pacc3[[6]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat7 <- simple_slopes(PRS_age_pacc3[[7]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat8 <- simple_slopes(PRS_age_pacc3[[8]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat9 <- simple_slopes(PRS_age_pacc3[[9]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat10 <- simple_slopes(PRS_age_pacc3[[10]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat11 <- simple_slopes(PRS_age_pacc3[[11]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
library(lme4)
library(reghelper)
set.seed(101)
## add an additional factor variable so we can use it for an interaction
sleepstudy$foo <- factor(sample(LETTERS[1:3], size = nrow(sleepstudy),
replace = TRUE))
m1 <- lmer(Reaction ~ Days*foo + I(Days^2)*foo + (1|Subject), data = sleepstudy)
s1 <- simple_slopes(m1, levels=list(Days = c(5, 10, 15)))
Looking at these results, s1 is a data frame with 6 rows (number of levels of foo × number of Days values specified) and 5 columns (Days, foo, estimate, std error, t value).
The simplest way to do this:
res <- list()
for (i in seq_along(varlist)) {
res[[i]] <- simple_slopes(model_list[[i]], ...) ## add appropriate args here
}
res <- do.call("rbind", res) ## collapse elements to a single data frame
## add an identifier column
res_final <- data.frame(model = rep(varlist, each = nrow(res[[1]])), res)
If you want to be fancier, you could replace the for loop with an appropriate lapply. If you want to be even fancier than that:
library(tidyverse)
(model_list
%>% setNames(varlist)
## map_dfr runs the function on each element, collapses results to
## a single data frame. `.id="model"` adds the names of the list elements
## (set in the previous step) as a `model` column
%>% purrr::map_dfr(simple_slopes, ... <extra args here>, .id = "model")
)
By the way, I would be very careful with simple_slopes when you have a quadratic term in the model as well. The slopes calculated will (presumably) apply only in the case where any other continuous variables in the model are zero. You might want to center your variables as in Schielzeth 2010 Methods in Ecology and Evolution ("Simple means to improve ...")

Extracting and exchanging terms within an R formula object

I would like to create a graphing function in R which takes a formula as an argument, e.g.:
my.plot(sqrt(Sepal.Width) ~ Petal.Width + log(Petal.Length) + Species + Petal.Width * Petal.Length, .data = iris)
And then
Perform a model fit with the first predictor term exchanged for another vector created within the function.
Use the outcome term and the first predictor term for an overlying plot.
Allow interaction and crossing terms, and use of the . symbol denoting all other variables in the data frame.
Handle the case where only 1 predictor term is provided - e.g. Sepal.Width ~ Petal.Width.
R pseudocode for a highly simplified example:
library("formula.tools")
my.plot <- function(.formula, .data) {
outcome.term <- lhs(.formula)
first.predictor.term <- rhs(.formula)[1]
new.formula <- outcome.term ~ 1:nrow(.data) + rhs(.formula)[-1]
my.fit <- lm(new.formula, data = .data)
my.predict <- predict(my.fit)
plot(first.predictor.term, outcome.term, data = .data)
lines(first.predictor.term, my.predict, data = .data)
}
You could accomplish the same using Base R:
my.plot <- function(.formula, .data) {
outcome.term <- deparse(.formula[[2]])
first.predictor.term <- .formula[[3]]
len <- length(first.predictor.term) > 1
if (len) first.predictor.term <- .formula[[3]][[2]]
if (len) .formula[[3]][[2]] <- substitute(new_variable)
else .formula[[3]] <- substitute(new_variable)
.data['new_variable'] <- 1:nrow(.data)
my.fit <- lm(.formula, data = .data)
my.predict <- predict(my.fit)
f <- reformulate(deparse(first.predictor.term), outcome.term)
plot(f, data = .data, ty = "p")
}

For loop not storing the results in for-purpose data frame columns

I have created the below for loop to deal with a prediction process on panel data. While each procedure on the data works like a charm, the storage of the predictions (last step) is unsuccessful. Due to incompetence on my part (fairly new to for loops), the for loop is not replacing the NAs in the data frame columns created for storage with the numeric predictions. What am i doing wrong?
There are 17 prefectures, each having 61 observations. Therefore, for each, i get 61 predictions.
Data set used: https://www.dropbox.com/scl/fi/v2xk34ac58h2kk7uxunat/dt1.xlsx?dl=0&rlkey=gf2e15z4gtuu83lxalzn91rai
#Data prep for modeling and predictions
mydata$...1 <- NULL #remove useless column
mydata$month_year <- as.factor(mydata$month_year) #time fixed-effects
mydata$ncve_relax_lag <- as.numeric(mydata$ncve_relax_lag) #make numeric
mydata$ncve_strict_lag <- as.numeric(mydata$ncve_strict_lag)
mydata <- mydata %>% drop_na()
mydata$population <- mydata$population/10000 #scaling
mydata$area <- mydata$area/10000 #scaling
mydata$no_troops <- mydata$no_troops/1000 #scaling
#Create data frame columns to store predictions
mydata$nbpred.core <- NA
mydata$nbpred.lit <- NA
mydata$nbpred.base <- NA
#Model fitting and predictions
runPredictions <- function(){
for(i in unique(mydata$prefecture)){
print(i)
#Define training and test sets
sptllearningSet <- mydata[mydata$prefecture != i,]
sptltestSet <- mydata[mydata$prefecture == i,]
#Train model
sptlnb_base <- glm.nb(ncve_relax ~ population +
capdist +
month_year,
data = sptllearningSet,
control = glm.control(maxit = 3000))
sptlnb_lit <- glm.nb(ncve_relax ~ population +
capdist +
multidim.poverty +
eth_frc_t13 +
eth_plr_t13 +
sp_lag_relax +
ncve_relax_lag +
month_year,
data = sptllearningSet,
control = glm.control(maxit = 3000))
sptlnb_core <- glm.nb(ncve_relax ~ population +
capdist +
multidim.poverty +
eth_frc_t13 +
eth_plr_t13 +
sp_lag_relax +
ncve_relax_lag +
no_troops +
unpol.dummy +
area +
ruggedness +
month_year,
data = sptllearningSet,
control = glm.control(maxit = 3000))
#Use coefficients to predict on test
mydata$nbpred.core[mydata$prefecture == i] = as.numeric(predict(sptlnb_core, newdata = mydata[mydata$prefecture == i,], type='response'))
mydata$nbpred.lit[mydata$prefecture == i] = as.numeric(predict(sptlnb_lit, newdata = mydata[mydata$prefecture == i,], type='response'))
mydata$nbpred.base[mydata$prefecture == i] = as.numeric(predict(sptlnb_base, newdata = mydata[mydata$prefecture == i,], type='response'))
}
}
Thank you for the help!
EDIT: I've added the initial part of my code to ensure it is fully reproducible.
You are dealing with a scoping issue, when you are running your for loop inside the function, the predictions are getting assigned within the function, but not affecting the global environment where the data frame is accessible.
The most straightforward method to handle this is just to pull the for loop out of the function - delete the call runPredictions <- function(){} and it should work fine.
Alternatively, you could force the function to assign to the global envoronment, or apply the individual functions across the prefectures (e.g. using pmap)

R dynamic data summary frequency with condition, map (n-1) variables to one

I found a function that provides frequencies with condition and I thought of creating a function
do.call(data.frame, aggregate(X1 ~ X2, data=dat, FUN=table))
I also managed to get the column names by their index number from this thread using name <- names(dataset)[index].
I want to get the frequency of Xn ~ Xstatic, where Xn are the n-1 variables and Xstatic is the variable of interest.
So far I made a for loop and here is my code:
library(prodlim)
NUM <- 100
dat1 <- SimSurv(NUM)
dat1$time <- sample(24:160,NUM,rep=TRUE)
dat1$X3 <- sample(0:1,NUM,rep=TRUE)
dat1$X4 <- sample(0:9,NUM,rep=TRUE)
dat1$X5 <- sample(c("a","b","c"),NUM,rep=TRUE)
dat1$X6 <- sample(c("was","que","koa","sim","sol"),NUM,rep=TRUE)
dat1$X7 <- sample(1:99,NUM,rep=TRUE)
dat1$X8 <- sample(1:200,NUM,rep=TRUE)
attach(dat1)
# EXAMPLE
# do.call(data.frame, aggregate(status ~ X6, data=dat1, FUN=table))
for( i in 1:ncol(dat1) ) {
name <- names(dat1)[i]
do.call(data.frame, aggregate(name ~ X6, data=dat1, FUN=table))
}
I get the error below and I am at a loss on how to solve this. All help is appreciated.
Error in model.frame.default(formula = name ~ X6, data = dat1) :
variable lengths differ (found for 'X6')
1) I would suggest not using attach;
2) it is meaningless to make a frequency table of your variable of interest to some of these other variables, the continuous ones, for instance, or the ones from which you have sampled from 99 and 200 possible values;
3) why would you want to combine your results into a data frame? just print them or save to a list:
mylist <- list()
for ( i in c('status','X2','X3','X4','X5','X7','X8') ) {
mylist[i] <- list(table(dat1[ ,i], dat1$X6))
}

R - interaction with only one factor level in regression

In a regression model is it possible to include an interaction with only one dummy variable of a factor? For example, suppose I have:
x: numerical vector of 3 variables (1,2 and 3)
y: response variable
z: numerical vector
Is it possible to build a model like:
y ~ factor(x) + factor(x) : z
but only include the interaction with one level of X? I realize that I could create a separate dummy variable for each level of x, but I would like to simplify things if possible.
Really appreciate any input!!
One key point you're missing is that when you see a significant effect for something like x2:z, that doesn't mean that x interacts with z when x == 2, it means that the difference between x == 2 and x == 1 (or whatever your reference level is) interacts with z. It's not a level of x that is interacting with z, it's one of the contrasts that has been set for x.
So for a 3 level factor with default treatment contrasts:
df <- data.frame(x = sample(1:3, 10, TRUE), y = rnorm(10), z = rnorm(10))
df$x <- factor(df$x)
contrasts(df$x)
2 3
1 0 0
2 1 0
3 0 1
if you really think that only the first contrast is important, you can create a new variable that compares x == 2 to x == 1, and ignores x == 3:
df$x_1vs2 <- NA
df$x_1vs2[df$x == 1] <- 0
df$x_1vs2[df$x == 2] <- 1
df$x_1vs2[df$x == 3] <- NA
And then run your regression using that:
lm(y ~ x_1vs2 + x_1vs2:z)
X <- data.frame(x = sample(1:3, 10, TRUE), y = rnorm(10), z = rnorm(10))
lm(y ~ factor(x) + factor(x):z, data=X)
Is it what you want?
Something like this may be what you need:
y~factor(x)+factor(x=='SomeLevel'):z
If x is already coded as a factor in your data, something like
y ~ x + I(x=='some_level'):z
Or if x is of numeric type in your data frame, then
y ~ as.factor(x) + I(as.factor(x)=='some_level'):z
Or to only model some subset of the data try:
lm(y ~ as.factor(x) + as.factor(x):z, data = subset(df, x=='some_level'))

Resources