recipes package cannot create interaction term in step_interact - r

I'm using a medical insurance data set to hone my modeling skills that looks like this:
> insur_dt
age sex bmi children smoker region charges
1: 19 female 27.900 0 yes southwest 16884.924
2: 18 male 33.770 1 no southeast 1725.552
3: 28 male 33.000 3 no southeast 4449.462
4: 33 male 22.705 0 no northwest 21984.471
5: 32 male 28.880 0 no northwest 3866.855
---
1334: 50 male 30.970 3 no northwest 10600.548
1335: 18 female 31.920 0 no northeast 2205.981
1336: 18 female 36.850 0 no southeast 1629.833
1337: 21 female 25.800 0 no southwest 2007.945
1338: 61 female 29.070 0 yes northwest 29141.360
I'm using recipes as part of the tidymodels meta-package to prepare my data for use in a model, and I have determined that bmi, age, and smoker form an interaction term.
insur_split <- initial_split(insur_dt)
insur_train <- training(insur_split)
insur_test <- testing(insur_split)
# we are going to do data processing and feature engineering with recipes
# below, we are going to predict charges using everything else(".")
insur_rec <- recipe(charges ~ age + bmi + smoker, data = insur_train) %>%
step_dummy(all_nominal()) %>%
step_zv(all_numeric()) %>%
step_normalize(all_numeric()) %>%
step_interact(~ bmi:smoker:age) %>%
prep()
Per the tidymodels guide/documentation, I have to specify the interaction as a step in the recipe as step_interact. However, I am getting an error when I attempt to do so:
> insur_rec <- recipe(charges ~ age + bmi + smoker, data = insur_train) %>%
+ step_dummy(all_nominal()) %>%
+ step_zv(all_numeric()) %>%
+ step_normalize(all_numeric()) %>%
+ step_interact(~ bmi:smoker:age) %>%
+ prep()
Interaction specification failed for: ~bmi:smoker:age. No interactions will be created.partial match of 'object' to 'objects'
I am new to modeling and am not quite sure why I am getting this error. I am simply trying to state that charges is explained by all other predictors, and that smoker (a yes/no factor), age (numeric), and bmi (double) all interact with each other to inform the outcome. What am I doing wrong?

From the documentation:
step_interact can create interactions between variables. It is primarily intended for numeric data; categorical variables should probably be converted to dummy variables using step_dummy() prior to being used for interactions.
step_dummy(all_nominal()) turned the variable smoker into smoker_yes. Below, you'll see that I just changed the name of smoker in the interaction term to smoker_yes.
insur_rec <- recipe(charges ~ bmi + age + smoker, data = insur_train) %>%
step_dummy(all_nominal()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_interact(terms = ~ bmi:age:smoker_yes) %>%
prep(verbose = TRUE, log_changes = TRUE)

Related

Using the survey package to find SE's and crosstabulations

I am using the survey package by Thomas Lumley to create cross tabs and SE's. I am struggling to specify the denominator of the cross tabulation.
This is my data:
library(survey)
data <- read_table2("Q50_1 Q50_2 Q38 Q90 pov gender wgt id
yes 3 Yes NA High M 1.3 A
NA 4 No 2 Med F 0.4 B
no 2 NA 4 Low F 1.2 C
maybe 3 No 2 High M 0.5 D
yes NA No NA High M 0.7 E
no 2 Yes 3 Low F 0.56 F
maybe 4 Yes 2 Med F 0.9 G")
Create the design object:
design <- svydesign(id =~id,
weights = ~wgt,
nest = FALSE,
data = data)
To find the cross tabulation of Q50_1 by Female:
svymean(~interaction(Q50_1,gender=="F"), design, na.rm = T)
This gives me:
mean SE
interaction(Q50_1, gender == "F")maybe.FALSE 0.096899 0.1043
interaction(Q50_1, gender == "F")no.FALSE 0.000000 0.0000
interaction(Q50_1, gender == "F")yes.FALSE 0.387597 0.2331
interaction(Q50_1, gender == "F")maybe.TRUE 0.174419 0.1725
interaction(Q50_1, gender == "F")no.TRUE 0.341085 0.2233
interaction(Q50_1, gender == "F")yes.TRUE 0.000000 0.0000
This is not as useful to me because the denominator includes TRUE FALSE values for every combination, whereas I am only interested in the mean that is true. So, I could easily find the percentage of TRUE as follows:
dat <- as.data.frame(svymean(~interaction(Q50_1,gender=="F"), design, na.rm = T)) %>% tibble::rownames_to_column("question")
dat %>% tidyr::separate(question,c("question",'response'), sep = "\\)", extra = "merge") %>%
mutate(question = str_replace(question,"interaction\\("," ")) %>%
tidyr::separate(response,c('value', 'bool'), sep ="\\." ) %>%
tidyr::separate(question,c('question', 'group'), sep ="\\," ) %>%
tidyr::separate(group,c('group_level', 'group'), sep ="\\==" ) %>%
filter(bool=='TRUE') %>%
group_by(question, group_level, group) %>%
mutate(sum_true = sum(mean)) %>%
mutate(mean= mean/sum_true)
This gives me:
question group_level group value bool mean SE sum_true
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
" Q50_1" " gender " " \"F\"" maybe TRUE 0.338 0.173 0.516
" Q50_1" " gender " " \"F\"" no TRUE 0.662 0.223 0.516
" Q50_1" " gender " " \"F\"" yes TRUE 0 0 0.516
The means are exactly what I want, but the SEs are associated with a different denominator and don't represent the manipulated mean. Is there a way to call the svymean to present the mean and SE of ONLY the TRUE values in the denominator?
I thought something like this might do (but it does not work):
svymean(~interaction(Q50_1,gender=="F"[TRUE]), design, na.rm = T)
My desired outcome (the SE's are fake):
mean SE
interaction(Q50_1, gender == "F"[TRUE])maybe.TRUE 0.338 0.0725
interaction(Q50_1, gender == "F"[TRUE])no.TRUE 0.0.662 0.0233
interaction(Q50_1, gender == "F"[TRUE])yes.TRUE 0.0 0.0000
To get the percentage of women who responded to each answer you want
svymean(~Q50_1, subset(design, gender== "F"),na.rm=TRUE)
or equivalently (because that's how svyby does it)
svyby(~Q50_1, ~gender, design, svymean, na.rm = TRUE)
If you want to get the empty category as well, you need to convert the ~Q50_1 variable to a factor -- that's the point of factors (vs strings): they know what levels they have.
If you to be able to extract parts of the output programmatically, use the coef and SE functions
data$Q50_1<-factor(data$Q50_1)
design <- svydesign(id =~id,
weights = ~wgt,
nest = FALSE,
data = data)
svymean(~Q50_1, subset(design, gender== "F"),na.rm=TRUE)
svyby(~Q50_1, ~gender, design, svymean, na.rm = TRUE)[1,]
coef(svyby(~Q50_1, ~gender, design, svymean, na.rm = TRUE))
SE(svyby(~Q50_1, ~gender, design, svymean, na.rm = TRUE))
These don't agree with what you got using ~interaction, because what you got that way doesn't match what you said you want. The interaction analysis gives you the percentage of people who are women and also responded yes, not the percentage of yes responses among women. To put it another way, the 6 percentages you get with the interaction analysis add to 100%, not to 200%.
> sum(coef(svymean(~interaction(Q50_1,gender=="F"), design, na.rm = T)))
[1] 1

Weighted survey functions by group

I have a weighted survey dataset that involves age groups, incomes and expenditure. I want to find the average of spending within age groups and within income deciles.
So for example
DF:
Age Income Spending1 Spending2 Weight
45-49 1000 50 35 100
30-39 2000 40 60 150
40-44 3434 30 55 120
Currently I have coded this:
DF$hhdecile<-weighted_ntile(DF$Income, weights=DF$Weight, 5)
Result1<- DF %>% group_by(Age,hhdecile) %>% dplyr::summarise(mean.exp = weighted.mean(x = Spending1, w = Weight))
Result2<- DF %>% group_by(Age,hhdecile) %>% dplyr::summarise(mean.exp = weighted.mean(x = Spending2, w = Weight))
df.list <- list(Result1=Result1,
Result2=Result2)
names(df.list$Result1)[names(df.list$Result1)=="mean.exp"]<- Result1
ResultJoined < - df.list %>% reduce(full_join, by=c('Age','hhdecile')
That finds the quintile of people compared to the population of all ages, and I'm interested in their quintile compared to their age group.
Is there a way to use group_by or similar to perform the weighted percentile function on each age group individually?
(there are actually 15 categories of spending)

Regression by Groups [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 4 years ago.
I have a table:
CityData ->
City Price Bathrooms Bedrooms Porch
Milwaukee 2300 2 3 yes
Chicago 3400 3 2 yes
Springfield 2300 1 1 no
Chicago 2390 2 1 yes
I would like to run a regression for each city (multiple rows per city) to give me coefficients for each city. I want to regress price on the other confounding variables (bathrooms, bedrooms, porch).
I tried the dplyr library:
library(dplyr)
fitted_models = CityData %>%
group_by(CityData$City) %>%
do(model = lm(CityData$Price ~ CityData$Bathrooms +
CityData$Porch + CityData$Bedrooms, data = CityData))
But the output is just
14 lm list
14 lm list
14 lm list
Any suggestions?
You might try something like this. Here I'll use the mtcars data as an example.
df <- mtcars
models <- df %>% group_by(cyl) %>% summarise(mod = list(lm(mpg ~ wt)))
This will give you a new variable mod that holds all the info for your model. You can call the coefficients like:
models$mod[[1]]$coefficients
(Intercept) wt
39.571196 -5.647025
You can get more complex with it too.
models <- df %>% group_by(cyl) %>% summarise(mod = list(lm(mpg ~ wt + hp)))
models$mod[[1]]$coefficients
(Intercept) wt hp
45.83607319 -5.11506233 -0.09052672
Of course models will also still also hold the info for the group
models$cyl
[1] 4 6 8

Paired bar chart with conditional labeling based on multiple factors

I am trying to create a graphical output like the picture below for the following sample of data but the code I have included gives an error:
Error in mutate_impl(.data, dots) : Evaluation error: Column n must be length 43 (the number of rows) or one, not 42.
My goal is to plot all providers from the same location on the same chart and then only include one name on the axis so that each provider can see how they compare to others in their area without revealing the identity of the other providers. I have tried specifying that n= 43 (the length of the full dataset) but have not had any success. Additionally, I would like to do a paired bar chart to show how each provider compares the their previous months' rates.
Provider Month Payment Location
Andrew 2 32.62 OH
Dillard 2 40 OH
Henry 2 32.28 OH
Lewis 2 47.79 IL
Marcus 2 73.04 IL
Matthews 2 45.22 NY
Paul 2 65.73 NY
Reed 2 27.67 NY
Andrew 1 33.23 OH
Dillard 1 36.63 OH
Henry 1 42.68 OH
Lewis 1 71.45 IL
Marcus 1 39.51 IL
Matthews 1 59.11 NY
Paul 1 27.67 NY
Reed 1 28.78 NY
library(tidyverse)
library(purrr)
df <- 1:nrow(PaymentsFeb) %>%
purrr::map( ~PaymentsFeb) %>%
set_names(PaymentsFeb$Provider) %>%
bind_rows(.id = "ID") %>%
nest(-ID) %>%
mutate(Location=map2(data,ID, ~.x %>% filter(Provider == .y) %>% select(Location))) %>%
mutate(data=
map2(data, ID, ~.x %>%
mutate(n=paste0("#", sample(seq_len(n()), size = n())),
Provider=ifelse(Provider == .y, as.character(Provider), n),
Provider=factor(Provider, levels = c(.y, sample(n, n())))))) %>%
mutate(plots=map2(data,Location, ~ggplot(data=.x,aes(x = Provider, y = scores, fill = scores))+
geom_col() +geom_text(aes(label=Per.Visit.Bill.Rate), vjust=-.3)+
ggtitle("test scores by Location- February 2018", subtitle = .y$Location)
))

The Intercept of a categorical multiple regression R is not the mean value?

Let's say I have 2 (categorical) variables and one continuous:
library(tidyverse)
set.seed(123)
ds <- data.frame(
depression=rnorm(90,10,2),
schooling_dummy=c(0,1,2),
sex_dummy=c(0,1)
)
When I regress depression on sex (0 or 1), the intercept is 10.0436, what is the mean of sex = 0. Ok!
ds %>% group_by(sex_dummy) %>%
+ summarise(formatC(mean(depression),format="f", digits=4))
# A tibble: 2 x 2
sex_dummy `formatC(mean(depression), format = "f", digits = 4)`
<dbl> <chr>
1 0 10.0436
2 1.00 10.1640
The same thing happens when I regress depression on schooling. The intercept value is 10.4398. The mean of schooling = 0 is the same.
ds %>% group_by(schooling_dummy) %>%
+ summarise(formatC(mean(depression),format="f", digits=4))
# A tibble: 3 x 2
schooling_dummy `formatC(mean(depression), format = "f", digits = 4)`
<dbl> <chr>
1 0 10.4398
2 1.00 9.7122
3 2.00 10.1593
Now, when I compute a model with both variables, why the intercept is not the mean when both groups = 0? The regression **intercept is 10.3796, but the mean when sex = 0, and schooling is = 0 is 10.32548:
ds %>% group_by(schooling_dummy,sex_dummy) %>%
+ summarise(formatC(mean(depression),format="f", digits=5))
# A tibble: 6 x 3
# Groups: schooling_dummy [?]
schooling_dummy sex_dummy `formatC(mean(depression), format = "f", digits = 5)`
<dbl> <dbl> <chr>
1 0 0 10.32548
2 0 1.00 10.55404
3 1.00 0 9.59305
4 1.00 1.00 9.83139
5 2.00 0 10.21218
6 2.00 1.00 10.10648
When I predict the model when both are 0:
predict(mod3, data.frame(sex_dummy=0, schooling_dummy=0))
1
10.37956
This result is related to depression (of course...) but still not What I was expecting, since:
(Reference: https://www.theanalysisfactor.com/interpret-the-intercept/)
What is the same for this previous forum post
I aware of my variables are categorical and I'm adjusting my script, as you can reproduce using this code below:
Thanks
library(tidyverse)
set.seed(123)
ds <- data.frame(
depression=rnorm(90,10,2),
schooling_dummy=c(0,1,2),
sex_dummy=c(0,1)
)
mod <- lm(data=ds, depression ~ relevel(factor(sex_dummy), ref = "0"))
summary(mod)
ds %>% group_by(sex_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=4))
mod2 <- lm(data=ds, depression ~ relevel(factor(schooling_dummy), ref = "0"))
summary(mod2)
ds %>% group_by(schooling_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=4))
mod3 <- lm(data=ds, depression ~ relevel(factor(sex_dummy), ref = "0") +
relevel(factor(schooling_dummy), ref = "0"))
summary(mod3)
ds %>% group_by(schooling_dummy,sex_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=5))
predict(mod3, data.frame(sex_dummy=0, schooling_dummy=0))
Two errors in your thinking (although your R code works so it's not a programming error.
First and foremost you violated your own statement you have not dummy coded schooling it does not have only zeroes and ones it has 0,1 & 2.
Second you forgot the interaction effect in your lm modeling...
Try this...
library(tidyverse)
set.seed(123)
ds <- data.frame(
depression=rnorm(90,10,2),
schooling_dummy=c(0,1,2),
sex_dummy=c(0,1)
)
# if you explicitly make these variables factors not integers R will do the right thing with them
ds$schooling_dummy<-factor(ds$schooling_dummy)
ds$sex_dummy<-factor(ds$sex_dummy)
ds %>% group_by(schooling_dummy,sex_dummy) %>%
summarise(formatC(mean(depression),format="f", digits=5))
# you need an asterisk in your lm model to include the interaction term
lm(depression ~ schooling_dummy * sex_dummy, data = ds)
The results give you the mean(s) you were expecting...
Call:
lm(formula = depression ~ schooling_dummy * sex_dummy, data = ds)
Coefficients:
(Intercept) schooling_dummy1 schooling_dummy2
10.325482 -0.732433 -0.113305
sex_dummy1 schooling_dummy1:sex_dummy1 schooling_dummy2:sex_dummy1
0.228561 0.009778 -0.334254
and FWIW you can avoid this sort of accidental misuse of categorical variables if your data is coded as characters to begin with... so if your data is coded this way:
ds <- data.frame(
depression=rnorm(90,10,2),
schooling=c("A","B","C"),
sex=c("Male","Female")
)
You're less likely to make the same mistake plus the results are easier to read...

Resources