Testing for multicollinearity when there are factors - r

is it possible to check for multicollinearity in a model with Dummyvariables? Assume the following example
treatment <- factor(rep(c(1, 2), c(43, 41)), levels = c(1, 2), labels = c("placebo", "treated"))
improved <- factor(rep(c(1, 2, 3, 1, 2, 3), c(29, 7, 7, 13, 7, 21)), levels = c(1, 2, 3), labels = c("none", "some", "marked"))
numberofdrugs <- rpois(84, 5)+1
healthvalue <- rpois(84,5)
y <- data.frame(healthvalue,numberofdrugs, treatment, improved)
test <- lm(healthvalue~numberofdrugs+treatment+improved, y)
What am I supposed to do, when I want to check if multicollinearity occurs in such a model?

You can calculate the VIF for your predictors to quantify the amount of multicollinearity:
library(car)
vif(test)
GVIF Df GVIF^(1/(2*Df))
numberofdrugs 1.035653 1 1.017670
treatment 1.224984 1 1.106790
improved 1.193003 2 1.04510

Related

Pooled average marginal effects from survey-weighted and multiple-imputed data

I am working with survey data and their associated weights, in addition to missing data that I imputed using mice(). The model I'm eventually running contains complex interactions between variables for which I want the average marginal effect.
This task seems trivial in STATA, but I'd rather stay in R since that's what I know best. It seems easy to retrieve AME's for each separate imputed dataset and average the estimates. However, I need to make use of pool() (from mice) to make sure I'm getting the correct standard errors.
Here is a reproducible example:
library(tidyverse)
library(survey)
library(mice)
library(margins)
df <- tibble(y = c(0, 5, 0, 4, 0, 1, 2, 3, 1, 12), region = c(1, 1, 1, 1, 1, 3, 3, 3, 3, 3),
weight = c(7213, 2142, 1331, 4342, 9843, 1231, 1235, 2131, 7548, 2348),
x1 = c(1.14, 2.42, -0.34, 0.12, -0.9, -1.2, 0.67, 1.24, 0.25, -0.3),
x2 = c(12, NA, 10, NA, NA, 12, 11, 8, 9, 9))
Using margins() on a simple (non-multiple) svyglm works without a hitch. Running svyglm on each imputation using which() and pooling the results also works well.
m <- with(surv_obj, svyglm(y ~ x1 * x2))
pool(m)
However, wrapping margins() into which() returns an error "Error in .svycheck(design) : argument "design" is missing, with no default"
with(surv_obj, margins(svyglm(y ~ x1 * x2), design = surv_obj))
If I specify the design in the svyglm call, I get "Error in UseMethod("svyglm", design) : no applicable method for 'svyglm' applied to an object of class "svyimputationList""
with(surv_obj, margins(svyglm(y ~ x1 * x2, design = surv_obj), design = surv_obj))
If I drop the survey layer, and simply try to run the margins on each imputed set and then pool, I get a warning: "Warning in get.dfcom(object, dfcom) : Infinite sample size assumed.".
m1 <- with(imputed_df, margins(lm(y ~ x1 * x2)))
pool(m1)
This worries me given that pool() may use sample size in its calculations.
Does anyone know of any method to either (a) use which(), margins() and pool() to retrieve the pooled average marginal effects or (b) knows what elements of margins() I should pass to pool() (or pool.scalar()) to achieve the desired result?
Update following Vincent's comment
Wanted to update this post following Vincent's comment and related package marginaleffects() which ended up fixing my issue. Hopefully, this will be helpful to others stuck on similar problems.
I implemented the code in the vignette linked in Vincent's comment, adding a few steps that allow for survey weighting and modeling. It's worth noting that svydesign() will drop any observations missing on clustering/weighting variables, so marginaleffects() can't predict values back unto the original "dat" data and will throw up an error. Pooling my actual data still throws up an "infinite sample size assumed", which (as noted) should be fine but I'm still looking into fixes.
library(tidyverse)
library(survey)
library(mice)
library(marginaleffects)
fit_reg <- function(dat) {
svy <- svydesign(ids = ~ 1, cluster = ~ region, weight = ~weight, data = dat)
mod <- svyglm(y ~ x1 + x2*factor(x3), design = svy)
out <- marginaleffects(mod, newdata = dat)
class(out) <- c("custom", class(out))
return(out)
}
tidy.custom <- function(x, ...) {
out <- marginaleffects:::tidy.marginaleffects(x, ...)
out$term <- paste(out$term, out$contrast)
return(out)
}
df <- tibble(y = c(0, 5, 0, 4, 0, 1, 2, 3, 1, 12), region = c(1, 1, 1, 1, 1, 3, 3, 3, 3, 3),
weight = c(7213, 2142, 1331, 4342, 9843, 1231, 1235, 2131, 7548, 2348),
x1 = c(1.14, 2.42, -0.34, 0.12, -0.9, -1.2, 0.67, 1.24, 0.25, -0.3),
x2 = c(12, NA, 10, NA, NA, 12, 11, 8, 9, 9),
x3 = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2))
imputed_df <- mice(df, m = 2, seed = 123)
dat_mice <- complete(imputed_df, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)
summary(mod_imputation)

R: Creating new column to represent hi/mid/low bins by mean and standard deviation

I've got a batch of survey data that I'd like to be able to subset on a few specific columns which have 0-10 scale data (e.g. Rank your attitude towards x as 0 to 10) so that I can plot using using ggplot() + facet_grid. Faceting will be using 3 hi/med/low bins calculated as +1 / -1 standard deviation above the mean. I have working code, which splits the overall dataframe into 3 parts like so:
# Generate sample data:
structure(list(Q4 = c(2, 3, 3, 5, 4, 3), Q5 = c(1, 3, 3, 3, 2,
2), Q6 = c(4, 3, 3, 3, 4, 4), Q7 = c(4, 2, 3, 5, 5, 5), Q53_1 = c(5,
8, 4, 5, 4, 5)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
# Aquire Q53_1 data as factors
political_scale <- factor(climate_experience_data$Q53_1, levels = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
# Generate thresholds based on mean and standard deviation thresholds
low_threshold <- round(mean(as.numeric(political_scale, na.rm = T)) - sd(as.numeric(political_scale)), digits = 0)
high_threshold <- round(mean(as.numeric(political_scale, na.rm = T)) + sd(as.numeric(political_scale)), digits = 0)
# Generate low/med/high bins based on Mean and SD
political_lr_low <- filter(climate_experience_data, Q53_1 <= low_threshold)
political_lr_mid <- filter(climate_experience_data, Q53_1 < high_threshold & Q53_1 > low_threshold)
political_lr_high <- filter(climate_experience_data, Q53_1 >= high_threshold)
What I've realised is that this approach really doesn't lend itself to faceting. What I suspect is that I need to use a combination of mutate() across() where() and group_by() to add data to a new column Q53_scale with "hi" "med" "low" based on where Q53_1 values fall in relation to those low/high thresholds (e.g. SD +1 over mean and -1 under mean). My first few dozen attempts have fallen short - has anyone managed to use sd() to bin data for faceting in this way?
library(tidyverse)
climate_experience_data <- structure(list(Q4 = c(2, 3, 3, 5, 4, 3), Q5 = c(
1, 3, 3, 3, 2,
2
), Q6 = c(4, 3, 3, 3, 4, 4), Q7 = c(4, 2, 3, 5, 5, 5), Q53_1 = c(
5,
8, 4, 5, 4, 5
)), row.names = c(NA, -6L), class = c(
"tbl_df",
"tbl", "data.frame"
))
climate_experience_data %>%
mutate(
bin = case_when(
Q53_1 > mean(Q53_1) + sd(Q53_1) ~ "high",
Q53_1 < mean(Q53_1) - sd(Q53_1) ~ "low",
TRUE ~ "medium"
) %>% factor(levels = c("low", "medium", "high"))
) %>%
ggplot(aes(Q4, Q5)) +
geom_point() +
facet_grid(~bin)
Created on 2022-03-10 by the reprex package (v2.0.0)

How to define score of popularity for list of tags in R?

There is a dataset where each object has a list of tags of categories comma separated. I would like to have aggregated categories score per object based on categories' popularities. I can define the sum, min, and max of popularities but it's not clear to me how an aggregated score can be calculated.
library(tidyverse)
library(tibble)
library(stringr)
# 1. Data
df <- tribble(
~object, ~category,
1, "Software, Model, Cloud",
2, "Model",
3, "Cloud, Software",
4, "Train, Test, Model",
5, "Test, Model"
)
# 2. List of categories
list_category <- trimws(unlist(str_split(df$category, ",")))
# 3. Categories popularity
data.frame(category = list_category) %>%
group_by(category) %>%
summarise(n_count = n()) %>%
arrange(-n_count) %>%
ungroup()
# 4. Outcome with undefined 'score_category' feature that I'd like to know how to score
tribble(
~object, ~sum_category, ~min_category, ~max_category, ~score_category,
1, sum(c(2, 4, 2)), min(c(2, 4, 2)), max(c(2, 4, 2)), NA,
2, sum(c(4)), min(c(4)), max(c(4)), NA,
3, sum(c(2, 2)), min(c(2, 2)), max(c(2, 2)), NA,
4, sum(c(1, 2, 4)), min(c(1, 2, 4)), max(c(1, 2, 4)), NA,
5, sum(c(2, 4)), min(c(2, 4)), max(c(2, 4)), NA
)
Any ideas and code are welcome!

Multiple imputations with MICE and Latent Profile Analysis with tidyLPA

I am trying to do a multiple imputation with the mice package and later use those results to do a latent profile analysis with the tidyLPA package. However, I am running into coding problems and I am not sure if it can be solved. I have seen examples on the internet that after the imputation people fit linear / logistic models and using the pool function to pool the R squared estimate but not performing a Latent Profile Analysis.
Averaging the results won't be a good idea as I have read in numerous posts it won't take into account the variability among the imputed datasets.
The code gives me an error before conducting Latent Profile analysis which is the following:
Error in df[, select_vars, drop = FALSE] : incorrect number of dimensions
I am attaching a small example here to check if anyone is having a solution / suggestions.
Thank you in advance.
library("mice")
library("tidyLPA")
data <- data.frame(ID = c(1, 1, 2, 2, 3, 3, 3, 4, 4, 4),
var1 = c(1, 2, 5, 10, NA, 5, 23, NA, NA, 1),
var2 = c(1, NA, NA, 1, NA, 0, 1, 3, 23, 4))
imputation <- mice(data, m = 5,
method = c("", "pmm", "pmm"),
maxit = 20)
LPA <- with(imputation, estimate_profiles(imputation, n_profiles = 2, variances = "equal",
covariances = "equal"))

Adding rows to make a full long dataset for longitudinal data analysis

I am working with a long-format longitudinal dataset where each person has 1, 2 or 3 time points. In order to perform certain analyses I need to make sure that each person has the same number of rows even if it consists of NAs because they did not complete the certain time point.
Here is a sample of the data before adding the rows:
structure(list(Values = c(23, 24, 45, 12, 34, 23), P_ID = c(1,
1, 2, 2, 2, 3), Event_code = c(1, 2, 1, 2, 3, 1), Site_code = c(1,
1, 3, 3, 3, 1)), class = "data.frame", row.names = c(NA, -6L))
This is the data I aim to get after adding the relevant rows:
structure(list(Values = c(23, 24, NA, 45, 12, 34, 23, NA, NA),
P_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Event_code = c(1, 2,
3, 1, 2, 3, 1, 2, 3), Site_code = c(1, 1, 1, 3, 3, 3, 1,
1, 1)), class = "data.frame", row.names = c(NA, -9L))
I want to come up with code that would automatically add rows to the dataset conditionally on whether the participant has had 1, 2 or 3 visits. Ideally it would make rest of data all NAs while copying Participant_ID and site_code but if not possible I would be satisfied just with creating the right number of rows.
We could use fill after doing a complete
library(dplyr)
library(tidyr)
ExpandedDataset %>%
complete(P_ID, Event_code) %>%
fill(Site_code)
I came with quite a long code, but you could group it in a function and make it easier:
Here's your dataframe:
df <- data.frame(ID = c(rep("P1", 2), rep("P2", 3), "P3"),
Event = c("baseline", "visit 2", "baseline", "visit 2", "visit 3", "baseline"),
Event_code = c(1, 2, 1, 2, 3, 1),
Site_code = c(1, 1, 2, 2, 2, 1))
How many records you have per ID?
values <- summary(df$ID)
What is the maximum number of records for a single patient?
target <- max(values)
Which specific patients have less records than the maximum?
uncompliant <- names(which(values<target))
And how many records do you have for those patients who have missing information?
rowcount <- values[which(values<target)]
So now, let's create the vectors of the data frame we will add to your original one. First, IDs:
IDs <- vector()
for(i in 1:length(rowcount)){
y <- rep(uncompliant[i], target - rowcount[i])
IDs <- c(IDs, y)
}
And now, the sitecodes:
SC <- vector()
for(i in 1:length(rowcount)){
y <- rep(unique(df$Site_code[which(df$ID == uncompliant[i])]), target - rowcount[i])
SC <- c(SC, y)
}
Finally, a data frame with the values we will introduce:
introduce <- data.frame(ID = IDs, Event = rep(NA, length(IDs)),
Event_code = rep(NA, length(IDs)),
Site_code = SC)
Combine the original dataframe with the new values to be added and sort it so it looks nice:
final <- as.data.frame(rbind(df, introduce))
final <- final[order(v$ID), ]

Resources