Error: Problem with `mutate()` input x `labels` must be unique - r

I am trying to recode some labelled variables to a 0 to 1 scale in the following fashion. When I try to calculate the mean of the two variables using c_across() I get this odd error Error: Problem with mutate() input market_liberalism. x labels must be unique.
If I delete the value labels then it works. I don't understand what problem the value labels cause.
Thank you.
#Install car package if necessary
#install.packages('car')
library(tidyverse)
library(car)
structure(list(PESE15 = structure(c(3, 5, 5, 8, NA), label = "The Government Should Leave it Entirely to the Private Sector to Create Jobs", na_values = c(8, 9), format.spss = "F1.0", display_width = 0L, labels = c(`Strongly agree` = 1, `Somewhat agree` = 3, Somewhatdisagree = 5, Stronglydisagree = 7,D.K. = 8, Refused = 9), class = c("haven_labelled_spss", "haven_labelled", "vctrs_vctr", "double")), MBSA2 = structure(c(3, 8, 1, 1, NA), label = "People Who Do Not Get Ahead Should Blame Themselves Not the System", na_values = 8, format.spss = "F1.0", display_width = 0L, labels = c(`Strongly agree` = 1, Agree = 2, Disagree = 3, Stronglydisagree = 4, `No opinion` = 8), class = c("haven_labelled_spss", "haven_labelled", "vctrs_vctr", "double"))), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), label = "NSDstat generated file")->out
#use the car::Recode command to convert values to 0 to 1
out$market1<-Recode(out$PESE15, "1=1; 3=0.75; 5=0.25; 7=0; 8=0.5; else=NA")
out$market2<-Recode(out$MBSA2, "1=1; 2=0.75; 3=0.25; 4=0; 8=0.5; else=NA")
#Use dplyr to try to calculate the average
out %>%
rowwise() %>%
mutate(market_liberalism=mean(
c_across(market1:market2))) -> out2
#setting value labels to NULL makes it work.
val_labels(out$market1)<-NULL
val_labels(out$market2)<-NULL
out %>%
rowwise() %>%
mutate(market_liberalism=mean(
c_across(market1:market2)))

For me car::Recode gives an error and does not work with haven labelled class but dplyr::recode does if you have labelled library loaded.
library(labelled)
library(dplyr)
out %>%
mutate(PESE15 = recode(PESE15, `1` = 1,`3` = 0.75, `5`=0.25, `7`=0, `8` = 0.5),
MBSA2 = recode(MBSA2, `1`=1, `2`=0.75, `3`=0.25, `4`=0, `8`=0.5),
market_liberalism = rowMeans(., na.rm = TRUE))

Related

How to use ggplot with Haven dataset

New to coding and R but have a STATA dataset, I want to use ggplot for visulations of my data however, I get multiple errors such as
no applicable method for 'rescale' applied to an object of class "c('haven_labelled', 'vctrs_vctr', 'double')"
I dont know how to convert them so I can plot them for visualisations,
the lines of code are as followed:
Data <- read_dta("longitudinal_td.dta")
Data <- Data %>%
select(pidp,wave,age_dv,sex_dv,ethn_dv,sf1_dv,bmi_dv,sf12pcs_dv,fihhmnnet1_dv,sf12mcs_dv) %>%
filter(wave == "1", age_dv<=50)%>%
mutate(pipd = row_number(),age=age_dv, sex=sex_dv, ethnicity = ethn_dv, general_health=sf1_dv,
bmi=bmi_dv, physical_component_score=sf12pcs_dv, mental_component_score=sf12mcs_dv, household_income=fihhmnnet1_dv)%>%
select(-pipd,-age_dv,-sex_dv,-ethn_dv,-sf1_dv,-bmi_dv,-sf12pcs_dv,-sf12mcs_dv,-fihhmnnet1_dv)
I hope this is correct, here is the dput:
Essentially im just trying to explore BMI but i dont know if I can just plot these or have to assign the numbers to a label like it already is done in haven labels
dput(head(Data))
structure(list(pidp = structure(c(68001367, 68006127, 68008167,
68009527, 68010207, 68010887), label = "cross-wave person identifier (public release)", format.stata = "%12.0g"),
wave = structure(c(1, 1, 1, 1, 1, 1), label = "interview wave", format.stata = "%8.0g"),
age = structure(c(39, 39, 38, 31, 24, 45), label = "Age, derived from dob_dv and intdat_dv", format.stata = "%8.0g"),
sex = structure(c(1, 2, 2, 1, 2, 2), label = "Sex, derived", format.stata = "%8.0g", labels = c(Male = 1,
Female = 2), class = c("haven_labelled", "vctrs_vctr", "double"
)), ethnicity = structure(c(1, 1, 1, 1, 1, 1), label = "Ethnic group (derived from multiple sources)", format.stata = "%8.0g", labels = c(`white uk` = 1,
irish = 2, `gypsy or irish traveller` = 3, `any other white background` = 4,
`white and black caribbean` = 5, `white and black african` = 6,
`white and asian` = 7, `any other mixed background` = 8,
indian = 9, pakistani = 10, bangladeshi = 11, chinese = 12,
`any other asian background` = 13, caribbean = 14, african = 15,
`any other black background` = 16, arab = 17, `any other ethnic group` = 97
), class = c("haven_labelled", "vctrs_vctr", "double")),
general_health = structure(c(2, 4, 5, 3, 1, 1), label = "General health", format.stata = "%8.0g", labels = c(excellent = 1,
`very good` = 2, good = 3, fair = 4, `or Poor?` = 5), class = c("haven_labelled",
"vctrs_vctr", "double")), bmi = structure(c(29.6, 38.8, 21.5,
24.2, 25, 25.5), label = "Body Mass Index", format.stata = "%12.0g")
Thanks for posting an example of your data with dput(). The format of the data you have posted suggests that it has somehow become a list rather than a data frame. You need to convert it to a data frame - as you're using haven I would stick with the tidyverse and do it with as_tibble().
Similarly, you want the labels rather than the underlying integers. You can simply apply as_factor to the whole data frame to do this.
Your data is then ready to be piped to ggplot2. For example:
library(dplyr)
library(ggplot2)
library(haven)
Data |>
as_tibble() |>
as_factor() |>
ggplot() +
geom_boxplot(aes(x=sex, y=bmi))

Removing Incorrect Labels within Tidyverse/ Limiting Actions of as_factor()

I'm working with British Election Study data. To be used in R, this first has to be converted from the .dta form provided, which I think puts labels on to a lot of variables. Most of the time this is useful, but I think a problem I've got is where this isn't the case.
Using as_factor() blindly converts all variables with labels to factors. Is there a way to specify that only certain vectors are converted ? i.e
new_df <- data %>%
as_factor(just_this_column)
Failing that, is there a good way to remove the labels of certain variables within a dataframe ? I've kooked at the sjlabelled package but this does something weird and converts the data from a dataframe:
example_data<- str(sjlabelled::remove_all_labels(example_data$generalElectionVoteW19))
The reason I'm trying to do all of this is to make a histogram of number of people voting for each party (the factor) at a certain age. In this dataset, the age variable has a label which is messing up the code.
Of course, I could just convert the factor to a numeric value at the end but this seems like a messy way of achieving things !
Here is the dput:
structure(list(ageW19 = structure(c(72, 52, 39, 75, 26, 56), label = "Age", format.stata = "%8.0g", labels = c(`Not Asked` = -9,
Skipped = -8), class = c("haven_labelled", "vctrs_vctr", "double"
)), generalElectionVoteW19 = structure(c(1, 13, 3, 1, 2, 1), label = "General election vote intention (recalled vote in post-election waves)", format.stata = "%40.0g", labels = c(`I would/did not vote` = 0,
Conservative = 1, Labour = 2, `Liberal Democrat` = 3, `Scottish National Party (SNP)` = 4,
`Plaid Cymru` = 5, `United Kingdom Independence Party (UKIP)` = 6,
`Green Party` = 7, `British National Party (BNP)` = 8, Other = 9,
`Change UK- The Independent Group` = 11, `Brexit Party` = 12,
`An independent candidate` = 13, `Don't know` = 9999), class = c("haven_labelled",
"vctrs_vctr", "double"))), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"), na.action = c(`1` = 1L, `3` = 3L, `5` = 5L
))
To your first questions, you need mutate to convert a single column, e.g.
new_df <- data %>%
mutate(factor_column = as_factor(old column))
However, as you said you probably want to convert to numeric type, so you might want to use as.numeric instead of as_factor.
We may use base R
data$factor_column <- factor(data$old_column)

svyglm and interactions package showing error in R4.0.2 not R3.6.3

I have a syvglm code that would work in R3.6.3, but not R4.0.2 (which gives the error messages of:
for svylgm:
Error: Must subset elements with a valid subscript vector. x Subscript
has the wrong type omit. i It must be logical, numeric, or
character. Backtrace:
survey::svyglm(...)
survey:::svyglm.survey.design(...)
survey:::[.survey.design2(design, -nas, )
base::[.data.frame(x$variables, i, ..1, drop = FALSE)
vctrs:::[.vctrs_vctr(xj, i)
vctrs:::vec_index(x, i, ...)
vctrs::vec_slice(x, i)
For interactions package:
Error: <labelled> - is not permitted
Backtrace:
interactions::probe_interaction(...)
interactions::sim_slopes(...)
interactions:::center_ss(...)
interactions:::center_ss_survey(...)
jtools::gscale(vars = ndfvars, data = design, center.only = TRUE)
...
vctrs:::-.vctrs_vctr(left, right)
haven:::vec_arith.haven_labelled.default("-", e1, e2)
vctrs::stop_incompatible_op(op, x, y)
vctrs:::stop_incompatible(...)
vctrs:::stop_vctrs(...)
My code is:
design <-
svydesign(
id = ~I_i ,
data = sm ,
weights = ~w ,
)
fit <- svyglm(sc ~ sex + school +sex*school, design)
summary(fit)
probe_interaction(fit, pred = school, modx = sex)
I wish to continue with R4.0.2 as downgrading R create a lot of problems with other R packages. The probe_Interaction part did not work on older version of R (PC computer) either.
Any clue what went wrong?
Data:
structure(list(I_i = structure(c(-54, 1000987, 1000166), label = "ID institution", format.spss = "F39.0", display_width = 39L, labels = c(Filtered = -99, Don't know = -98, Refused = -97, Not in list = -96, Implausible value = -95, Not reached = -94, Does not apply = -93, Question erroneously not asked = -92, Survey aborted = -91, Unspecific missing = -90, Not participated = -56, Not determinable = -55, Missing by design = -54, Anonymized = -53, Implausible value removed = -52, No estimate in check module = -51 ), class = c("haven_labelled", "vctrs_vctr", "double")), w = c(0.197917401820921, 0.186422150045312, 0.275986837527898),sc = c(NA, 7.66666666666667, 8.66666666666667 ), sex = structure(c(1, 1, 0), label = "Gender", format.spss = "F3.0", display_width = 9L, labels = c(Implausible value = -95, Unspecific missing = -90, Missing by design = -54, <U+0085> male? = 1, <U+0085> female? = 2), class = c("haven_labelled", "vctrs_vctr", "double")), school = c(NA, 1L, 1L)), row.names = c(NA, 3L), class = "data.frame")

Propensity Score Matching and subset the data by using a weighting factor in R

I'm doing Propensity Score Matching and want to subset the data for treatment and control by using weights. There are 5 variables: ID, treatment(yes/No), Outcome(Yes/No), Age and "Weights". I was trying to write a programme in R, but have problems to do this according to weights. The survey package is used.
dput(dat2):
structure(list(ID = c(1, 2, 3, 4, 6, 7),
Weight = c(2.4740626, 2.4740626, 2.4740626, 2.4740626, 1.9548149, 1.9548149),
Age = c("35-44", "<15-24", "25-34", "35-44", ">45", "25-34"),
Treatment = c(1, 0, 0, 1, 0, 0),
Outcome = c(1, 1, 1, 0, 1, 1)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame")))
head(dat2):
data<-svydesign(ids = ~dat2$Id,
weights = ~dat2$Weight,
data = dat2)
treat<-subset(dat, dat2$treatment==1)
cont<-subset(dat, dat2$treatment==0)
I am sharing sample of data. I have 1587 rows. When I am finding dimensions without weights then the dimensions of treat and cont is 877*5 and 710*5 respectively. But with weights it will be 803*5 and 784*5.
Please help me.
Thanks in advance.
One way to do this is as below:
Sample Data
dat2 <- structure(list(ID = c(1, 2, 3, 4, 6, 7),
Weight = c(2.4740626, 2.4740626, 2.4740626, 2.4740626, 1.9548149, 1.9548149),
Age = c("35-44", "<15-24", "25-34", "35-44", ">45", "25-34"),
Treatment = c(1, 0, 0, 1, 0, 0),
Outcome = c(1, 1, 1, 0, 1, 1)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame"))
Script
data<-svydesign(ids = ~dat2$ID,
weights = ~dat2$Weight,
data = dat2)
treat<-subset(data, Treatment==1)
cont<-subset(data, Treatment==0)

How to create differences between several pairs of columns?

I have a panel (cross-sectional time series) dataset. For each group (defined by (NAICS2, occ_type) in time ym) I have many variables. For each variable I would like to subtract each group's first (dplyr::first) value from every value of that group.
Ultimately I am trying to take the Euclidean difference between the vector of each row 's group's first entry, (i.e. sqrt(c_1^2 + ... + c_k^2).
I was able to create the a column equal to the first entries for each group:
df2 <- df %>%
group_by(ym, NAICS2, occ_type) %>%
distinct(ym, NAICS2, occ_type, .keep_all = T) %>%
arrange(occ_type, NAICS2, ym) %>%
select(group_cols(), ends_with("_scf")) %>%
mutate_at(vars(-group_cols(), ends_with("_scf")),
list(first = dplyr::first))
I then tried to include variations of f.diff = . - dplyr::first(.) in the list, but none of those worked. I googled the dot notation for a while as well as first and lag in dplyr timeseries but have not been able to resolve this yet.
Ideally, I unite all variables into a vector for each row first and then take the difference.
df2 <- df %>%
group_by(ym, NAICS2, occ_type) %>%
distinct(ym, NAICS2, occ_type, .keep_all = T) %>%
arrange(occ_type, NAICS2, ym) %>%
select(group_cols(), ends_with("_scf")) %>%
unite(vector, c(-group_cols(), ends_with("_scf")), sep = ',') %>%
# TODO: DISTANCE_BETWEEN_ENTRY_AND_FIRST
mutate(vector.diff = ???)
I expect the output to be a numeric column that contains a distance measure of how different each group's row vector is from its initial row vector.
Here is a sample of the data:
structure(list(ym = c("2007-01-01", "2007-02-01"), NAICS2 = c(0L,
0L), occ_type = c("is_middle_manager", "is_middle_manager"),
Administration_scf = c(344, 250), Agriculture..Horticulture..and.the.Outdoors_scf = c(11,
17), Analysis_scf = c(50, 36), Architecture.and.Construction_scf = c(57,
51), Business_scf = c(872, 585), Customer.and.Client.Support_scf = c(302,
163), Design_scf = c(22, 17), Economics..Policy..and.Social.Studies_scf = c(7,
7), Education.and.Training_scf = c(77, 49), Energy.and.Utilities_scf = c(25,
28), Engineering_scf = c(90, 64), Environment_scf = c(19,
19), Finance_scf = c(455, 313), Health.Care_scf = c(105,
71), Human.Resources_scf = c(163, 124), Industry.Knowledge_scf = c(265,
174), Information.Technology_scf = c(467, 402), Legal_scf = c(21,
17), Maintenance..Repair..and.Installation_scf = c(194, 222
), Manufacturing.and.Production_scf = c(176, 174), Marketing.and.Public.Relations_scf = c(139,
109), Media.and.Writing_scf = c(18, 20), Personal.Care.and.Services_scf = c(31,
16), Public.Safety.and.National.Security_scf = c(14, 7),
Religion_scf = c(0, 0), Sales_scf = c(785, 463), Science.and.Research_scf = c(52,
24), Supply.Chain.and.Logistics_scf = c(838, 455), total_scf = c(5599,
3877)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), groups = structure(list(ym = c("2007-01-01",
"2007-02-01"), NAICS2 = c(0L, 0L), occ_type = c("is_middle_manager",
"is_middle_manager"), .rows = list(1L, 2L)), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))

Resources