Can't use survey_mean in sapply - r

I'm using survey data with packages survey and srvyr and I have some trouble applying survey_mean() to all columns.
Here's an example:
library(survey)
library(srvyr)
data(api)
dstrata <- apistrat %>%
as_survey_design(strata = stype, weights = pw) %>%
mutate(api00 = ifelse(api00 == 467, NA, api00),
api99 = ifelse(api99 == 491, NA, api99))
sapply(dstrata$variables %>% select(api99, api00), function(x){
x <- enquo(x)
dstrata %>%
filter(!is.na(!!x)) %>%
summarise(stat = srvyr::survey_mean(!!x, na.rm = TRUE)[, 1])
})
Error: Assigned data x must be compatible with existing data.
x Existing data has 198 rows.
x Assigned data has 200 rows.
ℹ Only vectors of size 1 are recycled.
Run rlang::last_error() to see where the error occurred.
Note that:
dstrata %>%
select(api99, api00) %>%
summarise_all(.funs = srvyr::survey_mean, na.rm = T)
works with this example but not with my actual data so I would like to understand why the function above does not work.
I'm using srvyr_0.3.9 and survey_4.0

I don't know why would you need any kind of NSE here because in sapply only the value is passed and not an expression.
This seems to work :
library(dplyr)
sapply(dstrata$variables %>% select(api99, api00), function(x){
dstrata %>%
summarise(stat = srvyr::survey_mean(x, na.rm = TRUE))
})
# api99 api00
#stat 630.3107 663.4118
#stat_se 10.14777 9.566393

Related

How to use %>% in tidymodels in R?

I am trying to split a dataset from tidymodels in R.
library(tidymodels)
data(Sacramento, package = "modeldata")
data_split <- initial_split(Sacramento, prop = 0.75, strata = price)
Sac_train <- training(data_split)
I want to describe the distribution of the training dataset, but the following error occurs.
Sac_train %>%
select(price) %>%
summarize(min_sell_price = min(),
max_sell_price = max(),
mean_sell_price = mean(),
sd_sell_price = sd())
# Error: In min() : no non-missing arguments to min; returning Inf
However, the following code works.
Sac_train %>%
summarize(min_sell_price = min(price),
max_sell_price = max(price),
mean_sell_price = mean(price),
sd_sell_price = sd(price))
My question is: why select(price) is not working in the first example? Thanks.
Assuming your data are a data frame, despite having only one column selected, you still need to tell R/dplyr what column you want to summarize.
In other words, it doesn't treat a single-column data frame as a vector that you can pass through a function - i.e.:
Sac_train.vec <- 1:25
mean(Sac_train.vec)
# [1] 13
will calculate the mean, whereas
Sac_train.df <- data.frame(price = 1:25)
mean(Sac_train.df)
throws an error.
In the special case of only one column, this may be more parsimonious code:
# Example Data
Sac_train <- data.frame(price = 1:25, col2 = LETTERS[1:25])
Sac_train %>%
select(price) %>%
summarize(across(everything(),
list(min = min, max = max, mean = mean, sd = sd)))
Output:
# price_min price_max price_mean price_sd
# 1 1 25 13 7.359801

Pass variable name as argument dynamically on svydesign and dplyr::select functions

I'm newbie with R. There is a code like the following, and for that code, variable name wt_itvex_divided_by_4 should be dynamically replaced with wt_itvex_divided_by_3 or wt_itvex_divided_by_2
df_odds_sv <- as_survey(
svydesign(id = ~psu+ID_fam,
strata = ~kstrata,
weights = ~wt_itvex_divided_by_4,
data = data_sd1013
)) %>%
dplyr::select(ID, ID_fam, psu, kstrata, wt_itvex_divided_by_4) %>%
subset(ID %in% df_odds$ID)
To implement dynamical change, I tried something like the following by using temp variable, but it didn't work
temp='wt_itvex_divided_by_3'
df_odds_sv <- as_survey(
svydesign(id = ~psu+ID_fam,
strata = ~kstrata,
weights = ~temp,
data = data_sd1013
)) %>%
dplyr::select(ID, ID_fam, psu, kstrata, temp) %>%
subset(ID %in% df_odds$ID)
Or, by some search on this problem, I saw someone recommended to use get(), so I tried like the following. It didn't create the error but wt_itvex_divided_by_3 column wasn't selected
s<-'wt_itvex_divided_by_3'
df_odds_sv <- as_survey(
svydesign(id = ~psu+ID_fam,
strata = ~kstrata,
weights = ~get(s),
data = data_sd1013
)) %>%
dplyr::select(ID, ID_fam, psu, kstrata, get(s)) %>%
subset(ID %in% df_odds$ID)
Referencing Ronak Shah's answer, I solved the issue by the following code (note that I used arguments differently for svydesign's weights and dplyr::select)
temp='wt_itvex_divided_by_3'
df_odds_sv <- as_survey(
svydesign(id = ~psu+ID_fam,
strata = ~kstrata,
weights = ~data_sd1013[[temp]],
data = data_sd1013
)) %>%
dplyr::select(ID, ID_fam, psu, kstrata, temp) %>%
subset(ID %in% df_odds$ID)
You may try subsetting from the dataframe directly with [[.
Using apistrat data as an example.
library(survey)
library(srvyr)
data(api)
temp= "pw"
dstrat1 <- svydesign(id=~1,strata=~stype, weights= ~apistrat[[temp]],
data=apistrat, fpc=~fpc)

How to pipe a dataset into a pmap?

My tibble looks like this :
dataset <- tibble(country = sample(c(11,23,18,17,12,19,30,16,14,13,15),7679,replace = T),yrbirth = floor(runif(7679,1900,1970)))
and I have two help vectors to check conditons
country_code <- c(11,23,18,17,12,19,30,16,14,13,15)
crit_year <- c(1947,1969,1957,1953,1948,1958,1958,1949,1959,1947,1947)
I have a function to do the mutation
f_g_treat <- function(dataset, country_code, crit_year){
dataset_new <- dataset %>%
filter(country == country_code) %>%
mutate(treatment = ifelse(yrbirth >=crit_year-7,'Treat','Contr'))
return(dataset_new$treatment)
}
Now I want to pipe dataset into the pmap but it seems that is throws me an error. My idea was this
dataset <- dataset %>%
mutate(treatment =
pmap(list(country_code, crit_year), ~f_g_treat(dataset = ., country_code = ..1, crit_year = ..2 )) %>%
unlist() )
Doing so throws me the folling error :
Error: Problem with mutate() input treatment.
x no applicable method for 'filter_' applied to an object of class "c('double', 'numeric')"
i Input treatment is ``%>%(...).
When I try :
dataset <- mutate(dataset, treatment =
pmap(list(country_code, crit_year), ~f_g_treat(dataset = dataset, country_code = ..1, crit_year = ..2 )) %>%
unlist() )
everything works fine and I get the expected vector. So I believe I use the anonymous object passing . wrong in this part. Can someone help me with that?
For your specific question, do solves your problem:
do(dataset, mutate(., treatment =
pmap(list(country_code, crit_year), function(x, y) f_g_treat(dataset = ., country_code = x, crit_year = y )) %>%
unlist() ) )
Note I had to make the parameters of the function in pmap explicit (x, y) to avoid overwriting the . that do created.
I don't think the output is correct though, nor is it in your working example. The treatment column is pasted in the wrong order (namely the order of country_code, crit_year), rather than the order in the original data frame.
A better way to do this is via a join:
country_crit_year <- tibble(country = country_code, crit_year = crit_year)
dataset %>%
left_join(country_crit_year, by = "country") %>%
mutate(treatment = if_else(yrbirth >= crit_year-7, "Treat", "Contr"))
You can do the following:
dataset %>%
mutate(treatment = pmap(list(country_code, crit_year),
f_g_treat,
dataset = .) %>% unlist())

dplyr, dunn test, Error in dim(robj) <- c(dX, dY) : dims [product 0] do not match the length of object

I am trying to pass a dataset filtered by a variable value to the pairw.kw function from the "asbio" package in R.
example.df <- data.frame(
species = sample(c("primate", "non-primate"), 50, replace = TRUE),
treated = sample(c("Yes", "No"), 50, replace = TRUE),
gender = sample(c("male", "female"), 50, replace = TRUE),
var1 = rnorm(50, 100, 5)
)
library(dplyr)
library(asbio)
with(example.df, pairw.kw(var1, species, conf=0.95))
This code works. However,
example.df %>%
filter(treated=="No") %>%
{pairw.kw("var1", "species",conf = 0.95)}
gives me the error message
Error in dim(robj) <- c(dX, dY) :
dims [product 0] do not match the length of object [1]
I cannot understand what is causing this, other than to assume that the two vectors being compared become different lengths after the filter is applied.
Is there a way to fix this other than subsetting the data explicitly to a new dataframe and using that instead? I know that will work, but wanted to know if a more elegant solution exists.
First of all %>% pipe passes a data.frame to the pairw.kw function as a first argument. Secondly, pairw.kw function wants two vectors as an input. You can achive this with %$% pipe from magrittr package. It works similar to with function.
library(magrittr)
example.df %>%
filter(treated=="No") %$%
pairw.kw(var1, species, conf = 0.95)
Answer to question in comment:
library(tidyverse)
library(magrittr)
library(asbio)
example.df %>%
group_by(treated) %>%
nest() %>%
mutate(
kw = map(
data,
~ .x %$% pairw.kw(var1, species, conf = 0.95)
),
p_val = map_dbl(kw, ~ .x$summary$`Adj. P-value`)
)

Using replace_na for multiple data subsets

I'm trying to replace the NAs in multiple column variables with randomly generated values from each student_id's subset row data:
data snapshot
so for student 3, systolic needs two NAs replaced. I used the min and max values for each variable within the student 3 subset to generate random values.
library(dplyr)
library(tidyr)
library(tibble)
library(tidyverse)
dplyr::filter(exercise, student_id == "3") %>% replace_na(list(systolic= round(sample(runif(1000, 125,130),2),0),
diastolic =round(sample(runif(1000, 85,85),3),0), heart_rate= round(sample(runif(1000, 79,86),2),0),
phys_score = round(sample(runif(1000, 8,9),2),0)
However it works only when one NA needs replacing: successfully replaced systolic NA values. When I try to replace more than one NAs, this error comes up.
Error: Replacement for `systolic` is length 2, not length 1
Is there a way to fix this? I tried converting the column variables to data frames instead of the vectors they are now, but it only returned the original data without any replacement changes.
Are there any simpler ways to this? Any suggestions/comments would be appreciated. Thanks.
A solution that makes things a little more automated but may be unnecessarily complex.
Generated some grouped missing data from the mtcars dataset
library(magrittr)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
## Generate some missing data with a subset of car make
mtcars_miss <- mtcars %>%
as_tibble(rownames = "car") %>%
select(car) %>%
separate(car, c("make", "name"), " ") %>%
bind_cols(mtcars[, -1] %>%
map_df(~.[sample(c(TRUE, NA), prob = c(0.8, 0.2),
size = length(.), replace = TRUE)])) %>%
filter(make %in% c("Mazda", "Hornet", "Merc"))
Function to replace na values from a given variable by sampling within the min and max and depending on some group (here make).
replace_na_sample <- function(df_miss, var, group = "make") {
var <- enquo(var)
df_miss %>%
group_by(.dots = group) %>%
mutate(replace_var := round(runif(n(), min(!!var, na.rm = T),
max(!!var, na.rm = T)), 0)) %>%
rowwise %>%
mutate_at(.vars = vars(!!var),
.funs = funs(replace_na(., replace_var))) %>%
select(-replace_var) %>%
ungroup
}
Example replacing several missing values in multiple columns.
mtcars_replaced <- mtcars_miss %>%
replace_na_sample(cyl, group = "make") %>%
replace_na_sample(disp, group = "make") %>%
replace_na_sample(hp, group = "make")

Resources