My tibble looks like this :
dataset <- tibble(country = sample(c(11,23,18,17,12,19,30,16,14,13,15),7679,replace = T),yrbirth = floor(runif(7679,1900,1970)))
and I have two help vectors to check conditons
country_code <- c(11,23,18,17,12,19,30,16,14,13,15)
crit_year <- c(1947,1969,1957,1953,1948,1958,1958,1949,1959,1947,1947)
I have a function to do the mutation
f_g_treat <- function(dataset, country_code, crit_year){
dataset_new <- dataset %>%
filter(country == country_code) %>%
mutate(treatment = ifelse(yrbirth >=crit_year-7,'Treat','Contr'))
return(dataset_new$treatment)
}
Now I want to pipe dataset into the pmap but it seems that is throws me an error. My idea was this
dataset <- dataset %>%
mutate(treatment =
pmap(list(country_code, crit_year), ~f_g_treat(dataset = ., country_code = ..1, crit_year = ..2 )) %>%
unlist() )
Doing so throws me the folling error :
Error: Problem with mutate() input treatment.
x no applicable method for 'filter_' applied to an object of class "c('double', 'numeric')"
i Input treatment is ``%>%(...).
When I try :
dataset <- mutate(dataset, treatment =
pmap(list(country_code, crit_year), ~f_g_treat(dataset = dataset, country_code = ..1, crit_year = ..2 )) %>%
unlist() )
everything works fine and I get the expected vector. So I believe I use the anonymous object passing . wrong in this part. Can someone help me with that?
For your specific question, do solves your problem:
do(dataset, mutate(., treatment =
pmap(list(country_code, crit_year), function(x, y) f_g_treat(dataset = ., country_code = x, crit_year = y )) %>%
unlist() ) )
Note I had to make the parameters of the function in pmap explicit (x, y) to avoid overwriting the . that do created.
I don't think the output is correct though, nor is it in your working example. The treatment column is pasted in the wrong order (namely the order of country_code, crit_year), rather than the order in the original data frame.
A better way to do this is via a join:
country_crit_year <- tibble(country = country_code, crit_year = crit_year)
dataset %>%
left_join(country_crit_year, by = "country") %>%
mutate(treatment = if_else(yrbirth >= crit_year-7, "Treat", "Contr"))
You can do the following:
dataset %>%
mutate(treatment = pmap(list(country_code, crit_year),
f_g_treat,
dataset = .) %>% unlist())
Related
I'm newbie with R. There is a code like the following, and for that code, variable name wt_itvex_divided_by_4 should be dynamically replaced with wt_itvex_divided_by_3 or wt_itvex_divided_by_2
df_odds_sv <- as_survey(
svydesign(id = ~psu+ID_fam,
strata = ~kstrata,
weights = ~wt_itvex_divided_by_4,
data = data_sd1013
)) %>%
dplyr::select(ID, ID_fam, psu, kstrata, wt_itvex_divided_by_4) %>%
subset(ID %in% df_odds$ID)
To implement dynamical change, I tried something like the following by using temp variable, but it didn't work
temp='wt_itvex_divided_by_3'
df_odds_sv <- as_survey(
svydesign(id = ~psu+ID_fam,
strata = ~kstrata,
weights = ~temp,
data = data_sd1013
)) %>%
dplyr::select(ID, ID_fam, psu, kstrata, temp) %>%
subset(ID %in% df_odds$ID)
Or, by some search on this problem, I saw someone recommended to use get(), so I tried like the following. It didn't create the error but wt_itvex_divided_by_3 column wasn't selected
s<-'wt_itvex_divided_by_3'
df_odds_sv <- as_survey(
svydesign(id = ~psu+ID_fam,
strata = ~kstrata,
weights = ~get(s),
data = data_sd1013
)) %>%
dplyr::select(ID, ID_fam, psu, kstrata, get(s)) %>%
subset(ID %in% df_odds$ID)
Referencing Ronak Shah's answer, I solved the issue by the following code (note that I used arguments differently for svydesign's weights and dplyr::select)
temp='wt_itvex_divided_by_3'
df_odds_sv <- as_survey(
svydesign(id = ~psu+ID_fam,
strata = ~kstrata,
weights = ~data_sd1013[[temp]],
data = data_sd1013
)) %>%
dplyr::select(ID, ID_fam, psu, kstrata, temp) %>%
subset(ID %in% df_odds$ID)
You may try subsetting from the dataframe directly with [[.
Using apistrat data as an example.
library(survey)
library(srvyr)
data(api)
temp= "pw"
dstrat1 <- svydesign(id=~1,strata=~stype, weights= ~apistrat[[temp]],
data=apistrat, fpc=~fpc)
The dataset below has columns with very similar names and some values which are NA.
library(tidyverse)
dat <- data.frame(
v1_min = c(1,2,4,1,NA,4,2,2),
v1_max = c(1,NA,5,4,5,4,6,NA),
other_v1_min = c(1,1,NA,3,4,4,3,2),
other_v1_max = c(1,5,5,6,6,4,3,NA),
y1_min = c(3,NA,2,1,2,NA,1,2),
y1_max = c(6,2,5,6,2,5,3,3),
other_y1_min = c(2,3,NA,1,1,1,NA,2),
other_y1_max = c(5,6,4,2,NA,2,NA,NA)
)
head(dat)
In this example, x1 and y1 would be what I would consider the common "categories" among the columns. In order to get something similar with my current dataset, I had to use grepl to tease these out
cats<-dat %>%
names() %>%
gsub("^(.*)_(min|max)", "\\1",.) %>%
gsub("^(.*)_(.*)", "\\2",.) %>%
unique()
Now, my goal is to mutate a new min and a new max column for each of those categories. So far the code below works just fine.
dat %>%
rowwise() %>%
mutate(min_v1 = min(c_across(contains(cats[1])), na.rm=T)) %>%
mutate(max_v1 = max(c_across(contains(cats[1])), na.rm=T)) %>%
mutate(min_y1 = min(c_across(contains(cats[2])), na.rm=T)) %>%
mutate(max_y1 = max(c_across(contains(cats[2])), na.rm=T))
However, the number of categories in my current dataset is quite a bit bigger than 2.. Is there a way to implement this but quicker?
I've tried a few of the suggestions on this post but haven't quite been able to extend them to this problem.
You can use one of the map function here for each common categories.
library(dplyr)
library(purrr)
result <- bind_cols(dat, map_dfc(cats,
~dat %>%
rowwise() %>%
transmute(!!paste('min', .x, sep = '_') := min(c_across(matches(.x)), na.rm = TRUE),
!!paste('max', .x, sep = '_') := max(c_across(matches(.x)), na.rm = TRUE))))
result
I'm using survey data with packages survey and srvyr and I have some trouble applying survey_mean() to all columns.
Here's an example:
library(survey)
library(srvyr)
data(api)
dstrata <- apistrat %>%
as_survey_design(strata = stype, weights = pw) %>%
mutate(api00 = ifelse(api00 == 467, NA, api00),
api99 = ifelse(api99 == 491, NA, api99))
sapply(dstrata$variables %>% select(api99, api00), function(x){
x <- enquo(x)
dstrata %>%
filter(!is.na(!!x)) %>%
summarise(stat = srvyr::survey_mean(!!x, na.rm = TRUE)[, 1])
})
Error: Assigned data x must be compatible with existing data.
x Existing data has 198 rows.
x Assigned data has 200 rows.
ℹ Only vectors of size 1 are recycled.
Run rlang::last_error() to see where the error occurred.
Note that:
dstrata %>%
select(api99, api00) %>%
summarise_all(.funs = srvyr::survey_mean, na.rm = T)
works with this example but not with my actual data so I would like to understand why the function above does not work.
I'm using srvyr_0.3.9 and survey_4.0
I don't know why would you need any kind of NSE here because in sapply only the value is passed and not an expression.
This seems to work :
library(dplyr)
sapply(dstrata$variables %>% select(api99, api00), function(x){
dstrata %>%
summarise(stat = srvyr::survey_mean(x, na.rm = TRUE))
})
# api99 api00
#stat 630.3107 663.4118
#stat_se 10.14777 9.566393
I have the follwing code that takes a dataframe called dft1 and then produces a resulting dataframe called dfb1. I want to repeat the same code for multiple input dataframes such as dft1, dft2 all indexed by a number towards the end and then store the results using the same pattern i.e. dfb1, dfb2, ....
I have tried many methods such as using dapply or for loops but given the nature of the code inside I wasn't able to get the intended results.
#define the function for rolling
window <- 24
rolling_lm <-
rollify(.f = function(R_excess, MKT_RF, SMB, HML) {
lm(R_excess ~ MKT_RF + SMB + HML)
}, window = window, unlist = FALSE)
#rolling over the variable
dfb1 <-
dft1 %>%
mutate(rolling_ff =
rolling_lm(R_excess,
MKT_RF,
SMB,
HML)) %>%
mutate(tidied = map(rolling_ff,
tidy,
conf.int = T)) %>%
unnest(tidied) %>%
slice(-1:-23) %>%
select(date, term, estimate, conf.low, conf.high) %>%
filter(term != "(Intercept)") %>%
rename(beta = estimate, factor = term) %>%
group_by(factor)
Add the command you want to apply to each dataframe in a function
apply_fun <- function(df) {
df %>%
mutate(rolling_ff =
rolling_lm(R_excess,
MKT_RF,
SMB,
HML)) %>%
mutate(tidied = map(rolling_ff,
tidy,
conf.int = T)) %>%
unnest(tidied) %>%
slice(-1:-23) %>%
select(date, term, estimate, conf.low, conf.high) %>%
filter(term != "(Intercept)") %>%
rename(beta = estimate, factor = term) %>%
group_by(factor)
}
Now apply the function to each dataframe and store the results in a list
n <- 10
out <- setNames(lapply(mget(paste0("dft", 1:n)), apply_fun), paste0("dfb", 1:n))
Assuming you have input dataframes like dft1, dft2...this will output a list of dataframes which you can now access doing out[['dfb1']], out[['dfb2']] and so on. Change the value of n based on number of dft dataframes you have.
If the data is already present in a list we can avoid mget by doing
setNames(lapply(result, apply_fun), paste0("dfb", 1:n))
I am trying to use dplyr computation as below and then call this in a function where I can change the column name and dataset name. The code is as below:
sample_table <- function(byvar = TRUE, dataset = TRUE) {
tcount <-
df2 %>% group_by(.dots = byvar) %>% tally() %>% arrange(byvar) %>% rename(tcount = n) %>%
left_join(
select(
dataset %>% group_by(.dots = byvar) %>% tally() %>% arrange(byvar) %>% rename(scount = n), byvar, scount
), by = c("byvar")
) %>%
mutate_each(funs(replace(., is.na(.), 0)),-byvar %>% mutate(
tperc = round(tcount / rcount, digits = 2), sperc = round(scount / samplesize, digits = 2),
absdiff = abs(sperc - tperc)
) %>%
select(byvar, tcount, tperc, scount, sperc, absdiff)
return(tcount)
}
category_Sample1 <- sample_table(byvar = "category", dataset = Sample1)
My function name is sample_table.
The Error message is as below:-
Error: All select() inputs must resolve to integer column positions.
The following do not:
* byvar
I know this is a repeat question and I have gone through the below links:
Function writing passing column reference to group_by
Error when combining dplyr inside a function
I am not sure where I am going wrong. rcount is the number of rows in df2 and samplesize is the number of rows in "dataset" dataframe. I have to compute the same thing for another variable with three different "dataset" names.
You use column references as strings (byvar) (Standard Evaluation) and normal reference (tcount, tperc etc.) (Non Standard Evaluation) together.
Make sure you use one of both and the appropriate function: select() or select_(). You can fix your issue by using
select(one_of(c(byvar,'tcount')))