I want to subtract values from a row with label "baseline" from all the otherwise marked items in a long format data frame. It is easy to do this in two steps using a left_join with the "baseline" subset. However, I could not figure out how to combine vas_1 and vas_diff into one chain.
library(dplyr)
# Create test data
n_users = 5
vas = data_frame(
user = rep(letters[1:n_users], each = 3),
group = rep(c("baseline", "early", "late" ),n_users),
vas = round(rgamma(n_users*3, 10,1.4 ))
)
# The above data are given
# Assume some other operations are required
vas_1 = vas %>%
mutate(
vas = vas * 2
)
# I want to put the following into one
# chain with the above
# Use self-join to subtract baseline
vas_diff = vas_1 %>%
filter(group != "baseline") %>%
# Problem is vas_1 here. Using . gives error here
# Adding copy = TRUE does not help
# left_join(. %>% filter(group == "baseline") , by = c("user")) %>%
left_join(vas_1 %>% filter(group == "baseline") , by = c("user")) %>%
mutate(vas = vas.x - vas.y) %>% # compute offset
select(user, group.x, vas) # remove temporary variables
vas_diff
I use anonymous function when . should be used multiple times:
... %>% (function(df) { ... }) %>% ...
Hence, in your case:
vas_diff = vas_1 %>%
filter(group != "baseline") %>%
(function(df) left_join(df, df %>% filter(group == "baseline") , by = c("user"))) %>%
mutate(vas = vas.x - vas.y) %>% # compute offset
select(user, group.x, vas)
(which is not going produce desirable result as describe in comments above, but you it shows how to use anonymous function)
but probably you want this:
vas_diff = vas_1 %>%
left_join(
x = filter(., group != "baseline")
, y = filter(., group == "baseline")
, by = c("user")
) %>%
mutate(vas = vas.x - vas.y) %>% # compute offset
select(user, group.x, vas) # remove temporary variables
Here's a similar option, and a demonstration that you can pass whole pipe chains as arguments to the join. Instead of moving the . inside filter, you can pass . as an argument to eval, then remove unwanted columns in the right-hand side. This is largely just to document this approach for my own purposes.
vas_diff = vas_1 %>%
left_join(x = eval(.) %>%
filter(group != "baseline"),
y = eval(.) %>%
filter(group == "baseline") %>%
select(-group),
by = c("user")) %>%
mutate(vas = vas.x - vas.y) %>% # compute offset
select(user, group, vas)
Does anybody know why you can't simply pass the ., like x = . %>% filter ...... Why do we need to eval(.).
Related
Simple question. Considering the data frame below, I want to count distinct IDs: one for all records and one after filtering on status. However, the %>% doesn't seem to work here. I just want to have a single value as ouput (so for total this should be 10, for closed it should be 5), not a dataframe . Both # lines don't work
dat <- data.frame (ID = as.factor(c(1:10)),
status = as.factor(rep(c("open","closed"))))
total <- n_distinct(dat$ID)
#closed <- dat %>% filter(status == "closed") %>% n_distinct(dat$ID)
#closed <- dat %>% filter(status == "closed") %>% n_distinct(ID)
n_distinct expects a vector as input, you are passing a dataframe. You can do :
library(dplyr)
dat %>%
filter(status == "closed") %>%
summarise(n = n_distinct(ID))
# n
#1 5
Or without using filter :
dat %>% summarise(n = n_distinct(ID[status == "closed"]))
You can add %>% pull(n) to above if you want a vector back and not a dataframe.
An option with data.table
library(data.table)
setDT(dat)[status == "closed"][, .(n = uniqueN(ID))]
In data analysis applied to psychology, we often want to check all results for each subject. Therefore, let's say I have this dataset:
library(tidyverse)
set.seed(123)
ds <- data.frame(subject = rep(1:4, each=4),
metadata = c("congruent_1","congruent_2","incongruent_1", "incongruent_2"),
reaction_time = rnorm(16,mean = 0.1, sd=0.02))
I can get means and standard deviation grouped by each subject
#mean
ds %>%
group_by(subject) %>%
filter(metadata == "congruent_1" | metadata == "congruent_2") %>%
summarise(mean_cong = mean(reaction_time))
#sd
ds %>%
group_by(subject) %>%
filter(metadata =="incongruent_1" | metadata == "incongruent_2") %>%
summarise(sd_cong_incong = sd(reaction_time))
However, now I need to compute a variable with the result of mean_cong / sd_cong_incong. I'm sure this is possible via group_by or nest , but I'm not getting the right code to run that.
A fake code will be
ds %>%
group_by(subject) %>%
filter(metadata == "congruent_1" | metadata == "congruent_2") %>%
summarise(mean_cong = mean(reaction_time)) %>%
unfilter() %>% #<- I know this is not possible
filter(metadata =="incongruent_1" | metadata == "incongruent_2") %>%
summarise(sd_cong_incong = sd(reaction_time)) %>%
mutate(pooled = mean_cong/sd_cong_incong)
And a fake output will be:
I want to remain within tidyverse environment.
Thank you.
You can include logic within the summarise expression like this:
ds %>%
dplyr::group_by(subject) %>%
dplyr::summarise(
mean_cong = mean(reaction_time[metadata == "congruent_1" | metadata == "congruent_2"]),
sd_cong = sd(reaction_time[metadata == "incongruent_1" | metadata == "incongruent_2"])
) %>%
dplyr::mutate(
new_var = mean_cong/sd_cong
)
What is the best way to automate mutate function in one dplyr aggregation.
Best if I demonstrate on the example.
So in the first part of an example I am creating new columns based on values of variable gear. However, imagine I need to automate this step to automatically 'iterate' over all unique values of gear and creates new columns for each value.
Is there any how to do to so?
library(tidyverse)
cr <-
mtcars %>%
group_by(gear) %>%
nest()
# This is 'by-hand' approach of what I would like to do - How to automate it? E.g. we do not know all values of 'carb'
cr$data[[1]] %>%
mutate(VARIABLE1 =
case_when(carb == 1 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE2 =
case_when(carb == 2 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE4 =
case_when(carb == 4 ~ hp/mpg,
TRUE ~ 0))
# This is a pseodu-idea of what I need to do. Is the any way how to change iteration number in ONE dplyr code?
vals <- cr$data[[1]] %>% pull(carb) %>% sort %>% unique()
for (i in vals) {
message(i)
cr$data[[1]] %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be first element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be second element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be third element of vals
TRUE ~ 0))
}
One way would be to use dummy_cols from package fastDummies
Doing it for one dataframe at a time:
cr$data[[1]] %>%
dummy_cols(select_columns = 'carb')%>%
mutate_at(vars(starts_with('carb_')),funs(.*hp/mpg))
You can also do this first and the group by gear since you are not using gear value in calculation so it wouldn't matter. For that:
cr_new=mtcars %>%
dummy_cols(select_columns = 'carb')%>%
mutate_at(vars(starts_with('carb_')),funs(.*hp/mpg))%>%
group_by(gear)%>%
nest()
Perhaps, something like this would help -
library(dplyr)
library(purrr)
bind_cols(mtcars, map_dfc(unique(mtcars$carb),
~mtcars %>%
transmute(!!paste0('carb', .x) := case_when(carb == 1 ~ hp/mpg,TRUE ~ 0))))
It sounds a lot like what's called "the XY-problem".
https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
Please read about tidy data, and/or tidyr's pivot_longer/pivot_wider. Column names should not encode information.
I have the following dataset:
combined <- data.frame(
client = c('aaa','aaa','aaa','bbb','bbb','ccc','ccc','ddd','ddd','ddd'),
type = c('norm','reg','opt','norm','norm','reg','opt','opt','opt','reg'),
age = c('>50','>50','75+','<25','<25','>50','75+','25-50','25-50','75+'),
cases = c('1','2','2','1','0','1','2','0','3','2'),
IsActive = c('1','0','0','1','1','0','1','1','1','0')
)
And have identified the unique variable combinations with :
# get unique variable combinations
unique_vars <- combined %>%
select(1:3,5) %>%
distinct()
I am trying to iterate on this query combined %>% anti_join(slice(unique_vars,1)) using purrr and save both the output of the query and also save summary of cases from each output back to the unique_vars table. The slice should iterate through each row of unique_vars, not be fixed at 1
I tried :
qry <- combined %>% anti_join(slice(unique_vars,1))
map(.x = unique_vars %>%
slice(.),
~qry %>%
summarise(CaseCnt = sum(cases)) %>%
inner_join(.x))
My desired output would be two things:
Full output of the query
the new Field CaseCnt added to the unique_vars dataframe
Is this possible?
Although I don't completely follow the intuition behind your query, it seems that for #1 you would want:
lapply(1:nrow(unique_vars), function(x) {
combined %>%
anti_join(slice(unique_vars, x), keep = TRUE)
})
And for #2 you would want:
unique_vars$CaseCnt <- lapply(1:nrow(unique_vars), function(x) {
combined %>%
anti_join(slice(unique_vars, x), keep = TRUE) %>%
summarise(CaseCnt = sum(cases %>% as.numeric))
}) %>% do.call(what = rbind.data.frame,
args = .)
Alternatively for #2 with purrr:map_df():
unique_vars$CaseCnt <- map_df(c(1:nrow(unique_vars)), function(x) {
combined %>%
anti_join(slice(unique_vars, x), keep = TRUE) %>%
summarise(CaseCnt = sum(cases %>% as.numeric))
})
Just as an aside -- you could do this directly with:
combined %>%
mutate(cases = as.numeric(cases)) %>%
mutate(tot_cases = sum(cases)) %>% # sum total cases across unique_id's
group_by(client, type, age, IsActive) %>%
summarize(CaseCnt = mean(tot_cases) - sum(cases))
Or if what you were actually looking for is the sum of cases in that group:
combined %>%
mutate(cases = as.numeric(cases)) %>%
group_by(client, type, age, IsActive) %>%
summarize(CaseCnt = sum(cases))
I have the follwing code that takes a dataframe called dft1 and then produces a resulting dataframe called dfb1. I want to repeat the same code for multiple input dataframes such as dft1, dft2 all indexed by a number towards the end and then store the results using the same pattern i.e. dfb1, dfb2, ....
I have tried many methods such as using dapply or for loops but given the nature of the code inside I wasn't able to get the intended results.
#define the function for rolling
window <- 24
rolling_lm <-
rollify(.f = function(R_excess, MKT_RF, SMB, HML) {
lm(R_excess ~ MKT_RF + SMB + HML)
}, window = window, unlist = FALSE)
#rolling over the variable
dfb1 <-
dft1 %>%
mutate(rolling_ff =
rolling_lm(R_excess,
MKT_RF,
SMB,
HML)) %>%
mutate(tidied = map(rolling_ff,
tidy,
conf.int = T)) %>%
unnest(tidied) %>%
slice(-1:-23) %>%
select(date, term, estimate, conf.low, conf.high) %>%
filter(term != "(Intercept)") %>%
rename(beta = estimate, factor = term) %>%
group_by(factor)
Add the command you want to apply to each dataframe in a function
apply_fun <- function(df) {
df %>%
mutate(rolling_ff =
rolling_lm(R_excess,
MKT_RF,
SMB,
HML)) %>%
mutate(tidied = map(rolling_ff,
tidy,
conf.int = T)) %>%
unnest(tidied) %>%
slice(-1:-23) %>%
select(date, term, estimate, conf.low, conf.high) %>%
filter(term != "(Intercept)") %>%
rename(beta = estimate, factor = term) %>%
group_by(factor)
}
Now apply the function to each dataframe and store the results in a list
n <- 10
out <- setNames(lapply(mget(paste0("dft", 1:n)), apply_fun), paste0("dfb", 1:n))
Assuming you have input dataframes like dft1, dft2...this will output a list of dataframes which you can now access doing out[['dfb1']], out[['dfb2']] and so on. Change the value of n based on number of dft dataframes you have.
If the data is already present in a list we can avoid mget by doing
setNames(lapply(result, apply_fun), paste0("dfb", 1:n))