How to specify a column name in ddply via character variable? - r

I have a tibble/dataframe with
sample_id condition state
---------------------------------
sample1 case val1
sample1 case val2
sample1 case val3
sample2 control val1
sample2 control val2
sample2 control val3
The dataframe is generated within a for loop for different states. Hence, every dataframe has a different name for the state column.
I want to group the data by sample_id and calculate the median of the state column such that every unique sample_id has a single median value. The output should be like below...
sample_id condition state
---------------------------------
sample1 case median
sample2 control median
I am trying the command below; it is working if give the name of the column, but I am not able to pass the name via the state character variable. I tried ensym(state) and !!ensym(state), but they all are throwing errors.
ddply(dat_state, .(sample_id), summarize, condition=unique(condition), state_exp=median(ensym(state)))

As camille notes above, this is easier in dplyr. Basic syntax (not yet addressing your question):
my_df %>%
group_by(sample_id, condition) %>%
summarize(state = median(state))
Note that syntax will give you values for every unique sample_id-condition pair. Which isn't an issue in your example, since every sample_id has the same condition, but just something to be aware of.
On to your question... It's not quite clear to me how you're planning to pass the state name to your calculation. But a couple ways you can handle this. One is to use dplyr's "rename" function:
x <- "Massachusetts"
my_df %>%
rename(state = x) %>%
group_by(sample_id, condition) %>%
summarize(state = median(state))
The (probably more proper) way to do this is to write a function using dplyr's "tidyeval" syntax:
myfunc <- function(df, state_name) {
df %>%
group_by(sample_id, condition) %>%
summarize(state = median({{state_name}}))
}
myfunc(my_df, Massachusetts) # Note: Unquoted state name

Thank you all for putting effort into answering my question. With your suggestions, I have found the solution. Below is the code to what I was trying to achieve by grouping sample_id and condition and passing state through a variable.
state_mark <- c("pPCLg2", "STAT1", "STAT5", "AKT")
for(state in state_mark){
dat_state <- dat_clust_stim[,c("sample_id", "condition", state)]
# I had to use !!ensym() to convert a character to a symbol.
dat_med <- group_by(dat_state, sample_id, condition) %>%
summarise(med = median(!!ensym(state)))
dat_med <- ungroup(dat_med)
x <- dat_med[dat_med$condition == "case", "med"]
y <- dat_med[dat_med$condition == "control", "med"]
t_test <- t.test(x$med, y$med)
}

If you want to stay old-fashioned, you can use the eval(parse(text=expression)) idiom:
ddply(dat_state, .(sample_id), summarize,
state_exp = eval(parse(text = paste("median(",state,")"))))
No fancy operators but mind the parentheses!

Related

Combine character variable over rows and columns by group in R

I am a beginner in R and I am trying to solve a problem in R, which is I guess quite easy for experienced users.
The problem is the following: Customers (A, B, C) are coming in repeatedly using different programms (Prg). I would like to identify "typical sequences" of programs. Therefore, I identify the first programm, they consume, the second, and the third. In a next step, I would like to combine these information to sequences of programms by customer. For a customer first consuming Prg1, then Prg2, then Prg3, the final outcome should be "Prg1-Prg2-Prg3".
The code below produces a dataframe similar to the one I have. Prg is the Programm in the respective year, First is the first year the customer enters, Sec the second and Third the third.
The code produces columns that extract the program consumed in the first contract (Code_1_Prg), second contract (Code_2_Prg) and third contract (Code_3_Prg).
Unfortunately, I am not successful combining these 3 columns to the required goal. I tried to group by ID and save the frist element of the sequence in a new column called "chain1". Here I get the error message "Error in df %>% group_by(ID) %>% df$chain1 = df[df$Code_1_Prg != "NA", :
could not find function "%>%<-", even though I am using the magrittr and dplyr packages.
detach(package:plyr)
library(dplyr)
library(magrittr)
df %>%
group_by(ID) %>%
df$chain1 = df[df$Code_1_Prg!="NA", "Code_1_Prg"]
Below, I share some code, which produces the dataframe and the starting point for extracting the character variable in Code_1_Prg by group.
I would be really grateful, if you could help me with this. Thank you very much in advance!
df <- data.frame("ID"=c("A","A","A","A","B", "B", "B","B","B","C","C", "C", "C","C","C","C"),
"Year_Contract" =c("2010", "2015", "2017","2017","2010","2010", "2015","2015","2020","2015","2015","2017","2017","2017","2018","2018"),
"Prg"=c("AIB","AIB","LLA","LLA","BBU","BBU", "KLU","KLU","DDI","CKN","CKN","BBU","BBU","BBU","KLU","KLU"),
"First"=c("2010","2010","2010","2010","2010","2010", "2010","2010","2010","2015","2015","2015","2015","2015","2015","2015"),
"Sec"=c("2015","2015","2015","2015","2015","2015", "2015","2015","2015","2017","2017","2017","2017","2017","2017","2017"),
"Third"=c("2017","2017","2017","2017","2020","2020", "2020","2020","2020","2018","2018","2018","2018","2018","2018","2018")
)
df$Code_1_Prg <- ifelse(df$Year_Contract == df$First, df$Code_1_Prg <- df$Prg, NA)
df$Code_2_Prg <- ifelse(df$Year_Contract == df$Sec, df$Code_2_Prg <- df$Prg, NA)
df$Code_3_Prg <- ifelse(df$Year_Contract == df$Third, df$Code_3_Prg <- df$Prg, NA)
detach(package:plyr)
library(dplyr)
library(magrittr)
df %>%
group_by(ID) %>%
df$chain1 = df[df$Code_1_Prg!="NA", "Code_1_Prg"]
#This is the final column, I am trying to create
df2 <- data.frame("ID"=c("A","B", "C"),
"Goal" =c("AIB-LLA", "BBU-KLU-DDI", "CKN-BBU-KLU")
)
df <- merge(df, df2, by="ID")
Are you looking for something like this?
libra4ry(dplyr)
df %>%
group_by(ID) %>%
arrange(Year_Contract, .by_group = TRUE) %>%
distinct() %>%
summarise(sequence = toString(Prg))
ID sequence
<chr> <chr>
1 A AIB, AIB, LLA
2 B BBU, KLU, DDI
3 C CKN, BBU, KLU

Creating a loop in R for a function

I would like to create for loop to repeat the same function for 150 variables. I am new to R and I am a bit stuck.
To give you an example of some commands I need to repeat:
N <- table(df$ var1 ==0)["TRUE"]
n <- table(df$ var1 ==1)["TRUE"]
PREV95 <- (svyciprop(~ var1 ==1, level=0.95, design= design, deff= "replace")*100)
I need to run the same functions for 150 columns. I know that I need to put all my cols in one vector = x but then I don't know how to write the loop to repeat the same command for all my variables.
Can anyone help me to write a loop?
A word in advance: loops in R can in most cases be replaced with a faster, R-ish way (various flavours of apply, maping, walking ...)
applying a function to the columns of dataframe df:
a)
with base R, example dataset cars
my_function <- function(xs) max(xs)
lapply(cars, my_function)
b)
tidyverse-style:
cars %>%
summarise_all(my_function)
An anecdotal example: I came across an R-script which took about half an hour to complete and made abundant use of for-loops. Replacing the loops with vectorized functions and members of the apply family cut the execution time down to about 3 minutes. So while for-loops and related constructs might be more familiar when coming from another language, they might soon get in your way with R.
This chapter of Hadley Wickham's R for data science gives an introduction into iterating "the R-way".
Here is an approach that doesn't use loops. I've created a data set called df with three factor variables to represent your dataset as you described it. I created a function eval() that does all the work. First, it filters out just the factors. Then it converts your factors to numeric variables so that the numbers can be summed as 0 and 1 otherwise if we sum the factors it would be based on 1 and 2. Within the function I create another function neg() to give you the number of negative values by subtracting the sum of the 1s from the total length of the vector. Then create the dataframes "n" (sum of the positives), "N" (sum of the negatives), and PREV95. I used pivot_longer to get the data in a long format so that each stat you are looking for will be in its own column when merged together. Note I had to leave PREV95 out because I do not have a 'design' object to use as a parameter to run the function. I hashed it out but you can remove the hash to add back in. I then used left_join to combine these dataframes and return "results". Again, I've hashed out the version that you'd use to include PREV95. The function eval() takes your original dataframe as input. I think the logic for PREV95 should work, but I cannot check it without a 'design' parameter. It returns a dataframe, not a list, which you'll likely find easier to work with.
library(dplyr)
library(tidyr)
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
eval <- function(df){
df1 <- df %>%
select_if(is.factor) %>%
mutate_all(function(x) as.numeric(as.character(x)))
neg <- function(x){
length(x) - sum(x)
}
n<- df1 %>%
summarize(across(where(is.numeric), sum)) %>%
pivot_longer(everything(), names_to = "Var", values_to = "n")
N <- df1 %>%
summarize(across(where(is.numeric), function(x) neg(x))) %>%
pivot_longer(everything(), names_to = "Var", values_to = "N")
#PREV95 <- df1 %>%
# summarize(across(where(is.numeric), function(x) survey::svyciprop(~x == 1, design = design, level = 0.95, deff = "replace")*100)) %>%
# pivot_longer(everything(), names_to = "Var", values_to = "PREV95")
results <- n %>%
left_join(N, by = "Var")
#results <- n %>%
# left_join(N, by = "Var") %>%
# left_join(PREV95, by = "Var")
return(results)
}
eval(df)
Var n N
<chr> <dbl> <dbl>
1 Var1 2 8
2 Var2 5 5
3 Var3 4 6
If you really wanted to use a for loop, here is how to make it work. Again, I've left out the survey function due to a lack of info on the parameters to make it work.
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
VarList <- names(df %>% select_if(is.factor))
results <- list()
for (var in VarList){
results[[var]][["n"]] <- sum(df[[var]] == 1)
results[[var]][["N"]] <- sum(df[[var]] == 0)
}
unlist(results)
Var1.n Var1.N Var2.n Var2.N Var3.n Var3.N
2 8 5 5 4 6

R function with list of variables of unknown length

trying to branch out an learn some R, one thing I do often at my job is I pull weighted means by some time specific period variable. I figured out how to do that individually like this:
means_by_period <- df %>%
group_by(period) %>%
summarize(var1 = weighted.mean(var1, wgtvar),
var2 = weighted.mean(var2, wgtvar),
var3 = weighted.mean(var3, wgtvar),
var4 = weighted.mean(var4, wgtvar)
)
We do this all the time but I am not always going to know how many variables/what variables I am going to be pulling and it would be a pain to edit this code every time, so I built an excel sheet to do it for me, but this seems like a good opportunity to learn how to write a function to do it. Problem is I am not sure how to write it such that it will work. I know my arguments will be: 1. the current data set 2. the period 3. the weighted variable 4. a concatenated vector of my variables?
newfunction <- function(df, period, weight, variables)
{df %>%
group_by(period) %>%
summarize(var1 = weighted.mean(var1, weight),
var2 = weighted.mean(var2, weight),
var3 = weighted.mean(var3, weight),
var4 = weighted.mean(var4, weight) )
}
I am like 2 weeks into learning so if anyone could give me some pointers on what I'd need to do here that would be great. Thanks!
If the 'var1', 'var2', 'var3', 'var4' are a vector of column names (as strings in the 'variables', then we can convert to symbol and evaluate (!!)
library(dplyr)
newfunction <- function(df, period, weight, variables) {
df %>%
group_by({{period}}) %>%
summarize(
!! variables[1] := weighted.mean( !! rlang::sym(variables[1]), {{weight}}),
!! variables[2] := weighted.mean( !! rlang::sym(variables[2]), {{weight}}),
!! variables[3] := weighted.mean( !! rlang::sym(variables[3]), {{weight}}),
!! variables[4] := weighted.mean( !! rlang::sym(variables[4]), {{weight}}) )
}
Here, the column names for 'period', 'weight' are assumed to be passed as unquoted, while the 'variables' as a vector of strings
As the OP mentioned that 'variables' can be of unknown length, we can loop over the vector of column names ('variables') in map
library(purrr)
newfunction2 <- function(df, period, weight, variables) {
map(variables, ~ df %>%
group_by({{period}}) %>%
summarise(!! .x := weighted.mean(!! rlang::sym(.x), {{weight}}))) %>%
reduce(full_join)
}

How to sum up a list of variables in a customized dplyr function?

Starting point:
I have a dataset (tibble) which contains a lot of Variables of the same class (dbl). They belong to different settings. A variable (column in the tibble) is missing. This is the rowSum of all variables belonging to one setting.
Aim:
My aim is to produce sub data sets with the same data structure for each setting including the "rowSum"-Variable (i call it "s1").
Problem:
In each setting there are a different number of variables (and of course they are named differently).
Because it should be the same structure with different variables it is a typical situation for a function.
Question:
How can I solve the problem using dplyr?
I wrote a function to
(1) subset the original dataset for the interessting setting (is working) and
(2) try to rowSums the variables of the setting (does not work; Why?).
Because it is a function for a special designed dataset, the function includes two predefined variables:
day - which is any day of an investigation period
N - which is the Number of cases investigated on this special day
Thank you for any help.
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day,N,!!! subvars) %>%
dplyr::mutate(s1 = rowSums(!!! subvars,na.rm = TRUE))
return(dfplot)
}
We can change it to string with as_name and subset the dataset with [[ for the rowSums
library(rlang)
library(purrr)
library(dplyr)
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
v1 <- map_chr(subvars, as_name)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = rowSums( .[v1],na.rm = TRUE))
return(dfplot)
}
out <- mkr.sumsetting(col1, col2, dataset = df1)
head(out, 3)
# day N col1 col2 s1
#1 1 20 -0.5458808 0.4703824 -0.07549832
#2 2 20 0.5365853 0.3756872 0.91227249
#3 3 20 0.4196231 0.2725374 0.69216051
Or another option would be select the quosure and then do the rowSums
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = dplyr::select(., !!! subvars) %>%
rowSums(na.rm = TRUE))
return(dfplot)
}
mkr.sumsetting(col1, col2, dataset = df1)
data
set.seed(24)
df1 <- data.frame(day = 1:20, N = 20, col1 = rnorm(20),
col2 = runif(20))

Using variables as arguments in summarize()

I wish to pass user input variables to group_by() and summarize() functions.
The direct example of the data frame and code is below. Here I am 'hard-coding' the column names.
library(dplyr)
df <- data.frame('Category' = c('a','c','a','a','b','a','b','b'),
'Amt' = c(100,300,200,400,500,1000,350,250),
'Flag' = c(0,1,1,1,0,1,1,0))
rowCount <- nrow(df)
totalAmt <- sum(df$Amt)
g <- group_by(df, Category)
summ <- summarize(g, Count = n(), CountPercentage = n()*100/rowCount, TotalAmt = sum(Amt), AmtPercentage = sum(Amt)*100/totalAmt, FlagSum = sum(Flag))
summ
The output is below
In the application I am developing, the dataframe and hence the columns names will be user-defined. I will be reading the .csv file name, the column(s) to be grouped on and the columns to be summarized on from an Excel file.
I have searched far and wide and after spending much time reading and experimenting, I found the solution as shown below which worked for me. I have not used piping to make the steps clearer.
#The data frame df is read from the .csv file name
#Variables read from the Excel file
groupby <- 'Category'
sumBy1 <- 'Amt'
sumBy2 <- 'Flag'
rowCount <- nrow(df)
totalAmt <- sum(df[sumBy1])
g <- group_by_(df, groupby) #group by variable #grouping
summcount <- summarize(g, Count = n(), CountPercentage = n()*100/rowCount) #summarize counts #piece 1
summamt <- summarize_at(g, .vars = sumBy1, .funs=sum) #summarize by first variable
summamt <- summamt[-1] #remove first column to remove duplicate column
summamt$AmtPercentage <- summamt[sumBy1]*100/totalAmt #piece 2
summflag <- summarize_at(g, .vars = sumBy2, .funs=sum) #summarize by second variable
summflag <- summflag[-1] #remove first column to remove duplicate column #piece 3
summ <- cbind(summcount, summamt, summflag) #combine dataframes
summ
The result is the same as above. As you can see I am creating the final dataframe piecemeal and then binding them. The code is ugly. Also, how do I define the column headers in this syntax? I did consider summarize_all() but that requires creating a subset of the data frame. I have already read the following questions and they did not work for me
Passing arguments to dplyr summarize function
Summarizing data in table by group for each variable in r
Can you recommend a simpler and more elegant way to do this?
Above I have 'hardcoded' two types of summarization, viz. count and sum. To add another level of complication, what if the user wants to also define the type of summarization (viz. sum, mean, count, etc.) required? In the Excel file, I can capture the type of summarization needed against each variable.
Thanks for any suggestions.
That sounds like a job for Superman! Or at least quasi-quotations.
You want to insert variables using the bang-bang operator, !!.
You can do it like this
# Make a variable symbol from strings
make_var <- function(prefix, var, suffix)
as.symbol(paste0(prefix, var, suffix))
calc_summary <- function(df, groupby, sumBy1, sumBy2) {
totalSumBy1 <- make_var("Total", sumBy1, "")
sumBy1Percentage <- make_var("", sumBy1, "Percentage")
sumBy1 <- make_var("", sumBy1, "")
sumBy2Sum <- make_var("", sumBy2, "Sum")
sumBy2 <- make_var("", sumBy2, "")
group_by_(df, groupby) %>%
summarize(Count = n(),
CountPercentage = n()*100/rowCount,
!!totalSumBy1 := sum(!!sumBy1),
!!sumBy2Sum := sum(!!sumBy2)) %>%
mutate(CountPercentage = Count/sum(Count),
!!sumBy1Percentage := 100 * !!totalSumBy1 / sum(!!totalSumBy1))
}
When you use !! you are inserting the value of a variable, so this is how you can parameterise expressions given to dplyr functions. You need them as symbols, which is why I use the make_var function. It can be done more elegantly, but this will give you the variables you used in your example.
Notice that when the variables we assign to are dynamic we must use the := assignment instead of =. Otherwise, the parser complains.
You can use this function as such:
> df %>% calc_summary("Category", "Amt", "Flag")
# A tibble: 3 x 6
Category Count CountPercentage TotalAmt FlagSum AmtPercentage
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 a 4 0.500 1700. 3. 54.8
2 b 3 0.375 1100. 1. 35.5
3 c 1 0.125 300. 1. 9.68
The order of columns is not the same as in your example, but you can fix that using select. I cleaned up the percentage calculations a bit by moving those to a mutate after the summary. It removes the need for the rowCount variable. If you prefer, you can easily use that variable and avoid the mutate call. Then you can also get the columns in the order you want in the summarise call.
Anyway, the important point is that you want the bang-bang operator for what you are doing here.

Resources