I wish to pass user input variables to group_by() and summarize() functions.
The direct example of the data frame and code is below. Here I am 'hard-coding' the column names.
library(dplyr)
df <- data.frame('Category' = c('a','c','a','a','b','a','b','b'),
'Amt' = c(100,300,200,400,500,1000,350,250),
'Flag' = c(0,1,1,1,0,1,1,0))
rowCount <- nrow(df)
totalAmt <- sum(df$Amt)
g <- group_by(df, Category)
summ <- summarize(g, Count = n(), CountPercentage = n()*100/rowCount, TotalAmt = sum(Amt), AmtPercentage = sum(Amt)*100/totalAmt, FlagSum = sum(Flag))
summ
The output is below
In the application I am developing, the dataframe and hence the columns names will be user-defined. I will be reading the .csv file name, the column(s) to be grouped on and the columns to be summarized on from an Excel file.
I have searched far and wide and after spending much time reading and experimenting, I found the solution as shown below which worked for me. I have not used piping to make the steps clearer.
#The data frame df is read from the .csv file name
#Variables read from the Excel file
groupby <- 'Category'
sumBy1 <- 'Amt'
sumBy2 <- 'Flag'
rowCount <- nrow(df)
totalAmt <- sum(df[sumBy1])
g <- group_by_(df, groupby) #group by variable #grouping
summcount <- summarize(g, Count = n(), CountPercentage = n()*100/rowCount) #summarize counts #piece 1
summamt <- summarize_at(g, .vars = sumBy1, .funs=sum) #summarize by first variable
summamt <- summamt[-1] #remove first column to remove duplicate column
summamt$AmtPercentage <- summamt[sumBy1]*100/totalAmt #piece 2
summflag <- summarize_at(g, .vars = sumBy2, .funs=sum) #summarize by second variable
summflag <- summflag[-1] #remove first column to remove duplicate column #piece 3
summ <- cbind(summcount, summamt, summflag) #combine dataframes
summ
The result is the same as above. As you can see I am creating the final dataframe piecemeal and then binding them. The code is ugly. Also, how do I define the column headers in this syntax? I did consider summarize_all() but that requires creating a subset of the data frame. I have already read the following questions and they did not work for me
Passing arguments to dplyr summarize function
Summarizing data in table by group for each variable in r
Can you recommend a simpler and more elegant way to do this?
Above I have 'hardcoded' two types of summarization, viz. count and sum. To add another level of complication, what if the user wants to also define the type of summarization (viz. sum, mean, count, etc.) required? In the Excel file, I can capture the type of summarization needed against each variable.
Thanks for any suggestions.
That sounds like a job for Superman! Or at least quasi-quotations.
You want to insert variables using the bang-bang operator, !!.
You can do it like this
# Make a variable symbol from strings
make_var <- function(prefix, var, suffix)
as.symbol(paste0(prefix, var, suffix))
calc_summary <- function(df, groupby, sumBy1, sumBy2) {
totalSumBy1 <- make_var("Total", sumBy1, "")
sumBy1Percentage <- make_var("", sumBy1, "Percentage")
sumBy1 <- make_var("", sumBy1, "")
sumBy2Sum <- make_var("", sumBy2, "Sum")
sumBy2 <- make_var("", sumBy2, "")
group_by_(df, groupby) %>%
summarize(Count = n(),
CountPercentage = n()*100/rowCount,
!!totalSumBy1 := sum(!!sumBy1),
!!sumBy2Sum := sum(!!sumBy2)) %>%
mutate(CountPercentage = Count/sum(Count),
!!sumBy1Percentage := 100 * !!totalSumBy1 / sum(!!totalSumBy1))
}
When you use !! you are inserting the value of a variable, so this is how you can parameterise expressions given to dplyr functions. You need them as symbols, which is why I use the make_var function. It can be done more elegantly, but this will give you the variables you used in your example.
Notice that when the variables we assign to are dynamic we must use the := assignment instead of =. Otherwise, the parser complains.
You can use this function as such:
> df %>% calc_summary("Category", "Amt", "Flag")
# A tibble: 3 x 6
Category Count CountPercentage TotalAmt FlagSum AmtPercentage
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 a 4 0.500 1700. 3. 54.8
2 b 3 0.375 1100. 1. 35.5
3 c 1 0.125 300. 1. 9.68
The order of columns is not the same as in your example, but you can fix that using select. I cleaned up the percentage calculations a bit by moving those to a mutate after the summary. It removes the need for the rowCount variable. If you prefer, you can easily use that variable and avoid the mutate call. Then you can also get the columns in the order you want in the summarise call.
Anyway, the important point is that you want the bang-bang operator for what you are doing here.
Related
I would like to create for loop to repeat the same function for 150 variables. I am new to R and I am a bit stuck.
To give you an example of some commands I need to repeat:
N <- table(df$ var1 ==0)["TRUE"]
n <- table(df$ var1 ==1)["TRUE"]
PREV95 <- (svyciprop(~ var1 ==1, level=0.95, design= design, deff= "replace")*100)
I need to run the same functions for 150 columns. I know that I need to put all my cols in one vector = x but then I don't know how to write the loop to repeat the same command for all my variables.
Can anyone help me to write a loop?
A word in advance: loops in R can in most cases be replaced with a faster, R-ish way (various flavours of apply, maping, walking ...)
applying a function to the columns of dataframe df:
a)
with base R, example dataset cars
my_function <- function(xs) max(xs)
lapply(cars, my_function)
b)
tidyverse-style:
cars %>%
summarise_all(my_function)
An anecdotal example: I came across an R-script which took about half an hour to complete and made abundant use of for-loops. Replacing the loops with vectorized functions and members of the apply family cut the execution time down to about 3 minutes. So while for-loops and related constructs might be more familiar when coming from another language, they might soon get in your way with R.
This chapter of Hadley Wickham's R for data science gives an introduction into iterating "the R-way".
Here is an approach that doesn't use loops. I've created a data set called df with three factor variables to represent your dataset as you described it. I created a function eval() that does all the work. First, it filters out just the factors. Then it converts your factors to numeric variables so that the numbers can be summed as 0 and 1 otherwise if we sum the factors it would be based on 1 and 2. Within the function I create another function neg() to give you the number of negative values by subtracting the sum of the 1s from the total length of the vector. Then create the dataframes "n" (sum of the positives), "N" (sum of the negatives), and PREV95. I used pivot_longer to get the data in a long format so that each stat you are looking for will be in its own column when merged together. Note I had to leave PREV95 out because I do not have a 'design' object to use as a parameter to run the function. I hashed it out but you can remove the hash to add back in. I then used left_join to combine these dataframes and return "results". Again, I've hashed out the version that you'd use to include PREV95. The function eval() takes your original dataframe as input. I think the logic for PREV95 should work, but I cannot check it without a 'design' parameter. It returns a dataframe, not a list, which you'll likely find easier to work with.
library(dplyr)
library(tidyr)
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
eval <- function(df){
df1 <- df %>%
select_if(is.factor) %>%
mutate_all(function(x) as.numeric(as.character(x)))
neg <- function(x){
length(x) - sum(x)
}
n<- df1 %>%
summarize(across(where(is.numeric), sum)) %>%
pivot_longer(everything(), names_to = "Var", values_to = "n")
N <- df1 %>%
summarize(across(where(is.numeric), function(x) neg(x))) %>%
pivot_longer(everything(), names_to = "Var", values_to = "N")
#PREV95 <- df1 %>%
# summarize(across(where(is.numeric), function(x) survey::svyciprop(~x == 1, design = design, level = 0.95, deff = "replace")*100)) %>%
# pivot_longer(everything(), names_to = "Var", values_to = "PREV95")
results <- n %>%
left_join(N, by = "Var")
#results <- n %>%
# left_join(N, by = "Var") %>%
# left_join(PREV95, by = "Var")
return(results)
}
eval(df)
Var n N
<chr> <dbl> <dbl>
1 Var1 2 8
2 Var2 5 5
3 Var3 4 6
If you really wanted to use a for loop, here is how to make it work. Again, I've left out the survey function due to a lack of info on the parameters to make it work.
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
VarList <- names(df %>% select_if(is.factor))
results <- list()
for (var in VarList){
results[[var]][["n"]] <- sum(df[[var]] == 1)
results[[var]][["N"]] <- sum(df[[var]] == 0)
}
unlist(results)
Var1.n Var1.N Var2.n Var2.N Var3.n Var3.N
2 8 5 5 4 6
I have a data frame that has a binary variable for diagnosis (column 1) and 165 nutrient variables (columns 2-166) for n=237. Let’s call this dataset nutr_all. I need to create 165 new variables that take the natural log of each of the nutrient variables. So, I want to end up with a data frame that has 331 columns - column 1 = diagnosis, cols 2-166 = nutrient variables, cols 167-331 = log transformed nutrient variables. I would like these variables to take the name of the old variables but with "_log" at the end
I have tried using a for loop and the mutate command, but, I'm not very well versed in r, so, I am struggling quite a bit.
for (nutr in (nutr_all_nomiss[,2:166])){
nutr_all_log <- mutate(nutr_all, nutr_log = log(nutr) )
}
When I do this, it just creates a single new variable called nutr_log. I know I need to let r know that the "nutr" in "nutr_log" is the variable name in the for loop, but I'm not sure how.
For any encountering this page more recently, dplyr::across() was introduced in late 2020 and it is built for exactly this task - applying the same transformation to many columns all at once.
A simple solution is below.
If you need to be selective about which columns you want to transform, check out the tidyselect helper functions by running ?tidyr_tidy_select in the R console.
library(tidyverse)
# create vector of column names
variable_names <- paste0("nutrient_variable_", 1:165)
# create random data for example
data_values <- purrr::rerun(.n = 165,
sample(x=100,
size=237,
replace = T))
# set names of the columns, coerce to a tibble,
# and add the diagnosis column
nutr_all <- data_values %>%
set_names(variable_names) %>%
as_tibble() %>%
mutate(diagnosis = 1:237) %>%
relocate(diagnosis, .before = everything())
# use across to perform same transformation on all columns
# whose names contain the phrase 'nutrient_variable'
nutr_all_with_logs <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = list(log10 = log10),
.names = "{.col}_{.fn}"))
# print out a small sample of data to validate
nutr_all_with_logs[1:5, c(1, 2:3, 166:168)]
Personally, instead of adding all the columns to the data frame,
I would prefer to make a new data frame that contains only the
transformed values, and change the column names:
logs_only <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = log10)) %>%
rename_with(.cols = contains('nutrient_variable'),
.fn = ~paste0(., '_log10'))
logs_only[1:5, 1:3]
We can use mutate_at
library(dplyr)
nutr_all_log <- nutr_all_nomiss %>%
mutate_at(2:166, list(nutr_log = ~ log(.)))
In base R, we can do this directly on the data.frame
nm1 <- paste0(names(nutr_all_nomiss)[2:166], "_nutr_log")
nutr_all_nomiss[nm1] <- log(nutr_all_nomiss[nm1])
In base R, we can use lapply :
nutr_all_nomiss[paste0(names(nutr_all_nomiss)[2:166], "_log")] <- lapply(nutr_all_nomiss[2:166], log)
Here is a solution using only base R:
First I will create a dataset equivalent to yours:
nutr_all <- data.frame(
diagnosis = sample(c(0, 1), size = 237, replace = TRUE)
)
for(i in 2:166){
nutr_all[i] <- runif(n = 237, 1, 10)
names(nutr_all)[i] <- paste0("nutrient_", i-1)
}
Now let's create the new variables and append them to the data frame:
nutr_all_log <- cbind(nutr_all, log(nutr_all[, -1]))
And this takes care of the names:
names(nutr_all_log)[167:331] <- paste0(names(nutr_all[-1]), "_log")
given function using dplyr will do your task, which can be used to get log transformation for all variables in the dataset, it also checks if the column has -ive values. currently, in this function it will not calculate the log for those parameters,
logTransformation<- function(ds)
{
# this function creats log transformation of dataframe for only varibles which are positive in nature
# args:
# ds : Dataset
require(dplyr)
if(!class(ds)=="data.frame" ) { stop("ds must be a data frame")}
ds <- ds %>%
dplyr::select_if(is.numeric)
# to get only postive variables
varList<- names(ds)[sapply(ds, function(x) min(x,na.rm = T))>0]
ds<- ds %>%
dplyr::select(all_of(varList)) %>%
dplyr::mutate_at(
setNames(varList, paste0(varList,"_log")), log)
)
return(ds)
}
you can use it for your case as :
#assuming your binary variable has namebinaryVar
nutr_allTransformed<- nutr_all %>% dplyr::select(-binaryVar) %>% logTransformation()
if you want to have negative variables too, replace varlist as below:
varList<- names(ds)
I have a tibble/dataframe with
sample_id condition state
---------------------------------
sample1 case val1
sample1 case val2
sample1 case val3
sample2 control val1
sample2 control val2
sample2 control val3
The dataframe is generated within a for loop for different states. Hence, every dataframe has a different name for the state column.
I want to group the data by sample_id and calculate the median of the state column such that every unique sample_id has a single median value. The output should be like below...
sample_id condition state
---------------------------------
sample1 case median
sample2 control median
I am trying the command below; it is working if give the name of the column, but I am not able to pass the name via the state character variable. I tried ensym(state) and !!ensym(state), but they all are throwing errors.
ddply(dat_state, .(sample_id), summarize, condition=unique(condition), state_exp=median(ensym(state)))
As camille notes above, this is easier in dplyr. Basic syntax (not yet addressing your question):
my_df %>%
group_by(sample_id, condition) %>%
summarize(state = median(state))
Note that syntax will give you values for every unique sample_id-condition pair. Which isn't an issue in your example, since every sample_id has the same condition, but just something to be aware of.
On to your question... It's not quite clear to me how you're planning to pass the state name to your calculation. But a couple ways you can handle this. One is to use dplyr's "rename" function:
x <- "Massachusetts"
my_df %>%
rename(state = x) %>%
group_by(sample_id, condition) %>%
summarize(state = median(state))
The (probably more proper) way to do this is to write a function using dplyr's "tidyeval" syntax:
myfunc <- function(df, state_name) {
df %>%
group_by(sample_id, condition) %>%
summarize(state = median({{state_name}}))
}
myfunc(my_df, Massachusetts) # Note: Unquoted state name
Thank you all for putting effort into answering my question. With your suggestions, I have found the solution. Below is the code to what I was trying to achieve by grouping sample_id and condition and passing state through a variable.
state_mark <- c("pPCLg2", "STAT1", "STAT5", "AKT")
for(state in state_mark){
dat_state <- dat_clust_stim[,c("sample_id", "condition", state)]
# I had to use !!ensym() to convert a character to a symbol.
dat_med <- group_by(dat_state, sample_id, condition) %>%
summarise(med = median(!!ensym(state)))
dat_med <- ungroup(dat_med)
x <- dat_med[dat_med$condition == "case", "med"]
y <- dat_med[dat_med$condition == "control", "med"]
t_test <- t.test(x$med, y$med)
}
If you want to stay old-fashioned, you can use the eval(parse(text=expression)) idiom:
ddply(dat_state, .(sample_id), summarize,
state_exp = eval(parse(text = paste("median(",state,")"))))
No fancy operators but mind the parentheses!
Starting point:
I have a dataset (tibble) which contains a lot of Variables of the same class (dbl). They belong to different settings. A variable (column in the tibble) is missing. This is the rowSum of all variables belonging to one setting.
Aim:
My aim is to produce sub data sets with the same data structure for each setting including the "rowSum"-Variable (i call it "s1").
Problem:
In each setting there are a different number of variables (and of course they are named differently).
Because it should be the same structure with different variables it is a typical situation for a function.
Question:
How can I solve the problem using dplyr?
I wrote a function to
(1) subset the original dataset for the interessting setting (is working) and
(2) try to rowSums the variables of the setting (does not work; Why?).
Because it is a function for a special designed dataset, the function includes two predefined variables:
day - which is any day of an investigation period
N - which is the Number of cases investigated on this special day
Thank you for any help.
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day,N,!!! subvars) %>%
dplyr::mutate(s1 = rowSums(!!! subvars,na.rm = TRUE))
return(dfplot)
}
We can change it to string with as_name and subset the dataset with [[ for the rowSums
library(rlang)
library(purrr)
library(dplyr)
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
v1 <- map_chr(subvars, as_name)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = rowSums( .[v1],na.rm = TRUE))
return(dfplot)
}
out <- mkr.sumsetting(col1, col2, dataset = df1)
head(out, 3)
# day N col1 col2 s1
#1 1 20 -0.5458808 0.4703824 -0.07549832
#2 2 20 0.5365853 0.3756872 0.91227249
#3 3 20 0.4196231 0.2725374 0.69216051
Or another option would be select the quosure and then do the rowSums
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = dplyr::select(., !!! subvars) %>%
rowSums(na.rm = TRUE))
return(dfplot)
}
mkr.sumsetting(col1, col2, dataset = df1)
data
set.seed(24)
df1 <- data.frame(day = 1:20, N = 20, col1 = rnorm(20),
col2 = runif(20))
I am trying to apply a custom function across all groups of my dataset.
I tried applying the below custom function across all groups but its being applied on overall data set. But if I break my data into multiple groups & apply each group to this function its working well.
You almost had it. You just need to apply your function for each group. The purrr library makes this pretty easy.
Libraries:
library(purrr)
library(dplyr)
Main function:
Takes group name and full data set as arguments, then filters by that group. Then makes calculations and returns them as a dataframe.
width <- function(Group.Name, data){
# limit to rows for that group
df<-data %>% filter(Group == Group.Name)
i.mins <- which(diff(sign(diff(c(Inf, df$Value, Inf)))) == 2)
i.mx <- which.max(df$Value)
i <- sort(c(i.mins, i.mx))
ix <- i[which(i == i.mx) + c(-1, 1)]
# put results in dataframe
df <- data.frame("Group" = Group.Name, "Value_1" = ix[1], "Value_2" = ix[2])
# format Group Col
df$Group <- as.character(df$Group)
return(df)
}
Looping through groups with purrr
# unique group names we need to loop through for calcs
Group.Names <- unique(data$Group)
# take each name and use the width function
# then put results together in one datframe
Group.Names %>% map_df(~ width(Group.Name = .x, data = data))
Results:
Group Value_1 Value_2
1 Group1 16 22
2 Group2 4 12
3 Group3 2 15
Note: the .x notation just tells map to put the Group.Names object as the first argument in our width function