Apply custom function to multiple groups in R - r

I am trying to apply a custom function across all groups of my dataset.
I tried applying the below custom function across all groups but its being applied on overall data set. But if I break my data into multiple groups & apply each group to this function its working well.

You almost had it. You just need to apply your function for each group. The purrr library makes this pretty easy.
Libraries:
library(purrr)
library(dplyr)
Main function:
Takes group name and full data set as arguments, then filters by that group. Then makes calculations and returns them as a dataframe.
width <- function(Group.Name, data){
# limit to rows for that group
df<-data %>% filter(Group == Group.Name)
i.mins <- which(diff(sign(diff(c(Inf, df$Value, Inf)))) == 2)
i.mx <- which.max(df$Value)
i <- sort(c(i.mins, i.mx))
ix <- i[which(i == i.mx) + c(-1, 1)]
# put results in dataframe
df <- data.frame("Group" = Group.Name, "Value_1" = ix[1], "Value_2" = ix[2])
# format Group Col
df$Group <- as.character(df$Group)
return(df)
}
Looping through groups with purrr
# unique group names we need to loop through for calcs
Group.Names <- unique(data$Group)
# take each name and use the width function
# then put results together in one datframe
Group.Names %>% map_df(~ width(Group.Name = .x, data = data))
Results:
Group Value_1 Value_2
1 Group1 16 22
2 Group2 4 12
3 Group3 2 15
Note: the .x notation just tells map to put the Group.Names object as the first argument in our width function

Related

Creating a loop in R for a function

I would like to create for loop to repeat the same function for 150 variables. I am new to R and I am a bit stuck.
To give you an example of some commands I need to repeat:
N <- table(df$ var1 ==0)["TRUE"]
n <- table(df$ var1 ==1)["TRUE"]
PREV95 <- (svyciprop(~ var1 ==1, level=0.95, design= design, deff= "replace")*100)
I need to run the same functions for 150 columns. I know that I need to put all my cols in one vector = x but then I don't know how to write the loop to repeat the same command for all my variables.
Can anyone help me to write a loop?
A word in advance: loops in R can in most cases be replaced with a faster, R-ish way (various flavours of apply, maping, walking ...)
applying a function to the columns of dataframe df:
a)
with base R, example dataset cars
my_function <- function(xs) max(xs)
lapply(cars, my_function)
b)
tidyverse-style:
cars %>%
summarise_all(my_function)
An anecdotal example: I came across an R-script which took about half an hour to complete and made abundant use of for-loops. Replacing the loops with vectorized functions and members of the apply family cut the execution time down to about 3 minutes. So while for-loops and related constructs might be more familiar when coming from another language, they might soon get in your way with R.
This chapter of Hadley Wickham's R for data science gives an introduction into iterating "the R-way".
Here is an approach that doesn't use loops. I've created a data set called df with three factor variables to represent your dataset as you described it. I created a function eval() that does all the work. First, it filters out just the factors. Then it converts your factors to numeric variables so that the numbers can be summed as 0 and 1 otherwise if we sum the factors it would be based on 1 and 2. Within the function I create another function neg() to give you the number of negative values by subtracting the sum of the 1s from the total length of the vector. Then create the dataframes "n" (sum of the positives), "N" (sum of the negatives), and PREV95. I used pivot_longer to get the data in a long format so that each stat you are looking for will be in its own column when merged together. Note I had to leave PREV95 out because I do not have a 'design' object to use as a parameter to run the function. I hashed it out but you can remove the hash to add back in. I then used left_join to combine these dataframes and return "results". Again, I've hashed out the version that you'd use to include PREV95. The function eval() takes your original dataframe as input. I think the logic for PREV95 should work, but I cannot check it without a 'design' parameter. It returns a dataframe, not a list, which you'll likely find easier to work with.
library(dplyr)
library(tidyr)
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
eval <- function(df){
df1 <- df %>%
select_if(is.factor) %>%
mutate_all(function(x) as.numeric(as.character(x)))
neg <- function(x){
length(x) - sum(x)
}
n<- df1 %>%
summarize(across(where(is.numeric), sum)) %>%
pivot_longer(everything(), names_to = "Var", values_to = "n")
N <- df1 %>%
summarize(across(where(is.numeric), function(x) neg(x))) %>%
pivot_longer(everything(), names_to = "Var", values_to = "N")
#PREV95 <- df1 %>%
# summarize(across(where(is.numeric), function(x) survey::svyciprop(~x == 1, design = design, level = 0.95, deff = "replace")*100)) %>%
# pivot_longer(everything(), names_to = "Var", values_to = "PREV95")
results <- n %>%
left_join(N, by = "Var")
#results <- n %>%
# left_join(N, by = "Var") %>%
# left_join(PREV95, by = "Var")
return(results)
}
eval(df)
Var n N
<chr> <dbl> <dbl>
1 Var1 2 8
2 Var2 5 5
3 Var3 4 6
If you really wanted to use a for loop, here is how to make it work. Again, I've left out the survey function due to a lack of info on the parameters to make it work.
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
VarList <- names(df %>% select_if(is.factor))
results <- list()
for (var in VarList){
results[[var]][["n"]] <- sum(df[[var]] == 1)
results[[var]][["N"]] <- sum(df[[var]] == 0)
}
unlist(results)
Var1.n Var1.N Var2.n Var2.N Var3.n Var3.N
2 8 5 5 4 6

How to sum up a list of variables in a customized dplyr function?

Starting point:
I have a dataset (tibble) which contains a lot of Variables of the same class (dbl). They belong to different settings. A variable (column in the tibble) is missing. This is the rowSum of all variables belonging to one setting.
Aim:
My aim is to produce sub data sets with the same data structure for each setting including the "rowSum"-Variable (i call it "s1").
Problem:
In each setting there are a different number of variables (and of course they are named differently).
Because it should be the same structure with different variables it is a typical situation for a function.
Question:
How can I solve the problem using dplyr?
I wrote a function to
(1) subset the original dataset for the interessting setting (is working) and
(2) try to rowSums the variables of the setting (does not work; Why?).
Because it is a function for a special designed dataset, the function includes two predefined variables:
day - which is any day of an investigation period
N - which is the Number of cases investigated on this special day
Thank you for any help.
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day,N,!!! subvars) %>%
dplyr::mutate(s1 = rowSums(!!! subvars,na.rm = TRUE))
return(dfplot)
}
We can change it to string with as_name and subset the dataset with [[ for the rowSums
library(rlang)
library(purrr)
library(dplyr)
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
v1 <- map_chr(subvars, as_name)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = rowSums( .[v1],na.rm = TRUE))
return(dfplot)
}
out <- mkr.sumsetting(col1, col2, dataset = df1)
head(out, 3)
# day N col1 col2 s1
#1 1 20 -0.5458808 0.4703824 -0.07549832
#2 2 20 0.5365853 0.3756872 0.91227249
#3 3 20 0.4196231 0.2725374 0.69216051
Or another option would be select the quosure and then do the rowSums
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = dplyr::select(., !!! subvars) %>%
rowSums(na.rm = TRUE))
return(dfplot)
}
mkr.sumsetting(col1, col2, dataset = df1)
data
set.seed(24)
df1 <- data.frame(day = 1:20, N = 20, col1 = rnorm(20),
col2 = runif(20))

Using variables as arguments in summarize()

I wish to pass user input variables to group_by() and summarize() functions.
The direct example of the data frame and code is below. Here I am 'hard-coding' the column names.
library(dplyr)
df <- data.frame('Category' = c('a','c','a','a','b','a','b','b'),
'Amt' = c(100,300,200,400,500,1000,350,250),
'Flag' = c(0,1,1,1,0,1,1,0))
rowCount <- nrow(df)
totalAmt <- sum(df$Amt)
g <- group_by(df, Category)
summ <- summarize(g, Count = n(), CountPercentage = n()*100/rowCount, TotalAmt = sum(Amt), AmtPercentage = sum(Amt)*100/totalAmt, FlagSum = sum(Flag))
summ
The output is below
In the application I am developing, the dataframe and hence the columns names will be user-defined. I will be reading the .csv file name, the column(s) to be grouped on and the columns to be summarized on from an Excel file.
I have searched far and wide and after spending much time reading and experimenting, I found the solution as shown below which worked for me. I have not used piping to make the steps clearer.
#The data frame df is read from the .csv file name
#Variables read from the Excel file
groupby <- 'Category'
sumBy1 <- 'Amt'
sumBy2 <- 'Flag'
rowCount <- nrow(df)
totalAmt <- sum(df[sumBy1])
g <- group_by_(df, groupby) #group by variable #grouping
summcount <- summarize(g, Count = n(), CountPercentage = n()*100/rowCount) #summarize counts #piece 1
summamt <- summarize_at(g, .vars = sumBy1, .funs=sum) #summarize by first variable
summamt <- summamt[-1] #remove first column to remove duplicate column
summamt$AmtPercentage <- summamt[sumBy1]*100/totalAmt #piece 2
summflag <- summarize_at(g, .vars = sumBy2, .funs=sum) #summarize by second variable
summflag <- summflag[-1] #remove first column to remove duplicate column #piece 3
summ <- cbind(summcount, summamt, summflag) #combine dataframes
summ
The result is the same as above. As you can see I am creating the final dataframe piecemeal and then binding them. The code is ugly. Also, how do I define the column headers in this syntax? I did consider summarize_all() but that requires creating a subset of the data frame. I have already read the following questions and they did not work for me
Passing arguments to dplyr summarize function
Summarizing data in table by group for each variable in r
Can you recommend a simpler and more elegant way to do this?
Above I have 'hardcoded' two types of summarization, viz. count and sum. To add another level of complication, what if the user wants to also define the type of summarization (viz. sum, mean, count, etc.) required? In the Excel file, I can capture the type of summarization needed against each variable.
Thanks for any suggestions.
That sounds like a job for Superman! Or at least quasi-quotations.
You want to insert variables using the bang-bang operator, !!.
You can do it like this
# Make a variable symbol from strings
make_var <- function(prefix, var, suffix)
as.symbol(paste0(prefix, var, suffix))
calc_summary <- function(df, groupby, sumBy1, sumBy2) {
totalSumBy1 <- make_var("Total", sumBy1, "")
sumBy1Percentage <- make_var("", sumBy1, "Percentage")
sumBy1 <- make_var("", sumBy1, "")
sumBy2Sum <- make_var("", sumBy2, "Sum")
sumBy2 <- make_var("", sumBy2, "")
group_by_(df, groupby) %>%
summarize(Count = n(),
CountPercentage = n()*100/rowCount,
!!totalSumBy1 := sum(!!sumBy1),
!!sumBy2Sum := sum(!!sumBy2)) %>%
mutate(CountPercentage = Count/sum(Count),
!!sumBy1Percentage := 100 * !!totalSumBy1 / sum(!!totalSumBy1))
}
When you use !! you are inserting the value of a variable, so this is how you can parameterise expressions given to dplyr functions. You need them as symbols, which is why I use the make_var function. It can be done more elegantly, but this will give you the variables you used in your example.
Notice that when the variables we assign to are dynamic we must use the := assignment instead of =. Otherwise, the parser complains.
You can use this function as such:
> df %>% calc_summary("Category", "Amt", "Flag")
# A tibble: 3 x 6
Category Count CountPercentage TotalAmt FlagSum AmtPercentage
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 a 4 0.500 1700. 3. 54.8
2 b 3 0.375 1100. 1. 35.5
3 c 1 0.125 300. 1. 9.68
The order of columns is not the same as in your example, but you can fix that using select. I cleaned up the percentage calculations a bit by moving those to a mutate after the summary. It removes the need for the rowCount variable. If you prefer, you can easily use that variable and avoid the mutate call. Then you can also get the columns in the order you want in the summarise call.
Anyway, the important point is that you want the bang-bang operator for what you are doing here.

Using magrittr and lapply to divide a column in each df in a list by a list of values

I have a list of dataframes containing different time series of different lengths. I want to summarize the count of a variable and then normalize it by the number of years of data that is contained in that particular dataset.
so with a sample dataframe:
data_list <- list(data.frame(temp_bin = rep(1:4, 2:5), value = runif(14)),
data.frame(temp_bin = rep(1:4, 3:6), value = runif(18)),
data.frame(temp_bin = rep(1:4, 4:7), value = runif(22)))
# this might be ~10 different data sets with ~ 100k observations each
count <- lapply(data_list, function(x) {nrow(x)/5} )
# for real data this would be divided by 8760 for the # of hours in a year.
Here is approximately what I want to do, but the n()/count doesn't work because count is a list.
data_bin <- data_list %>%
lapply(., group_by, temp_bin) %>%
lapply(., summarise, n = n()/count)
I tried doing an lapply or mapply within the definition of n, but that didn't seem to work. also tried doing it in two steps - create get a raw n value and then divide in the next step with mapply, but that didn't work either.
If you put the count step in your data_bin step I think it accomplishes what you want, though I am a little hazy on exactly what you mean but I think this works: (Note that you can remove the . assignment from the first argument of lapply, that's the default behavior of %>%)
data_bin <- data_list %>%
lapply(group_by, temp_bin) %>%
# We need x so I put summarize in a manual function
lapply(function(x){summarize(x,n = 5*n()/nrow(x))}) # move the 5 to numerator
data_bin[[1]]
Source: local data frame [4 x 2]
temp_bin n
1 1 0.7142857
2 2 1.0714286
3 3 1.4285714
4 4 1.7857143
Is this what you wanted? You can double check the summarize is part is doing what you want by just returning the nrow(x) result.
data_bin <- data_list %>%
lapply(group_by, temp_bin) %>%
lapply(function(x){summarize(x,n = nrow(x))})
data_bin[[1]]
Source: local data frame [4 x 2]
temp_bin n
1 1 14
2 2 14
3 3 14
4 4 14
I would try to avoid using lapply on every row of a dplyr statement. You could wrap individual data.frame transformation in a function and then lapply that function to data_list
library(dplyr)
ret_db <- function(df) {
db <- df %>%
group_by(.,temp_bin) %>%
summarise(.,n=n()/(nrow(df)/5))
return(db)
}
data_bin <- lapply(data_list,ret_db)

Iteratively create columns based on grouped variables

I've got some data (below) where I want to iteratively add columns based on sums of current columns by some grouping variable, and I want to name the columns a pasted value of the current name + "_tot". I'm thinking a combination of dplyr and lapply is the way to go about it but I can't get the structure correct.
set.seed(1234)
data <- data.frame(
biz = sample(c("telco","shipping","tech"), 50, replace = TRUE),
region = sample(c("mideast","americas"), 50, replace = TRUE),
june = sample(1:50, 50, replace=TRUE),
july = sample(100:150, 50, replace=TRUE)
)
So, what I want to do is 1) group this data by "region", then add a new column for each of the following months that is the sum of that month's value (in the real dataframe, there are many periods that follow).
Basically, I want to apply this function
library(dplyr)
data %>% group_by(region) %>% mutate(june_tot = sum(june))
across every month, without having to specify "june" or "july". My initial take:
testfun <- function(df, col) {
name <- paste(col, "_tot", sep="")
data2 <- df %>% group_by(region) %>% summarise(name=sum(col))
return(data2)
}
but lapplying this doesn't work, because I have to specify the columns to call into the initial function. Just removing the "col" argument from the initial function doesn't work either, of course.
Any ideas how to lapply this sort of argument?
Here are possible solutions to your problems using dplyr (first, since that is what you tried), and followed by data.table as well as base R solutions:
dplyr:
cols <- lapply(names(data)[-(1:2)], as.name)
names(cols) <- paste0(names(data)[-(1:2)], "_tot")
data %>% group_by(region) %>% mutate_each_q(funs(sum), cols)
Assumes every column but the first two are monthly data. An explanation by line:
we use as.name and lapply to generate a list of the columns names we want to mutate as symbols
we give the new names we want (i.e. month_tot) to the list of symbols from 1.
we use the mutate_each_q (known as mutate_each_ in dplyr 0.3.0.2) to apply sum to the list of expressions we created in 1. and 2.
This is the (sample) result:
Source: local data frame [50 x 6]
Groups: region
biz region june july june_tot july_tot
1 shipping mideast 17 124 780 3339
2 telco americas 11 101 465 2901
3 telco mideast 27 131 780 3339
4 tech americas 24 135 465 2901
... rows omitted
data.table:
new.names <- paste0(tail(names(data), 2L), "_tot") # Make new names
data.table(data)[,
(new.names):=lapply(.SD, sum), # `lapply` `sum` to the selected columns (those in .SD), and assign to `new.names` columns
by=region, .SDcols=-1 # group by `region`, and exclude first column from `.SD` (note `region` is excluded as well by reason of being in `by`
][] # extra `[]` just to force printing
Here, similar logic, except we use the special .SD object that represents every column in the data.table that we are not grouping by.
base:
do.call(
cbind,
list(
data,
setNames(
lapply(data[-(1:2)], function(x) ave(x, data$region, FUN=sum)),
paste0(names(data[-(1:2)]), "_tot")
) ) )
Here we use ave to compute the per region sums, use lapply to apply ave to each column, and use do.call(cbind, ...) to reconstruct the final data frame.
Try:
> for(i in 3:4) print(tapply(data[[i]], data$region, sum))
americas mideast
563 768
americas mideast
2538 3802
You can get all outputs in a list if you want.
Restructuring the data works well for this.
require(tidyr)
# wide to long
d2 <- gather(data = data,key = month,value = monthval,-c(biz,region))
# get totals and rename month
month_tots <- aggregate(x = list(total = d2$monthval),by = list(region = d2$region,month = d2$month),sum)
month_tots$month <- paste0(month_tots$month,'_tot')
# long to wide
month_tots <- spread(data = month_tots,key = month,value = total)
# recombine
merge(data,month_tots,by = 'region',all.x = T)

Resources