Iteratively create columns based on grouped variables - r

I've got some data (below) where I want to iteratively add columns based on sums of current columns by some grouping variable, and I want to name the columns a pasted value of the current name + "_tot". I'm thinking a combination of dplyr and lapply is the way to go about it but I can't get the structure correct.
set.seed(1234)
data <- data.frame(
biz = sample(c("telco","shipping","tech"), 50, replace = TRUE),
region = sample(c("mideast","americas"), 50, replace = TRUE),
june = sample(1:50, 50, replace=TRUE),
july = sample(100:150, 50, replace=TRUE)
)
So, what I want to do is 1) group this data by "region", then add a new column for each of the following months that is the sum of that month's value (in the real dataframe, there are many periods that follow).
Basically, I want to apply this function
library(dplyr)
data %>% group_by(region) %>% mutate(june_tot = sum(june))
across every month, without having to specify "june" or "july". My initial take:
testfun <- function(df, col) {
name <- paste(col, "_tot", sep="")
data2 <- df %>% group_by(region) %>% summarise(name=sum(col))
return(data2)
}
but lapplying this doesn't work, because I have to specify the columns to call into the initial function. Just removing the "col" argument from the initial function doesn't work either, of course.
Any ideas how to lapply this sort of argument?

Here are possible solutions to your problems using dplyr (first, since that is what you tried), and followed by data.table as well as base R solutions:
dplyr:
cols <- lapply(names(data)[-(1:2)], as.name)
names(cols) <- paste0(names(data)[-(1:2)], "_tot")
data %>% group_by(region) %>% mutate_each_q(funs(sum), cols)
Assumes every column but the first two are monthly data. An explanation by line:
we use as.name and lapply to generate a list of the columns names we want to mutate as symbols
we give the new names we want (i.e. month_tot) to the list of symbols from 1.
we use the mutate_each_q (known as mutate_each_ in dplyr 0.3.0.2) to apply sum to the list of expressions we created in 1. and 2.
This is the (sample) result:
Source: local data frame [50 x 6]
Groups: region
biz region june july june_tot july_tot
1 shipping mideast 17 124 780 3339
2 telco americas 11 101 465 2901
3 telco mideast 27 131 780 3339
4 tech americas 24 135 465 2901
... rows omitted
data.table:
new.names <- paste0(tail(names(data), 2L), "_tot") # Make new names
data.table(data)[,
(new.names):=lapply(.SD, sum), # `lapply` `sum` to the selected columns (those in .SD), and assign to `new.names` columns
by=region, .SDcols=-1 # group by `region`, and exclude first column from `.SD` (note `region` is excluded as well by reason of being in `by`
][] # extra `[]` just to force printing
Here, similar logic, except we use the special .SD object that represents every column in the data.table that we are not grouping by.
base:
do.call(
cbind,
list(
data,
setNames(
lapply(data[-(1:2)], function(x) ave(x, data$region, FUN=sum)),
paste0(names(data[-(1:2)]), "_tot")
) ) )
Here we use ave to compute the per region sums, use lapply to apply ave to each column, and use do.call(cbind, ...) to reconstruct the final data frame.

Try:
> for(i in 3:4) print(tapply(data[[i]], data$region, sum))
americas mideast
563 768
americas mideast
2538 3802
You can get all outputs in a list if you want.

Restructuring the data works well for this.
require(tidyr)
# wide to long
d2 <- gather(data = data,key = month,value = monthval,-c(biz,region))
# get totals and rename month
month_tots <- aggregate(x = list(total = d2$monthval),by = list(region = d2$region,month = d2$month),sum)
month_tots$month <- paste0(month_tots$month,'_tot')
# long to wide
month_tots <- spread(data = month_tots,key = month,value = total)
# recombine
merge(data,month_tots,by = 'region',all.x = T)

Related

Rearrange dataframe to fit longitudinal model in R

I have a dataframe where each entry relates to a job posting in the NHS specifying the week the job was posted, and what NHS Trust (and region) the job is in.
At the moment my dataframe looks something like this:
set.seed(1)
df1 <- data.frame(
NHS_Trust = sample(1:30,20,T),
Week = sample(1:10,20,T),
Region = sample(1:15,20,T))
And I would like to count the number of jobs for each week across each NHS Trust and assign that value to a new column 'jobs' so my dataframe looks like this:
set.seed(1)
df2 <- data.frame(
NHS_Trust = rep(1:30, each=10),
Week = rep(seq(1,10),30),
Region = rep(as.integer(runif(30,1,15)),1,each = 10),
Jobs = rpois(10*30, lambda = 2))
The dataframe may then be used to create a Poisson longitudinal multilevel model where I may model the number of jobs.
Using the data.table package you can group by, count and assign to a new column in a single expression. The syntax for data.tables is dt[i, j, by]. Here i is "with" - ie the subset of data specified by i or data in the order of i which is empty in this case so all data is used in its original order. The j tells what is to be done, here counting the the number of occurrences using .N, which is then assigned to the new variable count using the assign operator :=. The by takes a list of variables where the j operation is performed on each group.
library(data.table)
setDT(df1)
df1[, count := .N, by = .(NHS_Trust, Week, Region)]
A tidyverse approach would be
library(tidyverse)
df1 <- df1 %>%
group_by(NHS_Trust, Week, Region) %>%
count()
You can use count to count number of jobs across each Region, NHS_Trust and Week and use complete to fill in missing combinations.
library(dplyr)
df1 %>%
count(Region, NHS_Trust, Week, name = 'Jobs') %>%
tidyr::complete(Region, Week = 1:10, fill = list(Jobs = 0))
I guess I'm moving my comment to an answer:
df2 <- df1 %>% group_by(Region, NHS_Trust, Week) %>% count(); colnames(df2)[4] <- "Jobs"
df2$combo <- paste0(df2$Region, "_", df2$NHS_Trust, "_", df2$Week)
for (i in 1:length(unique(df2$Region))){
for (j in 1:length(unique(df2$NHS_Trust))){
for (k in 1:length(unique(df2$Week))){
curr_combo <- paste0(unique(df2$Region)[i], "_",
unique(df2$NHS_Trust)[j], "_",
unique(df2$Week)[k])
if(!curr_combo %in% df2$combo){
curdat <- data.frame(unique(df2$Region)[i],
unique(df2$NHS_Trust)[j],
unique(df2$Week)[k],
0,
curr_combo,
stringsAsFactors = FALSE)
#cat(curdat)
names(curdat) <- names(df2)
df2 <- rbind(as.data.frame(df2), curdat)
}
}
}
}
tail(df2)
# Region NHS_Trust Week Jobs combo
# 4495 15 1 4 0 15_1_4
# 4496 15 1 5 0 15_1_5
# 4497 15 1 8 0 15_1_8
# 4498 15 1 3 0 15_1_3
# 4499 15 1 6 0 15_1_6
# 4500 15 1 9 0 15_1_9
The for loop here check which Region-NHS_Trust-Week combinations are missing from df2 and appends those to df2 with a corresponding Jobs value of 0. The checking is done with the help of the new variable combo which is just a concatenation of the values in the fields mentioned earlier separated by underscores.
Edit: I am plenty sure the people here can come up with something more elegant than this.

How to sum up a list of variables in a customized dplyr function?

Starting point:
I have a dataset (tibble) which contains a lot of Variables of the same class (dbl). They belong to different settings. A variable (column in the tibble) is missing. This is the rowSum of all variables belonging to one setting.
Aim:
My aim is to produce sub data sets with the same data structure for each setting including the "rowSum"-Variable (i call it "s1").
Problem:
In each setting there are a different number of variables (and of course they are named differently).
Because it should be the same structure with different variables it is a typical situation for a function.
Question:
How can I solve the problem using dplyr?
I wrote a function to
(1) subset the original dataset for the interessting setting (is working) and
(2) try to rowSums the variables of the setting (does not work; Why?).
Because it is a function for a special designed dataset, the function includes two predefined variables:
day - which is any day of an investigation period
N - which is the Number of cases investigated on this special day
Thank you for any help.
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day,N,!!! subvars) %>%
dplyr::mutate(s1 = rowSums(!!! subvars,na.rm = TRUE))
return(dfplot)
}
We can change it to string with as_name and subset the dataset with [[ for the rowSums
library(rlang)
library(purrr)
library(dplyr)
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
v1 <- map_chr(subvars, as_name)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = rowSums( .[v1],na.rm = TRUE))
return(dfplot)
}
out <- mkr.sumsetting(col1, col2, dataset = df1)
head(out, 3)
# day N col1 col2 s1
#1 1 20 -0.5458808 0.4703824 -0.07549832
#2 2 20 0.5365853 0.3756872 0.91227249
#3 3 20 0.4196231 0.2725374 0.69216051
Or another option would be select the quosure and then do the rowSums
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = dplyr::select(., !!! subvars) %>%
rowSums(na.rm = TRUE))
return(dfplot)
}
mkr.sumsetting(col1, col2, dataset = df1)
data
set.seed(24)
df1 <- data.frame(day = 1:20, N = 20, col1 = rnorm(20),
col2 = runif(20))

Using variables as arguments in summarize()

I wish to pass user input variables to group_by() and summarize() functions.
The direct example of the data frame and code is below. Here I am 'hard-coding' the column names.
library(dplyr)
df <- data.frame('Category' = c('a','c','a','a','b','a','b','b'),
'Amt' = c(100,300,200,400,500,1000,350,250),
'Flag' = c(0,1,1,1,0,1,1,0))
rowCount <- nrow(df)
totalAmt <- sum(df$Amt)
g <- group_by(df, Category)
summ <- summarize(g, Count = n(), CountPercentage = n()*100/rowCount, TotalAmt = sum(Amt), AmtPercentage = sum(Amt)*100/totalAmt, FlagSum = sum(Flag))
summ
The output is below
In the application I am developing, the dataframe and hence the columns names will be user-defined. I will be reading the .csv file name, the column(s) to be grouped on and the columns to be summarized on from an Excel file.
I have searched far and wide and after spending much time reading and experimenting, I found the solution as shown below which worked for me. I have not used piping to make the steps clearer.
#The data frame df is read from the .csv file name
#Variables read from the Excel file
groupby <- 'Category'
sumBy1 <- 'Amt'
sumBy2 <- 'Flag'
rowCount <- nrow(df)
totalAmt <- sum(df[sumBy1])
g <- group_by_(df, groupby) #group by variable #grouping
summcount <- summarize(g, Count = n(), CountPercentage = n()*100/rowCount) #summarize counts #piece 1
summamt <- summarize_at(g, .vars = sumBy1, .funs=sum) #summarize by first variable
summamt <- summamt[-1] #remove first column to remove duplicate column
summamt$AmtPercentage <- summamt[sumBy1]*100/totalAmt #piece 2
summflag <- summarize_at(g, .vars = sumBy2, .funs=sum) #summarize by second variable
summflag <- summflag[-1] #remove first column to remove duplicate column #piece 3
summ <- cbind(summcount, summamt, summflag) #combine dataframes
summ
The result is the same as above. As you can see I am creating the final dataframe piecemeal and then binding them. The code is ugly. Also, how do I define the column headers in this syntax? I did consider summarize_all() but that requires creating a subset of the data frame. I have already read the following questions and they did not work for me
Passing arguments to dplyr summarize function
Summarizing data in table by group for each variable in r
Can you recommend a simpler and more elegant way to do this?
Above I have 'hardcoded' two types of summarization, viz. count and sum. To add another level of complication, what if the user wants to also define the type of summarization (viz. sum, mean, count, etc.) required? In the Excel file, I can capture the type of summarization needed against each variable.
Thanks for any suggestions.
That sounds like a job for Superman! Or at least quasi-quotations.
You want to insert variables using the bang-bang operator, !!.
You can do it like this
# Make a variable symbol from strings
make_var <- function(prefix, var, suffix)
as.symbol(paste0(prefix, var, suffix))
calc_summary <- function(df, groupby, sumBy1, sumBy2) {
totalSumBy1 <- make_var("Total", sumBy1, "")
sumBy1Percentage <- make_var("", sumBy1, "Percentage")
sumBy1 <- make_var("", sumBy1, "")
sumBy2Sum <- make_var("", sumBy2, "Sum")
sumBy2 <- make_var("", sumBy2, "")
group_by_(df, groupby) %>%
summarize(Count = n(),
CountPercentage = n()*100/rowCount,
!!totalSumBy1 := sum(!!sumBy1),
!!sumBy2Sum := sum(!!sumBy2)) %>%
mutate(CountPercentage = Count/sum(Count),
!!sumBy1Percentage := 100 * !!totalSumBy1 / sum(!!totalSumBy1))
}
When you use !! you are inserting the value of a variable, so this is how you can parameterise expressions given to dplyr functions. You need them as symbols, which is why I use the make_var function. It can be done more elegantly, but this will give you the variables you used in your example.
Notice that when the variables we assign to are dynamic we must use the := assignment instead of =. Otherwise, the parser complains.
You can use this function as such:
> df %>% calc_summary("Category", "Amt", "Flag")
# A tibble: 3 x 6
Category Count CountPercentage TotalAmt FlagSum AmtPercentage
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 a 4 0.500 1700. 3. 54.8
2 b 3 0.375 1100. 1. 35.5
3 c 1 0.125 300. 1. 9.68
The order of columns is not the same as in your example, but you can fix that using select. I cleaned up the percentage calculations a bit by moving those to a mutate after the summary. It removes the need for the rowCount variable. If you prefer, you can easily use that variable and avoid the mutate call. Then you can also get the columns in the order you want in the summarise call.
Anyway, the important point is that you want the bang-bang operator for what you are doing here.

Sum by aggregating complex paired names in R

In R, I'm trying to aggregate a dataframe based on unique IDs, BUT I need to use some kind of wild card value for the IDs. Meaning I have paired names like this:
lion_tiger
elephant_lion
tiger_lion
And I need the lion_tiger and tiger_lion IDs to be summed together, because the order in the pair does not matter.
Using this dataframe as an example:
df <- data.frame(pair = c("1_3","2_4","2_2","1_2","2_1","4_2","3_1","4_3","3_2"),
value = c("12","10","19","2","34","29","13","3","14"))
So the values for pair IDs, "1_2" and "2_1" need to be summed in a new table. That new row would then read:
1_2 36
Any suggestions? While my example has numbers as the pair IDs, in reality I would need this to read in text (like the lion_tiger" example above).
We can split the 'pair' column by _, then sort and paste it back, use it in a group by function to get the sum
tapply(as.numeric(as.character(df$value)),
sapply(strsplit(as.character(df$pair), '_'), function(x)
paste(sort(as.numeric(x)), collapse="_")), FUN = sum)
Or another option is gsubfn
library(gsubfn)
df$pair <- gsubfn('([0-9]+)_([0-9]+)', ~paste(sort(as.numeric(c(x, y))), collapse='_'),
as.character(df$pair))
df$value <- as.numeric(as.character(df$value))
aggregate(value~pair, df, sum)
Using tidyverse and purrrlyr
df <- data.frame(name=c("lion_tiger","elephant_lion",
"tiger_lion"),value=c(1,2,3),stringsAsFactors=FALSE)
require(tidyverse)
require(purrrlyr)
df %>% separate(col = name, sep = "_", c("A", "B")) %>%
by_row(.collate = "rows",
..f = function(this_row) {
paste0(sort(c(this_row$A, this_row$B)), collapse = "_")
}) %>%
rename(sorted = ".out") %>%
group_by(sorted) %>%
summarize(sum(value))%>%show
## A tibble: 2 x 2
# sorted `sum(value)`
# <chr> <dbl>
#1 elephant_lion 2
#2 lion_tiger 4

apply function to grouped rows in dataframe [duplicate]

This question already has answers here:
Split dataframe using two columns of data and apply common transformation on list of resulting dataframes
(3 answers)
Closed 5 years ago.
I have created a function that computes a number of biological statistics, such as species range edges. Here is a simplified version of the function:
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
cent_lat <- weighted.mean(x=rangedf[,lat], w=rangedf[,weighting], na.rm=T)
cent_lon <- weighted.mean(x=rangedf[,lon], w=rangedf[,weighting], na.rm=T)
out <- data.frame(cent_lat, cent_lon)
return(out)
}
I would like to apply this to a large dataframe where every row is an observation of a species. As such, I want the function to group rows by a specified set of columns, and then computer these statistics for each group. Here is a test dataframe:
LATITUDE <- c(27.91977, 21.29066, 26.06340, 28.38918, 25.97517, 27.96313)
LONGITUDE <- c(-175.8617, -157.8645, -173.9593, -178.3571, -173.9679, -175.7837)
BIOMASS <- c(4.3540488, 0.2406332, 0.2406332, 2.1419699, 0.3451426, 1.0946017)
SPECIES <- c('Abudefduf abdominalis','Abudefduf abdominalis','Abudefduf abdominalis','Chaetodon lunulatus','Chaetodon lunulatus','Chaetodon lunulatus')
YEAR <- c('2005', '2005', '2014', '2009', '2009', '2015')
testdf <- data.table(LATITUDE, LONGITUDE, BIOMASS, SPECIES, YEAR)
I want to apply this function to every unique combination of species and year to calculate summary statistics, i.e., the following:
testresult <- testdf %>%
group_by(SPECIES, YEAR) %>%
range_stats(lat="LATITUDE",lon="LONGITUDE",weighting="BIOMASS",na.rm=T)
However, the code above does not work (I get a (list) object cannot be coerced to type 'double' error) and I am not sure how else to approach the problem.
Since you add the tag of dplyr and purrr, I assume you are interested in a tidyverse solution. So below I will demonstrate a solution based on the tidyverse.
First, your range_stats is problematic. This is why you got the error message. The weighted.mean is expecting a vector for both the x and w argument. However, if rangedf is a tibble, the way you subset the tibble, such as rangedf[,lat] will still return a one-column tibble. A better way is to use pull from the dplyr package.
library(tidyverse)
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
cent_lat <- weighted.mean(x = rangedf %>% pull(lat),
w = rangedf %>% pull(weighting), na.rm=T)
cent_lon <- weighted.mean(x = rangedf %>% pull(lon),
w = rangedf %>% pull(weighting), na.rm=T)
out <- data.frame(cent_lat, cent_lon)
return(out)
}
Next, the way you created the data frame is OK, but data.table is from the data.table package and you will create a data.table, not a tibble. I thought you want to use an approach from tidyverse, so I changed data.table to data_frame as follows.
LATITUDE <- c(27.91977, 21.29066, 26.06340, 28.38918, 25.97517, 27.96313)
LONGITUDE <- c(-175.8617, -157.8645, -173.9593, -178.3571, -173.9679, -175.7837)
BIOMASS <- c(4.3540488, 0.2406332, 0.2406332, 2.1419699, 0.3451426, 1.0946017)
SPECIES <- c('Abudefduf abdominalis','Abudefduf abdominalis','Abudefduf abdominalis','Chaetodon lunulatus','Chaetodon lunulatus','Chaetodon lunulatus')
YEAR <- c('2005', '2005', '2014', '2009', '2009', '2015')
testdf <- data_frame(LATITUDE, LONGITUDE, BIOMASS, SPECIES, YEAR)
Now, you said you want to apply the range_stats function to each combination of SPECIES and YEAR. One approach is to split the data frame to a list of data frames, and use lapply family function. But here I want to show you how to use the map family function to achieve this task as map is from the purrr package, which is part of the tidyverse.
We can first create a group indices based on SPECIES and YEAR.
testdf2 <- testdf %>%
mutate(Group = group_indices(., SPECIES, YEAR))
testdf2
# A tibble: 6 x 6
LATITUDE LONGITUDE BIOMASS SPECIES YEAR Group
<dbl> <dbl> <dbl> <chr> <chr> <int>
1 27.91977 -175.8617 4.3540488 Abudefduf abdominalis 2005 1
2 21.29066 -157.8645 0.2406332 Abudefduf abdominalis 2005 1
3 26.06340 -173.9593 0.2406332 Abudefduf abdominalis 2014 2
4 28.38918 -178.3571 2.1419699 Chaetodon lunulatus 2009 3
5 25.97517 -173.9679 0.3451426 Chaetodon lunulatus 2009 3
6 27.96313 -175.7837 1.0946017 Chaetodon lunulatus 2015 4
As you can see, Group is a new column showing the index number. Now we can split the data frame based on Group, and then use map_dfr to apply the range_stats function.
testresult <- testdf2 %>%
split(.$Group) %>%
map_dfr(range_stats, lat = "LATITUDE",lon = "LONGITUDE",
weighting = "BIOMASS", na.rm = TRUE, .id = "Group")
testresult
Group cent_lat cent_lon
1 1 27.57259 -174.9191
2 2 26.06340 -173.9593
3 3 28.05418 -177.7480
4 4 27.96313 -175.7837
Notice that map_dfr can automatic bind the output list of data frames to a single data frame. .id = "Group" means we want to create a column called Group based on the name of the list element.
I separated the process into two steps, but of course they can be all in one pipeline as follows.
testresult <- testdf %>%
mutate(Group = group_indices(., SPECIES, YEAR)) %>%
split(.$Group) %>%
map_dfr(range_stats, lat = "LATITUDE",lon = "LONGITUDE",
weighting = "BIOMASS", na.rm = TRUE, .id = "Group")
If you want, testresult can be merged with testdf using left_join, but I will stop here as testresult is probably already the desired output you want. I hope this helps.
Fundamentally, the main issue involves weighted.mean() where you are passing a dataframe object and not a vector that can be coerced to double. To fix within method, simply change:
x=rangedf[,lat]
To double brackets:
x=rangedf[[lat]]
Adjusted method:
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
cent_lat <- weighted.mean(x=rangedf[[lat]], w=rangedf[[weighting]], na.rm=T)
cent_lon <- weighted.mean(x=rangedf[[lon]], w=rangedf[[weighting]], na.rm=T)
out <- data.frame(cent_lat, cent_lon)
return(out)
}
As for overall group by slice computation, do forgive me in bypassing, dplyr and data.table which you use and consider base R's underutilized but useful method, by().
The challenge with your current setup is the output of range_stats method return is a data.frame of two columns and dplyr's group_by() expects one aggregation vector operation. However, by passes dataframe objects (sliced by factors) into a defined function to return a list of data.frames which you can then rbind for one final dataframe:
df_List <- by(testdf, testdf[, c("SPECIES", "YEAR")], FUN=function(df)
data.frame(species=df$SPECIES[1],
year=df$YEAR[1],
range_stats(df,"LATITUDE","LONGITUDE","BIOMASS"))
)
finaldf <- do.call(rbind, df_List)
finaldf
# species year cent_lat cent_lon
# 1 Abudefduf abdominalis 2005 27.57259 -174.9191
# 2 Chaetodon lunulatus 2009 28.05418 -177.7480
# 3 Abudefduf abdominalis 2014 26.06340 -173.9593
# 4 Chaetodon lunulatus 2015 27.96313 -175.7837

Resources