how to split by multiple columns when using multidplyr - r

tl;dr
How do I make "partition" from multiplyr split on multiple columns?
Motivation:
I was unhappy with using 1 of 32 cores for hard-working summarize, so I am trying to use multi-dplyer I am operating on multiple columns.
Example:
The vignette shows grouping by a single column, but when I do that, my other grouping column is not considered.
Code:
library(dplyr)
library(multidplyr)
library(nycflights13)
flights1 <- partition(flights, flight)
flights2 <- summarise(flights1, dep_delay = mean(dep_delay, na.rm = TRUE))
flights3 <- collect(flights2)
So how about splitting on year, month, and day?
This doesn't work for me:
flights1 <- partition(flights, list(year, month, day))
flights2 <- summarise(flights1, dep_delay = mean(dep_delay, na.rm = TRUE))
flights3 <- collect(flights2)
I can't seem to make this work. Can you point to a proper or at least effective way to do this?

According to ?partition, the usage for partition is
partition(.data, ..., cluster = get_default_cluster())
where ... are variables to partition by. Instead of passing in a list of variables, pass in each variable separately, i.e.
partition(flights, year, month, day)

Related

Counting the rows based on two other column values, and manipulate the value in a loop through one of these column values in R

There are three columns: website, Date ("%Y %m"), click_tracking (T/F). I would like to add a variable describing the number of websites whose click tracking = T in each month / the number of all website in that month.
I thought the steps would be something like:
aggregate(sum(df$click_tracking = TRUE), by=list(Category=df$Date), FUN = sum)
as.data.frame(table(Date))
Then somehow loop through Date and divide the two variables above which would have been already grouped by Date. How can I achieve this? Many thanks!
If we are creating a column, then do a group by 'Date' and get the sum of 'click_tracking' (assuming it is a logical column - TRUE/FALSE) iin mutate
library(dplyr)
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(click_tracking))
If the column is factor, convert to logical with as.logical
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(as.logical(click_tracking)))
If it is to create a summarised output
df %>%
group_by(Date) %>%
summarise(countTRUE = sum(click_tracking))
In the OP's code, = (assignment) is used instead of == in sum(df$click_tracking = TRUE) and there is no need to do a comparison on a logical column
aggregate(cbind(click_tracking = as.logical(click_tracking)) ~ Date, FUN = sum)
This will create the proportion of websites with click tracking (out of all websites) per month.
aggregate(data=df, click_tracking ~ Date, mean)

Dplyr solution for difference in row values based on two factor levels in separate columns

I am trying to use dplyr to calculate the difference between two row values based on factor levels in large data frame. In practical terms, I want the vote distance between two groups across each party within each country. For the data below, I would like to end up with a data frame with rows indicating the difference between the vote values for each group pair for each party level within each country level. The lag function does not seem to work with my data as the number of factor levels varies by country, meaning each country has a different total number of groups and parties. A small sample of the setup is below.
df1 <- data.frame(id = c(1:12),
country = c("a","a","a","a","a","a","b","b","b","b","b","b"),
group = c("x","y","z","x","y","z","x","y","z","x","y","z"),
party = c("d","d","d","e","e","e","d","d","d","e","e","e"),
vote = c(.15,.02,.7, .5, .6, .22,.47,.33,.09,.83,.77,.66))
This is how I would like the end product to look.
df2 <- data.frame(id= c(1:12),
country = c("a","a","a","b","b","b","a","a","a","b","b","b"),
group1 = c("x","x","y","x","x","y","x","x","y","x","x","y"),
group2 = c("y","z","z","y","z","z","y","z","z","y","z","z"),
party = c("d","d","d","d","d","d","e","e","e","e","e","e"),
dist = c(.13,-.5,-.68,.14,.38,.24,-.1,.28,.38,.06,.17,.11))
I have tried dcast previously and if I fill with the column I want, it doesn't line up and produces NA or 0 where there should be values. The lag function doesn't work in my case because the number of parties and groups are unique for each country and not fixed. Whenever I have tried different intervals for the lag the values are comparing across countries of across parties rather than across groups in some instances.
I have found solutions outside of dplyr but for parsimony in presenting code I am wondering if there is a way in dplyr. Also, the code I have is incredibly long and clunky, and uses six or seven packages just for this problem.
Thanks
We can use combn to create the difference
library(dplyr)
df1 %>%
group_by(country, party) %>%
mutate(dist = combn(vote, 2, FUN = function(x) x[1] - x[2]))
Another way is to use
library(tidyverse)
df1 %>%
left_join(df1 %>% select(-id), by = c("country", "party"), suffix = c("1", "2")) %>%
filter(group1 != group2) %>%
mutate(dist = vote1 - vote2)

How to return a value from a variable based on a condition in another variable within a grouped data frame?

I am calculating some metrics on each of a set of variables within a grouped dataframe using the basic group_by() + summarize_at approach. Each group represents a small timeseries. One metric I would like to calculate is the initial value (in this case, day == 1) of each variable within each group. Thus, the generalized problem is to return a value of a variable based on a criterion in another variable, within groups of a grouped dataframe. Within the group_by() + summarize_at approach, I believe I need a custom function that summarize_at can then apply to each variable. I can successfully deploy other custom functions that depend only on the data variable at hand. I seem to be hung up on getting the function to go look in other columns of the dataframe.
I am not married to this approach, and welcome alternate recommendations. However, I am most comfortable with dplyr.
# a dataset
df <- data.frame(day = rep(c(1:5),3),
group = c(rep(1,5),rep(2,5),rep(3,5)),
var_a = seq(1:15),
var_b = seq(2,30, length.out = 15),
var_c = seq(3,45, length.out = 15))
# the logic of what I am going for, on a manually extracted example group:
# initial value (day == 1) of var_a for group 2
df_subset <- df %>%
filter(group == 2)
df_subset$var_a[which(df_subset$day == 1)]
# [1] 6
# my laughable attempt at a function
initial <- function(x){
ini <- which(.$day == 1)
x[ini]
}
# custom function deployed in dplyr pipe (which of course doesn't work)
df %>%
group_by(group) %>%
summarize_at(c("var_a","var_b","var_c"),
list(max = max, ini = initial))
Many thanks.
After the group_by step, specify the variables to select in summarise_at using one of the select_helpers (here starts_with works fine), and within the list, apply the different functions on each of the columns (~ is one way to prefix the anonymous call instead of explicitly specifying function(x)), For the second function, 'day' is not part of the selected columns, but it can be selected with the unquoted column name
library(dplyr)
df %>%
group_by(group) %>%
summarise_at(vars(starts_with('var')),
list(max = ~max(.), ini = ~ .[day == 1]))

Dividing values in each cell by the group average in R

I am trying to generate a new column with values derived from the original chart. I would like to calculate the group average of same hotel and same date first, then use this group averages to divide the original sales.
Here is my code: I tried to calculate the group average by using group_by and summarise embedding in dplyr package, however, it did not generate my expected results.
hotel = c(rep("Hilton",3), rep("Caesar",3))
date1 = c(rep('2018-01-01',2), '2018-01-02', rep('2018-01-01',3))
dba = c(2,0,1,3,2,1)
sales = c(3,5,7,5,2,3)
df = data.frame(cbind(hotel, date1, dba, sales))
df1 = df %>%
group_by(date1, hotel) %>%
dplyr::summarise(avg = mean(sales)) %>%
acast(., date1~hotel)
Any suggestion would be highly appreciated!
Instead of summarise, we can use mutate. After grouping by 'date1', 'hotel', divide the 'sales' by the mean of 'sales' to create a new column
library(tidyverse)
df %>%
group_by(date1, hotel) %>%
mutate(SalesDividedByMean = sales/mean(sales))
NOTE: When there are columns having different types, cbinding results in a matrix and matrix can have only a single type. So, a character class vector can change the whole data into character. Wrapping with data.frame, propagate that change into either factor (by default stringsAsFactors = TRUE or `character)
data
df <- data.frame(hotel, date1, dba, sales)

How do I aggregate certain columns from data frame by a Unique ID?

I have a list of statcast data, per day dating back to 2016. I am attempting to aggregate this data for finding the mean for each pitching ID.
I have the following code:
aggpitch <- aggregate(pitchingstat, by=list(pitchingstat$PitcherID),
FUN=mean, na.rm = TRUE)
This function aggregates every single column. I am looking to only aggregate a certain amount of columns.
How would I include only certain columns?
If you have more than one column that you'd like to summarize, you can use QAsena's approach and add summarise_at function like so:
pitchingstat %>%
group_by(PitcherID) %>%
summarise_at(vars(col1:coln), mean, na.rm = TRUE)
Check out link below for more examples:
https://dplyr.tidyverse.org/reference/summarise_all.html
Replace the first argument (pitchingstat) with the name of the column you want to aggregate (or a vector thereof)
How about?:
library(tidyverse)
aggpitch <- pitchingstat %>%
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(variable)) #replace 'variable' with your variable of interest here
or
library(tidyverse)
aggpitch <- pitchingstat %>%
select(var_1, var_2)
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(var_1),
pitcher_mean2 = mean(var_2))
I think this works but could use a dummy example of your data to play with.

Resources