Iterate over dplyr code using purrr::map2 - r

I am relatively new to R, so my apologies if this question is too basic.
I have transactions that show quantity sold and revenue earned from different products. Because there are three products, there are 2^3 = 8 combinations for selling these products in a "basket." Each basket could be sold in any of the three given years (2016, 2017, 2018) and in any of the zones (East and West). [I have 3 years worth of transactions for the two zones: East and West.]
My objective is to analyze how much revenue is earned, how many quantities are sold, and how many transactions occurred for each combination of these products in a given year for a given zone.
I was able to do the above operation (using purrr::map) by splitting the data based on zones. I have created a list of two data frames that hold data grouped by "year" for each combination described above. This works well. However, the code is a little clunky in my opinion. There are a lot of repetitive statements. I want to be able to create a list of 2X3 (i.e. 2 zones and 3 years)
Here's my code using zone-wise splitting.
First Try
UZone <- unique(Input_File$Zone)
FYear <- unique(Input_File$Fiscal.Year)
#Split based on zone
a<-purrr::map(UZone, ~ dplyr::filter(Input_File, Zone == .)) %>%
#Create combinations of products
purrr::map(~mutate_each(.,funs(Exists = . > 0), L.Rev:I.Qty )) %>%
#group by Fiscal Year
purrr::map(~group_by_(.,.dots = c("Fiscal.Year", grep("Exists", names(.), value = TRUE)))) %>%
#Summarize, delete unwanted columns and rename the "number of transactions" column
purrr::map(~summarise_each(., funs(sum(., na.rm = TRUE), count = n()), L.Rev:I.Qty)) %>%
purrr::map(~select(., Fiscal.Year:L.Rev_count)) %>%
purrr::map(~plyr::rename(.,c("L.Rev_count" = "No.Trans")))
#Now do Zone and Year-wise splitting : Try 1
EastList<-a[[1]]
EastList <- EastList %>% split(.$Fiscal.Year)
WestList<-a[[2]]
WestList <- WestList %>% split(.$Fiscal.Year)
write.xlsx(EastList , file = "East.xlsx",row.names = FALSE)
write.xlsx(WestList , file = "West.xlsx",row.names = FALSE)
As you can see, the above code is very clunky. With limited knowledge of R, I researched https://blog.rstudio.org/2016/01/06/purrr-0-2-0/ and read purrr::map2() manual but I couldn't find too many examples. After reading the solution at How to add list of vector to list of data.frame objects as new slot by parallel?, I am assuming that I could use X = zone and Y= Fiscal Year to do what I have done above.
Here's what I tried:
Second Try
#Now try Zone and Year-wise splitting : Try 2
purrr::map2(UZone,FYear, ~ dplyr::filter(Input_File, Zone == ., Fiscal.Year == .))
But this code doesn't work. I get an error message that :
Error: .x (2) and .y (3) are different lengths
Question 1: Can I use map2 to do what I am trying to do? If not, is there any other better way?
Question 2: Just in case, we are able to use map2, how can I generate two Excel files using one command? As you can see above, I have two function calls above. I'd want to have only one.
Question 3: Instead of two statements below, is there any way to do sum and count in one statement? I am looking for more cleaner ways to do sum and count.
purrr::map(~summarise_each(., funs(sum(., na.rm = TRUE), count = n()), L.Rev:I.Qty)) %>%
purrr::map(~select(., Fiscal.Year:L.Rev_count)) %>%
Can someone please help me?
Here's my data:
dput(Input_File)
structure(list(Zone = c("East", "East", "East", "East", "East",
"East", "East", "West", "West", "West", "West", "West", "West",
"West"), Fiscal.Year = c(2016, 2016, 2016, 2016, 2016, 2016,
2017, 2016, 2016, 2016, 2017, 2017, 2018, 2018), Transaction.ID = c(132,
133, 134, 135, 136, 137, 171, 171, 172, 173, 175, 176, 177, 178
), L.Rev = c(3, 0, 0, 1, 0, 0, 2, 1, 1, 2, 2, 1, 2, 1), L.Qty = c(3,
0, 0, 1, 0, 0, 1, 1, 1, 2, 2, 1, 2, 1), A.Rev = c(0, 0, 0, 1,
1, 1, 0, 0, 0, 0, 0, 1, 0, 0), A.Qty = c(0, 0, 0, 2, 2, 3, 0,
0, 0, 0, 0, 3, 0, 0), I.Rev = c(4, 4, 4, 0, 1, 0, 3, 0, 0, 0,
1, 0, 1, 1), I.Qty = c(2, 2, 2, 0, 1, 0, 3, 0, 0, 0, 1, 0, 1,
1)), .Names = c("Zone", "Fiscal.Year", "Transaction.ID", "L.Rev",
"L.Qty", "A.Rev", "A.Qty", "I.Rev", "I.Qty"), row.names = c(NA,
14L), class = "data.frame")
Output Format:
Here's the code to generate the output. I would love to see EastList.2016 and EastList.2017 as two sheets in one Excel file, and WestList.2016, WestList.2017 and WestList.2018 as 3 sheets in one Excel file.
#generate the output:
EastList.2016 <- EastList[[1]]
EastList.2017 <- EastList[[2]]
WestList.2016 <- WestList[[1]]
WestList.2017 <- WestList[[2]]
WestList.2018 <- WestList[[3]]

Two lists broken down by year with sums and counts for each?
In dplyr:
(df <- your dataframe)
df %>%
group_by(Zone, Fiscal.Year) %>%
summarise_at(vars(L.Rev:I.Qty), funs(sum = sum, cnt = n()))
Source: local data frame [5 x 14]
Groups: Zone [?]
Zone Fiscal.Year L.Rev_sum L.Qty_sum A.Rev_sum A.Qty_sum I.Rev_sum I.Qty_sum L.Rev_cnt L.Qty_cnt A.Rev_cnt A.Qty_cnt I.Rev_cnt I.Qty_cnt
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
1 East 2016 4 4 3 7 13 7 6 6 6 6 6 6
2 East 2017 2 1 0 0 3 3 1 1 1 1 1 1
3 West 2016 4 4 0 0 0 0 3 3 3 3 3 3
4 West 2017 3 3 1 3 1 1 2 2 2 2 2 2
5 West 2018 3 3 0 0 2 2 2 2 2 2 2 2

Related

How to find sum of a column given the date and month is the same

I am wondering how I can find the sum of a column, (in this case it's the AgeGroup_20_to_24 column) for a month and year. Here's the sample data:
https://i.stack.imgur.com/E23Th.png
I essentially want to find the total amount of cases per month/year.
For an example: 01/2020 = total sum cases of the AgeGroup
02/2020 = total sum cases of the AgeGroup
I tried doing this, however I get this:
https://i.stack.imgur.com/1eH0O.png
xAge20To24 <- covid%>%
mutate(dates=mdy(Date), year = year(dates), month = month(dates))%>%
mutate(total = sum(AgeGroup_20_to_24))%>%
select(Date, year, month, AgeGroup_20_to_24)%>%
group_by(year)
View(xAge20To24)
Any help will be appreciated.
structure(list(Date = c("3/9/2020", "3/10/2020", "3/11/2020",
"3/12/2020", "3/13/2020", "3/14/2020"), AgeGroup_0_to_19 = c(1,
0, 2, 0, 0, 2), AgeGroup_20_to_24 = c(1, 0, 2, 0, 2, 1), AgeGroup_25_to_29 = c(1,
0, 1, 2, 2, 2), AgeGroup_30_to_34 = c(0, 0, 2, 3, 4, 3), AgeGroup_35_to_39 = c(3,
1, 2, 1, 2, 1), AgeGroup_40_to_44 = c(1, 2, 1, 3, 3, 1), AgeGroup_45_to_49 = c(1,
0, 0, 2, 0, 1), AgeGroup_50_to_54 = c(2, 1, 1, 1, 0, 1), AgeGroup_55_to_59 = c(1,
0, 1, 1, 1, 2), AgeGroup_60_to_64 = c(0, 2, 2, 1, 1, 3), AgeGroup_70_plus = c(2,
0, 2, 0, 0, 0)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I'm not sure if your question and your data match up. You're asking for by-month summaries of data, but your data only includes March entries. I've provided two examples of summarizing your data below, one that uses the entire date and one that uses by-day summaries since we can't use month. If your full data set has more months included, you can just swap the day for month instead. First, a quick summary of just the dates can be done with this code:
#### Load Library ####
library(tidyverse)
library(lubridate)
#### Pivot and Summarise Data ####
covid %>%
pivot_longer(cols = c(everything(),
-Date),
names_to = "AgeGroup",
values_to = "Cases") %>%
group_by(Date) %>%
summarise(Sum_Cases = sum(Cases))
This pivots your data into long format, groups by the entire date, then summarizes the cases, which gives you this by-date sum of data:
# A tibble: 6 × 2
Date Sum_Cases
<chr> <dbl>
1 3/10/2020 6
2 3/11/2020 16
3 3/12/2020 14
4 3/13/2020 15
5 3/14/2020 17
6 3/9/2020 13
Using the same pivot_longer principle, you can mutate the data to date format like you already did, pivot to longer format, then group by day, thereafter summarizing the cases:
#### Theoretical Example ####
covid %>%
mutate(Date=mdy(Date),
Year = year(Date),
Month = month(Date),
Day = day(Date)) %>%
pivot_longer(cols = c(everything(),
-Date,-Year,-Month,-Day),
names_to = "AgeGroup",
values_to = "Cases") %>%
group_by(Day) %>% # use by day instead of month
summarise(Sum_Cases = sum(Cases))
Which you can see below. Here we can see the 14th had the most cases:
# A tibble: 6 × 2
Day Sum_Cases
<int> <dbl>
1 9 13
2 10 6
3 11 16
4 12 14
5 13 15
6 14 17

Double splitting datasets in R

Is there a way to split a dataset into permutations of its original components? For example, I realized just now that split() splits a dataset (and the columns selected) into mini-data sets for each element of the columns but if I had a dataset "championships" with columns "question" with elements
a, b, c
and "year" with elements
2018, 2019
(among other columns) and I wanted to create mini-datasets for all observations in "championships" that had "question" = 1, year = "2018" and whatever elements from whatever other columns, how would I do this?
EDIT: Additionally, the columns I am working with have a lot more elements than these examples so how would I create new objects for each of them?
My expected results are basically what I imagine what would happen if I applied the filter() function to each element of "question" and then to each element of "year" and then created objects for every single one of those outputs.
The dataset:
structure(list(id = structure(c(25, 25, 25, 25, 25, 25, 25, 25,
25, 25), format.stata = "%8.0g"), year = structure(c(2018, 2018,
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018), format.stata = "%8.0g"),
round = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), format.stata = "%8.0g"),
question = structure(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), format.stata = "%8.0g"),
correct = structure(c(0, 0, 0, 0, 0, 0, 1, 0, 1, 0), format.stata = "%8.0g")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
You can split the dataset for each question and year, assign names as per choice and use list2env to create the individual datasets in global environment.
data <- split(df, list(df$question, df$year))
names(data) <- sub('(\\d+)\\.(\\d+)', 'question\\1_year\\2', names(data))
names(data)
# [1] "question1_year2018" "question2_year2018" "question3_year2018"
# [4] "question4_year2018" "question5_year2018" "question6_year2018"
# [7] "question7_year2018" "question8_year2018" "question9_year2018"
#[10] "question10_year2018"
list2env(data, .GlobalEnv)
question1_year2018
# A tibble: 1 x 5
# id year round question correct
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 25 2018 1 1 0
question2_year2018
# A tibble: 1 x 5
# id year round question correct
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 25 2018 1 2 0
It is not a good practice to create multiple datasets in global environment. You should keep them in lists, it is easier to manage them that way.

How can I select only the dummy variable columns?

Is there a way to select only those columns in a dataframe where the values in the columns are either 0 or 1 at a time?
I have a data frame with various values, including various strings as well as numbers.
I tried to use dplyr select but could not find a way to evaluate the values contained in the columns.
Sample data is shown below.
data %>%
tribble(
~id, ~gender, ~height, smoking,
1, 1, 170, 0,
2, 0, 150, 0,
3, 1, 160, 1
)
You can pass a function (or rlang-tilde function) to select_if, and look for columns that only contain 0:1.
tribble(
~id, ~gender, ~height, ~smoking,
1, 1, 170, 0,
2, 0, 150, 0,
3, 1, 160, 1
) %>%
select_if(~ all(. %in% 0:1))
# # A tibble: 3 x 2
# gender smoking
# <dbl> <dbl>
# 1 1 0
# 2 0 0
# 3 1 1
If you may have NA in a dummy-variable column, you may want to instead use %in% c(0:1, NA) in the predicate.

How to create a new dataset based on multiple conditions in R?

I have a dataset called carcom that looks like this
carcom <- data.frame(household = c(173, 256, 256, 319, 319, 319, 422, 422, 422, 422), individuals= c(1, 1, 2, 1, 2, 3, 1, 2, 3, 4))
Where individuals refer to father for "1" , mother for "2", child for "3" and "4". What I would like to get two new columns. First one should indicate the number of children in that household if there is. Second, assigning a weight to each individual respectively "1" for father, "0.5" to mother and "0.3" to each child. My new dataset should look like this
newcarcom <- data.frame(household = c(173, 256, 319, 422), child = c(0, 0, 1, 2), weight = c(1, 1.5, 1.8, 2.1)
I have been trying to find the solutions for days. Would be appreciated if someone helps me. Thanks
We can count number of individuals with value 3 and 4 in each household. To calculate weight we change the value for 1:4 to their corresponding weight values using recode and then take sum.
library(dplyr)
newcarcom <- carcom %>%
group_by(household) %>%
summarise(child = sum(individuals %in% 3:4),
weight = sum(recode(individuals,`1` = 1, `2` = 0.5, .default = 0.3)))
# household child weight
# <dbl> <int> <dbl>
#1 173 0 1
#2 256 0 1.5
#3 319 1 1.8
#4 422 2 2.1
Base R version suggested by #markus
newcarcom <- do.call(data.frame, aggregate(individuals ~ household, carcom, function(x)
c(child = sum(x %in% 3:4), weight = sum(replace(y <- x^-1, y < 0.5, 0.3)))))
An option with data.table
library(data.table)
setDT(carcom)[, .(child = sum(individuals %in% 3:4),
weight = sum(recode(individuals,`1` = 1, `2` = 0.5, .default = 0.3))), household]

Apply filter by group without using join and by using data.table in R

The objective of my code is to apply a percentile-based cutoff on a specific column defined by a group.
I found several threads on SO such as:
Efficient way to filter one data frame by ranges in another
Subsetting data frame with multiple date conditions for ranges in between
Efficient way to filter one data frame by ranges in another
How to filter cases in a data.table by multiple conditions defined in another data.table
Unfortunately, these threads either don't apply filter based on a group or don't use data.table or base-R
I am specifically looking for a method without join. Base R-based method would be fine, but I would really love data.table-based method because I have huge size of data. I was able to do what I want to do with join, but I am looking for even better method that possibly avoids join.
Here's my input data:
Input_File <- structure(list(Zone = c("East", "East", "East", "East", "East",
"East", "East", "West", "West", "West", "West", "West", "West",
"West"), Fiscal.Year = c(2016, 2016, 2016, 2016, 2016, 2016,
2017, 2016, 2016, 2016, 2017, 2017, 2018, 2018), Transaction.ID = c(132,
133, 134, 135, 136, 137, 171, 171, 172, 173, 175, 176, 177, 178
), L.Qty = c(3, 0, 0, 1, 0, 0, 1, 1, 1, 2, 2, 1, 2, 1), A.Qty = c(0,
0, 0, 2, 2, 3, 0, 0, 0, 0, 0, 3, 0, 0), I.Qty = c(2, 2, 2, 0,
1, 0, 3, 0, 0, 0, 1, 0, 1, 1)), .Names = c("Zone", "Fiscal.Year",
"Transaction.ID", "L.Qty", "A.Qty", "I.Qty"), row.names = c(NA,
-14L), class = "data.frame")
Here's my code (using join):
Input_File <- data.table::as.data.table(Input_File)
Q <- data.table::as.data.table(data.frame(Zone=c("East","West"), Ten_percentile=c(2017,2018)))
O <- Q[Input_File,on=c("Zone")] [Fiscal.Year>=Ten_percentile]
Brief explanation about my code: I am applying Ten_percentile cutoff on Fiscal.Year grouped by Zone.
Here's the cutoff table:
Q
Zone Ten_percentile
1: East 2017
2: West 2018
Here's the expected output:
O
Zone Ten_percentile Fiscal.Year Transaction.ID L.Qty A.Qty I.Qty
1: East 2017 2017 171 1 0 3
2: West 2018 2018 177 2 0 1
3: West 2018 2018 178 1 0 1
and here's the output in dput format
structure(list(Zone = structure(c(1L,2L,2L),
.Label = c("East","West"), class = "factor"),
Ten_percentile = c(2017,2018,2018),
Fiscal.Year = c(2017,2018,2018),
Transaction.ID = c(171,177,178), L.Qty = c(1,2,1),
A.Qty = c(0,0,0), I.Qty = c(3,1,1)),
.Names = c("Zone","Ten_percentile","Fiscal.Year","Transaction.ID",
"L.Qty","A.Qty","I.Qty"), class = "data.frame", row.names = c(NA,
-3L))
Thanks in advance for any help extended to me. I am a big fan of data.table. Hence, I want to learn different ways to solve the same problem and become well versed in data.table and base-R.
We can do a non-equi join
res <- as.data.table(Input_File)[Q, c(.SD, list(Ten_percentile = Ten_percentile)),
on = .(Zone, Fiscal.Year >= Ten_percentile)]

Resources