Double splitting datasets in R - r

Is there a way to split a dataset into permutations of its original components? For example, I realized just now that split() splits a dataset (and the columns selected) into mini-data sets for each element of the columns but if I had a dataset "championships" with columns "question" with elements
a, b, c
and "year" with elements
2018, 2019
(among other columns) and I wanted to create mini-datasets for all observations in "championships" that had "question" = 1, year = "2018" and whatever elements from whatever other columns, how would I do this?
EDIT: Additionally, the columns I am working with have a lot more elements than these examples so how would I create new objects for each of them?
My expected results are basically what I imagine what would happen if I applied the filter() function to each element of "question" and then to each element of "year" and then created objects for every single one of those outputs.
The dataset:
structure(list(id = structure(c(25, 25, 25, 25, 25, 25, 25, 25,
25, 25), format.stata = "%8.0g"), year = structure(c(2018, 2018,
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018), format.stata = "%8.0g"),
round = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), format.stata = "%8.0g"),
question = structure(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), format.stata = "%8.0g"),
correct = structure(c(0, 0, 0, 0, 0, 0, 1, 0, 1, 0), format.stata = "%8.0g")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))

You can split the dataset for each question and year, assign names as per choice and use list2env to create the individual datasets in global environment.
data <- split(df, list(df$question, df$year))
names(data) <- sub('(\\d+)\\.(\\d+)', 'question\\1_year\\2', names(data))
names(data)
# [1] "question1_year2018" "question2_year2018" "question3_year2018"
# [4] "question4_year2018" "question5_year2018" "question6_year2018"
# [7] "question7_year2018" "question8_year2018" "question9_year2018"
#[10] "question10_year2018"
list2env(data, .GlobalEnv)
question1_year2018
# A tibble: 1 x 5
# id year round question correct
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 25 2018 1 1 0
question2_year2018
# A tibble: 1 x 5
# id year round question correct
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 25 2018 1 2 0
It is not a good practice to create multiple datasets in global environment. You should keep them in lists, it is easier to manage them that way.

Related

How to find sum of a column given the date and month is the same

I am wondering how I can find the sum of a column, (in this case it's the AgeGroup_20_to_24 column) for a month and year. Here's the sample data:
https://i.stack.imgur.com/E23Th.png
I essentially want to find the total amount of cases per month/year.
For an example: 01/2020 = total sum cases of the AgeGroup
02/2020 = total sum cases of the AgeGroup
I tried doing this, however I get this:
https://i.stack.imgur.com/1eH0O.png
xAge20To24 <- covid%>%
mutate(dates=mdy(Date), year = year(dates), month = month(dates))%>%
mutate(total = sum(AgeGroup_20_to_24))%>%
select(Date, year, month, AgeGroup_20_to_24)%>%
group_by(year)
View(xAge20To24)
Any help will be appreciated.
structure(list(Date = c("3/9/2020", "3/10/2020", "3/11/2020",
"3/12/2020", "3/13/2020", "3/14/2020"), AgeGroup_0_to_19 = c(1,
0, 2, 0, 0, 2), AgeGroup_20_to_24 = c(1, 0, 2, 0, 2, 1), AgeGroup_25_to_29 = c(1,
0, 1, 2, 2, 2), AgeGroup_30_to_34 = c(0, 0, 2, 3, 4, 3), AgeGroup_35_to_39 = c(3,
1, 2, 1, 2, 1), AgeGroup_40_to_44 = c(1, 2, 1, 3, 3, 1), AgeGroup_45_to_49 = c(1,
0, 0, 2, 0, 1), AgeGroup_50_to_54 = c(2, 1, 1, 1, 0, 1), AgeGroup_55_to_59 = c(1,
0, 1, 1, 1, 2), AgeGroup_60_to_64 = c(0, 2, 2, 1, 1, 3), AgeGroup_70_plus = c(2,
0, 2, 0, 0, 0)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I'm not sure if your question and your data match up. You're asking for by-month summaries of data, but your data only includes March entries. I've provided two examples of summarizing your data below, one that uses the entire date and one that uses by-day summaries since we can't use month. If your full data set has more months included, you can just swap the day for month instead. First, a quick summary of just the dates can be done with this code:
#### Load Library ####
library(tidyverse)
library(lubridate)
#### Pivot and Summarise Data ####
covid %>%
pivot_longer(cols = c(everything(),
-Date),
names_to = "AgeGroup",
values_to = "Cases") %>%
group_by(Date) %>%
summarise(Sum_Cases = sum(Cases))
This pivots your data into long format, groups by the entire date, then summarizes the cases, which gives you this by-date sum of data:
# A tibble: 6 × 2
Date Sum_Cases
<chr> <dbl>
1 3/10/2020 6
2 3/11/2020 16
3 3/12/2020 14
4 3/13/2020 15
5 3/14/2020 17
6 3/9/2020 13
Using the same pivot_longer principle, you can mutate the data to date format like you already did, pivot to longer format, then group by day, thereafter summarizing the cases:
#### Theoretical Example ####
covid %>%
mutate(Date=mdy(Date),
Year = year(Date),
Month = month(Date),
Day = day(Date)) %>%
pivot_longer(cols = c(everything(),
-Date,-Year,-Month,-Day),
names_to = "AgeGroup",
values_to = "Cases") %>%
group_by(Day) %>% # use by day instead of month
summarise(Sum_Cases = sum(Cases))
Which you can see below. Here we can see the 14th had the most cases:
# A tibble: 6 × 2
Day Sum_Cases
<int> <dbl>
1 9 13
2 10 6
3 11 16
4 12 14
5 13 15
6 14 17

Create new column with if else in R

I have a database like this:
structure(list(code = c(1, 2, 3, 4), age = c(25, 30, 45, 50),
car = c(0, 1, 0, 1)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
I want to create a column "drivers under 40" with this conditions:
0 if Age<40 & car==0
1 if Age<40 & car==1
How do I create the third column with this conditions?
I tried using the code "if else" to create a variable but it doesn't work.
drivers <- ifelse((age <= 40) & (car==0), 0, ifelse((age<=40) & (car==1), 1))
Is maybe the code written wrong?
Is there another method to do it? I am afraid to mess up the parentheses so I'd prefer another method, if there is any faster
Here is a dplyr version with case_when
library(dplyr)
df %>%
mutate(drivers_under_40 = case_when(age <= 40 & car==0 ~ 0,
age <= 40 & car==1 ~ 1,
TRUE ~ NA_real_))
code age car drivers_under_40
<dbl> <dbl> <dbl> <dbl>
1 1 25 0 0
2 2 30 1 1
3 3 45 0 NA
4 4 50 1 NA
A base R option
df1$drivers_under_40 <- with(df1, (age <= 40 & car == 1)* NA^(age> 40))
df1$drivers_under_40
[1] 0 1 NA NA
Unless you work with dplyr you have to specify the data in your ifelse statement.
data$column for example. Also you have to assign a new column for the operation.
And the last else-statement is missing.
so your ifelse statement should look like this:
data = structure(list(code = c(1, 2, 3, 4), age = c(25, 30, 45, 50),
car = c(0, 1, 0, 1)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
data$drivers <- ifelse((data$age <= 40) & (data$car==0), 0, ifelse((data$age<=40) & (data$car==1), 1, "here you have to fill another 'else' value"))

Using ddply in combo with weighted.mean in a for loop with dynamic variables

my dataset looks like this:
structure(list(GEOLEV2 = structure(c("768001001", "768001001",
"768001002", "768001002", "768001006", "768001006", "768001002",
"768001002", "768001002", "768001002", "768002016", "768002016"
), format.stata = "%9s"), DHSYEAR = structure(c(1988, 1988, 1988,
1988, 1998, 1998, 1998, 1998, 2013, 2013, 2013, 2013), format.stata = "%9.0g"),
v005 = structure(c(1e+06, 1e+06, 1e+06, 1e+06, 1815025, 1815025,
1517492, 1517492, 1350366, 1350366, 617033, 617033), format.stata = "%9.0g"),
age = structure(c(37, 22, 18, 46, 15, 29, 18, 42, 19, 15,
35, 16), format.stata = "%9.0g"), highest_year_edu = structure(c(2,
6, NA, NA, 5, NA, 2, 3, 2, NA, 5, 3), format.stata = "%9.0g")), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"), label = "Written by R")
I want to collapse it on a df1$GEOLEV2/df1$DHSYEAR basis, with weighted.mean as the collapsing function. Each variable shall remain with the same name.
I chose the function ddply and when I try it on a single variable, it works:
ddply(df1, ~ df1$GEOLEV2+ df1$DHSYEAR, summarise, age = weighted.mean(age, v005, na.rm = TRUE))
However, when I build the loop, the function returns me an error. My trial was:
df1_collapsed <- ddply(df1, ~ df1$GEOLEV2+ df1$DHSYEAR, summarise, age = weighted.mean(age, v005, na.rm = TRUE))
for (i in names(df1[4,5)) {
variable <- ddply(df1, ~ df1$GEOLEV2+ df1$DHSYEAR, summarise, i = weighted.mean(i, v005, na.rm = TRUE))
df1_collapsed <- left_join(df1_collapsed, variable, by = c("df1$GEOLEV2", "df1$DHSYEAR"))
}
and the error is
Error in weighted.mean.default(i, v005, na.rm = TRUE) :
'x' and 'w' must have the same length
How can I build the for loop, embedding the variable name in the loop?
In general in R you don't need loops for grouping and summarising (which you would call collapsing in Stata). You can use dplyr for this type of operation:
df1 %>%
group_by(GEOLEV2, DHSYEAR) %>%
summarise(
across(age:highest_year_edu, ~ weighted.mean(.x, v005, na.rm = TRUE))
)
# A tibble: 6 x 4
# Groups: GEOLEV2 [4]
# GEOLEV2 DHSYEAR age highest_year_edu
# <chr> <dbl> <dbl> <dbl>
# 1 768001001 1988 29.5 4
# 2 768001002 1988 32 NaN
# 3 768001002 1998 30 2.5
# 4 768001002 2013 17 2
# 5 768001006 1998 22 5
# 6 768002016 2013 25.5 4

Struggling to find the total number of rows that meet a certain variable grouped by another variable

I'm performing some light analysis on an NFL kickers' dataset, and am trying to find the total number of kicks made from 18-29yds grouped by each kicker. The dataset's rows contain every made or missed field goal for each kicker, along with the distance and some other variables irrelevant to this issue. I'm using groupby() and then the sum function within the summarise function, but it is returning 1 for every kicker. I have tried different combinations, trying to use filter() as well, but the results keep returning 1 for each kicker. Pics of my code are attached. Any help is appreciated :)
Some code I have tried:
kicks20to29 <- nfl_kicks1%>%
group_by(Kicker)%>%
count(filter(nfl_kicks1$`FG Length`>=18 & nfl_kicks1$`FG Length`<=29))
kicks20to29 <- nfl_kicks1%>%
group_by(Kicker)%>%
filter(`FG Length`>=18 & `FG Length`<=29)
dput output:
structure(list(Quarter = c(1, 2, 1, 2, 2, 4), `Possession Team` = c("NE",
"NE", "NE", "NE", "NE", "NE"), `Wind Speed` = c("6", "6", "12",
"12", "12", "12"), Down = c(4, 4, 4, 4, 4, 4), Distance = c(13,
7, 2, 6, 9, 12), YardLine = c(22, 20, 2, 6, 35, 25), `FG Length` = c(39,
37, 19, 23, 52, 42), `4Q to tie or take lead` = c(0, 0, 0, 0,
0, 0), Result = c("Miss", "Miss", "Good", "Good", "Good", "Miss"
), `Success Rate` = c(0, 0, 1, 1, 1, 0), Kicker = c("A.Vinatieri",
"A.Vinatieri", "A.Vinatieri", "A.Vinatieri", "A.Vinatieri", "A.Vinatieri"
), `# career kicks in study` = c(766, 766, 766, 766, 766, 766
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
One approach is to use the tally function, which counts the number of rows per group.
library(tidyverse)
nfl_kicks1 %>%
group_by(Kicker) %>%
dplyr::filter(`FG Length` >= 18 & `FG Length` <= 29) %>%
tally(name = "Number of Kicks")
## A tibble: 1 x 2
# Kicker `Number of Kicks`
# <chr> <int>
#1 A.Vinatieri 2
You can use group_by + summarise :
library(dplyr)
nfl_kicks1 %>%
group_by(Kicker) %>%
summarise(n_kicks = sum(`FG Length` >= 18 & `FG Length` <= 29))

Iterate over dplyr code using purrr::map2

I am relatively new to R, so my apologies if this question is too basic.
I have transactions that show quantity sold and revenue earned from different products. Because there are three products, there are 2^3 = 8 combinations for selling these products in a "basket." Each basket could be sold in any of the three given years (2016, 2017, 2018) and in any of the zones (East and West). [I have 3 years worth of transactions for the two zones: East and West.]
My objective is to analyze how much revenue is earned, how many quantities are sold, and how many transactions occurred for each combination of these products in a given year for a given zone.
I was able to do the above operation (using purrr::map) by splitting the data based on zones. I have created a list of two data frames that hold data grouped by "year" for each combination described above. This works well. However, the code is a little clunky in my opinion. There are a lot of repetitive statements. I want to be able to create a list of 2X3 (i.e. 2 zones and 3 years)
Here's my code using zone-wise splitting.
First Try
UZone <- unique(Input_File$Zone)
FYear <- unique(Input_File$Fiscal.Year)
#Split based on zone
a<-purrr::map(UZone, ~ dplyr::filter(Input_File, Zone == .)) %>%
#Create combinations of products
purrr::map(~mutate_each(.,funs(Exists = . > 0), L.Rev:I.Qty )) %>%
#group by Fiscal Year
purrr::map(~group_by_(.,.dots = c("Fiscal.Year", grep("Exists", names(.), value = TRUE)))) %>%
#Summarize, delete unwanted columns and rename the "number of transactions" column
purrr::map(~summarise_each(., funs(sum(., na.rm = TRUE), count = n()), L.Rev:I.Qty)) %>%
purrr::map(~select(., Fiscal.Year:L.Rev_count)) %>%
purrr::map(~plyr::rename(.,c("L.Rev_count" = "No.Trans")))
#Now do Zone and Year-wise splitting : Try 1
EastList<-a[[1]]
EastList <- EastList %>% split(.$Fiscal.Year)
WestList<-a[[2]]
WestList <- WestList %>% split(.$Fiscal.Year)
write.xlsx(EastList , file = "East.xlsx",row.names = FALSE)
write.xlsx(WestList , file = "West.xlsx",row.names = FALSE)
As you can see, the above code is very clunky. With limited knowledge of R, I researched https://blog.rstudio.org/2016/01/06/purrr-0-2-0/ and read purrr::map2() manual but I couldn't find too many examples. After reading the solution at How to add list of vector to list of data.frame objects as new slot by parallel?, I am assuming that I could use X = zone and Y= Fiscal Year to do what I have done above.
Here's what I tried:
Second Try
#Now try Zone and Year-wise splitting : Try 2
purrr::map2(UZone,FYear, ~ dplyr::filter(Input_File, Zone == ., Fiscal.Year == .))
But this code doesn't work. I get an error message that :
Error: .x (2) and .y (3) are different lengths
Question 1: Can I use map2 to do what I am trying to do? If not, is there any other better way?
Question 2: Just in case, we are able to use map2, how can I generate two Excel files using one command? As you can see above, I have two function calls above. I'd want to have only one.
Question 3: Instead of two statements below, is there any way to do sum and count in one statement? I am looking for more cleaner ways to do sum and count.
purrr::map(~summarise_each(., funs(sum(., na.rm = TRUE), count = n()), L.Rev:I.Qty)) %>%
purrr::map(~select(., Fiscal.Year:L.Rev_count)) %>%
Can someone please help me?
Here's my data:
dput(Input_File)
structure(list(Zone = c("East", "East", "East", "East", "East",
"East", "East", "West", "West", "West", "West", "West", "West",
"West"), Fiscal.Year = c(2016, 2016, 2016, 2016, 2016, 2016,
2017, 2016, 2016, 2016, 2017, 2017, 2018, 2018), Transaction.ID = c(132,
133, 134, 135, 136, 137, 171, 171, 172, 173, 175, 176, 177, 178
), L.Rev = c(3, 0, 0, 1, 0, 0, 2, 1, 1, 2, 2, 1, 2, 1), L.Qty = c(3,
0, 0, 1, 0, 0, 1, 1, 1, 2, 2, 1, 2, 1), A.Rev = c(0, 0, 0, 1,
1, 1, 0, 0, 0, 0, 0, 1, 0, 0), A.Qty = c(0, 0, 0, 2, 2, 3, 0,
0, 0, 0, 0, 3, 0, 0), I.Rev = c(4, 4, 4, 0, 1, 0, 3, 0, 0, 0,
1, 0, 1, 1), I.Qty = c(2, 2, 2, 0, 1, 0, 3, 0, 0, 0, 1, 0, 1,
1)), .Names = c("Zone", "Fiscal.Year", "Transaction.ID", "L.Rev",
"L.Qty", "A.Rev", "A.Qty", "I.Rev", "I.Qty"), row.names = c(NA,
14L), class = "data.frame")
Output Format:
Here's the code to generate the output. I would love to see EastList.2016 and EastList.2017 as two sheets in one Excel file, and WestList.2016, WestList.2017 and WestList.2018 as 3 sheets in one Excel file.
#generate the output:
EastList.2016 <- EastList[[1]]
EastList.2017 <- EastList[[2]]
WestList.2016 <- WestList[[1]]
WestList.2017 <- WestList[[2]]
WestList.2018 <- WestList[[3]]
Two lists broken down by year with sums and counts for each?
In dplyr:
(df <- your dataframe)
df %>%
group_by(Zone, Fiscal.Year) %>%
summarise_at(vars(L.Rev:I.Qty), funs(sum = sum, cnt = n()))
Source: local data frame [5 x 14]
Groups: Zone [?]
Zone Fiscal.Year L.Rev_sum L.Qty_sum A.Rev_sum A.Qty_sum I.Rev_sum I.Qty_sum L.Rev_cnt L.Qty_cnt A.Rev_cnt A.Qty_cnt I.Rev_cnt I.Qty_cnt
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
1 East 2016 4 4 3 7 13 7 6 6 6 6 6 6
2 East 2017 2 1 0 0 3 3 1 1 1 1 1 1
3 West 2016 4 4 0 0 0 0 3 3 3 3 3 3
4 West 2017 3 3 1 3 1 1 2 2 2 2 2 2
5 West 2018 3 3 0 0 2 2 2 2 2 2 2 2

Resources