Calculate a new dataframe via applying calculations to rows in R - r

I want to re-work the data that I have in a dataframe based on adding certain values up, and to do that to all of the (numerical) columns in the same way.
In code, I have created a dataframe that is structured a bit like this:
library(tibble)
df_in <- tribble(~names, ~a, ~a_pc, ~b, ~b_pc,
"Three star", 1L, 1, 2L, 1,
"Two star", 5L, 5, 12L, 6,
"One star", 6L, 6, 100L, 50,
"No star", 88L, 88, 86L, 43,
"Empty", 0L, 0, 0L, 0,
"Also empty", 0L, 0, 0L, 0)
In my output I want to have one row that contains the sums of three rows in the input dataframe, another row that contains the sum of two of them, and one that contains the contents of a row from the original (but renamed).
I also want to keep other rows if they have numbers but to drop them if they are empty. I would prefer to do that programmatically, but can do it manually with indexing if need be, so that's a bit less important.
My desired output would be a bit like this:
df_out <- tribble(~names, ~a, ~a_pc, ~b, ~b_pc,
"Any stars", 12L, 12, 114L, 57,
"... of which at least 2 stars", 6L, 6, 14L, 7,
"... of which 3 stars", 1L, 1, 2L, 1,
"No star", 88L, 88, 86L, 43)
For example, that 12L in the top left (meaning column a, "Any stars") is the sum of the 1L, 5L and 6L entries in the a column of the input.
I want to do this merging of rows at this stage in my processing because it's important to do it after I've already calculated the percentage columns (..._pc in the example). You'll see that in the output the percentage columns add to more than 100, which is correct because there is deliberately some 'double counting' - things can correctly show up in multiple rows if they meet the conditions.
Edit to add: Note that the labels I am using in the $names column of the test dataset df_in are not the real labels I have in my real situation. I imagine that a workable solution to this will be able to somehow take a collection of vectors that specify sets of rows and another collection of the same number of strings to label the new rows, and process through them. I might be able to define the sets of rows and the associated names like this:
set_1 <- c("Three star", "Two star", "One star")
set_2 <- c("Three star", "Two star")
set_3 <- "Three star"
set_4 <- "No star"
new_name_1 <- "Any stars"
new_name_2 <- "... of which at least 2 stars"
new_name_3 <- "... of which 3 stars"
new_name_4 <- "No star"

We may use imap to loop over patterns (as some cases are overlapping) and do the group by sum across those columns (after filtering the rows)
library(purrr)
library(stringr)
imap_dfr(setNames(c('(?<!No) star', 'Two|Three', 'Three', 'Empty|No'),
c("Any stars", "... of which at least 2 stars",
"... of which 3 stars", "No star" )), ~ df_in %>%
filter(str_detect(names, regex(.x, ignore_case = TRUE))) %>%
group_by(names = .y) %>%
summarise(across(everything(), sum)))
-ouptut
# A tibble: 4 x 5
names a a_pc b b_pc
<chr> <int> <dbl> <int> <dbl>
1 Any stars 12 12 114 57
2 ... of which at least 2 stars 6 6 14 7
3 ... of which 3 stars 1 1 2 1
4 No star 88 88 86 43
OP's expected
> df_out
# A tibble: 4 x 5
names a a_pc b b_pc
<chr> <int> <dbl> <int> <dbl>
1 Any stars 12 12 114 57
2 ... of which at least 2 stars 6 6 14 7
3 ... of which 3 stars 1 1 2 1
4 No star 88 88 86 43
Update
If the OP is passing a custom set of names
map2_dfr(mget(ls(pattern = '^set_\\d+')),
mget(ls(pattern = '^new_name_\\d+')),
~ df_in %>%
filter(names %in% .x) %>%
group_by(names = .y) %>%
summarise(across(everything(), sum)))
# A tibble: 4 x 5
names a a_pc b b_pc
<chr> <int> <dbl> <int> <dbl>
1 Any stars 12 12 114 57
2 ... of which at least 2 stars 6 6 14 7
3 ... of which 3 stars 1 1 2 1
4 No star 88 88 86 43

Related

How do I pivot and join 3 different tables by a ID column using dplyr?

Say I want to join these 3 dataframes with dplyr. How do I do it? I know I should use some combination of pivots and joins, but I can't figure out how to get it right.
My goal is to have the df something like this:
mpg_deciles mean_mpg mean_price production coefficient
1 13.5 12990 Foreign 12990
2 16 10874 Domestic 10874.8571428572
Heres the data
library(dplyr)
a <- tibble::tribble(
~mpg_deciles, ~mean_mpg,
1L, 13.5,
2L, 16,
3L, 17.75,
4L, 18.625,
5L, 19.7142857142857)
b <- tibble::tribble(
~coeff_foreign, ~mpg_deciles, ~mean_p_foreign, ~foreign,
12990, 2, 12990, "Foreign",
-2147.49999999997, 3, 10842.5, "Foreign",
-7180.99999999996, 4, 5809.00000000003, "Foreign",
-6777.49999999999, 6, 6212.5, "Foreign",
-6435.3333333333, 7, 6554.66666666669, "Foreign")
c <- tibble::tribble(
~coeff_domestic, ~mpg_deciles, ~mean_p_domestic, ~foreign,
10874.8571428572, 1L, 10874.8571428572, "Domestic",
-3697.73214285716, 2L, 7177.125, "Domestic",
-6031.19047619049, 3L, 4843.66666666666, "Domestic",
-6365.35714285716, 4L, 4509.5, "Domestic",
-4650.42857142859, 5L, 6224.42857142857, "Domestic")
I think you need to pre-process b and c and then use a left_join:
library(dplyr)
a %>%
left_join(
bind_rows(
b %>%
rename(coefficient = coeff_foreign, mean_price = mean_p_foreign, production = foreign),
c %>%
rename(coefficient = coeff_domestic, mean_price = mean_p_domestic, production = foreign)
),
by = "mpg_deciles"
)
This returns
# A tibble: 8 x 5
mpg_deciles mean_mpg coefficient mean_price production
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 13.5 10875. 10875. Domestic
2 2 16 12990 12990 Foreign
3 2 16 -3698. 7177. Domestic
4 3 17.8 -2147. 10842. Foreign
5 3 17.8 -6031. 4844. Domestic
6 4 18.6 -7181. 5809. Foreign
7 4 18.6 -6365. 4510. Domestic
8 5 19.7 -4650. 6224. Domestic
The pre-processing changes the coeff_foreign and coeff_domestic (same for mean_p_) columns into columns of the same name. If now the two data.frames are appended to each other, all values with the same column names go into the respective (same) columns. Without this pre-processing the columns with different names (e.g. coeff_foreign and coeff_domestic) would not end in the same column, but two columns are created (coeff_foreign and coeff_domestic) where the values are stored. In this case left_join would not achieve the desired result.
Updated version: Thanks to #Martin Gal input:
We could use a nested left_join:
library(dplyr)
left_join(a, b, by='mpg_deciles') %>%
left_join(., c, by='mpg_deciles') %>%
select(-starts_with("foreign")) %>%
pivot_longer(-c("mpg_deciles", "mean_mpg"), names_pattern = "(coeff|mean_p)_(.*)", names_to = c(".value", "production"), values_drop_na = TRUE)
mpg_deciles mean_mpg production coeff mean_p
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 13.5 domestic 10875. 10875.
2 2 16 foreign 12990 12990
3 2 16 domestic -3698. 7177.
4 3 17.8 foreign -2147. 10842.
5 3 17.8 domestic -6031. 4844.
6 4 18.6 foreign -7181. 5809.
7 4 18.6 domestic -6365. 4510.
8 5 19.7 domestic -4650. 6224.

Making a table that contains Mean and SD of a Dataset

I am using this dataset: http://www.openintro.org/stat/data/cdc.R
to create a table from a subset that only contains the means and standard deviations of male participants. The table should look like this:
Mean Standard Deviation
Age: 44.27 16.715
Height: 70.25 3.009219
Weight: 189.3 36.55036
Desired Weight: 178.6 26.25121
I created a subset for males and females with this code:
mdata <- subset(cdc, cdc$gender == ("m"))
fdata <- subset(cdc, cdc$gender == ("f"))
How should I create a table that only contains means and SDs of age, height, weight, and desired weight using these subsets?
The data frame you provided sucked up all the memory on my laptop, and it's not needed to provide that much data to solve your problem. Here's a dplyr/tidyr solution to create a summary table grouped by categories, using the starwars dataset available with dplyr:
library(dplyr)
library(tidyr)
starwars |>
group_by(sex) |>
summarise(across(
where(is.numeric),
.fns = list(Mean = mean, SD = sd), na.rm = TRUE,
.names = "{col}__{fn}"
)) |>
pivot_longer(-sex, names_to = c("var", ".value"), names_sep = "__")
# A tibble: 15 × 4
sex var Mean SD
<chr> <chr> <dbl> <dbl>
1 female height 169. 15.3
2 female mass 54.7 8.59
3 female birth_year 47.2 15.0
4 hermaphroditic height 175 NA
5 hermaphroditic mass 1358 NA
6 hermaphroditic birth_year 600 NA
7 male height 179. 36.0
8 male mass 81.0 28.2
9 male birth_year 85.5 157.
10 none height 131. 49.1
11 none mass 69.8 51.0
12 none birth_year 53.3 51.6
13 NA height 181. 2.89
14 NA mass 48 NA
15 NA birth_year 62 NA
Just make a data frame of colMeans and column sd. Note, that you may also select columns.
fdata <- subset(cdc, gender == "f", select=c("age", "height", "weight", "wtdesire"))
data.frame(mean=colMeans(fdata), sd=apply(fdata, 2, sd))
# mean sd
# age 45.79772 17.584420
# height 64.36775 2.787304
# weight 151.66619 34.297519
# wtdesire 133.51500 18.963014
You can also use by to do it simultaneously for both groups, it's basically a combination of split and lapply. (To avoid apply when calculating column SDs, you could also use sd=matrixStats::colSds(as.matrix(fdata)) which is considerably faster.)
res <- by(cdc[c("age", "height", "weight", "wtdesire")], cdc$gender, \(x) {
data.frame(mean=colMeans(x), sd=matrixStats::colSds(as.matrix(x)))
})
res
# cdc$gender: m
# mean sd
# age 44.27307 16.719940
# height 70.25165 3.009219
# weight 189.32271 36.550355
# wtdesire 178.61657 26.251215
# ------------------------------------------------------------------------------------------
# cdc$gender: f
# mean sd
# age 45.79772 17.584420
# height 64.36775 2.787304
# weight 151.66619 34.297519
# wtdesire 133.51500 18.963014
To extract only one of the data frames in the list-like object use e.g. res$m.
Usually we use aggregate for this, which you also might consider:
aggregate(cbind(age, height, weight, wtdesire) ~ gender, cdc, \(x) c(mean=mean(x), sd=sd(x))) |>
do.call(what=data.frame)
# gender age.mean age.sd height.mean height.sd weight.mean weight.sd wtdesire.mean wtdesire.sd
# 1 m 44.27307 16.71994 70.251646 3.009219 189.32271 36.55036 178.61657 26.25121
# 2 f 45.79772 17.58442 64.367750 2.787304 151.66619 34.29752 133.51500 18.96301
The pipe |> call(what=data.frame) is just needed to get rid of matrix columns, which is useful in case you aim to further process the data.
Note: R >= 4.1 used.
Data:
source('https://www.openintro.org/stat/data/cdc.R')
or
cdc <- structure(list(genhlth = structure(c(3L, 3L, 1L, 5L, 3L, 3L), levels = c("excellent",
"very good", "good", "fair", "poor"), class = "factor"), exerany = c(0,
1, 0, 0, 1, 1), hlthplan = c(1, 1, 1, 1, 1, 1), smoke100 = c(1,
0, 0, 0, 0, 1), height = c(69, 66, 73, 65, 67, 69), weight = c(224L,
215L, 200L, 216L, 165L, 170L), wtdesire = c(224L, 140L, 185L,
150L, 165L, 165L), age = c(73L, 23L, 35L, 57L, 81L, 83L), gender = structure(c(1L,
2L, 1L, 2L, 2L, 1L), levels = c("m", "f"), class = "factor")), row.names = c("19995",
"19996", "19997", "19998", "19999", "20000"), class = "data.frame")

Assign max value of group to all rows in that group

I would like to assign the max value of a group to all rows within that group. How do I do that?
I have a dataframe containing the names of the group and the max number of credits that belongs to it.
course_credits <- aggregate(bsc_academic$Credits, by = list(bsc_academic$Course_code), max)
which gives
Course Credits
1 ABC1000 6.5
2 ABC1003 6.5
3 ABC1004 6.5
4 ABC1007 5.0
5 ABC1010 6.5
6 ABC1021 6.5
7 ABC1023 6.5
The main dataframe looks like this:
Appraisal.Type Resits Credits Course_code Student_ID
Final result 0 6.5 ABC1000 10
Final result 0 6.5 ABC1003 10
Grade supervisor 0 0 ABC1000 10
Grade supervisor 0 0 ABC1003 10
Final result 0 12 ABC1294 23
Grade supervisor 0 0 ABC1294 23
As you see, student 10 took course ABC1000, worth 6.5 credits. For each course (per student), however, two rows exist: Final result and Grade supervisor. In the end, Final result should be deleted, but the credits should be kept. Therefore, I want to assign the max value of 6.5 to the Grade supervisor row.
Likewise, student 23 has followed course ABC1294, worth 12 credits.
In the end, this should be the result:
Appraisal.Type Resits Credits Course_code Student_ID
Grade supervisor 0 6.5 ABC1000 10
Grade supervisor 0 6.5 ABC1003 10
Grade supervisor 0 12 ABC1294 23
How do I go about this?
An option would be to group by 'Student_ID', mutate the 'Credits' with max of 'Credits' and filter the rows with 'Appraisal.Type' as "Grade supervisor"
library(dplyr)
df1 %>%
group_by(Student_ID) %>%
dplyr::mutate(Credits = max(Credits)) %>%
ungroup %>%
filter(Appraisal.Type == "Grade supervisor")
# A tibble: 2 x 5
# Appraisal.Type Resits Credits Course_code Student_ID
# <chr> <int> <dbl> <chr> <int>
#1 Grade supervisor 0 6.5 ABC1000 10
#2 Grade supervisor 0 6.5 ABC1003 10
If we also need 'Course_code' to be included in the grouping
df2 %>%
group_by(Student_ID, Course_code) %>%
dplyr::mutate(Credits = max(Credits)) %>%
filter(Appraisal.Type == "Grade supervisor")
# A tibble: 3 x 5
# Groups: Student_ID, Course_code [3]
# Appraisal.Type Resits Credits Course_code Student_ID
# <chr> <int> <dbl> <chr> <int>
#1 Grade supervisor 0 6.5 ABC1000 10
#2 Grade supervisor 0 6.5 ABC1003 10
#3 Grade supervisor 0 12 ABC1294 23
NOTE: I case, plyr package is also loaded, there can be some masking of functions esp summarise/mutate which is also found in plyr. To prevent it, either do this on a fresh session without loading plyr or explicitly specify dplyr::mutate
data
df1 <- structure(list(Appraisal.Type = c("Final result", "Final result",
"Grade supervisor", "Grade supervisor"), Resits = c(0L, 0L, 0L,
0L), Credits = c(6.5, 6.5, 0, 0), Course_code = c("ABC1000",
"ABC1003", "ABC1000", "ABC1003"), Student_ID = c(10L, 10L, 10L,
10L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(Appraisal.Type = c("Final result", "Final result",
"Grade supervisor", "Grade supervisor", "Final result", "Grade supervisor"
), Resits = c(0L, 0L, 0L, 0L, 0L, 0L), Credits = c(6.5, 6.5,
0, 0, 12, 0), Course_code = c("ABC1000", "ABC1003", "ABC1000",
"ABC1003", "ABC1294", "ABC1294"), Student_ID = c(10L, 10L, 10L,
10L, 23L, 23L)), class = "data.frame", row.names = c(NA, -6L))
Generate a sample dataset.
data <- as.data.frame(list(Appraisal.Type = c(rep("Final result", 2), rep("Grade supervisor", 2)),
Resits = rep(0, 4),
Credits = c(rep(6.5, 2), rep(0, 2)),
Course_code = rep(c("ABC1000", "ABC1003"), 2),
Student_ID = rep(10, 4)))
Assign the max value of a group to all rows in this group and then delete rows that contain "Final results".
##Reassign the values of "Credits" column
for (i in 1: nlevels(as.factor(data$Course_code))) {
Course_code <- unique(data$Course_code)[i]
data$Credits [data$Course_code == Course_code] <- max (data$Credits [data$Course_code == Course_code])
}
##New dataset without "Final result" rows
data <- data[data$Appraisal.Type != "Final result",]
Here is the result.
data
Appraisal.Type Resits Credits Course_code Student_ID
3 Grade supervisor 0 6.5 ABC1000 10
4 Grade supervisor 0 6.5 ABC1003 10
Here's a data.table solution,
DT[,Credits := max(Credits),by=Student_ID]
Result <- DT[Appraisal.Type == "Grade supervisor"]

Identifying Duplicate/Unique Teams (and Restructuring Data) in R

I have a data set that looks like this:
Person Team
1 30
2 30
3 30
4 30
11 40
22 40
1 50
2 50
3 50
4 50
15 60
16 60
17 60
1 70
2 70
3 70
4 70
11 80
22 80
My overall goal is to organize that team identification codes so that it is easy to see which teams are duplicates of one another and which teams are unique. I want to summarize the data so that it looks like this:
Team Duplicate1 Duplicate2
30 50 70
40 80
60
As you can see, teams 30, 50, and 70 have identical members, so they share a row. Similarly, teams 40 and 80 have identical members, so they share a row. Only team 60 (in this example) is unique.
In situations where teams are duplicated, I don't care which team id goes in which column. Also, there may be more than 2 duplicates of a team. Teams range in size from 2 members to 8 members.
This answer gives the output data format you asked for. I left the duplicate teams in a single variable because I think it's a better way to handle an arbitrary number of duplicates.
require(dplyr)
df %>%
arrange(Team, Person) %>% # this line is necessary in case the rest of your data isn't sorted
group_by(Team) %>%
summarize(players = paste0(Person, collapse = ",")) %>%
group_by(players) %>%
summarize(teams = paste0(Team, collapse = ",")) %>%
mutate(
original_team = ifelse(grepl(",", teams), substr(teams, 1, gregexpr(",", teams)[[1]][1]-1), teams),
dup_teams = ifelse(grepl(",", teams), substr(teams, gregexpr(",", teams)[[1]][1]+1, nchar(teams)), NA)
)
The result:
Source: local data frame [3 x 4]
players teams original_team dup_teams
1 1,2,3,4 30,50,70 30 50,70
2 11,22 40,80 40 80
3 15,16,17 60 60 NA
Not exactly the format you're wanting, but pretty useful:
# using MrFlick's data
library(dplyr)
dd %>% group_by(Team) %>%
arrange(Person) %>%
summarize(team.char = paste(Person, collapse = "_")) %>%
group_by(team.char) %>%
arrange(team.char, Team) %>%
mutate(duplicate = 1:n())
Source: local data frame [6 x 3]
Groups: team.char
Team team.char duplicate
1 40 11_22 1
2 80 11_22 2
3 60 15_16_17 1
4 30 1_2_3_4 1
5 50 1_2_3_4 2
6 70 1_2_3_4 3
(Edited in the arrange(Person) line in case the data isn't already sorted, got the idea from #Reed's answer.)
Using this for your sample data
dd<-structure(list(Person = c(1L, 2L, 3L, 4L, 11L, 22L, 1L, 2L, 3L,
4L, 15L, 16L, 17L, 1L, 2L, 3L, 4L, 11L, 22L), Team = c(30L, 30L,
30L, 30L, 40L, 40L, 50L, 50L, 50L, 50L, 60L, 60L, 60L, 70L, 70L,
70L, 70L, 80L, 80L)), .Names = c("Person", "Team"),
class = "data.frame", row.names = c(NA, -19L))
You could try a table()/interaction() to find duplicate groups. For example
tt <- with(dd, table(Team, Person))
grp <- do.call("interaction", c(data.frame(unclass(tt)), drop=TRUE))
split(rownames(tt), grp)
this returns
$`1.1.1.1.0.0.0.0.0`
[1] "30" "50" "70"
$`0.0.0.0.0.1.1.1.0`
[1] "60"
$`0.0.0.0.1.0.0.0.1`
[1] "40" "80"
so the group "names" are really just indicators for membership for each person. You could easily rename them if you like with setNames(). But here it collapse the appropriate teams.
Two more base R options (though not exactly the desired output):
DF2 <- aggregate(Person ~ Team, DF, toString)
> split(DF2$Team, DF2$Person)
$`1, 2, 3, 4`
[1] 30 50 70
$`11, 22`
[1] 40 80
$`15, 16, 17`
[1] 60
Or
( DF2$DupeGroup <- as.integer(factor(DF2$Person)) )
Team Person DupeGroup
1 30 1, 2, 3, 4 1
2 40 11, 22 2
3 50 1, 2, 3, 4 1
4 60 15, 16, 17 3
5 70 1, 2, 3, 4 1
6 80 11, 22 2
Note that the expected output as shown in the question would either require to add NAs or empty strings in some of the columns entries because in a data.frame, all columns must have the same number of rows. That is different for lists in, as you can see in some of the answers.
The second option, but using data.table, since aggregate tends to be slow for large data:
library(data.table)
setDT(DF)[, toString(Person), by=Team][,DupeGroup := .GRP, by=V1][]
Team V1 DupeGroup
1: 30 1, 2, 3, 4 1
2: 40 11, 22 2
3: 50 1, 2, 3, 4 1
4: 60 15, 16, 17 3
5: 70 1, 2, 3, 4 1
6: 80 11, 22 2
Using uniquecombs from the mgcv package:
library(mgcv)
library(magrittr) # for the pipe %>%
# Using MrFlick's data
team_names <- sort(unique(dd$Team))
unique_teams <- with(dd, table(Team, Person)) %>% uniquecombs %>% attr("index")
printout <- unstack(data.frame(team_names, unique_teams))
> printout
$`1`
[1] 60
$`2`
[1] 40 80
$`3`
[1] 30 50 70
Now you could use something like this answer to print it in tabular form (note that the groups are column-wise, not row-wise as in your question):
attributes(printout) <- list(names = names(printout)
, row.names = 1:max(sapply(printout, length))
, class = "data.frame")
> printout
1 2 3
1 60 40 30
2 <NA> 80 50
3 <NA> <NA> 70
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs

Is it possible to combine summarise with summarise_at in a single group_by with dplyr

Edit: just realized the side column in the data isn't used at all, so please disregard it for the purposes of the example.
I have a large dataframe of play-by-play basketball data, and I would like to perform a group_by, summarise and summarise_at on my data. Below is a subset of my dataframe:
> dput(zed)
structure(list(side = c("right", "right", "right", "right", "right",
"right", "left", "right", "right", "right", "left", "right",
"left", "left", "left", "right", "right", "right", "left", "right"
), result = c("twopointmiss", "twopointmade", "twopointmade",
"twopointmiss", "twopointmade", "twopointmade", "twopointmiss",
"twopointmade", "twopointmade", "twopointmade", "twopointmade",
"twopointmade", "twopointmiss", "twopointmiss", "twopointmiss",
"twopointmiss", "twopointmade", "twopointmade", "twopointmiss",
"twopointmiss"), zonenumber = c(1, 1, 1, 1, 2, 3, 2, 3, 2, 3,
4, 4, 4, 1, 1, 2, 3, 2, 3, 4), team = c("Bos", "Bos", "Bos",
"Bos", "Bos", "Bos", "Bos", "Bos", "Bos", "Bos", "Min", "Min",
"Min", "Min", "Min", "Min", "Min", "Min", "Min", "Min")), row.names = c(3L,
5L, 8L, 14L, 17L, 23L, 28L, 30L, 39L, 41L, 42L, 43L, 47L, 52L,
54L, 58L, 60L, 63L, 69L, 72L), class = "data.frame")
> zed
side result zonenumber team
3 right twopointmiss 1 Bos
5 right twopointmade 1 Bos
8 right twopointmade 1 Bos
14 right twopointmiss 1 Bos
17 right twopointmade 2 Bos
23 right twopointmade 3 Bos
28 left twopointmiss 2 Bos
30 right twopointmade 3 Bos
39 right twopointmade 2 Bos
41 right twopointmade 3 Bos
42 left twopointmade 4 Min
43 right twopointmade 4 Min
47 left twopointmiss 4 Min
52 left twopointmiss 1 Min
54 left twopointmiss 1 Min
58 right twopointmiss 2 Min
60 right twopointmade 3 Min
63 right twopointmade 2 Min
69 left twopointmiss 3 Min
72 right twopointmiss 4 Min
In the example below, i only use summarise, as I'm currently not sure how to use summarise and summarise_at with the same group_by call:
> grouped.df <- zed %>%
+ dplyr::group_by(team) %>%
+ dplyr::summarise(
+ shotsMade = sum(result == "twopointmade"),
+ shotsAtt = n(),
+ shotsPct = round(shotsMade / shotsAtt),
+ points = 2 * shotsMade,
+
+ z1Made = sum(zonenumber == 1),
+ z2Made = sum(zonenumber == 2),
+ z3Made = sum(zonenumber == 3),
+ z4Made = sum(zonenumber == 4)
+ )
> grouped.df
# A tibble: 2 x 9
team shotsMade shotsAtt shotsPct points z1Made z2Made z3Made z4Made
<chr> <int> <int> <dbl> <dbl> <int> <int> <int> <int>
1 Bos 7 10 1 14 4 3 3 0
2 Min 4 10 0 8 2 2 2 4
In the example below, I'd like to create the first 4 columns (shotsMade, shotsAtt, shotsPct, points) in summarise, and create the z#made columns with a summarise_at. In my full data, there are ~30 unique-ish columns that I plan on creating with summarise, and ~80 similar-ish columns that I plan on creating with summarise_at.
For sake of a small example, I didn't want to bring my entire dataframe in for this example. If I am able to implement both summarise and summarise_at in the example above, then I'll be able to do it for my full data frame as well.
Any thoughts on this is greatly appreciated, as I am particularly keen on improving with the _at functions in dplyr. Thanks!
I don't think there is a way to actually use both summarise and summarise_at as clearly we wouldn't be able to execute the second one after losing many rows and columns.
So, instead we may use mutate, mutate_at, and then drop certain rows (and perhaps columns).The difference between this and somehow magically applying summarise and summarise_at is going to be that the former approach will not drop any variables. I guess it depends whether that's a good thing for you. Below I add an extra line of select(-one_of(setdiff(names(zed), "team"))) that will actually drop all the columns that the summarise combo would drop.
zed$zonenumber2 <- zed$zonenumber # Example
zed %>%
group_by(team) %>%
mutate(
shotsMade = sum(result == "twopointmade"),
shotsAtt = n(),
shotsPct = round(shotsMade / shotsAtt),
points = 2 * shotsMade) %>%
mutate_at(
vars(contains("zone")),
.funs = funs(Made1 = sum(. == 1), Made2 = sum(. == 2),
Made3 = sum(. == 3), Made4 = sum(. == 4))) %>%
filter(!duplicated(team)) %>%
select(-one_of(setdiff(names(zed), "team"))) # May want to remove
# A tibble: 2 x 13
# Groups: team [2]
# team shotsMade shotsAtt shotsPct points zonenumber_Made1 zonenumber2_Mad… zonenumber_Made2
# <chr> <int> <int> <dbl> <dbl> <int> <int> <int>
# 1 Bos 7 10 1 14 4 4 3
# 2 Min 4 10 0 8 2 2 2
# … with 5 more variables: zonenumber2_Made2 <int>, zonenumber_Made3 <int>,
# zonenumber2_Made3 <int>, zonenumber_Made4 <int>, zonenumber2_Made4 <int>

Resources