I have a data frame with ~150K rows and 77 categorical variables in a form such as the below. How do I found the Score and count for each category
One numeric variable and 77 grouping variables
students<-data.frame(ID = c("A","B","C","D"), Gender = c("M","F","F","F"), Socioeconomic = c("Low","Low","Medium","High"), Subject = c("Maths","Maths","Science", "Science"),
Scores = c(45,98, 50,38))
That is I do not want to have to go through each categorical column individually 77 times but want a tibble that contains a list of the outputs for each of the below
students %>% group_by(Gender) %>% summarise(Mean.score = mean(Scores), Count = length(ID))
students %>% group_by(Socioeconomic) %>% summarise(Mean.score = mean(Scores), Count = length(ID))
students %>% group_by(Subject) %>% summarise(Mean.score = mean(Scores), Count = length(ID))```
Here are two options:
library(tidyverse)
# map successively over each categorical column
map(students %>% select(-Scores, -ID) %>% names() %>% set_names(),
~ students %>%
group_by_at(.x) %>%
summarise(Mean.score = mean(Scores),
Count = n())
)
$Gender
# A tibble: 2 x 3
Gender Mean.score Count
<fct> <dbl> <int>
1 F 62 3
2 M 45 1
$Socioeconomic
# A tibble: 3 x 3
Socioeconomic Mean.score Count
<fct> <dbl> <int>
1 High 38 1
2 Low 71.5 2
3 Medium 50 1
$Subject
# A tibble: 2 x 3
Subject Mean.score Count
<fct> <dbl> <int>
1 Maths 71.5 2
2 Science 44 2
# Convert to long format, group, then summarize
students %>%
gather(key, value, -ID, -Scores) %>%
group_by(key, value) %>%
summarise(Count=n(),
Mean.score=mean(Scores))
key value Count Mean.score
<chr> <chr> <int> <dbl>
1 Gender F 3 62
2 Gender M 1 45
3 Socioeconomic High 1 38
4 Socioeconomic Low 2 71.5
5 Socioeconomic Medium 1 50
6 Subject Maths 2 71.5
7 Subject Science 2 44
Related
DATA = data.frame("TRIMESTER" = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3),
"STUDENT" = c(1,2,3,4,5,6,7,1,2,3,5,9,10,11,3,7,10,6,12,15,17,16,21))
WANT = data.frame("TRIMESTER" = c(1,2,3),
"NEW_ENROLL" = c(7,3,5),
"TOTAL_ENROLL" = c(7,10,15))
I Have 'DATA' and want to make 'WANT' which has three columns and for every 'TRIMESTER' you count the number of NEW 'STUDENT' and then for 'TOTAL_ENROLL' you just count the total number of unique 'STUDENT' every trimester.
My attempt only counts the number for each TRIMESTER.
library(dplyr)
DATA %>%
group_by(TRIMESTER) %>%
count()
Here is a way.
suppressPackageStartupMessages(library(dplyr))
DATA <- data.frame("TRIMESTER" = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3),
"STUDENT" = c(1,2,3,4,5,6,7,1,2,3,5,9,10,11,3,7,10,6,12,15,17,16,21))
DATA %>%
mutate(NEW_ENROLL = !duplicated(STUDENT)) %>%
group_by(TRIMESTER) %>%
summarise(NEW_ENROLL = sum(NEW_ENROLL)) %>%
ungroup() %>%
mutate(TOTAL_ENROLL = cumsum(NEW_ENROLL))
#> # A tibble: 3 × 3
#> TRIMESTER NEW_ENROLL TOTAL_ENROLL
#> <dbl> <int> <int>
#> 1 1 7 7
#> 2 2 3 10
#> 3 3 5 15
Created on 2022-08-14 by the reprex package (v2.0.1)
For variety we can use Base R aggregate with transform
transform(aggregate(. ~ TRIMESTER , DATA[!duplicated(DATA$STUDENT),] , length),
TOTAL_ENROLL = cumsum(STUDENT))
Output
TRIMESTER STUDENT TOTAL_ENROLL
1 1 7 7
2 2 3 10
3 3 5 15
We replace the duplicated elements in 'STUDENT' to NA, grouped by TRIMESTER, get the sum of nonNA elements and finally do the cumulative sum (cumsum)
library(dplyr)
DATA %>%
mutate(STUDENT = replace(STUDENT, duplicated(STUDENT), NA)) %>%
group_by(TRIMESTER) %>%
summarise(NEW_ENROLL = sum(!is.na(STUDENT)), .groups= 'drop') %>%
mutate(TOTAL_ENROLL = cumsum(NEW_ENROLL))
-output
# A tibble: 3 × 3
TRIMESTER NEW_ENROLL TOTAL_ENROLL
<dbl> <int> <int>
1 1 7 7
2 2 3 10
3 3 5 15
Or with distinct
distinct(DATA, STUDENT, .keep_all = TRUE) %>%
group_by(TRIMESTER) %>%
summarise(NEW_ENROLL = n(), .groups = 'drop') %>%
mutate(TOTAL_ENROLL = cumsum(NEW_ENROLL))
# A tibble: 3 × 3
TRIMESTER NEW_ENROLL TOTAL_ENROLL
<dbl> <int> <int>
1 1 7 7
2 2 3 10
3 3 5 15
So this data frame has a lot of separate observations that I need to be added together.
Use dget():
I've tried a lot of different solutions like:
df %>%
group_by(product, price) %>%
summarise(
quantity = sum(quantity),
total = sum(total)
)
And:
df %>%
gather(key = variable, value = value, c(Quantity,Price,Total)) %>%
group_by(Product, variable) %>%
summarize(sum = sum(value)) %>%
spread(variable, sum)
And:
df %>%
group_by(Product) %>%
summarise(Quantity = sum(Quantity),
AveragePrice = sum(Total)/sum(Quantity),
Total = sum(Total))
but i just get:
> df
quantity total
1 61 1685
Expected output is something like this but obviously with all the products on the product column:
#> # Groups: product [2]
#> product price quantity total
#> <chr> <dbl> <dbl> <dbl>
#> 1 small cucumber 10 1 10
#> 2 tomatoes 1kg 16 2 32
I have asked this before but I wasn't nearly specific enough.
Thanks.
You're calling plyr::summarise instead of dplyr::summarise. You can call it explicitly with
df %>%
group_by(Product) %>%
dplyr::summarise(
Quantity = sum(Quantity),
AveragePrice = sum(Total)/sum(Quantity),
Total = sum(Total))
#> # A tibble: 27 x 4
#> Product Quantity AveragePrice Total
#> <chr> <dbl> <dbl> <dbl>
#> 1 asparagus 200g 4 45 180
#> 2 back bacon 200g 1 30 30
#> 3 beef fillet strips 500g 1 90 90
#> 4 beetroot 1kg 1 15 15
#> 5 broccoli head 1 25 25
#> 6 butter 500g 1 57 57
#> 7 butternut cubes 4 14 56
#> 8 calistos jalape=c3=b1o salsa 1 40 40
#> 9 carrot 1 kg 4 14 56
#> 10 cauliflower whole head 2 25 50
#> # … with 17 more rows
I'm still new to the group and R.
I had some really helpful feedback on my last query so hoping I can get
some more support with the following:
I am working on a horse racing database that at this stage has 4 variables:
race horse number, race id, distance of race and the rating (DaH) assigned for the horses
performance for the race.
The dataset:
horse_ratings <- tibble(
horse=c(1,1,1,2,2,2,3,3,3),
raceid=c(1,2,3,1,2,3,1,2,3),
Dist=c(9.47,9.47,10,10.1,10.2,9,11,9.47,10.5),
DaH=c(101,99,103,101,94,87,102,96,62)
)
Giving:
> horse_ratings
# A tibble: 9 x 4
horse raceid Dist DaH
<dbl> <dbl> <dbl> <dbl>
1 1 1 9.47 101
2 1 2 9.47 99
3 1 3 10 103
4 2 1 10.1 101
5 2 2 10.2 94
6 2 3 9 87
7 3 1 11 102
8 3 2 9.47 96
9 3 3 10.5 62
I will perform a number of calculations on the dataset such as mean rating, max rating etc
which id like to result in a number of vectors of equal length.
I'm using the filter function to look at the performance ratings achieved for different
race distances (ie. Distance greater than 10 to begin). However, if one of the horses has not
run a race for that distance then i've noticed that the result does not include that
horse in the output. ie:
> horse_ratings %>%
+ group_by(horse) %>%
+ filter(Dist>10) %>%
+ summarise(mean_rating=mean(DaH))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
horse mean_rating
<dbl> <dbl>
1 2 97.5
2 3 82
So horse 1 has disappeared as it has not run a race of distance greater than 10.
I need to keep the output vector of length 3 ideally so I can put all the calculations
in to a dataframe of same length (for my final data output/print out).
I'm hoping there's a way of assigning an NA or similar to an output for horse 1
Giving:
# A tibble: 2 x 2
horse mean_rating
<dbl> <dbl>
1 1 NA
2 2 97.5
3 3 82
Or a similar solution.
Help would be much appreciated!!
You can use the .drop = FALSE parameter in group_by():
horse_ratings %>%
group_by(horse, .drop = FALSE) %>%
filter(Dist > 10) %>%
summarise(mean_rating = mean(DaH))
horse mean_rating
<dbl> <dbl>
1 1 NaN
2 2 97.5
3 3 82
Don't filter first, do it in summarise so you don't drop groups (horse).
library(dplyr)
horse_ratings %>%
group_by(horse) %>%
summarise(mean_rating = mean(DaH[Dist>10], na.rm = TRUE))
# A tibble: 3 x 2
# horse mean_rating
# <dbl> <dbl>
#1 1 NaN
#2 2 97.5
#3 3 82
library(tidyverse)
Method 1:
horse_stats <-
horse_ratings %>%
mutate(raceid = as.factor(raceid)) %>%
filter(Dist > 10) %>%
group_by(horse) %>%
summarise_if(is.numeric, c("sum", "mean", "max", "min")) %>%
ungroup() %>%
left_join(horse_ratings %>%
select(horse) %>%
distinct(),
., by = "horse", all.x = TRUE)
Method 2 :
horse_stats <-
horse_ratings %>%
mutate(raceid = factor(raceid),
Dist = ifelse(Dist <= 10, 0, Dist),
DaH = ifelse(Dist == 0, 0, Dist)) %>%
group_by(horse) %>%
summarise_if(is.numeric, c("sum", "mean", "max", "min")) %>%
ungroup() %>%
mutate_if(is.numeric, list(~na_if(., 0)))
My data is something like this:
group <- c(21, 21, 21, 9, 9, 9, 25, 25, 25)
a <- c(8,3,5,6,8,3,3,9,3)
b <- c(4,9,0,1,3,5,6,1,1)
c <- c(1,7,2,5,6,8,4,8,6)
value <- c(23,34,43,52,65,21,12,89,76)
df <- data.frame(group,a,b,c,value)
I applied following function to it.
out <- df %>%
select(group, a, b, value) %>%
group_by(group = gl(n()/3, 3)) %>%
summarise(res = mean(value), a=a[1], b=b[1])
print(out)
Then I am getting following result.
group res a b
<fct> <dbl> <dbl> <dbl>
1 1 33.3 8 4
2 2 46 6 1
3 3 59 3 6
>
My question is how to keep the orgiignal values of ID as they were in the output df like this
group res a b
<fct> <dbl> <dbl> <dbl>
1 21 33.3 8 4
2 9 46 6 1
3 25 59 3 6
>
Thanks in advance!
The issue is you are overwriting your group variable in group_by call hence you are not getting the original variable. You need to use some other name in group_by and then do the calculations.
We can use two options -
1) With summarise
library(dplyr)
df %>%
group_by(group1 = gl(n()/3, 3)) %>%
summarise(res = mean(value), a=a[1], b=b[1], group = group[1])
# group1 res a b group
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 33.3 8 4 21
#2 2 46 6 1 9
#3 3 59 3 6 25
2) With mutate
df %>%
select(group, a, b, value) %>%
group_by(group1 = gl(n()/3, 3)) %>%
mutate(res = mean(value), a=a[1], b=b[1]) %>%
slice(1)
In both the case, if you are no longer interested in keeping the grouping variable do ungroup() %>% select(-group1) to remove it.
Consider the following dataset where id uniquely identifies a person, and name varies within id only to the extent of minor spelling issues. I want to aggregate to id level using dplyr:
df= data.frame(id=c(1,1,1,2,2,2),name=c('michael c.','mike', 'michael','','John',NA),var=1:6)
Using group_by(id) yields the correct computation, but I lose the name column:
df %>% group_by(id) %>% summarise(newvar=sum(var)) %>%ungroup()
A tibble: 2 x 2
id newvar
<dbl> <int>
1 1 6
2 2 15
Using group_by(id,name) yields both name and id but obviously the "wrong" sums.
I would like to keep the last non-missing observatoin of the name within each group. I basically lack a dplyr version of Statas lastnm() function:
df %>% group_by(id) %>% summarise(sum = sum(var), Name = lastnm(name))
id sum Name
1 1 6 michael
2 2 15 John
Is there a "keep last non missing"-option?
1) Use mutate like this:
df %>%
group_by(id) %>%
mutate(sum = sum(var)) %>%
ungroup
giving:
# A tibble: 6 x 4
id name var sum
<dbl> <fct> <int> <int>
1 1 michael c. 1 6
2 1 mike 2 6
3 1 michael 3 6
4 2 john 4 15
5 2 john 5 15
6 2 john 6 15
2) Another possibility is:
df %>%
group_by(id) %>%
summarize(name = name %>% unique %>% toString, sum = sum(var)) %>%
ungroup
giving:
# A tibble: 2 x 3
id name sum
<dbl> <chr> <int>
1 1 michael c., mike, michael 6
2 2 john 15
3) Another variation is to only report the first name in each group:
df %>%
group_by(id) %>%
summarize(name = first(name), sum = sum(var)) %>%
ungroup
giving:
# A tibble: 2 x 3
id name sum
<dbl> <fct> <int>
1 1 michael c. 6
2 2 john 15
I posted a feature request on dplyrs github thread, and the reponse there is actually the best answer. For sake of completion I repost it here:
df %>%
group_by(id) %>%
summarise(sum=sum(var), Name=last(name[!is.na(name)]))
#> # A tibble: 2 x 3
#> id sum Name
#> <dbl> <int> <chr>
#> 1 1 6 michael
#> 2 2 15 John