Dplyr: Rename Tibble Output Columns With Factor Levels - r

I am trying to find a way to rename my factor levels (1, 2, 3) with girl, boy, other in the dplyr tibble output.
This is the code:
library(dplyr)
df1 %>%
dplyr::group_by(sex)%>%
dplyr::summarise(percent=100*n()/nrow(df1), n=n())
And my result is:
# A tibble: 3 x 3
sexs percent n
<int> <dbl> <int>
1 1 52.1 731
2 2 47.1 661
3 NA 0.855 12
The desired result would be:
# A tibble: 3 x 3
sexs percent n
<int> <dbl> <int>
Girl 1 52.1 731
Boy 2 47.1 661
Other NA 0.855 12

I happen to love the forcats package because when I get done I can actually see what I did. Another solution by simply adding to the pipe before your existiung code.
library(dplyr)
library(forcats)
sex <- sample(1:2, 100, replace = TRUE)
sex[[88]] <- NA
df1 <- data.frame(sex)
df1 %>%
mutate(newsex = fct_explicit_na(fct_recode(as_factor(sex),
Girl = "1",
Boy = "2" ),
na_level = "Other")) %>%
group_by(newsex, sex) %>%
summarise(percent = 100 * n() / nrow(df1), n=n())
#> # A tibble: 3 x 4
#> # Groups: newsex [3]
#> newsex sex percent n
#> <fct> <int> <dbl> <int>
#> 1 Girl 1 56 56
#> 2 Boy 2 43 43
#> 3 Other NA 1 1
Created on 2020-05-11 by the reprex package (v0.3.0)

When posting please provide some sample data to work with, it will help others test and make sure everything is working properly. This problem is relatively simple so it shouldn't be a problem.
If you want to replace the NA with literally any other number you can do this
df1 %>%
dplyr::mutate(sex = ifelse(is.na(sex), 0, sex),
sex = factor(sex,
levels = c(1,2,0),
labels = c("Girl", "Boy", "Other"))) %>%
dplyr::group_by(sex)%>%
dplyr::summarise(percent=100*n()/nrow(df1), n=n())
Otherwise you can use case_when to assign the factors and then convert the column to a factor
df1 %>%
dplyr::mutate(sex = case_when(
sex == 1 ~ "Girl",
sex == 2 ~ "Boy",
is.na(sex) ~ "Other") %>%
as_factor(.)) %>%
dplyr::group_by(sex)%>%
dplyr::summarise(percent=100*n()/nrow(df1), n=n())

Related

applying weighted.mean for specific values in a column

I have a data frame named df with five columns :
age <- c(10,11,12,12,10,11,11,12,10,11,12)
time <- c(20,26,41,60,29,28,54,24,59,70,25)
weight <- c(123,330,445,145,67,167,190,104,209,146,201)
gender <- c(1,1,2,2,2,2,1,2,2,2,1)
Q2 <- c(112,119,114,120,121,117,116,114,121,122,124)
df <- data_frame(age, w, time, gender, Q2)
what I want is applying the weighted.mean based on each age to my data frame by using two conditions: 1)gender = 2 and 2) Q2 >=114 & Q2 <= 121
by the code below, I can simply apply weighted.mean but I do not know how to use my two conditions.
df1<-
df %>%
group_by(age) %>%
summarise(weighted_time = weighted.mean(time, weight))
Is the following what you are looking for?
library(tidyverse)
age <- c(10,11,12,12,10,11,11,12,10,11,12)
time <- c(20,26,41,60,29,28,54,24,59,70,25)
weight <- c(123,330,445,145,67,167,190,104,209,146,201)
gender <- c(1,1,2,2,2,2,1,2,2,2,1)
Q2 <- c(112,119,114,120,121,117,116,114,121,122,124)
df <- data.frame(age, weight, time, gender, Q2)
df %>%
group_by(age) %>%
filter(gender == 2 & Q2 >=114 & Q2 <= 121) %>%
summarise(weighted_time = weighted.mean(time, weight), .groups = "drop")
#> # A tibble: 3 × 2
#> age weighted_time
#> <dbl> <dbl>
#> 1 10 51.7
#> 2 11 28
#> 3 12 42.4
You can add a filter for those 2 (3) conditions:
df %>% filter(gender == 2 & Q2 >= 114 & Q2 <= 121) %>% group_by(age) %>% summarise(weighted_time = weighted.mean(time, weight))
This gives
# A tibble: 3 x 2
age weighted_time
<dbl> <dbl>
1 10 51.7
2 11 28
3 12 42.4
data.table
age <- c(10,11,12,12,10,11,11,12,10,11,12)
time <- c(20,26,41,60,29,28,54,24,59,70,25)
weight <- c(123,330,445,145,67,167,190,104,209,146,201)
gender <- c(1,1,2,2,2,2,1,2,2,2,1)
Q2 <- c(112,119,114,120,121,117,116,114,121,122,124)
df <- data.frame(age, weight, time, gender, Q2)
library(data.table)
setDT(df)[gender == 2 & (Q2 >=114 & Q2 <= 121), list(res = weighted.mean(time, weight)), by = age
][order(age)]
#> age res
#> 1: 10 51.71739
#> 2: 11 28.00000
#> 3: 12 42.42219
Created on 2021-12-10 by the reprex package (v2.0.1)

Iterating over listed data frames within a piped purrr anonymous function call

Using purrr::map and the magrittr pipe, I am trying generate a new column with values equal to a substring of the existing column.
I can illustrate what I'm trying to do with the following toy dataset:
library(tidyverse)
library(purrr)
test <- list(tibble(geoid_1970 = c(123, 456),
name_1970 = c("here", "there"),
pop_1970 = c(1, 2)),
tibble(geoid_1980 = c(234, 567),
name_1980 = c("here", "there"),
pop_1970 = c(3, 4))
)
Within each listed data frame, I want a column equal to the relevant year. Without iterating, the code I have is:
data <- map(test, ~ .x %>% mutate(year = as.integer(str_sub(names(test[[1]][1]), -4))))
Of course, this returns a year of 1970 in both listed data frames, which I don't want. (I want 1970 in the first and 1980 in the second.)
In addition, it's not piped, and my attempt to pipe it throws an error:
data <- test %>% map(~ .x %>% mutate(year = as.integer(str_sub(names(.x[[1]][1]), -4))))
# > Error: Problem with `mutate()` input `year`.
# > x Input `year` can't be recycled to size 2.
# > ℹ Input `year` is `as.integer(str_sub(names(.x[[1]][1]), -4))`.
# > ℹ Input `year` must be size 2 or 1, not 0.
How can I iterate over each listed data frame using the pipe?
Try:
test %>% map(~.x %>% mutate(year = as.integer(str_sub(names(.x[1]), -4))))
[[1]]
# A tibble: 2 x 4
geoid_1970 name_1970 pop_1970 year
<dbl> <chr> <dbl> <int>
1 123 here 1 1970
2 456 there 2 1970
[[2]]
# A tibble: 2 x 4
geoid_1980 name_1980 pop_1970 year
<dbl> <chr> <dbl> <int>
1 234 here 3 1980
2 567 there 4 1980
We can get the 'year' with parse_number
library(dplyr)
library(purrr)
map(test, ~ .x %>%
mutate(year = readr::parse_number(names(.)[1])))
-output
#[[1]]
# A tibble: 2 x 4
# geoid_1970 name_1970 pop_1970 year
# <dbl> <chr> <dbl> <dbl>
#1 123 here 1 1970
#2 456 there 2 1970
#[[2]]
# A tibble: 2 x 4
# geoid_1980 name_1980 pop_1970 year
# <dbl> <chr> <dbl> <dbl>
#1 234 here 3 1980
#2 567 there 4 1980

Using Filter function in R. Need to assign NA and keep length of dataset the same for Horse Racing Database

I'm still new to the group and R.
I had some really helpful feedback on my last query so hoping I can get
some more support with the following:
I am working on a horse racing database that at this stage has 4 variables:
race horse number, race id, distance of race and the rating (DaH) assigned for the horses
performance for the race.
The dataset:
horse_ratings <- tibble(
horse=c(1,1,1,2,2,2,3,3,3),
raceid=c(1,2,3,1,2,3,1,2,3),
Dist=c(9.47,9.47,10,10.1,10.2,9,11,9.47,10.5),
DaH=c(101,99,103,101,94,87,102,96,62)
)
Giving:
> horse_ratings
# A tibble: 9 x 4
horse raceid Dist DaH
<dbl> <dbl> <dbl> <dbl>
1 1 1 9.47 101
2 1 2 9.47 99
3 1 3 10 103
4 2 1 10.1 101
5 2 2 10.2 94
6 2 3 9 87
7 3 1 11 102
8 3 2 9.47 96
9 3 3 10.5 62
I will perform a number of calculations on the dataset such as mean rating, max rating etc
which id like to result in a number of vectors of equal length.
I'm using the filter function to look at the performance ratings achieved for different
race distances (ie. Distance greater than 10 to begin). However, if one of the horses has not
run a race for that distance then i've noticed that the result does not include that
horse in the output. ie:
> horse_ratings %>%
+ group_by(horse) %>%
+ filter(Dist>10) %>%
+ summarise(mean_rating=mean(DaH))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
horse mean_rating
<dbl> <dbl>
1 2 97.5
2 3 82
So horse 1 has disappeared as it has not run a race of distance greater than 10.
I need to keep the output vector of length 3 ideally so I can put all the calculations
in to a dataframe of same length (for my final data output/print out).
I'm hoping there's a way of assigning an NA or similar to an output for horse 1
Giving:
# A tibble: 2 x 2
horse mean_rating
<dbl> <dbl>
1 1 NA
2 2 97.5
3 3 82
Or a similar solution.
Help would be much appreciated!!
You can use the .drop = FALSE parameter in group_by():
horse_ratings %>%
group_by(horse, .drop = FALSE) %>%
filter(Dist > 10) %>%
summarise(mean_rating = mean(DaH))
horse mean_rating
<dbl> <dbl>
1 1 NaN
2 2 97.5
3 3 82
Don't filter first, do it in summarise so you don't drop groups (horse).
library(dplyr)
horse_ratings %>%
group_by(horse) %>%
summarise(mean_rating = mean(DaH[Dist>10], na.rm = TRUE))
# A tibble: 3 x 2
# horse mean_rating
# <dbl> <dbl>
#1 1 NaN
#2 2 97.5
#3 3 82
library(tidyverse)
Method 1:
horse_stats <-
horse_ratings %>%
mutate(raceid = as.factor(raceid)) %>%
filter(Dist > 10) %>%
group_by(horse) %>%
summarise_if(is.numeric, c("sum", "mean", "max", "min")) %>%
ungroup() %>%
left_join(horse_ratings %>%
select(horse) %>%
distinct(),
., by = "horse", all.x = TRUE)
Method 2 :
horse_stats <-
horse_ratings %>%
mutate(raceid = factor(raceid),
Dist = ifelse(Dist <= 10, 0, Dist),
DaH = ifelse(Dist == 0, 0, Dist)) %>%
group_by(horse) %>%
summarise_if(is.numeric, c("sum", "mean", "max", "min")) %>%
ungroup() %>%
mutate_if(is.numeric, list(~na_if(., 0)))

How to keep real values of grouped variable within dplyr package in R

My data is something like this:
group <- c(21, 21, 21, 9, 9, 9, 25, 25, 25)
a <- c(8,3,5,6,8,3,3,9,3)
b <- c(4,9,0,1,3,5,6,1,1)
c <- c(1,7,2,5,6,8,4,8,6)
value <- c(23,34,43,52,65,21,12,89,76)
df <- data.frame(group,a,b,c,value)
I applied following function to it.
out <- df %>%
select(group, a, b, value) %>%
group_by(group = gl(n()/3, 3)) %>%
summarise(res = mean(value), a=a[1], b=b[1])
print(out)
Then I am getting following result.
group res a b
<fct> <dbl> <dbl> <dbl>
1 1 33.3 8 4
2 2 46 6 1
3 3 59 3 6
>
My question is how to keep the orgiignal values of ID as they were in the output df like this
group res a b
<fct> <dbl> <dbl> <dbl>
1 21 33.3 8 4
2 9 46 6 1
3 25 59 3 6
>
Thanks in advance!
The issue is you are overwriting your group variable in group_by call hence you are not getting the original variable. You need to use some other name in group_by and then do the calculations.
We can use two options -
1) With summarise
library(dplyr)
df %>%
group_by(group1 = gl(n()/3, 3)) %>%
summarise(res = mean(value), a=a[1], b=b[1], group = group[1])
# group1 res a b group
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 33.3 8 4 21
#2 2 46 6 1 9
#3 3 59 3 6 25
2) With mutate
df %>%
select(group, a, b, value) %>%
group_by(group1 = gl(n()/3, 3)) %>%
mutate(res = mean(value), a=a[1], b=b[1]) %>%
slice(1)
In both the case, if you are no longer interested in keeping the grouping variable do ungroup() %>% select(-group1) to remove it.

Winners within pairs; or vector-valued group_by mutate?

I'm trying to assess which unit in a pair is the "winner". group_by() %>% mutate() is close to the right thing, but it's not quite there. in particular
dat %>% group_by(pair) %>% mutate(winner = ifelse(score[1] > score[2], c(1, 0), c(0, 1))) doesn't work.
The below does, but is clunky with an intermediate summary data frame. Can we improve this?
library(tidyverse)
set.seed(343)
# units within pairs get scores
dat <-
data_frame(pair = rep(1:3, each = 2),
unit = rep(1:2, 3),
score = rnorm(6))
# figure out who won in each pair
summary_df <-
dat %>%
group_by(pair) %>%
summarize(winner = which.max(score))
# merge back and determine whether each unit won
dat <-
left_join(dat, summary_df, "pair") %>%
mutate(won = as.numeric(winner == unit))
dat
#> # A tibble: 6 x 5
#> pair unit score winner won
#> <int> <int> <dbl> <int> <dbl>
#> 1 1 1 -1.40 2 0
#> 2 1 2 0.523 2 1
#> 3 2 1 0.142 1 1
#> 4 2 2 -0.847 1 0
#> 5 3 1 -0.412 1 1
#> 6 3 2 -1.47 1 0
Created on 2018-09-26 by the reprex
package (v0.2.0).
maybe related to Weird group_by + mutate + which.max behavior
You could do:
dat %>%
group_by(pair) %>%
mutate(won = score == max(score),
winner = unit[won == TRUE]) %>%
# A tibble: 6 x 5
# Groups: pair [3]
pair unit score won winner
<int> <int> <dbl> <lgl> <int>
1 1 1 -1.40 FALSE 2
2 1 2 0.523 TRUE 2
3 2 1 0.142 TRUE 1
4 2 2 -0.847 FALSE 1
5 3 1 -0.412 TRUE 1
6 3 2 -1.47 FALSE 1
Using rank:
dat %>% group_by(pair) %>% mutate(won = rank(score) - 1)
More for fun (and slightly faster), using the outcome of the comparison (score[1] > score[2]) to index a vector with 'won alternatives' :
dat %>% group_by(pair) %>%
mutate(won = c(0, 1, 0)[1:2 + (score[1] > score[2])])

Resources