I have a tibble with student test data, and I wish to convert these to percentiles using dplyr. For the sake of having a minimal example, imagine the following setup of three students.
require(tidyverse)
tbl <- tibble(Name = c("Alice", "Bob", "Cat"), Test = c(16, 13, 15))
The following code works and yields the desired output.
tbl %>% mutate(TestPercentile = cume_dist(Test) * 100)
# A tibble: 3 x 3
Name Test TestPercentile
<chr> <dbl> <dbl>
1 Alice 16 100
2 Bob 13 33.3
3 Cat 15 66.7
However, I actually want to do it programmatically because there are many such columns.
colname <- "Test"
percname <- str_c(colname, "Percentile")
tbl %>% mutate({{percname}} := cume_dist({{colname}}) * 100)
# A tibble: 3 x 3
Name Test TestPercentile
<chr> <dbl> <dbl>
1 Alice 16 100
2 Bob 13 100
3 Cat 15 100
Why does cume_dist make the percentile 100 for all students when I try to use tidy evaluation like this? (And ideally, if I can be permitted a second question, how can I fix it?)
If by programmatically you mean you want to write your own function, you can do it like this:
calculate_percentile <- function(data, colname) {
data %>%
mutate("{{colname}}Percentile" := cume_dist({{colname}} * 100))
}
tbl %>%
calculate_percentile(Test)
# A tibble: 3 x 3
Name Test TestPercentile
<chr> <dbl> <dbl>
1 Alice 16 1
2 Bob 13 0.333
3 Cat 15 0.667
Edit for multiple columns
New Data
tbl <- tibble(Name = c("Alice", "Bob", "Cat"), Test = c(16, 13, 15), Test_math = c(16, 30, 55), Test_music = c(3, 78, 34))
calculate_percentile <- function(data, colnames) {
data %>%
mutate(across({{colnames}}, ~cume_dist(.) * 100, .names = "{col}Percentile"))
}
test_columns <- c("Test_math", "Test_music")
tbl %>%
calculate_percentile(test_columns)
# A tibble: 3 x 6
Name Test Test_math Test_music Test_mathPercentile Test_musicPercentile
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alice 16 16 3 33.3 33.3
2 Bob 13 30 78 66.7 100
3 Cat 15 55 34 100 66.7
Why does your solution not work? Because your solution applies cume_dist literally to the string "test":
tbl %>% mutate({{percname}} := print({{colname}}))
[1] "Test"
# A tibble: 3 x 5
Name Test Test_math Test_music TestPercentile
<chr> <dbl> <dbl> <dbl> <chr>
1 Alice 16 16 3 Test
2 Bob 13 30 78 Test
3 Cat 15 55 34 Test
Why does this give a TestPercentile value of 100? Because cume_dist of "test" is 1:
cume_dist("test")
#[1] 1
So we need R to tell not to evaluate the string "test" per se but to look for a variable with this name, which we can do like this:
tbl %>% mutate({{percname}} := cume_dist(!!parse_quo(colname, env = global_env())) * 100)
# A tibble: 3 x 5
Name Test Test_math Test_music TestPercentile
<chr> <dbl> <dbl> <dbl> <dbl>
1 Alice 16 16 3 100
2 Bob 13 30 78 33.3
3 Cat 15 55 34 66.7
#Check that this uses the values of "Test" and not "Test" per se:
tbl %>% mutate({{percname}} := print(!!parse_quo(colname, env = global_env())))
[1] 16 13 15
# A tibble: 3 x 5
Name Test Test_math Test_music TestPercentile
<chr> <dbl> <dbl> <dbl> <dbl>
1 Alice 16 16 3 16
2 Bob 13 30 78 13
3 Cat 15 55 34 15
Passing column name as string :
library(dplyr)
library(rlang)
return_percentile <- function(data, colname) {
percname <- paste0(colname, "Percentile")
data %>% mutate({{percname}} := cume_dist(!!sym(colname)) * 100)
}
tbl %>% return_percentile("Test")
# A tibble: 3 x 3
# Name Test TestPercentile
# <chr> <dbl> <dbl>
#1 Alice 16 100
#2 Bob 13 33.3
#3 Cat 15 66.7
Passing column name unquoted :
return_percentile <- function(data, colname) {
percname <- paste0(deparse(substitute(colname)), "Percentile")
data %>% mutate({{percname}} := cume_dist({{colname}}) * 100)
}
tbl %>% return_percentile(Test)
# A tibble: 3 x 3
# Name Test TestPercentile
# <chr> <dbl> <dbl>
#1 Alice 16 100
#2 Bob 13 33.3
#3 Cat 15 66.7
Related
let's say I have a dataframe like this:
df <- tibble(ID = c(1, 1, 1, 1, 1), v1 = c(3, 5, 1, 0, 1), v2 = c(10, 6, 1, 20, 23), Time = c(as.POSIXct("1900-01-01 10:00:00"), as.POSIXct("1900-01-01 11:00:00"), as.POSIXct("1900-01-01 13:00:00"), as.POSIXct("1900-01-01 16:00:00"), as.POSIXct("1900-01-01 20:00:00"))) %>% group_by(ID)
# A tibble: 5 x 4
# Groups: ID [1]
ID v1 v2 Time
<dbl> <dbl> <dbl> <dttm>
1 1 3 10 1900-01-01 10:00:00
2 1 5 6 1900-01-01 11:00:00
3 1 1 1 1900-01-01 13:00:00
4 1 0 20 1900-01-01 16:00:00
5 1 1 23 1900-01-01 20:00:00
In words, this is a simple timeseries of a specific ID with two values v1 and v2 per time.
As quite common in machine learning, I want to aggregate the last n timesteps into one feature vector. For all previous timesteps there should be a time reference in hours when this data point occured. For the first row, where no previous timestep is available, the data should be filled with zeros.
Let's make an example. In this case n=2, that is I want to aggregate the current time step (t2) and the prevopus (t1) together:
# A tibble: 5 x 6
ID v1_t1 v2_t1 time_t1 v1_t2 v2_t2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 NA 3 10
2 1 3 10 1 5 6
3 1 5 6 2 1 1
4 1 1 1 3 0 20
5 1 0 20 4 1 23
I want to keep that as generic as possible, so that n can change and the number of data columns. Any idea how to do this?
Thanks :)
Using dplyr::lag and dplyr::across you could do:
library(dplyr, warn=FALSE)
library(lubridate, warn=FALSE)
df %>%
group_by(ID) %>%
mutate(time_t1 = lubridate::hour(Time) - lag(lubridate::hour(Time))) %>%
mutate(across(c(v1, v2), .fns = list(t2 = ~.x, t1 = ~lag(.x, default = 0)))) %>%
select(-v1, -v2, -Time)
#> # A tibble: 5 × 6
#> # Groups: ID [1]
#> ID time_t1 v1_t2 v1_t1 v2_t2 v2_t1
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 NA 3 0 10 0
#> 2 1 1 5 3 6 10
#> 3 1 2 1 5 1 6
#> 4 1 3 0 1 20 1
#> 5 1 4 1 0 23 20
UPDATE Here is a more generic approach which makes use of some function factories to create list of functions which could then be passed to the .fns argument of across. Haven't tested for the more general case but should work for any n or number of lags to include and also for any number of data columns.
library(dplyr, warn=FALSE)
library(lubridate, warn=FALSE)
fun_factory1 <- function(n) {
function(x) {
lubridate::hour(x) - lag(lubridate::hour(x), n = n)
}
}
fun_factory2 <- function(n) {
function(x) {
lag(x, n = n, default = 0)
}
}
n <- 2
fns1 <- lapply(seq(n - 1), fun_factory1)
names(fns1) <- paste0("t", seq(n - 1))
fns2 <- lapply(seq(n) - 1, fun_factory2)
names(fns2) <- paste0("t", seq(n))
df %>%
group_by(ID) %>%
mutate(across(Time, .fns = fns1)) %>%
mutate(across(c(v1, v2), .fns = fns2)) %>%
select(-v1, -v2, -Time)
#> # A tibble: 5 × 6
#> # Groups: ID [1]
#> ID Time_t1 v1_t1 v1_t2 v2_t1 v2_t2
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 NA 3 0 10 0
#> 2 1 1 5 3 6 10
#> 3 1 2 1 5 1 6
#> 4 1 3 0 1 20 1
#> 5 1 4 1 0 23 20
I have a tibble with columns named as numbers (e.g. 1). I created a function to compute differences between columns, but I don't know how to do it with that type of columns:
<!-- language-all: lang-r -->
library(tidyverse)
df <- tibble(`1` = c(1,2,3), `2` = c(2,4,6))
# This works
df %>%
mutate(diff = `1` - `2`)
#> # A tibble: 3 x 3
#> `1` `2` diff
#> <dbl> <dbl> <dbl>
#> 1 1 2 -1
#> 2 2 4 -2
#> 3 3 6 -3
# But this doesn't
calc_diffs <- function(x, y){
df %>%
mutate(diff := !!x - !!y)
}
calc_diffs(1, 2)
#> # A tibble: 3 x 3
#> `1` `2` diff
#> <dbl> <dbl> <dbl>
#> 1 1 2 -1
#> 2 2 4 -1
#> 3 3 6 -1
<sup>Created on 2020-10-14 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>
We can convert to a symbol and evaluate
calc_diffs <- function(x, y){
df %>%
mutate(diff := !! rlang::sym(x) - !!rlang::sym(y))
}
Then, we just pass a string as argument
calc_diffs("1", "2")
# A tibble: 3 x 3
# `1` `2` diff
# <dbl> <dbl> <dbl>
#1 1 2 -1
#2 2 4 -2
#3 3 6 -3
Column names are strings. We could pass index to subset the column, but here the column name is an unusual name that starts with number. So, either we can wrap it with backreference using paste or just pass a string, convert to symbol and evaluate (!!)
Does this work:
> df <- tibble(`1` = c(1,2,3), `2` = c(2,4,6))
> df
# A tibble: 3 x 2
`1` `2`
<dbl> <dbl>
1 1 2
2 2 4
3 3 6
> calc_diffs <- function(x, y){
+ df %>%
+ mutate(diff = {{x}} - {{y}})
+ }
> calc_diffs(`1`,`2`)
# A tibble: 3 x 3
`1` `2` diff
<dbl> <dbl> <dbl>
1 1 2 -1
2 2 4 -2
3 3 6 -3
>
I am using the newly minted dplyr 1.0.0 and the digest package to generate a hash of every row in a tibble.
I am aware of
adding hash to each row using dplyr and digest in R
but I would like to use the revamped rowwise() in dplyr 1.0.0.
See the example below. Anyone has any idea about why it fails? I should be allowed to digest a row where the entries are of different types.
library(dplyr)
library(digest)
df <- tibble(
student_id = letters[1:4],
student_id2 = letters[9:12],
test1 = 10:13,
test2 = 20:23,
test3 = 30:33,
test4 = 40:43
)
df
#> # A tibble: 4 x 6
#> student_id student_id2 test1 test2 test3 test4
#> <chr> <chr> <int> <int> <int> <int>
#> 1 a i 10 20 30 40
#> 2 b j 11 21 31 41
#> 3 c k 12 22 32 42
#> 4 d l 13 23 33 43
dd <- df %>%
rowwise(student_id) %>%
mutate(hash = digest(c_across(everything()))) %>%
ungroup
#> Error: Problem with `mutate()` input `hash`.
#> ✖ Can't combine `student_id2` <character> and `test1` <integer>.
#> ℹ Input `hash` is `digest(c_across(everything()))`.
#> ℹ The error occured in row 1.
### but digest should not care too much about the type of the input
Created on 2020-06-04 by the reprex package (v0.3.0)
It seems that the different column types have an issue. One option is to first change the column types to a single one and then do the rowwise
library(dplyr)
library(digest)
df %>%
mutate(across(everything(), as.character)) %>%
rowwise %>%
mutate(hash = digest(c_across(everything())))
# A tibble: 4 x 7
# Rowwise:
# student_id student_id2 test1 test2 test3 test4 hash
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 a i 10 20 30 40 2638067de6dcfb3d58b83a83e0cd3089
#2 b j 11 21 31 41 21162fc0c528a6550b53c87ca0c2805e
#3 c k 12 22 32 42 8d7539eacff61efbd567b6100227523b
#4 d l 13 23 33 43 9739997605aa39620ce50e96f1ff4f70
Or another option is to unite the columns to a single one and then do the digest on that column
library(tidyr)
df %>%
unite(new, everything(), remove = FALSE) %>%
rowwise %>%
mutate(hash = digest(new)) %>%
select(-new)
# A tibble: 4 x 7
# Rowwise:
# student_id student_id2 test1 test2 test3 test4 hash
# <chr> <chr> <int> <int> <int> <int> <chr>
#1 a i 10 20 30 40 a9e4cafdfbc88f17b7593dfd684eb2a1
#2 b j 11 21 31 41 a67a5df8186972285bd7be59e6fdab38
#3 c k 12 22 32 42 9c20bd87a50642631278b3e6d28ecf68
#4 d l 13 23 33 43 3f4f373d1969dcf0c8f542023a258225
Or another option is pmap, where we concatenate the elements to a single one in each row, resulting in integer converting to character as vectors can hold only a single class
library(purrr)
df %>%
mutate(hash = pmap_chr(., ~ digest(c(...))))
# A tibble: 4 x 7
# student_id student_id2 test1 test2 test3 test4 hash
# <chr> <chr> <int> <int> <int> <int> <chr>
#1 a i 10 20 30 40 f0fb4100907570ef9bda073b78dc44a6
#2 b j 11 21 31 41 754b09e8d4d854aa5e40aa88d1edfc66
#3 c k 12 22 32 42 5f3a699caff833e900fd956232cf61dd
#4 d l 13 23 33 43 4d31c65284e5db36c37461126a9eb63c
The advantage here is that we are not changing the column types
I have some data in the format below, where all columns are of type chr.
#> # A tibble: 3 x 4
#> id age name income
#> <chr> <chr> <chr> <chr>
#> 1 1 18 jim 100
#> 2 2 21 bob 200
#> 3 3 16 alice 300
I'd like to use as.numeric() on only some columns. Preferably, I'd like to define a vector of column names and then use purrr:map to map as.numeric() to only those columns:
numeric_variables <- c("id", "age", "income")
How can I map that?
My desired output would look like:
df
#> # A tibble: 3 x 4
#> id age name income
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 18 jim 100
#> 2 2 21 bob 200
#> 3 3 16 alice 300
Code for data entry below.
library(purrr)
df <- data.frame(stringsAsFactors=FALSE,
id = c(1, 2, 3),
age = c(18, 21, 16),
name = c("jim", "bob", "alice"),
income = c(100, 200, 300)
)
df <- map_df(df, as.character)
df
Created on 2020-02-15 by the reprex package (v0.3.0)
We can use mutate_at
library(dplyr)
df %>%
mutate_at(vars(numeric_variables), as.numeric) %>%
as_tibble
# A tibble: 3 x 4
# id age name income
# <dbl> <dbl> <chr> <dbl>
#1 1 18 jim 100
#2 2 21 bob 200
#3 3 16 alice 300
Or more easily
df %>%
type.convert(as.is = TRUE)
Or with map
library(purrr)
df %>%
map_if(names(.) %in% numeric_variables, as.numeric) %>%
bind_cols
# A tibble: 3 x 4
# id age name income
# <dbl> <dbl> <chr> <dbl>
#1 1 18 jim 100
#2 2 21 bob 200
#3 3 16 alice 300
Or if we use the compound assignment operator (%<>%), this can be assigned in place
library(magrittr)
df %<>%
map_if(names(.) %in% numeric_variables, as.numeric) %<>%
bind_cols
str(df)
#tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
# $ id : num [1:3] 1 2 3
# $ age : num [1:3] 18 21 16
# $ name : chr [1:3] "jim" "bob" "alice"
# $ income: num [1:3] 100 200 300
You can use map_at
df[] <- purrr::map_at(df, numeric_variables, as.numeric)
df
# A tibble: 3 x 4
# id age name income
# <dbl> <dbl> <chr> <dbl>
#1 1 18 jim 100
#2 2 21 bob 200
#3 3 16 alice 300
Consider the following:
df <- data.frame(
Name = c("Alan", "Bob", "Christine", "David", "Erica"),
Gender = c("M", "M", "F", "M", "F"),
Star_Sign = c("Aquarius", "Capricorn", "Aquarius", "Libra", "Leo"),
City = c("London", "Paris", "Berlin", "London", "Paris"),
Blood_Group = c("A", "AB", "B", "O", "A"),
Hours_Worked = c(2000, 1600, 0, 100, 200),
Salary = c(100000, 20000, 0, 500, 4000)
)
Name_Summary <- df %>% group_by(Name) %>% summarise(Hours_Worked = sum(Hours_Worked), Average_Salary = mean(Salary))
Gender_Summary <- df %>% group_by(Gender) %>% summarise(Hours_Worked = sum(Hours_Worked), Average_Salary = mean(Salary))
Star_Sign_Summary <- df %>% group_by(Star_Sign) %>% summarise(Hours_Worked = sum(Hours_Worked), Average_Salary = mean(Salary))
City_Summary <- df %>% group_by(City) %>% summarise(Hours_Worked = sum(Hours_Worked), Average_Salary = mean(Salary))
Blood_Group_Summary <- df %>% group_by(Blood_Group) %>% summarise(Hours_Worked = sum(Hours_Worked), Average_Salary = mean(Salary))
Obviously this works fine for a small number of fields. If, however, I've got 100 different fields (say) to do this for, it becomes very unwieldy.
I'd like to think that there is a way to loop through the list of fields and produce these summaries for each field, using some code to generate (and name the summaries), but I don't think I know how to do this. Can anyone help please?
Thanks
Alan
If you have a list of the columns you want to group by as a character vector:
vars_to_group_by <- names(df)[1:5]
You could iterate over them (I'm using purrr::map() but you could use lapply() or a loop), and use this rlang pattern to convert strings >> symbols >> properly evaluated variables.
library(tidyverse)
map(vars_to_group_by, sym) %>%
map(~ df %>%
group_by(!!.x) %>%
summarise(avg_salary = mean(Salary),
avg_hours = mean(Hours_Worked),
avg_hourly_wage = avg_salary / avg_hours))
You get an unnamed list back, because the vector going in was unnamed.
[[1]]
# A tibble: 5 x 4
Name avg_salary avg_hours avg_hourly_wage
<fct> <dbl> <dbl> <dbl>
1 Alan 100000 2000 50
2 Bob 20000 1600 12.5
3 Christine 0 0 NaN
4 David 500 100 5
5 Erica 4000 200 20
[[2]]
# A tibble: 2 x 4
Gender avg_salary avg_hours avg_hourly_wage
<fct> <dbl> <dbl> <dbl>
1 F 2000 100 20
2 M 40167. 1233. 32.6
[[3]]
# A tibble: 4 x 4
Star_Sign avg_salary avg_hours avg_hourly_wage
<fct> <dbl> <dbl> <dbl>
1 Aquarius 50000 1000 50
2 Capricorn 20000 1600 12.5
3 Leo 4000 200 20
4 Libra 500 100 5
[[4]]
# A tibble: 3 x 4
City avg_salary avg_hours avg_hourly_wage
<fct> <dbl> <dbl> <dbl>
1 Berlin 0 0 NaN
2 London 50250 1050 47.9
3 Paris 12000 900 13.3
[[5]]
# A tibble: 4 x 4
Blood_Group avg_salary avg_hours avg_hourly_wage
<fct> <dbl> <dbl> <dbl>
1 A 52000 1100 47.3
2 AB 20000 1600 12.5
3 B 0 0 NaN
4 O 500 100 5
You could add names based on vars_to_group_by either before or after the map() calls.
We could use the group_by_at which can take a string as input
library(purrr)
library(dplyr)
map(names(df)[-6], ~ df %>%
group_by_at(.x) %>%
summarise(avg_salary = mean(Salary)))
#[[1]]
# A tibble: 5 x 2
# Name avg_salary
# <fct> <dbl>
#1 Alan 100000
#2 Bob 20000
#3 Christine 0
#4 David 500
#5 Erica 4000
#[[2]]
# A tibble: 2 x 2
# Gender avg_salary
# <fct> <dbl>
#1 F 2000
#2 M 40167.
#[[3]]
# A tibble: 4 x 2
# Star_Sign avg_salary
# <fct> <dbl>
#1 Aquarius 50000
#2 Capricorn 20000
#3 Leo 4000
#4 Libra 500
#[[4]]
# A tibble: 3 x 2
# City avg_salary
# <fct> <dbl>
#1 Berlin 0
#2 London 50250
#3 Paris 12000
#[[5]]
# A tibble: 4 x 2
# Blood_Group avg_salary
# <fct> <dbl>
#1 A 52000
#2 AB 20000
#3 B 0
#4 O 500