Reverse quantile for scores in tidyverse - r

I want to ranks quantiles for my data, but I get the high rank for the highest score. I want to get the lowest rank (i.e., 1) for the highest score.
The following code gives me a high rank for the highest score, which I am not interested in:
M %>%
mutate(quantile = ntile(Score, 10))
I have used reverse= TRUE, or include.lowest = TRUE, did not work. Can you help me? I felt that it is not necessary to provide a sample of data.

Just negate the values:
M %>% mutate(quantile = ntile(-Score, 10))
Example:
df1 %>%
mutate(quantile = ntile(-score, 3))
# # A tibble: 6 x 4
# ntile date score quantile
# <int> <date> <dbl> <int>
# 1 1 2005-08-31 -2.39 3
# 2 1 2005-09-30 0.573 2
# 3 1 2005-10-31 -1.61 3
# 4 1 2005-11-30 5.43 1
# 5 1 2005-12-31 0.106 2
# 6 1 2006-01-31 6.66 1
Toy Data:
df1 <- structure(list(ntile = c(1L, 1L, 1L, 1L, 1L, 1L),
date = structure(c(13026, 13056, 13087, 13117, 13148, 13179),
class = "Date"),
score = c(-2.38916419707325, 0.572675136581781, -1.61130358515631,
5.42706994951004, 0.105533424368025, 6.65697289481407)),
row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

Related

Returning group maximum and NA using dplyr

I need a function I can use that returns both the group maximum and any NA values. Here is toy data:
df <- data.frame(id = rep(1:5,
each = 3),
score = rnorm(15))
df$score[c(3,7,10,14)] <- NA
# id score
# 1 1 -1.4666164
# 2 1 0.4392647
# 3 1 NA
# 4 2 -0.6010311
# 5 2 1.9845774
# 6 2 0.1749082
# 7 3 NA
# 8 3 -0.3089731
# 9 3 0.4427471
# 10 4 NA
# 11 4 1.7156319
# 12 4 -0.2354253
# 13 5 1.1781350
# 14 5 NA
# 15 5 0.0642082
I can use slice_max to get the maximum in each group:
df %>%
group_by(id) %>%
slice_max(score)
# id score
# <int> <dbl>
# 1 1 0.439
# 2 2 1.98
# 3 3 0.443
# 4 4 1.72
# 5 5 1.18
But how do I get the maximum plus any NAs returned?
We can group_by the id column, then use summarize to output the summaries with max. Here, two max are used, with one of them has na.rm = T and the other doesn't. union() is used to combine output that is present in both max.
library(dplyr)
df %>%
group_by(id) %>%
summarize(score = union(max(score, na.rm = T), max(score)))
UPDATE: The above code only works if you have at most one NA per ID. Thanks #KU99 for the reminder.
If you have more than one NA per ID, you need to combine the result of max with the records of NA found by is.na().
df %>%
group_by(id) %>%
summarize(score = c(max(score, na.rm = T), score[is.na(score)]))
Result
# A tibble: 9 × 2
# Groups: id [5]
id score
<int> <dbl>
1 1 0.735
2 1 NA
3 2 0.314
4 3 0.994
5 3 NA
6 4 0.847
7 4 NA
8 5 1.95
9 5 NA
Data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L), score = c(-1.05089006245306, 0.734652105895187,
NA, -1.31427279695036, -0.250038722057874, 0.314204596436828,
NA, 0.994420599790523, 0.855768431757766, NA, 0.834325037545013,
0.846790152407738, 1.95410525460771, NA, 0.971120269710021)), row.names = c(NA,
-15L), class = "data.frame")
One option would be to use slice and | to create a logical condition with is.na to return the NA rows and the max rows.
library(dplyr)
df %>%
group_by(id) %>%
slice(which(score == max(score, na.rm = T)|is.na(score)))
Another option would be to use slice.max as you did but then to use bind_rows to add the NA values back to the dataframe.
library(dplyr)
df %>%
group_by(id) %>%
slice_max(score) %>%
bind_rows(df %>% filter(is.na(score))) %>%
arrange(id)
Output
id score
<int> <dbl>
1 1 -0.161
2 1 NA
3 2 1.49
4 3 -0.451
5 3 NA
6 4 0.878
7 4 NA
8 5 -0.0652
9 5 NA
Data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L), score = c(-0.161217942983375, -0.456571996252207,
NA, 0.540071362460494, 1.49325799630099, -0.17985218510166, NA,
-0.451301758592, -0.839100876399644, NA, -0.0432130218441599,
0.87779273806634, -0.339260854059069, NA, -0.065177224102029)), row.names = c(NA,
-15L), class = "data.frame")
Using a custom function you could do:
library(dplyr)
set.seed(123)
slice_max_na <- function(.data, order_by, ..., n, prop, with_ties = TRUE) {
bind_rows(
slice_max(.data, order_by = {{order_by}}, ..., n = n, prop = prop, with_ties = with_ties),
filter(.data, is.na({{order_by}})),
)
}
df %>%
group_by(id) %>%
slice_max_na(score)
#> # A tibble: 9 × 2
#> # Groups: id [5]
#> id score
#> <int> <dbl>
#> 1 1 -0.230
#> 2 2 1.72
#> 3 3 -0.687
#> 4 4 1.22
#> 5 5 0.401
#> 6 1 NA
#> 7 3 NA
#> 8 4 NA
#> 9 5 NA
Here is dplyr version more using rank:
library(dplyr)
df %>%
group_by(id) %>%
mutate(rank = rank(-score, ties.method = "random")) %>%
filter(rank == 1 | is.na(score)) %>%
select(-rank)
id score
<int> <dbl>
1 1 0.505
2 1 NA
3 2 -0.109
4 3 NA
5 3 1.45
6 4 NA
7 4 0.355
8 5 NA
9 5 -0.298

Coalescing multiple columns from both the left and right side

Given the following data
df1 <- structure(list(ID = 1:3, alpha_1 = c(2L, 2L, 3L),
alpha_2 = c(1L, 2L,
3L), alpha_3 = c(4L, 4L, 2L), alpha_4 = c(3L, NA, NA), beta_1 = c(NA,
2L, NA), beta_2 = c(3L, NA, 2L), charlie_1 = c(1L, NA, 1L), charlie_2 = c(NA,
2L, NA)), class = "data.frame", row.names = c(NA, -3L))
I'm trying to coalesce all columns sharing the same initial prefix name (i.e. coalesce alpha_1, alpha_2, alpha_3, alpha_4, and coalesce beta_1 beta_2, etc.), but from both the left and right sides. That is, I want to generate two new variables, say 'alpha_left' and 'alpha_right', whose columns would be, in this example, (2, 2, 3) and (3, 4, 2) respectively (first non-missing elements from the left and right side of the dataframe).
User #akrun offered a great solution for the coalescing part here, but I'm unsure how to create two new variables from both the left and right coalesces.
Here is an option in tidyverse
Reshape to 'long' format - pivot_longer
Grouped by 'ID'
Do the summarise across the columns 'alpha' till 'charlie'
Get the column name - cur_column()
Create a tibble with the first non-NA element from the left and the right
Change the column names by appending the 'nm1' as prefix
Finally, unnest the list columns created in summarise
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
pivot_longer(cols = contains("_"),
names_to = c( ".value", "grp"), names_sep = "_") %>%
group_by(ID) %>%
summarise(across(alpha:charlie, ~ {
nm1 <- cur_column()
tbl1 <- tibble(left= .[complete.cases(.)][1],
right = rev(.)[complete.cases(rev(.))][1]);
names(tbl1) <- str_c(nm1, "_", names(tbl1))
list(tbl1)})) %>%
unnest(c(alpha, beta, charlie))
-output
# A tibble: 3 x 7
ID alpha_left alpha_right beta_left beta_right charlie_left charlie_right
<int> <int> <int> <int> <int> <int> <int>
1 1 2 3 3 3 1 1
2 2 2 4 2 2 2 2
3 3 3 2 2 2 1 1
Or using base R
lst1 <- lapply(split.default(df1[-1], sub("_\\d+$", "", names(df1)[-1])),
function(x) {
x1 <- apply(x, 1, function(y) {
y1 <- na.omit(y)
if(length(y1) > 1 ) y1[c(1, length(y1))] else y1[1]
})
if(is.vector(x1)) as.data.frame(matrix(x1)) else as.data.frame(t(x1))
})
You could also do:
df1[-1] %>%
split.default(sub("_\\d+", "", names(.))) %>%
imap_dfc(~data.frame(right = coalesce(!!!.x),
left = coalesce(!!!rev(.x))) %>%
set_names(paste(.y, names(.), sep="_")))
alpha_right alpha_left beta_right beta_left charlie_right charlie_left
1 2 3 3 3 1 1
2 2 4 2 2 2 2
3 3 2 2 2 1 1
One more approach not as elegant as #Onyambu's
library(tidyverse)
df1[-1] %>%
split.default(sub("_\\d+", "", names(.))) %>%
imap_dfc(~ .x %>% rowwise() %>%
mutate(!!paste0(.y, '_left') := head(na.omit(c_across(everything())),1),
!!paste0(.y, '_right') := tail(na.omit(c_across(!last_col())),1),
.keep = 'none' )
)
#> # A tibble: 3 x 6
#> # Rowwise:
#> alpha_left alpha_right beta_left beta_right charlie_left charlie_right
#> <int> <int> <int> <int> <int> <int>
#> 1 2 3 3 3 1 1
#> 2 2 4 2 2 2 2
#> 3 3 2 2 2 1 1
Created on 2021-06-19 by the reprex package (v2.0.0)
Another option
library(tidyverse)
df1 <- structure(list(ID = 1:3, alpha_1 = c(2L, 2L, 3L),
alpha_2 = c(1L, 2L,
3L), alpha_3 = c(4L, 4L, 2L), alpha_4 = c(3L, NA, NA), beta_1 = c(NA,
2L, NA), beta_2 = c(3L, NA, 2L), charlie_1 = c(1L, NA, 1L), charlie_2 = c(NA,
2L, NA)), class = "data.frame", row.names = c(NA, -3L))
df1 %>%
pivot_longer(cols = -ID, names_sep = "_", names_to = c(".value", "set")) %>%
group_by(ID) %>%
fill(alpha:charlie, .direction = "updown") %>%
filter(set %in% range(set)) %>%
mutate(set = c("left", "right")) %>%
pivot_wider(id_cols = ID, names_from = set, values_from = alpha:charlie)
#> # A tibble: 3 x 7
#> # Groups: ID [3]
#> ID alpha_left alpha_right beta_left beta_right charlie_left charlie_right
#> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 2 3 3 3 1 1
#> 2 2 2 4 2 2 2 2
#> 3 3 3 2 2 2 1 1
Created on 2021-06-20 by the reprex package (v2.0.0)

Pivot Wide with Custom Names, Original Values in the cell

I have data that is set up like the following - the CODE variable is character and needs to remain as it is because the numbers have meaning.
ID CODE
1 1.0
1 0.00
1 9.99
2 40.56
3 33.54
3 0.00
How would I use pivot wider to rearrange it so it is like the following, where I can have 4 CODE columns and if there isn't a fourth code per ID, it is just left blank
ID CODE_1 CODE_2 CODE_3 CODE_4
1 1.0 0.00 9.99 "."
2 40.56 "." "." "."
3 33.54 0.00 "." "."
Thank you!
This approach can be close to what you want. You can use tidyverse function complete() to enable the level not present in your original values. Here the code:
library(tidyverse)
#Code
df <- df %>% group_by(ID) %>% mutate(Var=factor(paste0('CODE_',row_number()),
levels = paste0('CODE_',1:4),
labels = paste0('CODE_',1:4),ordered = T,
exclude = F)) %>%
complete(Var = Var) %>%
pivot_wider(names_from = Var,values_from=CODE)
Output:
# A tibble: 3 x 5
# Groups: ID [3]
ID CODE_1 CODE_2 CODE_3 CODE_4
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 9.99 NA
2 2 40.6 NA NA NA
3 3 33.5 0 NA NA
Some data used:
#Data
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L), CODE = c(1, 0,
9.99, 40.56, 33.54, 0)), class = "data.frame", row.names = c(NA,
-6L))
If you really want dots for missing values, you have to transform the variables to character and then assign the replace like this:
#Code 2
df <- df %>% group_by(ID) %>% mutate(Var=factor(paste0('CODE_',row_number()),
levels = paste0('CODE_',1:4),
labels = paste0('CODE_',1:4),ordered = T,
exclude = F)) %>%
complete(Var = Var) %>%
pivot_wider(names_from = Var,values_from=CODE) %>%
mutate(across(CODE_1:CODE_4,~as.character(.))) %>%
replace(is.na(.),'.')
Output:
# A tibble: 3 x 5
# Groups: ID [3]
ID CODE_1 CODE_2 CODE_3 CODE_4
<int> <chr> <chr> <chr> <chr>
1 1 1 0 9.99 .
2 2 40.56 . . .
3 3 33.54 0 . .
We can use dcast from data.table
library(data.table)
dcast(setDT(df), ID ~ paste0("CODE_", rowid(ID)), value.var = 'CODE')
# ID CODE_1 CODE_2 CODE_3
#1: 1 1.00 0 9.99
#2: 2 40.56 NA NA
#3: 3 33.54 0 NA
data
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L), CODE = c(1, 0,
9.99, 40.56, 33.54, 0)), class = "data.frame", row.names = c(NA,
-6L))

Pmax of columns ending with a given string

I would like to conditionally mutate a new column representing the pmax() of columns ending with "_n" for a given row. I know I can do this by explicitly specifying the column names, but I would prefer to have this be the result of a call to ends_with() or similar.
I have tried mutate_at() and plain mutate(). My general thought is that I need to pass a vars(ends_with("_n")) to something, but I'm just missing that something.
Thanks in advance.
library(dplyr)
library(tidyr)
mtcars %>%
group_by(vs, gear) %>%
summarize(mean = mean(disp),
sd = sd(disp),
n = n()) %>%
mutate_if(is.double, round, 1) %>%
mutate(mean_sd = paste0(mean, " (", sd, ")")) %>%
select(-mean, -sd) %>%
group_by(vs, gear) %>%
nest(n, mean_sd, .key = "summary") %>%
spread(key = vs, value = summary) %>%
unnest(`0`, `1`, .sep = "_")
gear `0_n` `0_mean_sd` `1_n` `1_mean_sd`
<dbl> <int> <chr> <int> <chr>
1 3 12 357.6 (71.8) 3 201 (72)
2 4 2 160 (0) 10 115.6 (38.5)
3 5 4 229.3 (113.9) 1 95.1 (NA)
edit: both answers are much appreciated. Cheers!
Here's one way using the unquote-splice operator. We can select columns that we want to compare and then splice them as vectors into pmax:
library(tidyverse)
tbl <- structure(list(gear = c(3, 4, 5), `0_n` = c(12L, 2L, 4L), `0_mean_sd` = c("357.6 (71.8)", "160 (0)", "229.3 (113.9)"), `1_n` = c(3L, 10L, 1L), `1_mean_sd` = c("201 (72)", "115.6 (38.5)", "95.1 (NA)")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
tbl %>%
mutate(pmax = pmax(!!!select(., ends_with("_n"))))
#> # A tibble: 3 x 6
#> gear `0_n` `0_mean_sd` `1_n` `1_mean_sd` pmax
#> <dbl> <int> <chr> <int> <chr> <int>
#> 1 3 12 357.6 (71.8) 3 201 (72) 12
#> 2 4 2 160 (0) 10 115.6 (38.5) 10
#> 3 5 4 229.3 (113.9) 1 95.1 (NA) 4
Created on 2019-04-23 by the reprex package (v0.2.1)
A base R version, just as an alternative:
tbl <- structure(list(gear = c(3, 4, 5), `0_n` = c(12L, 2L, 4L), `0_mean_sd` = c("357.6 (71.8)", "160 (0)", "229.3 (113.9)"), `1_n` = c(3L, 10L, 1L), `1_mean_sd` = c("201 (72)", "115.6 (38.5)", "95.1 (NA)")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
tbl$pmax <- do.call(pmax,as.list(dat[,grepl("_n$",names(dat))]))

Merge two tables based on 2 conditions and output the average as result column

I have the following two tables:
Table_1
ID Interval
1 10
1 11
2 11
and
Table_2
ID Interval Rating
1 10 0.5
1 10 0.3
1 11 0.1
2 11 0.1
2 11 0.2
The output table should look like this:
ID Interval Mean Ratings
1 10 0.4
1 11 0.1
2 11 0.15
My goal is to join both tables based on the two conditions/columns ID and interval. Given that I have several ratings for the same ID and interval, I want to compute the mean of the ratings. Whereas the IDs are unique (~9500), the interval repeats for different IDs (as seen in the table above). My current approach is the join function with 2 by arguments. How can I create a final table in which Table_1 and Table_2 are joined based on the condition ID and interval, and receive the average rating in the result column?
left_join(Table_1, Table_2, by = c("ID" = "ID", "Interval" = "Interval"))
First of all you would need to summarize second table DT2 and then perform a right join with first table DT1.
library(data.table)
DT1[DT2[, .(Mean_Rating = mean(Rating)), .(ID, Interval)], on = c(ID = "ID", Interval = "Interval")]
which gives
ID Interval Mean_Rating
1: 1 10 0.40
2: 1 11 0.10
3: 2 11 0.15
Sample data:
DT1 <- structure(list(ID = c(1L, 1L, 2L), Interval = c(10L, 11L, 11L
)), .Names = c("ID", "Interval"), class = c("data.table", "data.frame"
), row.names = c(NA, -3L))
DT2 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Interval = c(10L,
10L, 11L, 11L, 11L), Rating = c(0.5, 0.3, 0.1, 0.1, 0.2)), .Names = c("ID",
"Interval", "Rating"), class = c("data.table", "data.frame"), row.names = c(NA,
-5L))
You can achieve it with dplyr's left_join, group_by and then summarise.
library(dplyr)
table1 %>%
left_join(table2, by = c("ID", "Interval")) %>%
group_by(ID, Interval) %>%
summarise("Mean Ratings" = mean(Rating))
## A tibble: 3 x 3
## Groups: ID [?]
# ID Interval `Mean Ratings`
# <int> <int> <dbl>
#1 1 10 0.4
#2 1 11 0.1
#3 2 11 0.15
data
table1 <- read.table(header = T, text="ID Interval
1 10
1 11
2 11")
table2 <- read.table(header = T, text = "ID Interval Rating
1 10 0.5
1 10 0.3
1 11 0.1
2 11 0.1
2 11 0.2")
You don't need join. Instead, bind your tables and use group & summarize from dplyr. The following achieves what you asked for:
library(dplyr)
table_1 <- data.frame("ID"= c(1,1,2),"Interval"=c (10,11,11),"Rating"= c(NA,NA,NA))
table_2 <- data.frame("ID"= c(1,1,1,2,2),"Interval"= c(10,10,11,11,11),"Rating"= c(0.5,0.3,0.1,0.1,0.2))
df1 <- bind_rows(table_1,table_2) %>% group_by(ID,Interval) %>% summarise("Mean Ratings" = mean(Rating,na.rm = TRUE))

Resources