Calculating a value in R to tie the next-highest ranking value - r

Given the following tibble, I'd like to mutate a new column indicating the additional count necessary to tie the next-highest in rank.
v <- tribble(
~rank, ~name, ~count,
1, "Mary", 100,
2, "Fred", 96,
3, "Sue", 90,
3, "Michelle", 90,
4, "Tom", 72
)
I've tried dplyr's lag function (v %>% mutate(toTie = lag(count) - count)). This works, but not when there are ties, as any tied observation after the first is compared to an observation with the same value. For example, after the mutate I have this:
rank name count toTie
<dbl> <chr> <dbl> <dbl>
1 1 Mary 100 NA
2 2 Fred 96 4
3 3 Sue 90 6
4 3 Michelle 90 0
5 4 Tom 72 18
This output correctly says Sue, who is ranked third, needs 6 to tie Fred, who is second. But because it compares Michelle to Sue (and not Fred), it says Michelle needs none to tie Sue. This is true, but not the intent. Michelle, like Sue, needs 6 to tie second-place Fred.
Any thoughts on a better approach would be most appreciated.

We can get the difference on the lag of the 'distinct' values of 'count' and do a right_join
library(dplyr)
v %>%
distinct(count) %>%
mutate(ToTie = lag(count)- count) %>%
right_join(v) %>%
select(names(v), ToTie)
-output
# A tibble: 5 x 4
# rank name count ToTie
# <dbl> <chr> <dbl> <dbl>
#1 1 Mary 100 NA
#2 2 Fred 96 4
#3 3 Sue 90 6
#4 3 Michelle 90 6
#5 4 Tom 72 18
Or another option is fill
library(tidyr)
v %>%
mutate(toTie = lag(count) - count,
toTie = na_if(toTie, 0)) %>%
fill(toTie)

You can use match() to index the difference at first occurrence.
library(dplyr)
v %>%
mutate(toTie = c(NA, diff(-count))[match(count, count)])
# A tibble: 5 x 4
rank name count toTie
<dbl> <chr> <dbl> <dbl>
1 1 Mary 100 NA
2 2 Fred 96 4
3 3 Sue 90 6
4 3 Michelle 90 6
5 4 Tom 72 18

Related

fuzzyjoin based on relative difference

I have understood that fuzzyjoin::difference will join two tables based on absolute difference between columns. Is there an R function that will join tables based on relative/percentage differences? I could do so using a full_join() + filter() but I suspect there is a more straightforward way.
Minimal example as follows:
library(tidyverse)
library(fuzzyjoin)
df_1 <- tibble(id = c("wombat", "jerry", "akow"), scores = c(10, 50, 75))
df_2 <- tibble(id= c("wombat", "jerry", "akow"), scores = c(14, 45, 82))
# joining based on absolute difference
difference_full_join(df_1, df_2,
by=c("scores"),
max_dist= 5,
distance_col = "abs_diff" )
# A tibble: 4 x 5
id.x scores.x id.y scores.y abs_diff
<chr> <dbl> <chr> <dbl> <dbl>
1 wombat 10 wombat 14 4
2 jerry 50 jerry 45 5
3 akow 75 NA NA NA
4 NA NA akow 82 NA
## joining based on relative difference (setting 10% as a threshold)
full_join(df_1, df_2, "id") %>%
dplyr::filter( (abs(scores.x - scores.y)/scores.x) <=0.10)
# A tibble: 2 x 3
id scores.x scores.y
<chr> <dbl> <dbl>
1 jerry 50 45
2 akow 75 82

For every category I want to make numeric calculations

I want to add a column in r to calculate the difference of scores for every equal characters in the same column (e.g:names)
I tried to group the data using group_by function in dplyr but it didn't work.
Something like this?
library(dplyr)
df %>% group_by(name) %>% mutate(score_diff = c(0, diff(score)))
## A tibble: 6 x 3
## Groups: name [4]
# name score score_diff
# <fct> <dbl> <dbl>
#1 James 83 0
#2 Andrew 84 0
#3 James 87 4
#4 Sonya 40 0
#5 Catherine 50 0
#6 Sonya 55 15
Sample data
df <- data.frame(
name = c("James", "Andrew", "James", "Sonya", "Catherine", "Sonya"),
score = c(83,84,87,40,50,55))

dplyr equivalent of sql row_number() over (partition by group order by value)

Initial situation
I have a data set of the following form:
library(dplyr)
dat <- tribble(
~name, ~iq,
"ben", 100,
"alex", 98,
"mia", 110,
"paco", 124,
"mia", 112,
"mia", 120,
"paco", 112,
"ben", 90,
"alex", 107
)
I'd like to create a new column which ranks, grouped by name, the values iq in descending order. In SQL one could write
select
name,
iq,
row_number() over (partition by name order by iq desc) as rank
from
dat;
which would produce the following expected output (already ordered for simplicity):
#> name iq rank
#> <chr> <dbl> <int>
#> 1 alex 107 1
#> 2 alex 98 2
#> 3 ben 100 1
#> 4 ben 90 2
#> 5 mia 120 1
#> 6 mia 112 2
#> 7 mia 110 3
#> 8 paco 124 1
#> 9 paco 112 2
Questions
With my data, one can achieve the desired result with:
dat %>%
group_by(name) %>%
mutate(rank = with_order(order_by = iq,
fun = row_number,
x = desc(iq)
)
) %>%
arrange(name, rank)
#> # A tibble: 9 x 3
#> # Groups: name [4]
#> name iq rank
#> <chr> <dbl> <int>
#> 1 alex 107 1
#> 2 alex 98 2
#> 3 ben 100 1
#> 4 ben 90 2
#> 5 mia 120 1
#> 6 mia 112 2
#> 7 mia 110 3
#> 8 paco 124 1
#> 9 paco 112 2
However, I don't understand why the code works. When reading the documentation of dplyr::with_order(), it says the arguments are
order_by = the vector to order by
fun = window function
x, ... = arguments to f
Given the description in the documentation and the working code, I have two questions I cannot answer:
What is the purpose of the argument x? Why not just specify the vector to order by and the window function (like in sql)? What is meant by f?
Why don't I have to write order_by = desc(iq)? To get the result I expect I have to write x = desc(iq) and set order_by = iq. This seems to contradict the documentation, which states that order_by = the vector to order by.

How can I create an incremental ID column based on whenever one of two variables are encountered?

My data came to me like this (but with 4000+ records). The following is data for 4 patients. Every time you see surgery OR age reappear, it is referring to a new patient.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
So to say again, every time surgery or age appear (surgery isn't always there, but age is), those records and the ones after pertain to the same patient until you see surgery or age appear again.
Thus I somehow need to add an ID column with this data:
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,4)
testdat$ID = ID
I know how to transpose and melt and all that to put the data into regular format, but how can I create that ID column?
Advice on relevant tags to use is helpful!
Assuming that surgery and age will be the first two pieces of information for each patient and that each patient will have a information that is not age or surgery afterward, this is a solution.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
# Use a tibble and get rid of factors.
dfTest = as_tibble(testdat) %>%
mutate_all(as.character)
# A little dplyr magic to see find if the start of a new patient, then give them an id.
dfTest = dfTest %>%
mutate(couldBeStart = if_else(col1 == "surgery" | col1 == "age", T, F)) %>%
mutate(isStart = couldBeStart & !lag(couldBeStart, default = FALSE)) %>%
mutate(patientID = cumsum(isStart)) %>%
select(-couldBeStart, -isStart)
# # A tibble: 17 x 3
# col1 col2 patientID
# <chr> <chr> <int>
# 1 surgery yes 1
# 2 age 54 1
# 3 weight 153 1
# 4 albumin normal 1
# 5 abiotics 2 1
# 6 surgery no 2
# 7 age 65 2
# 8 weight 134 2
# 9 BAPPS yes 2
# 10 abiotics 1 2
# 11 surgery yes 3
# 12 age 61 3
# 13 weight 210 3
# 14 age 46 4
# 15 weight 178 4
# 16 BAPPS no 4
# 17 albumin low 4
# Get the data to a wide workable format.
dfTest %>% spread(col1, col2)
# # A tibble: 4 x 7
# patientID abiotics age albumin BAPPS surgery weight
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2 54 normal NA yes 153
# 2 2 1 65 NA yes no 134
# 3 3 NA 61 NA NA yes 210
# 4 4 NA 46 low no NA 178
Using dplyr:
library(dplyr)
testdat = testdat %>%
mutate(patient_counter = cumsum(col1 == 'surgery' | (col1 == 'age' & lag(col1 != 'surgery'))))
This works by checking whether the col1 value is either 'surgery' or 'age', provided 'age' is not preceded by 'surgery'. It then uses cumsum() to get the cumulative sum of the resulting logical vector.
You can try the following
keywords <- c('surgery', 'age')
lgl <- testdat$col1 %in% keywords
testdat$ID <- cumsum(c(0, diff(lgl)) == 1) + 1
col1 col2 ID
1 surgery yes 1
2 age 54 1
3 weight 153 1
4 albumin normal 1
5 abiotics 2 1
6 surgery no 2
7 age 65 2
8 weight 134 2
9 BAPPS yes 2
10 abiotics 1 2
11 surgery yes 3
12 age 61 3
13 weight 210 3
14 age 46 4
15 weight 178 4
16 BAPPS no 4
17 albumin low 4

Better way to summarize percentages by subgroup using dplyr?

After consulting similar questions on SO, like here I finally have the output I want, but I can't help wonder if there is a better way to get there. In addition, I am wondering if there is a way to use pipe operators to chain the last step which eliminates repeats of manager and title combinations.
Reproducible example:
library(dplyr)
# Sample data frame
employee = LETTERS[1:18]
manager = c(rep("Tom", 3), rep("Sue", 4), rep("Mike", 4), rep("Jack", 7))
title = c(rep("Entry", 2), rep("Mid", 3), rep("Junior", 7), rep("Senior", 6))
mydata <- data.frame(employee, manager, title)
# Code gives me output I want, but wondering if there is a better way
org2 <- mydata %>%
group_by(manager, title) %>%
mutate(title_count = n()) %>% # Total number of people with given title by manager
ungroup() %>%
group_by(manager) %>% # Total number of people in manager's group
mutate(mgr_total = n()) %>%
group_by(title, add = TRUE) %>%
mutate(title_pctg = round(title_count/mgr_total*100, 1)) %>% # Percent of people with given title by manager
select(-employee)
# Remove duplicates of manager and title to summarize data wanted
org2 <- org2[!duplicated(org2[2:4]), ]
arrange(org2, manager, title)
# A tibble: 7 x 5
# Groups: manager, title [7]
# manager title title_count mgr_total title_pctg
# <fctr> <fctr> <int> <int> <dbl>
#1 Jack Junior 1 7 14.3
#2 Jack Senior 6 7 85.7
#3 Mike Junior 4 4 100.0
#4 Sue Junior 2 4 50.0
#5 Sue Mid 2 4 50.0
#6 Tom Entry 2 3 66.7
#7 Tom Mid 1 3 33.3
Thanks in advance for the thoughts and help.
You can simplify it as the following by switching the order of group_by (i.e. group by manager first then manger + title instead of the other way);
mydata %>%
group_by(manager) %>%
mutate(mgr_count = n()) %>%
group_by(title, mgr_count, add=TRUE) %>%
summarise(
title_count = n(),
title_pctg = round(title_count / first(mgr_count) * 100, 1)
)
# A tibble: 7 x 5
# Groups: manager, title [?]
# manager title mgr_count title_count title_pctg
# <fctr> <fctr> <int> <int> <dbl>
#1 Jack Junior 7 1 14.3
#2 Jack Senior 7 6 85.7
#3 Mike Junior 4 4 100.0
#4 Sue Junior 4 2 50.0
#5 Sue Mid 4 2 50.0
#6 Tom Entry 3 2 66.7
#7 Tom Mid 3 1 33.3

Resources