Mutate percentile rank based on two columns - r

I've previously asked the following question: Mutate new column over a large list of tibbles & the solutions giving were perfect. I now have a follow-up question to this.
I now have the following dataset:
df1:
name
group
competition
metric
value
A
A
comp A
distance
10569
B
A
comp B
distance
12939
C
A
comp C
distance
11532
A
A
comp B
psv-99
29.30
B
A
comp A
psv-99
30.89
C
A
comp C
psv-99
32.00
I now want to find out the percentile rank of all the values in df1, but only based on the group & one of the competitions - competition A.

We could slice the rows where the 'comp A' is found %in% competition, then do a grouping by 'group' column and create a new column percentile with percent_rank
library(dplyr)
df <- df %>%
slice(which(competition %in% "comp A")) %>%
group_by(group) %>%
mutate(percentile = percent_rank(value))

Maybe just change metric to competition in the previous code? It would give you the percentile rank for all competitions, including A.
df1 %>%
group_nest(group, competition) %>%
mutate(percentile = map(data, ~percent_rank(.$value))) %>%
unnest(c(data, percentile))

You can filter the competition and group_by group.
library(dplyr)
df %>%
filter(competition == "comp A") %>%
group_by(group) %>%
mutate(percentile = percent_rank(value))

Related

How to count the number of occurrences in a table through filtering in summarise in R?

I have a data frame like this:
df <- data.frame(Identifier = c("A","B","C"),
Year = c("2020","2020","2019"), Sex = c("Male","Male","Female")
I want to then filter this, and count the number of each sex. I thought this would work with n() but:
df %>% group_by(year) %>% summarise(Number_males = n(Sex =="Male"))
Does not work. I would like the following output:
Year Number_males
1 2020 2
2 2019 0
Note: my real data frame is considerably more complicated than this one, and so I cannot afford to just filter by Gender == Male separately
We need to sum the logical vector as TRUE -> 1 and FALSE -> 0
library(dplyr)
df %>%
group_by(Year) %>%
summarise(Number_males = sum(Sex =="Male"))

Method for grouping in R for unique values in a list?

I have a dataframe of patients who underwent one or more surgical procedures and am interested in grouping them by procedure type for analysis of outcomes. The procedures are represented by numbers (1-5). To avoid having to create a new column in the dataframe for each procedure type to identify whether the patient had that unique procedure performed, I'm basically looking for a way to do aggregate grouping and summarizing for each unique value in a list.
A representative df would look like this...
id <- c(1,2,3,4,5,6,7,8,9,10)
procedures <- list(2, 3, c(1,5), 1, c(3,4), c(1,3), 5, 2, c(1,2,5), 4)
df <- as.data.frame(cbind(id, procedures))
Say I wanted to count the number of patients who had each type of procedure. The following would obviously count each unique list as a separate object.
df %>%
group_by(procedures) %>%
summarise(n = n())
What I'm trying to accomplish would be a count of times each unique procedure appears in the list of lists. The below is oversimplified but an example of this.
df %>%
group_by(unique(procedures)) %>%
summarise(n = n())
We may unnest the list column and use that in group_by
library(dplyr)
library(tidyr)
df %>%
unnest(everything()) %>%
group_by(procedures) %>%
summarise(n = n())
We could use separate_rows with count:
library(dplyr)
library(tidyr)
df %>%
separate_rows("procedures", sep = " ,") %>%
count(procedures)
procedures n
<dbl> <int>
1 1 4
2 2 3
3 3 3
4 4 2
5 5 3

Mutate new column over a large list of tibbles

So I have used the following code to split the below dataframe (df1) into multiple dataframes/tibbles based on the filters so that I can work out the percentile rank of each metric.
df1:
name
group
metric
value
A
A
distance
10569
B
A
distance
12939
C
A
distance
11532
A
A
psv-99
29.30
B
A
psv-99
30.89
C
A
psv-99
28.90
split <- lapply(unique(df1$metric), function(x){
filter <- df1 %>% filter(group == "A" & metric == x)
})
This then gives me a large list of tibbles. I want to now mutate a new column for each tibble to work out the percentile rank of the value column which I can do using the following code:
df2 <- split[[1]] %>% mutate(percentile = percent_rank(value))
I could do this for each metric then row_bind them together, but that seems very messy. Could anyone suggest a better way of doing this?
No need to split the data here. You can use group_by to do the calculation for each metric separately.
library(dplyr)
df %>%
filter(group == "A") %>%
group_by(metric) %>%
mutate(percentile = percent_rank(value))
We can use base R
df1 <- subset(df, group == 'A')
df1$percentile <- with(df1, ave(value, metric, FUN = percent_rank))
df %>%
group_nest(group, metric) %>%
mutate(percentile = map(data, ~percent_rank(.x$value))) %>%
unnest(cols = c("data", "percentile"))

Reclassify attributes that are less than x% of total as 'other'

Okay so I have data as so:
ID Name Job
001 Bill Carpenter
002 Wilma Lawyer
003 Greyson Lawyer
004 Eddie Janitor
I want to group these together for analysis so any job that appears less than x percent of the whole will be grouped into "Other"
How can I do this, here is what I tried:
df %>%
group_by(Job) %>%
summarize(count = n()) %>%
mutate(pct = count/sum(count)) %>%
arrange(desc(count)) %>%
drop_na()
And now I know what the percentages are but how do I integrate this in to the original data to make everything below X "Other". (let's say less than or equal to 25% is other).
Maybe there's a more straightforward way....
You can try this :
library(dplyr)
df %>%
count(Job) %>%
mutate(n = n/sum(n)) %>%
left_join(df, by = 'Job') %>%
mutate(Job = replace(Job, n <= 0.25, 'Other'))
To integrate our calculation in original data we do a left_join and then replace the values.

Sampling from a subset of a dataframe where the subset is conditional on a value from another dataframe in R

I have two data frames in R. One contains a row for each individual person and the area they live in. E.g.
df1 = data.frame(Person_ID = seq(1,10,1), Area = c("A","A","A","B","B","C","D","A","D","C"))
The other data frame contains demographic information for each Area.
E.g. for gender df2 = data.frame(Area = c("A","A","B","B","C","C","D","D"), gender = c("M","F","M","F","M","F","M","F"), probability = c(0.4,0.6,0.55,0.45,0.6,0.4,0.5,0.5))
In df1 I want to create a gender column where for each row of df1 I sample a gender from the appropriate subset of df2.
For example, for row 1 of df1 I would sample a gender from df2 %>% filter(Area == "A")
The question is how do I do this for all rows without a for loop as in practice df1 could have up to 5 million rows?
Try using the following :
library(dplyr)
library(tidyr)
out <- df1 %>%
nest(data = -Area) %>%
left_join(df2, by = 'Area') %>%
group_by(Area) %>%
summarise(data = map(data, ~.x %>%
mutate(gender = sample(gender, n(),
prob = probability, replace = TRUE)))) %>%
distinct(Area, .keep_all = TRUE) %>%
unnest(data)
We first nest df1 and join it with df2 by Area. For each Area we sample gender value based on probability in df2 and unnest to get long dataframe.
There are not enough samples in df1 to verify the result but if we increase number of rows in df1 the proportion should be similar to probability in df2.

Resources