Calculating percentiles and showing them as stacked bars for benchmarking - r

This is a follow up question to Calculate proportions of multiple columns.
I have the following data:
location = rep(c("A", "B", "C", "D"),
times = c(4, 6, 3, 7))
ID = (1:20)
Var1 = rep(c(0,2,1,1,0), times = 4)
Var2 = rep(c(2,1,1,0,2), times = 4)
Var3 = rep(c(1,1,0,2,0), times = 4)
df=as.data.frame(cbind(location, ID, Var1, Var2, Var3))
And with some help I counted occurrences and proportions of the different scores (0, 1, 2) in the different Vars like this:
df %>%
pivot_longer(starts_with("Var"), values_to = "score") %>%
type_convert() %>%
group_by(location, name) %>%
count(score) %>%
mutate(frac = n / sum(n)) -> dfmut
Now I have a data frame that looks like this, which I called dfmut:
# A tibble: 36 × 5
# Groups: location, name [12]
location name score n frac
<chr> <chr> <dbl> <int> <dbl>
1 A Var1 0 2 0.4
2 A Var1 1 2 0.4
3 A Var1 2 1 0.2
4 A Var2 0 1 0.2
5 A Var2 1 2 0.4
6 A Var2 2 2 0.4
7 A Var3 0 2 0.4
8 A Var3 1 2 0.4
9 A Var3 2 1 0.2
10 B Var1 0 2 0.4
Now what I like to do is get the 10th, 25th, 75th and 90th percentiles of the scores that are not 0 and turn them into a stacked bar chart. I'll give you an example story: location (A, B, etc.) is gardens of different people where they grew different kinds of vegetables (Var1, Var 2, etc.). We scored how well the vegetables turned out with score 0 = optimal, score 1 = suboptimal, score 2 = failure.
The goal is to get a stacked bar chart that shows how high the proportion of non-optimal (score 1 and 2) vegetables is in the 10% best, the 25% best gardens, etc. Then I want to indicate to each gardener where they lie in the ranking regarding each Var.
This could look something like in the image with dark green: best 10% to dark pink-purplish: worst 10% with the dot indicating garden A.
I started making a new data frame with the quantiles, which is probable not very elegant, so feel free to point out how I could do this more efficiently:
dfmut %>%
subset(name =="Var1") %>%
subset(score == "1"| score == "2") -> Var1_12
Percentiles <- c("10", "25", "75", "90")
Var1 <- quantile(Var1_12$frac, probs = c(0.1, 0.25, 0.75, 0.9))
data <- data.frame(Percentiles, Var1)
dfmut %>%
subset(name =="Var2") %>%
subset(score == "1"| score == "2") -> Var2_12
data$Var2 <- quantile(Var2_12$frac, probs = c(0.1, 0.25, 0.75, 0.9))
data_tidy <- melt(data, id.vars = "Percentiles")
I can't get any further than this. Probably because I'm on an entirely wrong path...
Thank you for your help![]

Related

How to calculate the difference between values in one column based on another column?

I'm trying to calculate the difference between abundance at the time points C1 and C0. I'd like to do this for the different genes, so I've used group_by for the genes, but can't figure out how to find the difference in abundance at the different time points.
Here is one of my attempts:
IgH_CDR3_post_challenge_unique_vv <- IgH_CDR3_post_challenge_unique_v %>%
group_by(gene ) %>%
mutate(increase_in_abundance = (abunance[Timepoint=='C1'])-(abunance[Timepoint=='C0'])) %>%
ungroup()
My data looks something like this:
gene
Timepoint
abundance
1
C0
5
2
C1
3
1
C1
6
3
C0
2
Assuming (!) you will have one entry per gene and timepoint (as opposed to the table posted in the question), you can pivot_wider your data and then calculate the difference for every gene. The current example, of course, isn't very helpful with mostly missings.
df <- data.frame(gene = c(1, 2, 1, 3),
Timepoint = c("c0", "c1", "c1", "c0"),
abundance = c(5, 3, 6, 2))
library(tidyverse)
df %>%
pivot_wider(names_from = Timepoint,
values_from = abundance,
id_cols = gene) %>%
mutate(increase_in_abundance = c1 - c0)
# A tibble: 3 x 4
gene c0 c1 increase_in_abundance
<dbl> <dbl> <dbl> <dbl>
1 1 5 6 1
2 2 NA 3 NA
3 3 2 NA NA

Compute grouped mean while retaining single-row group in R (dplyr)

I'm trying to compute mean + standard deviation for a dataset. I have a list of organizations, but one organization has just one single row for the column "cpue." When I try to compute the grouped mean for each organization and another variable (scientific name), this organization is removed and yields a NA. I would like to retain the single-group value however, and for it to be in the "mean" column so that I can plot it (without sd). Is there a way to tell dplyr to retain groups with a single row when calculating the mean? Data below:
l<- df<- data.frame(organization = c("A","B", "B", "A","B", "A", "C"),
species= c("turtle", "shark", "turtle", "bird", "turtle", "shark", "bird"),
cpue= c(1, 2, 1, 5, 6, 1, 3))
l2<- l %>%
group_by( organization, species)%>%
summarize(mean= mean(cpue),
sd=sd(cpue))
Any help would be much appreciated!
We can create an if/else condition in sd to check for the number of rows i.e. if n() ==1 then return the 'cpue' or else compute the sd of 'cpue'
library(dplyr)
l1 <- l %>%
group_by( organization, species)%>%
summarize(mean= mean(cpue),
sd= if(n() == 1) cpue else sd(cpue), .groups = 'drop')
-output
l1
# A tibble: 6 x 4
# organization species mean sd
#* <chr> <chr> <dbl> <dbl>
#1 A bird 5 5
#2 A shark 1 1
#3 A turtle 1 1
#4 B shark 2 2
#5 B turtle 3.5 3.54
#6 C bird 3 3
If the condition is based on the value of grouping variable 'organization', then create the condition in if/else by extracting the grouping variable with cur_group()
l %>%
group_by(organization, species) %>%
summarise(mean = mean(cpue),
sd = if(cur_group()$organization == 'A') cpue else sd(cpue),
.groups = 'drop')

Finding the first non-zero year in data frame for multiple variables using tidyverse

I have the following data:
library(tidyverse)
set.seed(1)
test <- data.frame(id = c(rep(1, 3), rep(2, 4), rep(3, 5)),
Year = 2000 + c(1,3,5,2,3,5,6,1,2,3,4,5),
var1 = sample(0:2, replace = TRUE, size = 12, prob = c(0.6, 0.3, 0.1)),
var2 = sample(0:2, replace = TRUE, size = 12, prob = c(0.6, 0.3, 0.1)))
I need to the first year that each variable (var1 and var2) is non-zero within each id group.
I know how to find the row number of the first non-zero row:
temp <- function(a) ifelse(length(head(which(a>0),1))==0,0,head(which(a>0),1))
test2 <- test %>% group_by(id) %>%
mutate_at(vars(var1:var2),funs(temp)) %>%
filter(row_number()==1) %>% select (-year)
id var1 var2
1 1 0 1
2 2 1 2
3 3 1 1
However, I am not sure how to match the row number back to the year variable so that I will know exactly when did the var1 and var2 turn non-zero, instead of only having the row numbers.
This is what I want:
id var1 var2
1 1 0 2001
2 2 2002 2003
3 3 2001 2001
We may do the following:
test %>% group_by(id) %>% summarise_at(vars(var1:var2), funs(Year[. > 0][1]))
# A tibble: 3 x 3
# id var1 var2
# <dbl> <dbl> <dbl>
# 1 1 NA 2001
# 2 2 2002 2003
# 3 3 2001 2001
That is, . > 0 gives a logical vector with TRUE whenever a value is positive, then we select all the corresponding years, and lastly pick only the first one.
That's very similar to your approach. Notice that due to using summarise I no longer need filter(row_number()==1) %>% select (-year). Also, my function corresponding to temp is more concise.
A slightly different approach gathering everything into a big long file first:
test %>%
gather(var, value, var1:var2) %>%
filter(value != 0) %>%
group_by(id, var) %>%
summarise(Year = min(Year)) %>%
spread(var, Year)
## A tibble: 3 x 3
## Groups: id [3]
# id var1 var2
#* <dbl> <dbl> <dbl>
#1 1.00 NA 2001
#2 2.00 2002 2003
#3 3.00 2001 2001
And a base R version for fun:
tmp <- cbind(test[c("id", "Year")], stack(test[c("var1","var2")]))
tmp <- tmp[tmp$values != 0,]
tmp <- aggregate(Year ~ id + ind, data=tmp, FUN=min)
reshape(tmp[c("id","ind","Year")], idvar="id", timevar="ind", direction="wide")

R - Group by a value and calculate the percentage of the whole group

EDIT: My question was not clear enough. I apologize.
The problem was to define groups and assign values of a column of a dataframe to it.
I solved the question myself with a chain of ifelse and the comments here. Thanks for that. I then did it manually for each column seperately.
data %>%
mutate(group = ifelse(richness <= -0.6, "1",
ifelse(richness > -0.6 & richness <= -0.2, "2",
ifelse(richness >-0.2 & richness <= 0.2, "3",
ifelse(richness >0.2 & richness <= 0.6, "4",
ifelse(richness >0.6, "5", NA)))))) %>%
group_by(group) %>%
summarise(percentage=n()*100/"No.of.values")
Using carb variable from mtcars data set as example:
prop.table(table(mtcars$carb)) * 100
1 2 3 4 6 8
21.875 31.250 9.375 31.250 3.125 3.125
If you want to define groups your self you can use the cut function:
groups <- c(0,2,6,8) # interval values for the groups
prop.table(table(cut(mtcars$carb, breaks=groups))) * 100
(0,2] (2,6] (6,8]
53.125 43.750 3.125
Work flow.
Add a dummy column;
Group by the dummy column;
Count the subgroups.
Here are some sample codes:
require(dplyr)
# generate fake data.
set.seed(123456)
sample <- data.frame(Nums = rep(NA,100))
sample$Nums <- sample(-100:100, 100, replace = T)/100
size <- length(sample$Nums)
# add dummy column
sample <- sample %>%
# changed the dummy column accordingly
mutate(dummy = ifelse(Nums < 0, "A", "B")) %>%
# group nums
group_by(dummy) %>%
# calculate percentage
summarise(percentage = n()*100/size)
head(sample)
# A tibble: 2 x 3
dummy count percentage
<chr> <int> <dbl>
1 A 50 50
2 B 50 50

Ignore value conditionally within group_by in dplyr

Please consider the following.
Background
In a data.frame I have patient IDs (id), the day at which patients are admitted to a hospital (day), a code for the diagnostic activity they received that day (code), a price for that activity (price) and a frequency for that activity (freq).
Activities with code b and c are registered at the same time but mean more or less the same thing and should not be double counted.
Problem
What I want is: if code "b" and "c" are registered for the same day, code "b" should be ignored.
The example data.frame looks like this:
x <- data.frame(id = c(rep("a", 4), rep("b", 3)),
day = c(1, 1, 1, 2, 1, 2, 3),
price = c(500, 10, 100, rep(10, 3), 100),
code = c("a", "b", "c", rep("b", 3), "c"),
freq = c(rep(1, 5), rep(2, 2))))
> x
id day price code freq
1 a 1 500 a 1
2 a 1 10 b 1
3 a 1 100 c 1
4 a 2 10 b 1
5 b 1 10 b 1
6 b 2 10 b 2
7 b 3 100 c 2
So the costs for patient "a" for day 1 would be 600 and not 610 as I can compute with the following:
x %>%
group_by(id, day) %>%
summarise(res = sum(price * freq))
# A tibble: 5 x 3
# Groups: id [?]
id day res
<fct> <dbl> <dbl>
1 a 1. 610.
2 a 2. 10.
3 b 1. 10.
4 b 2. 20.
5 b 3. 200.
Possible approaches
Either I delete observation code "b" when "c" is present on that same day or I set freq of code "b" to 0 in case code "c" is present on the same day.
All my attempts with ifelse and mutate failed so far.
Every help is much appreciated. Thank you very much in advance!
You can add a filter line to remove the offending b values like this...
x %>%
group_by(id, day) %>%
filter(!(code=="b" & "c" %in% code)) %>%
summarise(res = sum(price * freq))
id day res
<fct> <dbl> <dbl>
1 a 1. 600.
2 a 2. 10.
3 b 1. 10.
4 b 2. 20.
5 b 3. 200.
You could create a new column like this:
mutate(code_day = paste0(ifelse(code %in% c("b", "c"), "z", code), day)
Then all your Bs and Cs will become Zs (without losing the original code column that helps you tell them apart). You can then arrange by code descending and remove duplicate values in the code_day column:
arrange(desc(code)) %>% # Bs will come after Cs
distinct(code_day, .keep_all = TRUE)

Resources