R - Group by a value and calculate the percentage of the whole group - r

EDIT: My question was not clear enough. I apologize.
The problem was to define groups and assign values of a column of a dataframe to it.
I solved the question myself with a chain of ifelse and the comments here. Thanks for that. I then did it manually for each column seperately.
data %>%
mutate(group = ifelse(richness <= -0.6, "1",
ifelse(richness > -0.6 & richness <= -0.2, "2",
ifelse(richness >-0.2 & richness <= 0.2, "3",
ifelse(richness >0.2 & richness <= 0.6, "4",
ifelse(richness >0.6, "5", NA)))))) %>%
group_by(group) %>%
summarise(percentage=n()*100/"No.of.values")

Using carb variable from mtcars data set as example:
prop.table(table(mtcars$carb)) * 100
1 2 3 4 6 8
21.875 31.250 9.375 31.250 3.125 3.125
If you want to define groups your self you can use the cut function:
groups <- c(0,2,6,8) # interval values for the groups
prop.table(table(cut(mtcars$carb, breaks=groups))) * 100
(0,2] (2,6] (6,8]
53.125 43.750 3.125

Work flow.
Add a dummy column;
Group by the dummy column;
Count the subgroups.
Here are some sample codes:
require(dplyr)
# generate fake data.
set.seed(123456)
sample <- data.frame(Nums = rep(NA,100))
sample$Nums <- sample(-100:100, 100, replace = T)/100
size <- length(sample$Nums)
# add dummy column
sample <- sample %>%
# changed the dummy column accordingly
mutate(dummy = ifelse(Nums < 0, "A", "B")) %>%
# group nums
group_by(dummy) %>%
# calculate percentage
summarise(percentage = n()*100/size)
head(sample)
# A tibble: 2 x 3
dummy count percentage
<chr> <int> <dbl>
1 A 50 50
2 B 50 50

Related

Calculating percentiles and showing them as stacked bars for benchmarking

This is a follow up question to Calculate proportions of multiple columns.
I have the following data:
location = rep(c("A", "B", "C", "D"),
times = c(4, 6, 3, 7))
ID = (1:20)
Var1 = rep(c(0,2,1,1,0), times = 4)
Var2 = rep(c(2,1,1,0,2), times = 4)
Var3 = rep(c(1,1,0,2,0), times = 4)
df=as.data.frame(cbind(location, ID, Var1, Var2, Var3))
And with some help I counted occurrences and proportions of the different scores (0, 1, 2) in the different Vars like this:
df %>%
pivot_longer(starts_with("Var"), values_to = "score") %>%
type_convert() %>%
group_by(location, name) %>%
count(score) %>%
mutate(frac = n / sum(n)) -> dfmut
Now I have a data frame that looks like this, which I called dfmut:
# A tibble: 36 × 5
# Groups: location, name [12]
location name score n frac
<chr> <chr> <dbl> <int> <dbl>
1 A Var1 0 2 0.4
2 A Var1 1 2 0.4
3 A Var1 2 1 0.2
4 A Var2 0 1 0.2
5 A Var2 1 2 0.4
6 A Var2 2 2 0.4
7 A Var3 0 2 0.4
8 A Var3 1 2 0.4
9 A Var3 2 1 0.2
10 B Var1 0 2 0.4
Now what I like to do is get the 10th, 25th, 75th and 90th percentiles of the scores that are not 0 and turn them into a stacked bar chart. I'll give you an example story: location (A, B, etc.) is gardens of different people where they grew different kinds of vegetables (Var1, Var 2, etc.). We scored how well the vegetables turned out with score 0 = optimal, score 1 = suboptimal, score 2 = failure.
The goal is to get a stacked bar chart that shows how high the proportion of non-optimal (score 1 and 2) vegetables is in the 10% best, the 25% best gardens, etc. Then I want to indicate to each gardener where they lie in the ranking regarding each Var.
This could look something like in the image with dark green: best 10% to dark pink-purplish: worst 10% with the dot indicating garden A.
I started making a new data frame with the quantiles, which is probable not very elegant, so feel free to point out how I could do this more efficiently:
dfmut %>%
subset(name =="Var1") %>%
subset(score == "1"| score == "2") -> Var1_12
Percentiles <- c("10", "25", "75", "90")
Var1 <- quantile(Var1_12$frac, probs = c(0.1, 0.25, 0.75, 0.9))
data <- data.frame(Percentiles, Var1)
dfmut %>%
subset(name =="Var2") %>%
subset(score == "1"| score == "2") -> Var2_12
data$Var2 <- quantile(Var2_12$frac, probs = c(0.1, 0.25, 0.75, 0.9))
data_tidy <- melt(data, id.vars = "Percentiles")
I can't get any further than this. Probably because I'm on an entirely wrong path...
Thank you for your help![]

Counting Peaks in R per group(s) of a data.frame

This is a followup question to Counting peaks in r per group.
Reproducible data:
set.seed(949494)
Happiness <- round(runif(100, -100, 100))
ID <- rep(c("ID1", "ID2", "ID3", "ID4", "ID5"), 20)
Stimuli <- rep(1:4, 1)
DF <- data.frame(ID, Stimuli, Happiness)
Function to calculate the Happiness threshold per ID by using each ID's unique sd():
# 1SD
f.SD1 <- function(y) {
SD1_thresh <- mean(y) + (1*sd(y))
return(SD1_thresh)
}
Function to identify when Happiness is above (TRUE = 1) or below (FALSE = 0) the threshold:
# SD1
f.Peaks_SD1 <- function(X, thresh) {
H_peaks_1 <- ifelse(X >= thresh ,1,0)
return(H_peaks_1)
}
Now I want to group by ID and Stimuli so that I can determine average peaks per stimuli:
H_peaks_1_df <- DF %>% group_by(Stimuli, ID) %>% summarise(thresh_SD1 = f.SD1(Happiness), ttime = sum(Happiness > thresh_SD1), nP_H_SD1 = sum(diff(c(f.Peaks_SD1(Happiness, thresh = thresh_SD1), 0)) < 0))
H_peaks_1_df
summary(H_peaks_1_df)
Output:
The problem with this output is that the thresholds for the same ID are different because the sd() was calculated per ID per Stimuli. I want to calculate the sd() across all Stimuli per ID and then count peaks per Stimuli.
So, the output H_peaks_1_df here is perfect (group_by(Stimuli, ID)), just the column "thresh_SD1" should be the same value for ID1, namely "58.5" which is correctly calculated when grouping only by ID.
Is it possible in dplyr to execute the "thresh_SD1" calculation via group_by(ID) and then count peaks and total time via group_by(Stimuli, ID) in simple code?
Thanks in advance!
Yes, it is possible, Using head, mean or other to retrieve only one element from thresh_SD1.
H_peaks_1_df <- DF %>% group_by(ID) %>%
mutate(thresh_SD1 = f.SD1(Happiness)) %>%
group_by(ID, Stimuli) %>%
summarise(thresh_SD1 = head(thresh_SD1,1), ttime = sum(Happiness > thresh_SD1), nP_H_SD1 = sum(diff(c(f.Peaks_SD1(Happiness, thresh = thresh_SD1), 0)) < 0))
H_peaks_1_df
ID Stimuli thresh_SD1 ttime nP_H_SD1
<chr> <int> <dbl> <int> <int>
1 ID1 1 58.5 0 0
2 ID1 2 58.5 2 2
3 ID1 3 58.5 0 0
4 ID1 4 58.5 1 1
5 ID2 1 71.3 1 1
6 ID2 2 71.3 1 1
7 ID2 3 71.3 3 1
8 ID2 4 71.3 0 0

findInterval by group with dplyr [duplicate]

This question already has answers here:
How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame
(11 answers)
Closed 1 year ago.
In this example I have a tibble with two variables:
a group variable gr
the variable of interest val
set.seed(123)
df <- tibble(gr = rep(1:3, each = 10),
val = gr + rnorm(30))
Goal
I want to produce a discretized version of val using the function findInterval but the breakpoints should be gr-specific, since in my actual data as well as in this example, the distribution of valdepends on gr. The breakpoints are determined within each group using the quartiles of val.
What I did
I first construct a nested tibble containing the vectors of breakpoints for each value of gr:
df_breakpoints <- bind_cols(gr = 1:3,
purrr::map_dfr(1:3, function(gr) {
c(-Inf, quantile(df$val[df$gr == gr], c(0.25, 0.5, 0.75)), Inf)
})) %>%
nest(bp = -gr) %>%
mutate(bp = purrr::map(.$bp, unlist))
Then I join it with df:
df <- inner_join(df, df_breakpoints, by = "gr")
My first guess to define the discretized variable lvl was
df %>% mutate(lvl = findInterval(x = val, vec = bp))
It produces the error
Error : Problem with `mutate()` input `lvl2`.
x 'vec' must be sorted non-decreasingly and not contain NAs
ℹ Input `lvl` is `findInterval(x = val, vec = bp)`.
Then I tried
df$lvl <- purrr::imap_dbl(1:nrow(df),
~findInterval(x = df$val[.x], vec = df$bp[[.x]]))
or
df %>% mutate(lvl = purrr::map2_int(df$val, df$bp, findInterval))
It does work. However it is highly unefficient. With my actual data (1.2 million rows) it takes several minutes to run. I guess there is a much better way of doing this than iterating on rows. Any idea?
You can do this in group_by + mutate step -
library(dplyr)
df %>%
group_by(gr) %>%
mutate(breakpoints = findInterval(val,
c(-Inf, quantile(val, c(0.25, 0.5, 0.75)), Inf))) %>%
ungroup
# gr val breakpoints
# <int> <dbl> <int>
# 1 1 0.440 1
# 2 1 0.770 2
# 3 1 2.56 4
# 4 1 1.07 3
# 5 1 1.13 3
# 6 1 2.72 4
# 7 1 1.46 4
# 8 1 -0.265 1
# 9 1 0.313 1
#10 1 0.554 2
# … with 20 more rows
findInterval is applied for each gr separately.

Select a number of top groups from data frame

Is there an efficient way to grab some number of top groups from a data frame in R?
For example:
exampleDf <- data.frame(
subchar = c("facebook", "twitter", "snapchat", "male", "female", "18", "20"),
superchar = c("social media", "social media", "social media", "gender", "gender", "age", "age"),
cweight = c(.2, .4, .4, .7, .3, .8, .6),
groupWeight = c(10, 10, 10, 20, 20, 70, 70)
)
So with dplyr I can group them and sort by group weight with:
sortedDf <- exampleDf %>%
group_by(superchar) %>%
arrange(desc(groupWeight))
But is there anyway to select the 'top' groups, like age and gender in this case? Kind of like the slice() dplyr function, but for the whole group rather than rows within the group.
dplyr has a group_indices function that can be used to assign a consecutive group number. Then filter by that new number. In the example below, I will filter/keep the 2 first groups.
library(dplyr)
Top <- 2
sortedDf <- exampleDf %>%
group_by(superchar) %>%
arrange(desc(groupWeight)) %>%
mutate(new_id = group_indices()) %>%
filter(new_id <= Top) %>%
select(-new_id)
sortedDf
## A tibble: 4 x 4
## Groups: superchar [2]
# subchar superchar cweight groupWeight
# <fct> <fct> <dbl> <dbl>
#1 18 age 0.8 70
#2 20 age 0.6 70
#3 male gender 0.7 20
#4 female gender 0.3 20
Here are two other approaches using dplyr :
We calculate sum of groupWeight for each superchar select top 2 records and do a left_join with the original dataframe to select all the rows.
n <- 2
library(dplyr)
exampleDf %>%
group_by(superchar) %>%
summarise(sum_gr = sum(groupWeight)) %>%
top_n(n, sum_gr) %>%
left_join(exampleDf)
# A tibble: 4 x 5
# superchar sum_gr subchar cweight groupWeight
# <fct> <dbl> <fct> <dbl> <dbl>
#1 age 140 18 0.8 70
#2 age 140 20 0.6 70
#3 gender 40 male 0.7 20
#4 gender 40 female 0.3 20
Another approach is to sum groupWeight by superchar and use dense_rank to select top groups.
exampleDf %>%
group_by(superchar) %>%
mutate(sum_gr = sum(groupWeight)) %>%
ungroup() %>%
filter(dense_rank(-sum_gr) <= n)
The first approach can be written in base R as :
temp <- aggregate(groupWeight~superchar, exampleDf, sum)
temp <- temp[order(temp$groupWeight, decreasing = TRUE), ][1:n, ]
merge(temp, exampleDf, all.x = TRUE, by = 'superchar')

Iterate through columns and row values (list) in R dplyr

This question is based on the following post with additional requirements (Iterate through columns in dplyr?).
The original code is as follows:
df <- data.frame(col1 = rep(1, 15),
col2 = rep(2, 15),
col3 = rep(3, 15),
group = c(rep("A", 5), rep("B", 5), rep("C", 5)))
for(col in c("col1", "col2", "col3")){
filt.df <- df %>%
filter(group == "A") %>%
select_(.dots = c('group', col))
# do other things, like ggplotting
print(filt.df)
}
My objective is to output a frequency table for each unique COL by GROUP combination. The current example specifies a dplyr filter based on a GROUP value A, B, or C. In my case, I want to iterate (loop) through a list of values in GROUP (list <- c("A", "B", "C") and generate a frequency table for each combination.
The frequency table is based on counts. For Col1 the result would look something like the table below. The example data set is simplified. My real dataset is more complex with multiple 'values' per 'group'. I need to iterate through Col1-Col3 by group.
group value n prop
A 1 5 .1
B 2 5 .1
C 3 5 .1
A better example of the frequency table is here: How to use dplyr to generate a frequency table
I struggled with this for a couple days, and I could have done better with my example. Thanks for the posts. Here is what I ended up doing to solve this. The result is a series of frequency tables for each column and each unique value found in group. I had 3 columns (col1, col2, col3) and 3 unique values in group (A,B,C), 3x3. The result is 9 frequency tables and a frequency table for each group value that is non-sensical. I am sure there is a better way to do this. The output generates some labeling, which is useful.
# Build unique group list
group <- unique(df$group)
# Generate frequency tables via a loop
iterate_by_group <- function(x)
for (i in 1:length(group)){
filt.df <- df[df$group==group[i],]
print(lapply(filt.df, freq))
}
# Run
iterate_by_group(df)
We could gather into long format and then get the frequency (n()) by group
library(tidyverse)
gather(df, value, val, col1:col3) %>%
group_by(group, value = parse_number(value)) %>%
summarise(n = n(), prop = n/nrow(.))
# A tibble: 9 x 4
# Groups: group [?]
# group value n prop
# <fct> <dbl> <int> <dbl>
#1 A 1 5 0.111
#2 A 2 5 0.111
#3 A 3 5 0.111
#4 B 1 5 0.111
#5 B 2 5 0.111
#6 B 3 5 0.111
#7 C 1 5 0.111
#8 C 2 5 0.111
#9 C 3 5 0.111
Is this what you want?
df %>%
group_by(group) %>%
summarise_all(funs(freq = sum))

Resources