Computing Percentages of each Subgroup - r

This question has been answered before, but solutions not working for my particular situation.
col1 | col2
A | 0
B | 1
A | 0
A | 1
B | 0
I'm basically looking for this:
col1 | col2 | Percentage
A | 0 | 0.67
A | 1 | 0.33
B | 0 | 0.50
B | 1 | 0.50
Both columns are factors. The following solutions is what I keep finding on other threads:
df %>% group_by(col1, col2) %>% summarise(n=n()) %>% mutate(freq = n / sum(n))
or something along those lines.
In fact, group_by doesn't really seem to be doing anything at all. It's not giving me an 'n' or 'freq' column. Don't know what I'm doing wrong. Is it because I'm working with factors? Also, if it's not obvious, the values provided in the columns are hypothetical.

An option would be to get the frequency count after grouping by 'col1', then with the 'col2' also as grouping column, divide that frequency by the already created frequency
library(dplyr)
df %>%
group_by(col1) %>%
mutate(n = n()) %>%
group_by(col2, add = TRUE) %>%
summarise(freq = n()/n[1])
# A tibble: 4 x 3
# Groups: col1 [2]
# col1 col2 freq
# <chr> <int> <dbl>
#1 A 0 0.667
#2 A 1 0.333
#3 B 0 0.5
#4 B 1 0.5
data
df <- structure(list(col1 = c("A", "B", "A", "A", "B"), col2 = c(0L,
1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L
))

Related

how to group_by one variable and count based on another variable?

Is it possible to use group_by to group one variable and count the target variable based on another variable?
For example,
x1
x2
x3
A
1
0
B
2
1
C
3
0
B
1
1
A
1
1
I want to count 0 and 1 of x3 with grouped x1
x1
x3=0
x3=1
A
1
1
B
0
2
C
1
0
Is it possible to use group_by and add something to summarize? I tried group_by both x1 and x3, but that gives x3 as the second column which is not what we are looking for.
If it's not possible to just use group_by, I was thinking we could group_by both x1 and x3, then split by x3 and cbind them, but the two dataframes after split have different lengths of rows, and there's no cbind_fill. What should I do to cbind them and fill the extra blanks?
using the data.table package:
library(data.table)
dat <- as.data.table(dataset)
dat[, x3:= paste0("x3=", x3)]
result <- dcast(dat, x1~x3, value.var = "x3", fun.aggregate = length)
A tidyverse approach to achieve your desired result using dplyr::count + tidyr::pivot_wider:
library(dplyr)
library(tidyr)
df %>%
count(x1, x3) %>%
pivot_wider(names_from = "x3", values_from = "n", names_prefix = "x3=", values_fill = 0)
#> # A tibble: 3 × 3
#> x1 `x3=0` `x3=1`
#> <chr> <int> <int>
#> 1 A 1 1
#> 2 B 0 2
#> 3 C 1 0
DATA
df <- data.frame(
x1 = c("A", "B", "C", "B", "A"),
x2 = c(1L, 2L, 3L, 1L, 1L),
x3 = c(0L, 1L, 0L, 1L, 1L)
)
Yes, it is possible. Here is an example:
dat = read.table(text = "x1 x2 x3
A 1 0
B 2 1
C 3 0
B 1 1
A 1 1", header = TRUE)
dat %>% group_by(x1) %>%
count(x3) %>%
pivot_wider(names_from = x3,
names_glue = "x3 = {x3}",
values_from = n) %>%
replace(is.na(.),0)
# A tibble: 3 x 3
# Groups: x1 [3]
# x1 `x3 = 0` `x3 = 1`
# <chr> <int> <int>
#1 A 1 1
#2 B 0 2
#3 C 1 0

Finding maximum difference between columns of same name in R

I have the following table in R. I have 2 A columns, 3 B columns and 1 C column. I need to calculate the maximum difference possible between any columns of the same name and return the column name as output.
For row 1
The max difference between A is 2
The max difference between B is 4
I need the output as B
For row 2
The max difference between A is 3
The max difference between B is 2
I need the output as A
| A | A | B | B | B | C |
| 2 | 4 |5 |2 |1 |0 |
| -3 |0 |2 |3 |4 |2 |
First of all, it's a bit dangerous (and not allowed in some cases) to have non-unique column names, so the first thing I did was to uniqueify the names using base::make.unique(). From there, I used tidyr::pivot_longer() so that the grouping information contained in the column names could be accessed more easily. Here I use a regex inside names_pattern to discard the differentiating parts of the column names so they will be the same again. Then we use dplyr::group_by() followed by dplyr::summarize() to get the largest difference in each id and grp which corresponds to your rows and similar columns in the original data. Finally we use dplyr::slice_max() to return only the largest difference per group.
library(tidyverse)
d <- structure(list(A = c(2L, -3L), A = c(4L, 0L), B = c(5L, 2L), B = 2:3, B = c(1L, 4L), C = c(0L, 2L)), row.names = c(NA, -2L), class = "data.frame")
# give unique names
names(d) <- make.unique(names(d), sep = "_")
d %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_to = "grp", names_pattern = "([A-Z])*") %>%
group_by(id, grp) %>%
summarise(max_diff = max(value) - min(value)) %>%
slice_max(order_by = max_diff, n = 1, with_ties = F)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 2 x 3
#> # Groups: id [2]
#> id grp max_diff
#> <int> <chr> <int>
#> 1 1 B 4
#> 2 2 A 3
Created on 2022-02-14 by the reprex package (v2.0.1)
Here is base R option using aggregate + range + diff + which.max
df$max_diff <- with(
p <- aggregate(
. ~ id,
cbind(id = names(df), as.data.frame(t(df))),
function(v) diff(range(v))
),
id[sapply(p[-1],which.max)]
)
which gives
> df
A A B B B C max_diff
1 2 4 5 2 1 0 B
2 -3 0 2 3 4 2 A
data
> dput(df)
structure(list(A = c(2L, -3L), A = c(4L, 0L), B = c(5L, 2L),
B = 2:3, B = c(1L, 4L), C = c(0L, 2L), max_diff = c("B",
"A")), row.names = c(NA, -2L), class = "data.frame")
We may also use split.default to split based on the column names similarity and then with max.col find the index of the max diff
m1 <- sapply(split.default(df, names(df)), \(x)
apply(x, 1, \(u) diff(range(u))))
df$max_diff <- colnames(m1)[max.col(m1, "first")]
df$max_diff
[1] "B" "A"

Conditions based on dynamic column names in R

I have a following data:
name | product1_flag1 | product1_flag2 | product1_flag3 | product2_flag1 | product2_flag2 | product2_flag3
lmn | 0 | 1 | 0 | 1 | 0 | 1
Here, Product names and number of products are dynamic. I want to create new column Product1_Final_Flag based on multiple flag values for each name like if((flag1=1 or flag=0) and flag3=1) then "1" else "0".
Expected output as flows;
name | Product1_final_Flag | Product2_final_Flag
lmn | 0 | 1
How should I achieve the same?
Using DF shown reproducibly in the Note at the end, convert to long form having columns name, product, flag and value. Then convert to wide form having columns name, product, flag1, flag2 and flag3. Compute flag, append "_final_flag" to the product and select desired columns. Finally use pivot_wider to produce products along the top.
library(dplyr)
library(tidyr)
DF %>%
pivot_longer(-1, names_to = c("product", "flag"), names_sep = "_") %>%
pivot_wider(names_from = "flag") %>%
mutate(flag = (flag1 | !flag2) * flag3,
product = paste0(product, "_final_flag")) %>%
select(name, product, flag) %>%
pivot_wider(names_from = "product", values_from = "flag")
## # A tibble: 1 x 3
## name product1_final_flag product2_final_flag
## <chr> <int> <int>
## 1 lmn 0 1
Note
DF is shown in a reproducible manner here:
DF <- structure(list(name = "lmn", product1_flag1 = 0L, product1_flag2 = 1L,
product1_flag3 = 0L, product2_flag1 = 1L, product2_flag2 = 0L,
product2_flag3 = 1L), class = "data.frame", row.names = c(NA, -1L))

Is there an R function to reshape this data from long to wide?

How the data looks now:
Coach ID | Student | score |
---------------------------------
1 | A | 8 |
1 | B | 3 |
2 | A | 5 |
2 | B | 4 |
2 | C | 7 |
To look like this:
Coach ID | Student | score | student_2|score_2| student_3|score_3
------------------------------------------------------------------
1 | A | 8 | B | 3 |
2 | A | 5 | B | 4 | C | 7
Is there anyway to reshape data from long to wide?
Thanks!
You could create a new identifier column with unique value for every student and then use pivot_wider to case multiple columns to wide.
library(dplyr)
df %>%
mutate(name = as.integer(factor(Student))) %>%
tidyr::pivot_wider(names_from = name, values_from = c(Student, score))
# CoachID Student_1 Student_2 Student_3 score_1 score_2 score_3
# <int> <fct> <fct> <fct> <int> <int> <int>
#1 1 A B NA 8 3 NA
#2 2 A B C 5 4 7
data
df <- structure(list(CoachID = c(1L, 1L, 2L, 2L, 2L), Student = structure(c(1L,
2L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
score = c(8L, 3L, 5L, 4L, 7L)), class = "data.frame", row.names = c(NA, -5L))
In base R, you could use the reshape function:
reshape(transform(df,time = as.numeric(factor(Student))),idvar = "CoachID",dir = "wide",sep="_")
CoachID Student_1 score_1 Student_2 score_2 Student_3 score_3
1 1 A 8 B 3 <NA> NA
3 2 A 5 B 4 C 7
library(tidyverse)
mydf <- tribble(
~`Coach ID` , ~Student, ~score,
# ---------------------------------
1 ,"A" , 8,
1 , "B" , 3,
2 , "A" , 5,
2 , "B" , 4,
2 , "C" , 7
)
mydf <- mydf %>%
group_by(`Coach ID`) %>%
mutate(index = row_number())
#----- Reshape to wide
mydf %>%
pivot_wider(
id_cols = `Coach ID`,
names_from = index,
values_from = Student:score
)

R data frame: Create weighted average group-wise

I'm working on an R dataframe
with columns
GROUP_COL | TIME| VALUE
. Time is in order, value is numerical and group col is a categorical variable I want to group the data by.
My goal is to
first group by the GROUP_COL variable
then, order by TIME
and then calculate a weighted mean for the values in each group using the formula value = 0.1 * previous_value + 0.9 * value for each row. If there is no previous value, leave the value as it is.
this weighted value should be stored in a separate column WEIGHTED.
What I tried so far is: Usng `dplyr, I created a vector of previous values using lag()
weighted_avg_with_previous <- function(.data, lag_weight=0.1) {
# get previous values
lag_val <- lag(.data$VALUE, n = 1L, default = 0, order_by = .data$TIME)
# give each value a weight 0.9 for current value and 0.1 for previous value
weighted = (1 -lag_weight) * .data$VALUE + lag_weight * lag_val
return (weighted)
}
data <- data %>%
group_by(SALES_RESPONSIBILITY, PRODUCT_AREA, CURRENCY, FORECAST_TYPE) %>%
arrange(HORIZON, .by_group=TRUE) %>%
mutate(WEIGHTED_VALUE = weighted_avg_with_previous(0.1))
However, the mutate statement throws an error. How can I get my weighted_avg_with_previous functions to run on the single groups?
Example:
GROUP | TIME| VALUE | WEIGHTED VALUE
_____________________________________
A | 1 | 1 | 1
A | 2 | 2 | 1.9
A | 3 | 3 | 2.9
A | 4 | 4 | 3.9
B | 1 | 3 | 3
B | 2 | 7 | 6.6
B | 3 | -4 | -3.3
...
Best,
Julia
library(tidyverse)
df <- structure(list(GROUP = c("A", "A", "A", "A", "B", "B", "B"),
TIME = c(1L, 2L, 3L, 4L, 1L, 2L, 3L), VALUE = c(1L, 2L, 3L,
4L, 3L, 7L, -4L)), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
df %>%
group_by(GROUP) %>%
mutate(previous.value = lag(VALUE)) %>%
mutate(weighted.value = ifelse(is.na(previous.value),VALUE, 0.1*previous.value + 0.9*VALUE)) %>%
select(-previous.value)
The first mutate() statement creates a new variable for lagged value and the second one creates weighted.value which equals either 0.1*previous.value + 0.9*value, or value if previous.value is null.
Output:
# A tibble: 7 x 4
# Groups: GROUP [2]
GROUP TIME VALUE weighted.value
<chr> <int> <int> <dbl>
1 A 1 1 1
2 A 2 2 1.9
3 A 3 3 2.9
4 A 4 4 3.9
5 B 1 3 3
6 B 2 7 6.6
7 B 3 -4 -2.9

Resources