How to find all combinations in column and count occurrences in data - r

I am trying to find all actual combinations within my data of values in column 1.
I then want to count all occurrences of these by column 2.
It feels like R should be able to do this fairly quickly. I tried reading up on combn and expand.grid, but with no success. The main problem was I could not find any guidance on how to generate combinations within a column.
My data looks like:
Animal (n=57) | Person ID (n=1000)
Dog | 0001
Cat | 0004
Bird | 0001
Snake | 0002
Spider | 0002
Cat | 0003
Dog | 0004
Expected output is:
AnimalComb | CountbyID
Cat | 1
DogBird | 1
SnakeSpider | 1
CatDog | 1
EDIT deleted an erroneous entry for cat

If I have understood you correctly, you need to group_by PersonID and paste the all the unique Animals in the group and count the number of occurrence of their combination which can be done counting the number of rows in the group (n()) and dividing it by number of distinct values (n_distinct).
library(dplyr)
df %>%
group_by(PersonID) %>%
summarise(AnimalComb = paste(unique(Animal), collapse = ""),
CountbyID = n() / n_distinct(Animal))
# PersonID AnimalComb CountbyID
# <int> <chr> <dbl>
#1 1 DogBird 1
#2 2 SnakeSpider 1
#3 3 Cat 1
#4 4 CatDog 1

An option using data.table
library(data.table)
setDT(df)[, .(AnimalComb = toString(unique(Animal)),
CountbyID = .N/uniqueN(Animal)), by = PersonID]
data
df <- structure(list(Animal = c("Dog", "Cat", "Bird", "Snake", "Spider",
"Cat", "Dog"), PersonID = c(1L, 4L, 1L, 2L, 2L, 3L, 4L)),
class = "data.frame", row.names = c(NA, -7L))

Related

Finding maximum difference between columns of same name in R

I have the following table in R. I have 2 A columns, 3 B columns and 1 C column. I need to calculate the maximum difference possible between any columns of the same name and return the column name as output.
For row 1
The max difference between A is 2
The max difference between B is 4
I need the output as B
For row 2
The max difference between A is 3
The max difference between B is 2
I need the output as A
| A | A | B | B | B | C |
| 2 | 4 |5 |2 |1 |0 |
| -3 |0 |2 |3 |4 |2 |
First of all, it's a bit dangerous (and not allowed in some cases) to have non-unique column names, so the first thing I did was to uniqueify the names using base::make.unique(). From there, I used tidyr::pivot_longer() so that the grouping information contained in the column names could be accessed more easily. Here I use a regex inside names_pattern to discard the differentiating parts of the column names so they will be the same again. Then we use dplyr::group_by() followed by dplyr::summarize() to get the largest difference in each id and grp which corresponds to your rows and similar columns in the original data. Finally we use dplyr::slice_max() to return only the largest difference per group.
library(tidyverse)
d <- structure(list(A = c(2L, -3L), A = c(4L, 0L), B = c(5L, 2L), B = 2:3, B = c(1L, 4L), C = c(0L, 2L)), row.names = c(NA, -2L), class = "data.frame")
# give unique names
names(d) <- make.unique(names(d), sep = "_")
d %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_to = "grp", names_pattern = "([A-Z])*") %>%
group_by(id, grp) %>%
summarise(max_diff = max(value) - min(value)) %>%
slice_max(order_by = max_diff, n = 1, with_ties = F)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 2 x 3
#> # Groups: id [2]
#> id grp max_diff
#> <int> <chr> <int>
#> 1 1 B 4
#> 2 2 A 3
Created on 2022-02-14 by the reprex package (v2.0.1)
Here is base R option using aggregate + range + diff + which.max
df$max_diff <- with(
p <- aggregate(
. ~ id,
cbind(id = names(df), as.data.frame(t(df))),
function(v) diff(range(v))
),
id[sapply(p[-1],which.max)]
)
which gives
> df
A A B B B C max_diff
1 2 4 5 2 1 0 B
2 -3 0 2 3 4 2 A
data
> dput(df)
structure(list(A = c(2L, -3L), A = c(4L, 0L), B = c(5L, 2L),
B = 2:3, B = c(1L, 4L), C = c(0L, 2L), max_diff = c("B",
"A")), row.names = c(NA, -2L), class = "data.frame")
We may also use split.default to split based on the column names similarity and then with max.col find the index of the max diff
m1 <- sapply(split.default(df, names(df)), \(x)
apply(x, 1, \(u) diff(range(u))))
df$max_diff <- colnames(m1)[max.col(m1, "first")]
df$max_diff
[1] "B" "A"

Conditions based on dynamic column names in R

I have a following data:
name | product1_flag1 | product1_flag2 | product1_flag3 | product2_flag1 | product2_flag2 | product2_flag3
lmn | 0 | 1 | 0 | 1 | 0 | 1
Here, Product names and number of products are dynamic. I want to create new column Product1_Final_Flag based on multiple flag values for each name like if((flag1=1 or flag=0) and flag3=1) then "1" else "0".
Expected output as flows;
name | Product1_final_Flag | Product2_final_Flag
lmn | 0 | 1
How should I achieve the same?
Using DF shown reproducibly in the Note at the end, convert to long form having columns name, product, flag and value. Then convert to wide form having columns name, product, flag1, flag2 and flag3. Compute flag, append "_final_flag" to the product and select desired columns. Finally use pivot_wider to produce products along the top.
library(dplyr)
library(tidyr)
DF %>%
pivot_longer(-1, names_to = c("product", "flag"), names_sep = "_") %>%
pivot_wider(names_from = "flag") %>%
mutate(flag = (flag1 | !flag2) * flag3,
product = paste0(product, "_final_flag")) %>%
select(name, product, flag) %>%
pivot_wider(names_from = "product", values_from = "flag")
## # A tibble: 1 x 3
## name product1_final_flag product2_final_flag
## <chr> <int> <int>
## 1 lmn 0 1
Note
DF is shown in a reproducible manner here:
DF <- structure(list(name = "lmn", product1_flag1 = 0L, product1_flag2 = 1L,
product1_flag3 = 0L, product2_flag1 = 1L, product2_flag2 = 0L,
product2_flag3 = 1L), class = "data.frame", row.names = c(NA, -1L))

Is there an R function to reshape this data from long to wide?

How the data looks now:
Coach ID | Student | score |
---------------------------------
1 | A | 8 |
1 | B | 3 |
2 | A | 5 |
2 | B | 4 |
2 | C | 7 |
To look like this:
Coach ID | Student | score | student_2|score_2| student_3|score_3
------------------------------------------------------------------
1 | A | 8 | B | 3 |
2 | A | 5 | B | 4 | C | 7
Is there anyway to reshape data from long to wide?
Thanks!
You could create a new identifier column with unique value for every student and then use pivot_wider to case multiple columns to wide.
library(dplyr)
df %>%
mutate(name = as.integer(factor(Student))) %>%
tidyr::pivot_wider(names_from = name, values_from = c(Student, score))
# CoachID Student_1 Student_2 Student_3 score_1 score_2 score_3
# <int> <fct> <fct> <fct> <int> <int> <int>
#1 1 A B NA 8 3 NA
#2 2 A B C 5 4 7
data
df <- structure(list(CoachID = c(1L, 1L, 2L, 2L, 2L), Student = structure(c(1L,
2L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
score = c(8L, 3L, 5L, 4L, 7L)), class = "data.frame", row.names = c(NA, -5L))
In base R, you could use the reshape function:
reshape(transform(df,time = as.numeric(factor(Student))),idvar = "CoachID",dir = "wide",sep="_")
CoachID Student_1 score_1 Student_2 score_2 Student_3 score_3
1 1 A 8 B 3 <NA> NA
3 2 A 5 B 4 C 7
library(tidyverse)
mydf <- tribble(
~`Coach ID` , ~Student, ~score,
# ---------------------------------
1 ,"A" , 8,
1 , "B" , 3,
2 , "A" , 5,
2 , "B" , 4,
2 , "C" , 7
)
mydf <- mydf %>%
group_by(`Coach ID`) %>%
mutate(index = row_number())
#----- Reshape to wide
mydf %>%
pivot_wider(
id_cols = `Coach ID`,
names_from = index,
values_from = Student:score
)

R data frame: Create weighted average group-wise

I'm working on an R dataframe
with columns
GROUP_COL | TIME| VALUE
. Time is in order, value is numerical and group col is a categorical variable I want to group the data by.
My goal is to
first group by the GROUP_COL variable
then, order by TIME
and then calculate a weighted mean for the values in each group using the formula value = 0.1 * previous_value + 0.9 * value for each row. If there is no previous value, leave the value as it is.
this weighted value should be stored in a separate column WEIGHTED.
What I tried so far is: Usng `dplyr, I created a vector of previous values using lag()
weighted_avg_with_previous <- function(.data, lag_weight=0.1) {
# get previous values
lag_val <- lag(.data$VALUE, n = 1L, default = 0, order_by = .data$TIME)
# give each value a weight 0.9 for current value and 0.1 for previous value
weighted = (1 -lag_weight) * .data$VALUE + lag_weight * lag_val
return (weighted)
}
data <- data %>%
group_by(SALES_RESPONSIBILITY, PRODUCT_AREA, CURRENCY, FORECAST_TYPE) %>%
arrange(HORIZON, .by_group=TRUE) %>%
mutate(WEIGHTED_VALUE = weighted_avg_with_previous(0.1))
However, the mutate statement throws an error. How can I get my weighted_avg_with_previous functions to run on the single groups?
Example:
GROUP | TIME| VALUE | WEIGHTED VALUE
_____________________________________
A | 1 | 1 | 1
A | 2 | 2 | 1.9
A | 3 | 3 | 2.9
A | 4 | 4 | 3.9
B | 1 | 3 | 3
B | 2 | 7 | 6.6
B | 3 | -4 | -3.3
...
Best,
Julia
library(tidyverse)
df <- structure(list(GROUP = c("A", "A", "A", "A", "B", "B", "B"),
TIME = c(1L, 2L, 3L, 4L, 1L, 2L, 3L), VALUE = c(1L, 2L, 3L,
4L, 3L, 7L, -4L)), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
df %>%
group_by(GROUP) %>%
mutate(previous.value = lag(VALUE)) %>%
mutate(weighted.value = ifelse(is.na(previous.value),VALUE, 0.1*previous.value + 0.9*VALUE)) %>%
select(-previous.value)
The first mutate() statement creates a new variable for lagged value and the second one creates weighted.value which equals either 0.1*previous.value + 0.9*value, or value if previous.value is null.
Output:
# A tibble: 7 x 4
# Groups: GROUP [2]
GROUP TIME VALUE weighted.value
<chr> <int> <int> <dbl>
1 A 1 1 1
2 A 2 2 1.9
3 A 3 3 2.9
4 A 4 4 3.9
5 B 1 3 3
6 B 2 7 6.6
7 B 3 -4 -2.9

Computing Percentages of each Subgroup

This question has been answered before, but solutions not working for my particular situation.
col1 | col2
A | 0
B | 1
A | 0
A | 1
B | 0
I'm basically looking for this:
col1 | col2 | Percentage
A | 0 | 0.67
A | 1 | 0.33
B | 0 | 0.50
B | 1 | 0.50
Both columns are factors. The following solutions is what I keep finding on other threads:
df %>% group_by(col1, col2) %>% summarise(n=n()) %>% mutate(freq = n / sum(n))
or something along those lines.
In fact, group_by doesn't really seem to be doing anything at all. It's not giving me an 'n' or 'freq' column. Don't know what I'm doing wrong. Is it because I'm working with factors? Also, if it's not obvious, the values provided in the columns are hypothetical.
An option would be to get the frequency count after grouping by 'col1', then with the 'col2' also as grouping column, divide that frequency by the already created frequency
library(dplyr)
df %>%
group_by(col1) %>%
mutate(n = n()) %>%
group_by(col2, add = TRUE) %>%
summarise(freq = n()/n[1])
# A tibble: 4 x 3
# Groups: col1 [2]
# col1 col2 freq
# <chr> <int> <dbl>
#1 A 0 0.667
#2 A 1 0.333
#3 B 0 0.5
#4 B 1 0.5
data
df <- structure(list(col1 = c("A", "B", "A", "A", "B"), col2 = c(0L,
1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L
))

Resources