I'm working on an R dataframe
with columns
GROUP_COL | TIME| VALUE
. Time is in order, value is numerical and group col is a categorical variable I want to group the data by.
My goal is to
first group by the GROUP_COL variable
then, order by TIME
and then calculate a weighted mean for the values in each group using the formula value = 0.1 * previous_value + 0.9 * value for each row. If there is no previous value, leave the value as it is.
this weighted value should be stored in a separate column WEIGHTED.
What I tried so far is: Usng `dplyr, I created a vector of previous values using lag()
weighted_avg_with_previous <- function(.data, lag_weight=0.1) {
# get previous values
lag_val <- lag(.data$VALUE, n = 1L, default = 0, order_by = .data$TIME)
# give each value a weight 0.9 for current value and 0.1 for previous value
weighted = (1 -lag_weight) * .data$VALUE + lag_weight * lag_val
return (weighted)
}
data <- data %>%
group_by(SALES_RESPONSIBILITY, PRODUCT_AREA, CURRENCY, FORECAST_TYPE) %>%
arrange(HORIZON, .by_group=TRUE) %>%
mutate(WEIGHTED_VALUE = weighted_avg_with_previous(0.1))
However, the mutate statement throws an error. How can I get my weighted_avg_with_previous functions to run on the single groups?
Example:
GROUP | TIME| VALUE | WEIGHTED VALUE
_____________________________________
A | 1 | 1 | 1
A | 2 | 2 | 1.9
A | 3 | 3 | 2.9
A | 4 | 4 | 3.9
B | 1 | 3 | 3
B | 2 | 7 | 6.6
B | 3 | -4 | -3.3
...
Best,
Julia
library(tidyverse)
df <- structure(list(GROUP = c("A", "A", "A", "A", "B", "B", "B"),
TIME = c(1L, 2L, 3L, 4L, 1L, 2L, 3L), VALUE = c(1L, 2L, 3L,
4L, 3L, 7L, -4L)), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
df %>%
group_by(GROUP) %>%
mutate(previous.value = lag(VALUE)) %>%
mutate(weighted.value = ifelse(is.na(previous.value),VALUE, 0.1*previous.value + 0.9*VALUE)) %>%
select(-previous.value)
The first mutate() statement creates a new variable for lagged value and the second one creates weighted.value which equals either 0.1*previous.value + 0.9*value, or value if previous.value is null.
Output:
# A tibble: 7 x 4
# Groups: GROUP [2]
GROUP TIME VALUE weighted.value
<chr> <int> <int> <dbl>
1 A 1 1 1
2 A 2 2 1.9
3 A 3 3 2.9
4 A 4 4 3.9
5 B 1 3 3
6 B 2 7 6.6
7 B 3 -4 -2.9
Related
I have the following table in R. I have 2 A columns, 3 B columns and 1 C column. I need to calculate the maximum difference possible between any columns of the same name and return the column name as output.
For row 1
The max difference between A is 2
The max difference between B is 4
I need the output as B
For row 2
The max difference between A is 3
The max difference between B is 2
I need the output as A
| A | A | B | B | B | C |
| 2 | 4 |5 |2 |1 |0 |
| -3 |0 |2 |3 |4 |2 |
First of all, it's a bit dangerous (and not allowed in some cases) to have non-unique column names, so the first thing I did was to uniqueify the names using base::make.unique(). From there, I used tidyr::pivot_longer() so that the grouping information contained in the column names could be accessed more easily. Here I use a regex inside names_pattern to discard the differentiating parts of the column names so they will be the same again. Then we use dplyr::group_by() followed by dplyr::summarize() to get the largest difference in each id and grp which corresponds to your rows and similar columns in the original data. Finally we use dplyr::slice_max() to return only the largest difference per group.
library(tidyverse)
d <- structure(list(A = c(2L, -3L), A = c(4L, 0L), B = c(5L, 2L), B = 2:3, B = c(1L, 4L), C = c(0L, 2L)), row.names = c(NA, -2L), class = "data.frame")
# give unique names
names(d) <- make.unique(names(d), sep = "_")
d %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_to = "grp", names_pattern = "([A-Z])*") %>%
group_by(id, grp) %>%
summarise(max_diff = max(value) - min(value)) %>%
slice_max(order_by = max_diff, n = 1, with_ties = F)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 2 x 3
#> # Groups: id [2]
#> id grp max_diff
#> <int> <chr> <int>
#> 1 1 B 4
#> 2 2 A 3
Created on 2022-02-14 by the reprex package (v2.0.1)
Here is base R option using aggregate + range + diff + which.max
df$max_diff <- with(
p <- aggregate(
. ~ id,
cbind(id = names(df), as.data.frame(t(df))),
function(v) diff(range(v))
),
id[sapply(p[-1],which.max)]
)
which gives
> df
A A B B B C max_diff
1 2 4 5 2 1 0 B
2 -3 0 2 3 4 2 A
data
> dput(df)
structure(list(A = c(2L, -3L), A = c(4L, 0L), B = c(5L, 2L),
B = 2:3, B = c(1L, 4L), C = c(0L, 2L), max_diff = c("B",
"A")), row.names = c(NA, -2L), class = "data.frame")
We may also use split.default to split based on the column names similarity and then with max.col find the index of the max diff
m1 <- sapply(split.default(df, names(df)), \(x)
apply(x, 1, \(u) diff(range(u))))
df$max_diff <- colnames(m1)[max.col(m1, "first")]
df$max_diff
[1] "B" "A"
I am trying to find the minimum value among different columns and group.
A small sample of my data looks something like this:
group cut group_score_1 group_score_2
1 a 1 3 5.0
2 b 2 2 4.0
3 a 0 2 2.5
4 b 3 5 4.0
5 a 2 3 6.0
6 b 1 5 1.0
I want to group by the groups and for each group, find the row which contains the minimum group score among both group scores and then also get the name of the column which contains the minimum (group_score_1 or group_score_2),
so basically my result should be something like this:
group cut group_score_1 group_score_2
1 a 0 2 2.5
2 b 1 5 1.0
I tried a few ideas, and came up eventually to dividing the into several new data frames, filtering by group and selecting the relevant columns and then using which.min(), but I'm sure there's a much more efficient way to do it. Not sure what I am missing.
We can use data.table methods
library(data.table)
setDT(df)[df[, .I[which.min(do.call(pmin, .SD))],
group, .SDcols = patterns('^group_score')]$V1]
# group cut group_score_1 group_score_2
#1: a 0 2 2.5
#2: b 1 5 1.0
For each group, you can calculate min value and select the row in which that value exist in one of the column.
library(dplyr)
df %>%
group_by(group) %>%
filter({tmp = min(group_score_1, group_score_2);
group_score_1 == tmp | group_score_2 == tmp})
# group cut group_score_1 group_score_2
# <chr> <int> <int> <dbl>
#1 a 0 2 2.5
#2 b 1 5 1
The above works well when you have only two group_score columns. If you have many such columns it is not possible to list down each one of them with group_score_1 == tmp | group_score_2 == tmp etc. In such case, get the data in long format and get the corresponding cut value of the minimum value and join the data. Assuming cut is unique in each group.
df %>%
tidyr::pivot_longer(cols = starts_with('group_score')) %>%
group_by(group) %>%
summarise(cut = cut[which.min(value)]) %>%
left_join(df, by = c("group", "cut"))
Here is a base R option using pmin + ave + subset
subset(
df,
as.logical(ave(
do.call(pmin, df[grep("group_score_\\d+", names(df))]),
group,
FUN = function(x) x == min(x)
))
)
which gives
group cut group_score_1 group_score_2
3 a 0 2 2.5
6 b 1 5 1.0
Data
> dput(df)
structure(list(group = c("a", "b", "a", "b", "a", "b"), cut = c(1L,
2L, 0L, 3L, 2L, 1L), group_score_1 = c(3L, 2L, 2L, 5L, 3L, 5L
), group_score_2 = c(5, 4, 2.5, 4, 6, 1)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
How the data looks now:
Coach ID | Student | score |
---------------------------------
1 | A | 8 |
1 | B | 3 |
2 | A | 5 |
2 | B | 4 |
2 | C | 7 |
To look like this:
Coach ID | Student | score | student_2|score_2| student_3|score_3
------------------------------------------------------------------
1 | A | 8 | B | 3 |
2 | A | 5 | B | 4 | C | 7
Is there anyway to reshape data from long to wide?
Thanks!
You could create a new identifier column with unique value for every student and then use pivot_wider to case multiple columns to wide.
library(dplyr)
df %>%
mutate(name = as.integer(factor(Student))) %>%
tidyr::pivot_wider(names_from = name, values_from = c(Student, score))
# CoachID Student_1 Student_2 Student_3 score_1 score_2 score_3
# <int> <fct> <fct> <fct> <int> <int> <int>
#1 1 A B NA 8 3 NA
#2 2 A B C 5 4 7
data
df <- structure(list(CoachID = c(1L, 1L, 2L, 2L, 2L), Student = structure(c(1L,
2L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
score = c(8L, 3L, 5L, 4L, 7L)), class = "data.frame", row.names = c(NA, -5L))
In base R, you could use the reshape function:
reshape(transform(df,time = as.numeric(factor(Student))),idvar = "CoachID",dir = "wide",sep="_")
CoachID Student_1 score_1 Student_2 score_2 Student_3 score_3
1 1 A 8 B 3 <NA> NA
3 2 A 5 B 4 C 7
library(tidyverse)
mydf <- tribble(
~`Coach ID` , ~Student, ~score,
# ---------------------------------
1 ,"A" , 8,
1 , "B" , 3,
2 , "A" , 5,
2 , "B" , 4,
2 , "C" , 7
)
mydf <- mydf %>%
group_by(`Coach ID`) %>%
mutate(index = row_number())
#----- Reshape to wide
mydf %>%
pivot_wider(
id_cols = `Coach ID`,
names_from = index,
values_from = Student:score
)
This question has been answered before, but solutions not working for my particular situation.
col1 | col2
A | 0
B | 1
A | 0
A | 1
B | 0
I'm basically looking for this:
col1 | col2 | Percentage
A | 0 | 0.67
A | 1 | 0.33
B | 0 | 0.50
B | 1 | 0.50
Both columns are factors. The following solutions is what I keep finding on other threads:
df %>% group_by(col1, col2) %>% summarise(n=n()) %>% mutate(freq = n / sum(n))
or something along those lines.
In fact, group_by doesn't really seem to be doing anything at all. It's not giving me an 'n' or 'freq' column. Don't know what I'm doing wrong. Is it because I'm working with factors? Also, if it's not obvious, the values provided in the columns are hypothetical.
An option would be to get the frequency count after grouping by 'col1', then with the 'col2' also as grouping column, divide that frequency by the already created frequency
library(dplyr)
df %>%
group_by(col1) %>%
mutate(n = n()) %>%
group_by(col2, add = TRUE) %>%
summarise(freq = n()/n[1])
# A tibble: 4 x 3
# Groups: col1 [2]
# col1 col2 freq
# <chr> <int> <dbl>
#1 A 0 0.667
#2 A 1 0.333
#3 B 0 0.5
#4 B 1 0.5
data
df <- structure(list(col1 = c("A", "B", "A", "A", "B"), col2 = c(0L,
1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L
))
I have a data set that looks like this:
id a b
1 AA 2
1 AB 5
1 AA 1
2 AB 2
2 AB 4
3 AB 4
3 AB 3
3 AA 1
I need to calculate the cumulative mean for each record within each group and excluding the case where a == 'AA', So sample output should be:
id a b mean
1 AA 2 -
1 AB 5 5
1 AA 1 5
2 AB 2 2
2 AB 4 (4+2)/2
3 AB 4 4
3 AB 3 (4+3)/2
3 AA 1 (4+3)/2
3 AA 4 (4+3)/2
I tried to achieve it using dplyr and cummean by getting an error.
df <- df %>%
group_by(id) %>%
mutate(mean = cummean(b[a != 'AA']))
Error: incompatible size (123), expecting 147 (the group size) or 1
Can you suggest a better way to achieve the same in R ?
The trick here is to reconstruct the cummean by dividing the adjusted cumsum by the adjusted count. As a one-liner:
df %>% group_by(id) %>% mutate(cumsum(b * (a != 'AA')) / cumsum(a != 'AA'))
We can make this a little nicer (the "multiply by a!='AA' - magic!" is the ugliness in my mind) by taking out the a != 'AA' as a column
df %>%
group_by(id) %>%
mutate(relevance = 0+(a!='AA'),
mean = cumsum(relevance * b) / cumsum(relevance))
There may be an easier approach. Here, we group by 'id'. Create a new column 'Mean' by first converting the elements in 'b' that corresponds to 'AA' in 'a' to NA (b*NA^(a=='AA')). NA^(a=='AA') gives an output of NA for 'AA' in 'a' and 1 for all other values. So, when we multiply by 'b', it replaces the 1 with the values in 'b' while NA remains as such. We use na.aggregate to replace the 'NA' with the mean of non-NA elements in each group, then wrap with cummean to get the cumulative mean. If the first value in each group for 'a' is 'AA', we can get NA for that by multiplying with NA^(row_number()==1 & a=='AA').
library(zoo)
library(dplyr)
df %>%
group_by(id) %>%
mutate(Mean= cummean(na.aggregate(b*NA^(a=='AA')))*
NA^(row_number()==1 & a=='AA'))
# Source: local data frame [9 x 4]
#Groups: id [3]
# id a b Mean
# (int) (chr) (int) (dbl)
#1 1 AA 2 NA
#2 1 AB 5 5.0
#3 1 AA 1 5.0
#4 2 AB 2 2.0
#5 2 AB 4 3.0
#6 3 AB 4 4.0
#7 3 AB 3 3.5
#8 3 AA 1 3.5
#9 3 AA 4 3.5
data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
a = c("AA",
"AB", "AA", "AB", "AB", "AB", "AB", "AA", "AA"), b = c(2L, 5L,
1L, 2L, 4L, 4L, 3L, 1L, 4L)), .Names = c("id", "a", "b"),
class = "data.frame", row.names = c(NA, -9L))