Create lag onto next group in R - r

Hi I would like to create a lag by one whole group on a dataframe in R.
So lets say the value for the group A is 11 I would like to make a lag where all values of group B are 11 and so on. Below is an example of what I would like to do.
Letter = c('A', 'A', 'A', 'B', 'C', 'C', 'D')
Value = c(11, 11, 11, 4, 8, 8, 10)
data.frame(Letter, Value)
Letter Value
1 A 11
2 A 11
3 A 11
4 B 4
5 C 8
6 C 8
7 D 10
And then have it become after the lag:
Lag = c(NA, NA, NA, 11, 4, 4, 8)
data.frame(Letter, Value, Lag)
Letter Value Lag
1 A 11 NA
2 A 11 NA
3 A 11 NA
4 B 4 11
5 C 8 4
6 C 8 4
7 D 10 8
(One thing to note is all values of the group will be the same)

Get the unique rows of the data, lag Value and then left join the original data frame with that.
library(dplyr)
DF %>%
left_join(mutate(distinct(.), Lag = lag(Value), Value = NULL), by = "Letter")
giving:
Letter Value Lag
1 A 11 NA
2 A 11 NA
3 A 11 NA
4 B 4 11
5 C 8 4
6 C 8 4
7 D 10 8

You can do the following (see below)
We first group by LETTER and assign an id to each group member.
Next we ungroup and assign the lag value if something is the first group member.
And the final step is to fill the missings.
NOTE: all of this assumes your data set is sorted to your needs so that it would be correct to take the lag value from the last group.
library(tidyverse)
data.frame(Letter, Value) |>
group_by(Letter) |>
mutate(id = 1:n()) |>
ungroup() |>
mutate(Lag = ifelse(id == 1, lag(Value), NA)) |>
fill(Lag) |>
select(-id)
# A tibble: 7 × 3
Letter Value Lag
<chr> <dbl> <dbl>
1 A 11 NA
2 A 11 NA
3 A 11 NA
4 B 4 11
5 C 8 4
6 C 8 4
7 D 10 8

Related

fill NA values per group based on first value of a group

I am trying to fill NA values of my dataframe. However, I would like to fill them based on the first value of each group.
#> df = data.frame(
group = c(rep("A", 4), rep("B", 4)),
val = c(1, 2, NA, NA, 4, 3, NA, NA)
)
#> df
group val
1 A 1
2 A 2
3 A NA
4 A NA
5 B 4
6 B 3
7 B NA
8 B NA
#> fill(df, val, .direction = "down")
group val
1 A 1
2 A 2
3 A 2 # -> should be 1
4 A 2 # -> should be 1
5 B 4
6 B 3
7 B 3 # -> should be 4
8 B 3 # -> should be 4
Can I do this with tidyr::fill()? Or is there another (more or less elegant) way how to do this? I need to use this in a longer chain (%>%) operation.
Thank you very much!
Use tidyr::replace_na() and dplyr::first() (or val[[1]]) inside a grouped mutate():
library(dplyr)
library(tidyr)
df %>%
group_by(group) %>%
mutate(val = replace_na(val, first(val))) %>%
ungroup()
#> # A tibble: 8 × 2
#> group val
#> <chr> <dbl>
#> 1 A 1
#> 2 A 2
#> 3 A 1
#> 4 A 1
#> 5 B 4
#> 6 B 3
#> 7 B 4
#> 8 B 4
PS - #richarddmorey points out the case where the first value for a group is NA. The above code would keep all NA values as NA. If you'd like to instead replace with the first non-missing value per group, you could subset the vector using !is.na():
df %>%
group_by(group) %>%
mutate(val = replace_na(val, first(val[!is.na(val)]))) %>%
ungroup()
Created on 2022-11-17 with reprex v2.0.2
This should work, which uses dplyr's case_when
library(dplyr)
df %>%
group_by(group) %>%
mutate(val = case_when(
is.na(val) ~ val[1],
TRUE ~ val
))
Output:
group val
<chr> <dbl>
1 A 1
2 A 2
3 A 1
4 A 1
5 B 4
6 B 3
7 B 4
8 B 4

R Lag Variable And Skip Value Between

DATA = data.frame(STUDENT = c(1,1,1,2,2,2,3,3,4,4),
SCORE = c(6,4,8,10,9,0,2,3,3,7),
CLASS = c('A', 'B', 'C', 'A', 'B', 'C', 'B', 'C', 'A', 'B'),
WANT = c(NA, NA, 2, NA, NA, -10, NA, NA, NA, NA))
I have DATA and wish to create 'WANT' which is calculate by:
For each STUDENT, find the SCORE where SCORE equals to SCORE(CLASS = C) - SCORE(CLASS = A)
EX: SCORE(STUDENT = 1, CLASS = C) - SCORE(STUDENT = 1, CLASS = A) = 8-6=2
Assuming at most one 'C' and 'A' CLASS per each 'STUDENT', just subset the 'SCORE' where the CLASS value is 'C', 'A', do the subtraction and assign the value only to position where CLASS is 'C' by making all other positions to NA (after grouping by 'STUDENT')
library(dplyr)
DATA <- DATA %>%
group_by(STUDENT) %>%
mutate(WANT2 = (SCORE[CLASS == 'C'][1] - SCORE[CLASS == 'A'][1]) *
NA^(CLASS != "C")) %>%
ungroup
-output
# A tibble: 10 × 5
STUDENT SCORE CLASS WANT WANT2
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 6 A NA NA
2 1 4 B NA NA
3 1 8 C 2 2
4 2 10 A NA NA
5 2 9 B NA NA
6 2 0 C -10 -10
7 3 2 B NA NA
8 3 3 C NA NA
9 4 3 A NA NA
10 4 7 B NA NA
Here is a solution with the data organized in a wider format first, then a longer format below. This solution works regardless of the order of the "CLASS" column (for instance, if there is one instance in which the CLASS order is CBA or BCA instead os ABC, this solution will work).
Solution
library(dplyr)
library(tidyr)
wider <- DATA %>% select(-WANT) %>%
pivot_wider( names_from = "CLASS", values_from = "SCORE") %>%
rowwise() %>%
mutate(WANT = C-A) %>%
ungroup()
output wider
# A tibble: 4 × 5
STUDENT A B C WANT
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6 4 8 2
2 2 10 9 0 -10
3 3 NA 2 3 NA
4 4 3 7 NA NA
If you really want like your output example, then we can reorganize the wider data this way:
Reorganizing wider to long format
wider %>%
pivot_longer(A:C, values_to = "SCORE", names_to = "CLASS") %>%
relocate(WANT, .after = SCORE) %>%
mutate(WANT = if_else(CLASS == "C", WANT, NA_real_))
Final Output
# A tibble: 12 × 4
STUDENT CLASS SCORE WANT
<dbl> <chr> <dbl> <dbl>
1 1 A 6 NA
2 1 B 4 NA
3 1 C 8 2
4 2 A 10 NA
5 2 B 9 NA
6 2 C 0 -10
7 3 A NA NA
8 3 B 2 NA
9 3 C 3 NA
10 4 A 3 NA
11 4 B 7 NA
12 4 C NA NA

Subsetting dataframe in grouped data

I have a dataframe including a column of factors that I would like to subset to select every nth row, after grouping by factor level. For example,
my_df <- data.frame(col1 = c(1:12), col2 = rep(c("A","B", "C"), 4))
my_df
col1 col2
1 1 A
2 2 B
3 3 C
4 4 A
5 5 B
6 6 C
7 7 A
8 8 B
9 9 C
10 10 A
11 11 B
12 12 C
Subsetting to select every 2nd row should yield my_new_df as,
col1 col2
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
I tried in dplyr:
my_df %>% group_by(col2) %>%
my_df[seq(2, nrow(my_df), 2), ] -> my_new_df
I get an error:
Error: Can't subset columns that don't exist.
x Locations 4, 6, 8, 10, and 12 don't exist.
ℹ There are only 2 columns.
To see if the nrow function was a problem, I tried using the number directly. So,
my_df %>% group_by(col2) %>%
my_df[seq(2, 4, 2), ] -> my_new_df
Also gave an error,
Error: Can't subset columns that don't exist.
x Location 4 doesn't exist.
ℹ There are only 2 columns.
Run `rlang::last_error()` to see where the error occurred.
My expectation was that it would run the subsetting on each group of data and then combine them into 'my_new_df'. My understanding of how group_by works is clearly wrong but I am stuck on how to move past this error. Any help would much appreciated.
Try:
my_df %>%
group_by(col2)%>%
slice(seq(from = 2, to = n(), by = 2))
# A tibble: 6 x 2
# Groups: col2 [3]
col1 col2
<int> <chr>
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
You might want to ungroup after slicing if you want to do other operations not based on col2.
Here is a data.table option:
library(data.table)
data <- as.data.table(my_df)
data[(rowid(col2) %% 2) == 0]
col1 col2
1: 4 A
2: 5 B
3: 6 C
4: 10 A
5: 11 B
6: 12 C
Or base R:
my_df[as.logical(with(my_df, ave(col1, col2, FUN = function(x)
seq_along(x) %% 2 == 0))), ]
col1 col2
4 4 A
5 5 B
6 6 C
10 10 A
11 11 B
12 12 C

R - Compare values in a dataframe to aggregated dataframe

I'm trying to figure out how I can compare the values in a dataframe row-by-row that correspond to those given by the aggregate() function.
For example:
#create data frame
df <- data.frame(team=c('a', 'a', 'b', 'b', 'b', 'c', 'c'),
pts=c(5, 8, 14, 18, 5, 7, 7),
rebs=c(8, 8, 9, 3, 8, 7, 4))
#view data frame
df
team pts rebs
1 a 5 8
2 a 8 8
3 b 14 9
4 b 18 3
5 b 5 8
6 c 7 7
7 c 7 4
#find mean points scored by team
agg_df = aggregate(df$pts, list(df$team), FUN=mean)
Group.1 x
1 a 6.50000
2 b 12.33333
3 c 7.00000
What I want to do is create a new column in df using a logic similar to the following pseudo-code:
df$pts[i] > agg_df$x[i] then df$performance = 'overperformed' else df$performance = 'underperformed'.
But this is not exactly what I want. I want to compare row 1 and 2's points to the mean points for team a in agg_df. Similarly, rows 3-5 in df should be compared to the mean points for group b in agg_df.
The final result would look like:
> df
team pts rebs performance
1 a 5 8 under
2 a 8 8 over
3 b 14 9 over
4 b 18 3 over
5 b 5 8 under
6 c 7 7 average
7 c 7 4 average
I am a little puzzled as to how to achieve this, or if it is even achievable, so any help is very much appreciated.
You can do:
library(tidyverse)
df %>%
group_by(team) %>%
mutate(performance = case_when(pts > mean(pts) ~ "over",
pts == mean(pts) ~ "average",
pts < mean(pts) ~ "under")) %>%
ungroup()
which gives:
# A tibble: 7 x 4
team pts rebs performance
<chr> <dbl> <dbl> <chr>
1 a 5 8 under
2 a 8 8 over
3 b 14 9 over
4 b 18 3 over
5 b 5 8 under
6 c 7 7 average
7 c 7 4 average
Or in base way with merge().
# Merge data
db <- merge(df, agg_df, by.x = "team", by.y = 'Group.1')
db$performance <- ifelse(db$pts == db$x, 'average',
ifelse(db$pts > db$x, 'over', 'under'))
db$x <- NULL
db

Expanding a data.frame based on (group) values from the data.frame

Lets say I have the following data frame:
tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9))
# A tibble: 2 x 3
user first last
<chr> <dbl> <dbl>
1 A 1 6
2 B 4 9
And want to create a tibble that looks like:
bind_rows(tibble(user = 'A', weeks = 1:6),
tibble(user = 'B', weeks = 4:9))
# A tibble: 12 x 2
user weeks
<chr> <int>
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 4
8 B 5
9 B 6
10 B 7
11 B 8
12 B 9
How could I go about doing this? I have tried:
tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9)) %>%
group_by(user) %>%
mutate(weeks = first:last)
I wonder if I should try a combination of complete map or nest?
One option is unnest after creating a sequence
library(dplyr)
library(purrr)
df1 %>%
transmute(user, weeks = map2(first, last, `:`)) %>%
unnest(weeks)
# A tibble: 12 x 2
# user weeks
# <chr> <int>
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 A 5
# 6 A 6
# 7 B 4
# 8 B 5
# 9 B 6
#10 B 7
#11 B 8
#12 B 9
Or another option is rowwise
df1 %>%
rowwise %>%
transmute(user, weeks = list(first:last)) %>%
unnest(weeks)
Or without any packages
stack(setNames(Map(`:`, df1$first, df1$last), df1$user))
Or otherwise written as
stack(setNames(do.call(Map, c(f = `:`, df1[-1])), df1$user))
data
df1 <- tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9))
One option involving dplyr and tidyr could be:
df %>%
uncount(last - first + 1) %>%
group_by(user) %>%
transmute(weeks = first + 1:n() - 1)
user weeks
<chr> <dbl>
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 4
8 B 5
9 B 6
10 B 7
11 B 8
12 B 9

Resources