Sum most recent scores in 3 unique areas - r

I have dataset of areas and scores in those areas.
I want to maintain an aggregated score (agg_score) that is equal to the sum of the most recent scores for A, B, and C.
For instance you will see in my expected_output for row 4 is 7, because the calue of C is now 2 while the most recent values of A and B are still 1 & 4.
All I have been able to do so far is sum the three most recent scores, which results in agg_score values that equal the sum of C, C, and B at times. It is important that I have an accurate agg_score at each possible date.
library(dplyr)
ds <-
tibble(
area = c("A", "B", "C", "C", "B", "A", "A", "B", "C"),
score = c(1,4,5,2,6,3,4,6,3),
scoring_date =
seq.Date(
from = as.Date("2019-01-01"),
to = as.Date("2019-01-09"),
by = "days"
),
expected_output = c(NA, NA, 10, 7, 9, 11, 12, 12, 13)
) %>%
arrange(scoring_date)
# Inadequate code for summing last three scores
ds %>%
mutate(agg_score = score + lag(score) + lag(score, 2))

Using dplyr::last we can find the last 'recent' value for each area then sum them when length reaches 3.
#small function to clarify
sum_fun<-function(x){
#browser()
lc_vec <- ds[1:x,] %>% group_by(area) %>% summarise(lc=last(score)) %>% pull(lc)
lc_vecf <- ifelse(length(lc_vec)==3,sum(lc_vec),NA)
return(lc_vecf)
}
library(dplyr)
ds %>% mutate(Output=sapply(1:nrow(.),sum_fun)) #Instead of sapply we can use purrr::map_dpl
# A tibble: 9 x 5
area score scoring_date expected_output Output
<chr> <dbl> <date> <dbl> <dbl>
1 A 1. 2019-01-01 NA NA
2 B 4. 2019-01-02 NA NA
3 C 5. 2019-01-03 10. 10.
4 C 2. 2019-01-04 7. 7.
5 B 6. 2019-01-05 9. 9.
6 A 3. 2019-01-06 11. 11.
7 A 4. 2019-01-07 12. 12.
8 B 6. 2019-01-08 12. 12.
9 C 3. 2019-01-09 13. 13.

There might be a data.table self-merge option out there, but I couldn't quite figure it out. Here's an idea using implementing your fill but in data.table. Should be flexible for more "area"s:
library(data.table)
lapply(unique(ds$area), function(a){
ds[, paste0("val_",a) := zoo::na.locf0(ifelse(area==a, score, NA))]
invisible(return(NULL))
})
ds[, agg_score := rowSums(.SD), .SDcols = paste0("val_", unique(ds$area))][, paste0("val_", unique(ds$area)) := NULL]
ds
# area score scoring_date agg_score
#1 A 1 2019-01-01 NA
#2 B 4 2019-01-02 NA
#3 C 5 2019-01-03 10
#4 C 2 2019-01-04 7
#5 B 6 2019-01-05 9
#6 A 3 2019-01-06 11
#7 A 4 2019-01-07 12
#8 B 6 2019-01-08 12
#9 C 3 2019-01-09 13
Original solution:
Alternatively you could try an sapply. The function is a little long, but that's because we have a lot of work to do! If you wanted to do this on more areas you wouldn't have to manually fill each one, so that could be a benefit:
ds$agg_score <- sapply(1:nrow(ds), function(i) {other_areas <- setdiff(unique(ds$area), ds[i, "area"])
f_idxs = Filter(function(x) x < i, which(ds$area %in% other_areas)) #Locate other areas that come before current index
if(length(f_idxs) == 0) return(NA)
idxs = sapply(split(f_idxs, ds[f_idxs, "area"]), max) #Split based on area so we can get maximum index before our date
if(length(idxs) < length(other_areas)) return(NA)
sum(ds[c(idxs, i), "score"])}) #Sum up our scores

So I found a way to do this using fill() to ensure the most recent value is always carried forward until replaced by a more recent value.
library(tidyr)
ds %>%
select(area, score, scoring_date) %>%
spread(area, score) %>%
fill(A, .direction = "down") %>%
fill(B, .direction = "down") %>%
fill(C, .direction = "down") %>%
rowwise() %>%
mutate(agg_score = sum(A, B, C))

nuevoDs<-ds %>% arrange(desc(scoring_date)) %>% as.data.frame
#getting length of dataframe
longitud<-nrow(nuevoDs)
#we will iterate on each value up until (longitud - 2) and save results to a vector
elVector <- vector()
for(i in 1:(longitud-2))
{
elVector[i] <- nuevoDs[i,"score"] + nuevoDs[i+1,"score"] + nuevoDs[i+2,"score"]
}
#before cbinding we need to make the vector the same length as your dataFrame
elVector[longitud-1] <- 0
elVector[longitud] <- 0
elVector
cbind(nuevoDs,elVector)
area score scoring_date elVector
1 C 3 2019-01-09 13
2 B 6 2019-01-08 13
3 A 4 2019-01-07 13
4 A 3 2019-01-06 11
5 B 6 2019-01-05 13
6 C 2 2019-01-04 11
7 C 5 2019-01-03 10
8 B 4 2019-01-02 0
9 A 1 2019-01-01 0

Another possible data.table approach.
ds[, output :=
ds[,
ds[.(area=unique(area), scd=.BY$scoring_date),
sum(score),
on=.(area=area, scoring_date<=scd),
mult="last"],
by=.(area, scoring_date)]$V1
]
output:
area score scoring_date output
1: A 1 2019-01-01 NA
2: B 4 2019-01-02 NA
3: C 5 2019-01-03 10
4: C 2 2019-01-04 7
5: B 6 2019-01-05 9
6: A 3 2019-01-06 11
7: A 4 2019-01-07 12
8: B 6 2019-01-08 12
9: C 3 2019-01-09 13
data:
library(data.table)
ds <- data.table(
area = c("A", "B", "C", "C", "B", "A", "A", "B", "C"),
score = c(1,4,5,2,6,3,4,6,3),
scoring_date = seq.Date(from = as.Date("2019-01-01"), to = as.Date("2019-01-09"), by = "days"))
Explanation:
The gist of the above code is:
ds[.(area=unique(area), scd=.BY$scoring_date),
sum(score),
on=.(area=area, scoring_date<=scd),
mult="last"]
It means for each date (scd=.BY$scoring_date), we try to perform a non-equi self join to find the latest (mult="last") score for all areas (area=unique(area))

Related

Filter groups based on difference two highest values

I have the following dataframe called df (dput below):
> df
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 B 8
6 B 2
7 B 2
8 B 3
9 C 10
10 C 1
11 C 1
12 C 8
I would like to filter groups based on the difference between their highest value (max) and second highest value. The difference should be smaller equal than 2 (<=2), this means that group B should be removed because the highest value is 8 and the second highest value is 3 which is a difference of 5. The desired output should look like this:
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
So I was wondering if anyone knows how to filter groups based on the difference between their highest and second-highest value?
dput of df:
df<-structure(list(group = c("A", "A", "A", "A", "B", "B", "B", "B",
"C", "C", "C", "C"), value = c(5, 1, 1, 5, 8, 2, 2, 3, 10, 1,
1, 8)), class = "data.frame", row.names = c(NA, -12L))
Using dplyr
library(dplyr)
df %>%
group_by(group) %>%
filter(abs(diff(sort(value, decreasing=T)[1:2])) <= 2) %>%
ungroup()
# A tibble: 8 × 2
group value
<chr> <int>
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
A base R alternative
grp <- na.omit(aggregate(. ~ group, df, function(x)
abs(diff(sort(x, decreasing=T)[1:2])) <= 2))
do.call(rbind, c(mapply(function(g, v)
list(df[df$group == g & v,]), grp$group, grp$value), make.row.names=F))
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
I possibility would be to first create a vector with the groups that achieve your condition and then filter in the original data.frame. Here how I thought:
library(dplyr)
group_to_keep <-
df %>%
group_by(group) %>%
slice_max(n = 2,value) %>%
filter(abs(diff(value)) <= 2) %>%
pull(group) %>%
unique()
df %>%
filter(group %in% group_to_keep)
You can use ave.
df[ave(df$value, df$group, FUN=\(x) diff(sort(c(-x, Inf)))[1]) <= 2,]
# group value
#1 A 5
#2 A 1
#3 A 1
#4 A 5
#9 C 10
#10 C 1
#11 C 1
#12 C 8
In case you can sure that you have all the time at least two values you can use.
df[ave(df$value, df$group, FUN=\(x) diff(tail(sort(x), 2))) <= 2,]
df[ave(df$value, df$group, FUN=\(x) diff(sort(-x)[1:2])) <= 2,]

Find rolling min and max occurrences from two columns

I would like to track min and max occurrences in two columns. This should be done in rolling fashion from beginning of the data, so we can track how many times overall IDs are present at each date. Also it doesn't matter in which column ID is present.
Result should be as follows. Row 1, nor B or C has occurred, so min_appearance is 0 but max_appearance is 1 as A and D was present. Row 5 A and D have been present 3 times at this point but B and C only 2. I'm not concerned which ID is present, but only on counts what is min and max. Also real data is more complicated, so pairs are not static, but A could face C and so on.
# A tibble: 8 x 5
date id1 id2 min_appearances max_appearances
<date> <chr> <chr> <dbl> <dbl>
1 2020-01-01 A D 0 1
2 2020-01-02 B C 1 1
3 2020-01-03 C B 1 2
4 2020-01-04 D A 2 2
5 2020-01-05 A D 2 3
6 2020-01-06 B C 3 3
7 2020-01-07 C B 3 4
8 2020-01-08 D A 4 4
DATA:
library(dplyr)
date <- seq(as.Date("2020/1/1"), by = "day", length.out = 8)
id1 <- rep(c("A", "B", "C", "D"), 2)
id2 <- rep(c("D", "C", "B", "A"), 2)
dt <- tibble(date = date,
id1 = id1,
id2 = id2)
Here's a way to do it using functions from the tidyverse. First, pivot_longer to handle more easily the data. Then compute the cumulative count of value for every unique ids. Compute the min and max for each row over the "count" columns. Finally, take the last min and max values for each pairs, and pivot back to wide.
library(tidyverse)
dt %>%
pivot_longer(cols = -date, values_to = "id") %>%
mutate(map_dfc(unique(id), ~ tibble("count_{.x}" := cumsum(id == .x)))) %>%
mutate(min_appearances = do.call(pmin, select(., starts_with("count"))),
max_appearances = do.call(pmax, select(., starts_with("count")))) %>%
group_by(date) %>%
mutate(across(min_appearances:max_appearances, last),
n = row_number()) %>%
pivot_wider(c(date, min_appearances, max_appearances), names_from = n, values_from = id, names_prefix = "id") %>%
relocate(order(colnames(.)))
date id1 id2 max_appearances min_appearances
<date> <chr> <chr> <int> <int>
1 2020-01-01 A D 1 0
2 2020-01-02 B C 1 1
3 2020-01-03 C B 2 1
4 2020-01-04 D A 2 2
5 2020-01-05 A D 3 2
6 2020-01-06 B C 3 3
7 2020-01-07 C B 4 3
8 2020-01-08 D A 4 4

Create frequency data frame and transfer columns from old data frame

I am using the map function to create frequency tables from a list of data frames. I would like to import the name column from the original data frame. For example, when I enter df_freq$C I want to see three columns, value, n, and name. For the name column I want all values equal to "C".
# load packages and define variables
rm(list = ls())
library(purrr)
library(dplyr)
## load data
df_raw <- data.frame(name = c("C", "A", "B", "A", "B", "C"),
start = c(2, 1, 3, 4, 5, 2),
end = c(7, 6, 7, 8, 10, 9))
df <- df_raw %>%
split(.$name) %>% # split data by name
imap(function(x, x_name) {
data.frame(value = Map(seq.int, x$start, x$end) %>% unlist,
name = x_name) })
## create frequency plot with name column
df_freq <- df %>%
map(., ~count(.x, value))```
It can be done more directly in tidyverse. Create a rowwise attribute, then transmute to return the name and list of sequence from 'start' to 'end' for each row, unnest the list column and do the count
library(dplyr)
library(tidyr)
df_raw %>%
rowwise %>%
transmute(name, value = list(start:end)) %>%
unnest(c(value)) %>%
count(name, value)
-output
# A tibble: 24 x 3
# name value n
# <chr> <int> <int>
# 1 A 1 1
# 2 A 2 1
# 3 A 3 1
# 4 A 4 2
# 5 A 5 2
# 6 A 6 2
# 7 A 7 1
# 8 A 8 1
# 9 B 3 1
#10 B 4 1
# … with 14 more rows
Or instead of rowwise, may use map2
library(purrr)
df_raw %>%
transmute(name, value = map2(start, end, `:`)) %>%
unnest(c(value)) %>%
count(name, value)
In the OP's code, the count needs the name column as well
df %>%
map(., ~count(.x, name, value))
Here is a data.table option
setDT(df)[, .(value = unlist(Map(seq, start, end)), n = 1), .(name)][, .(n = sum(n)), by = .(name, value)]
which gives
name value n
1: C 2 2
2: C 3 2
3: C 4 2
4: C 5 2
5: C 6 2
6: C 7 2
7: C 8 1
8: C 9 1
9: A 1 1
10: A 2 1
11: A 3 1
12: A 4 2
13: A 5 2
14: A 6 2
15: A 7 1
16: A 8 1
17: B 3 1
18: B 4 1
19: B 5 2
20: B 6 2
21: B 7 2
22: B 8 1
23: B 9 1
24: B 10 1
name value n

Identify Two-Way Combinations of Levels in a Column for Each ID

I want to identify the two-way combinations of levels in one column grouped by the id and Date variables. Basically, I want the daily unique letter pairs for each person.
I have a dataframe that looks like this:
in_df <- data.frame(id = c(1,1,1,1,1,2,2,3),
Date = as.Date(c("2019-01-01", "2019-01-01", "2019-01-01", "2019-01-02", "2019-01-02", "2019-01-01", "2019-01-01", "2019-01-01")),
letter = c("A", "B", "C", "A", "B", "A", "D", "B"))
in_df
id Date letter
1 1 2019-01-01 A
2 1 2019-01-01 B
3 1 2019-01-01 C
4 1 2019-01-02 A
5 1 2019-01-02 B
6 2 2019-01-01 A
7 2 2019-01-01 D
8 3 2019-01-01 B
And I want one that looks like this:
out_df
id Date letter_1 letter_2
1 1 2019-01-01 A B
2 1 2019-01-01 A C
3 1 2019-01-01 B C
4 1 2019-01-02 A B
5 2 2019-01-01 A D
6 3 2019-01-01 B NA
So the first id and the first Date have letters A, B, and C. I want every unique pair from the three. Order doesn't matter so switching what goes to letter_1 and letter_2 would be the same thing.
I have played around with expand.grid and combn, but neither seems quite appropriate for this task.
EDIT
I also have cases where there is only one row per id/Date so using combn gives me Error in combn(letter, m = 2) : n < m. How can I add an if case such that the letter_2 gets an NA? (I also updated the dfs above to address this)
Using data.table:
require(data.table); setDT(in_df)
dt = in_df[, data.table(t(combn(letter, m = 2))), .(id, Date)]
Output:
> dt
id Date V1 V2
1: 1 2019-01-01 A B
2: 1 2019-01-01 A C
3: 1 2019-01-01 B C
4: 1 2019-01-02 A B
5: 2 2019-01-01 A D
We can use split and combn:
do.call('rbind',
lapply(split(in_df, list(in_df$id, in_df$Date), drop = TRUE),
FUN = function(d)
cbind.data.frame(unique(d[c('id', 'Date')]),
data.frame(t(
if(length(d$letter) > 1){
combn(d$letter, 2)
}else{
matrix(c(d$letter, NA), nrow = 2)
})))))
# id Date X1 X2
# 1.2019-01-01.1 1 2019-01-01 A B
# 1.2019-01-01.2 1 2019-01-01 A C
# 1.2019-01-01.3 1 2019-01-01 B C
# 2.2019-01-01 2 2019-01-01 A D
# 1.2019-01-02 1 2019-01-02 A B
It might be helpful to step through this. Investigate the output of:
(ss <- split(in_df, list(in_df$id, in_df$Date), drop = TRUE))
Then check out:
lapply(ss, FUN = function(d) data.frame(t(combn(d$letter, 2))))
The rest of the way, we're just combining the data. You might want to adjust the column names a bit.
I think the following code works:
library("dplyr")
in_df %>%
group_by(id, Date) %>%
mutate(
letter_1 = combn(letter, 2)[1, ],
letter_2 = combn(letter, 2)[2, ]
) %>%
distinct(letter_1, letter_2)
# # A tibble: 5 x 4
# # Groups: id, Date [3]
# letter_1 letter_2 id Date
# <fct> <fct> <dbl> <date>
# 1 A B 1 2019-01-01
# 2 A C 1 2019-01-01
# 3 B C 1 2019-01-01
# 4 A B 1 2019-01-02
# 5 A D 2 2019-01-01

dplyr Rolling Conditional Counts

I have a data frame as follows:
df <- data.frame(
Item=c("A","A","A","A","A","B","B","B","B","B"),
Date=c("2018-1-1","2018-2-1","2018-3-1","2018-4-1","2018-5-1","2018-1-1","2018-2-1",
"2018-3-1","2018-4-1","2018-5-1"),
Value=rnorm(10))
I want to mutate a new column grouped by Item, to count the number of values higher than 0 within the window of 3 (or any other integer I specify).
I am familiar with tidyverse, therefore, a dplyr solution would be most welcome.
Think zoo:: package if you want to roll anything.
df$new<-
zoo::rollsum( df$Value > 0, 3, fill = NA )
# Item Date Value new
#1 A 2018-1-1 0.5852699 NA
#2 A 2018-2-1 -0.7383377 1
#3 A 2018-3-1 -0.3157693 1
#4 A 2018-4-1 1.2475237 1
#5 A 2018-5-1 -1.5479757 1
#6 B 2018-1-1 -0.6913331 0
#7 B 2018-2-1 -0.2423809 0
#8 B 2018-3-1 -1.6363024 0
#9 B 2018-4-1 -0.3256263 1
#10 B 2018-5-1 0.3563144 NA
You have an option of the "window-position". Have a closer look at argument align = c("center", "left", "right").
So as a dplyr chain:
df %>% group_by(Item) %>% dplyr::mutate( new = zoo::rollsum( Value > 0, 3, fill = NA ))
You could use the RcppRoll package.
require(RcppRoll)
df$new <- df$new <- RcppRoll::roll_sum(df$Value > 0, 3, fill = NA)
Using Tidyverse:
df %>%
group_by(Item) %>%
dplyr::mutate(new = RcppRoll::roll_sum(Value > 0, 3, fill = NA))
Speedwise this is faster than the zoo Package:
n <- 10000
df <- data.frame(
Item = sample(LETTERS, n, replace = TRUE),
Value = rnorm(n))
df_grouped <- df %>%
group_by(Item)
microbenchmark::microbenchmark(
RcppRoll = df_grouped <- df_grouped %>% dplyr::mutate(new_RcppRoll = RcppRoll::roll_sum(Value > 0, 3, fill = NA)),
zoo = df_grouped <- df_grouped %>% dplyr::mutate(new_zoo = zoo::rollsum( Value > 0, 3, fill = NA ))
)
Results in:
Unit: milliseconds
expr min lq mean median uq max neval
RcppRoll 2.509003 2.741993 2.929227 2.83913 2.983726 5.832962 100
zoo 11.172920 11.785113 13.288970 12.43320 13.607826 25.879754 100
And
all.equal(df_grouped$new_RcppRoll, df_grouped$new_zoo)
TRUE
Item Date Value
<fct> <date> <int>
1 A 2018-01-01 3
2 B 2018-01-01 2
3 B 2018-02-01 -5
4 A 2018-02-01 -3
5 A 2018-03-01 4
6 B 2018-03-01 -2
7 A 2018-04-01 5
8 B 2018-04-01 0
9 A 2018-05-01 1
10 B 2018-05-01 -4
Changed rnorm example for clarity, used sample(-5:5):
> df <- df %>% mutate(greater_than = (Value>0)*Value) %>%
group_by(Item) %>% arrange(Date) %>% mutate(greater_than =
zoo::rollapplyr(greater_than, 3, sum, partial = T))
df %>% arrange(Item) %>% head(10)
Should look like this:
1 A 2018-01-01 3 3
2 A 2018-02-01 -3 3
3 A 2018-03-01 4 7
4 A 2018-04-01 5 9
5 A 2018-05-01 1 10
6 B 2018-01-01 2 2
7 B 2018-02-01 -5 2
8 B 2018-03-01 -2 2
9 B 2018-04-01 0 0
10 B 2018-05-01 -4 0

Resources