'Stretch' a grouped data frame using dplyr - r

I have a grouped dataframe with multiple IDs with a date and value column.
id <- c("a", "a", "a", "b", "b", "b", "c")
date <- c("2020-01-01", "2020-01-02", "2020-01-03",
"2020-01-01", "2020-01-02", "2020-01-03",
"2020-01-01")
value <- rnorm(n = length(id))
df <- cbind.data.frame(id, date, value)
However, some IDs have less than 3 dates. I want to "stretch" those IDs and add an NA for the value column for the new dates. In this dataframe, the "c" ID would have two new dates added ("2020-01-02" and "2020-01-03").

Perhaps this approach would suit?
library(tidyverse)
id <- c("a", "a", "a", "b", "b", "b", "c")
date <- c("2020-01-01", "2020-01-02", "2020-01-03",
"2020-01-01", "2020-01-02", "2020-01-03",
"2020-01-01")
value <- rnorm(n = length(id))
df <- cbind.data.frame(id, date, value)
df %>%
right_join(df %>% expand(id, date))
#> Joining, by = c("id", "date")
#> id date value
#> 1 a 2020-01-01 -1.5371474
#> 2 a 2020-01-02 0.9001098
#> 3 a 2020-01-03 0.1523491
#> 4 b 2020-01-01 0.8194577
#> 5 b 2020-01-02 1.2005270
#> 6 b 2020-01-03 0.1158812
#> 7 c 2020-01-01 -0.8676445
#> 8 c 2020-01-02 NA
#> 9 c 2020-01-03 NA
Created on 2022-09-05 by the reprex package (v2.0.1)

In base R, by id you may merge with a data frame created out of sequences of the date range. First of all you want to use proper date format by doing df$date <- as.Date(df$date).
by(df, df$id, \(x)
merge(x,
data.frame(id=el(x$id),
date=do.call(seq.Date, c(as.list(range(df$date)), 'day'))),
all=TRUE)) |>
do.call(what=rbind)
# id date value
# a.1 a 2020-01-01 1.3709584
# a.2 a 2020-01-02 -0.5646982
# a.3 a 2020-01-03 0.3631284
# b.1 b 2020-01-01 0.6328626
# b.2 b 2020-01-02 0.4042683
# b.3 b 2020-01-03 -0.1061245
# c.1 c 2020-01-01 1.5115220
# c.2 c 2020-01-02 NA
# c.3 c 2020-01-03 NA

You could use complete() from tidyr.
library(tidyr)
df %>%
complete(id, date)
# # A tibble: 9 × 3
# id date value
# <chr> <chr> <dbl>
# 1 a 2020-01-01 1.12
# 2 a 2020-01-02 1.58
# 3 a 2020-01-03 1.26
# 4 b 2020-01-01 -2.30
# 5 b 2020-01-02 -1.45
# 6 b 2020-01-03 -0.212
# 7 c 2020-01-01 0.344
# 8 c 2020-01-02 NA
# 9 c 2020-01-03 NA

Related

Find rolling min and max occurrences from two columns

I would like to track min and max occurrences in two columns. This should be done in rolling fashion from beginning of the data, so we can track how many times overall IDs are present at each date. Also it doesn't matter in which column ID is present.
Result should be as follows. Row 1, nor B or C has occurred, so min_appearance is 0 but max_appearance is 1 as A and D was present. Row 5 A and D have been present 3 times at this point but B and C only 2. I'm not concerned which ID is present, but only on counts what is min and max. Also real data is more complicated, so pairs are not static, but A could face C and so on.
# A tibble: 8 x 5
date id1 id2 min_appearances max_appearances
<date> <chr> <chr> <dbl> <dbl>
1 2020-01-01 A D 0 1
2 2020-01-02 B C 1 1
3 2020-01-03 C B 1 2
4 2020-01-04 D A 2 2
5 2020-01-05 A D 2 3
6 2020-01-06 B C 3 3
7 2020-01-07 C B 3 4
8 2020-01-08 D A 4 4
DATA:
library(dplyr)
date <- seq(as.Date("2020/1/1"), by = "day", length.out = 8)
id1 <- rep(c("A", "B", "C", "D"), 2)
id2 <- rep(c("D", "C", "B", "A"), 2)
dt <- tibble(date = date,
id1 = id1,
id2 = id2)
Here's a way to do it using functions from the tidyverse. First, pivot_longer to handle more easily the data. Then compute the cumulative count of value for every unique ids. Compute the min and max for each row over the "count" columns. Finally, take the last min and max values for each pairs, and pivot back to wide.
library(tidyverse)
dt %>%
pivot_longer(cols = -date, values_to = "id") %>%
mutate(map_dfc(unique(id), ~ tibble("count_{.x}" := cumsum(id == .x)))) %>%
mutate(min_appearances = do.call(pmin, select(., starts_with("count"))),
max_appearances = do.call(pmax, select(., starts_with("count")))) %>%
group_by(date) %>%
mutate(across(min_appearances:max_appearances, last),
n = row_number()) %>%
pivot_wider(c(date, min_appearances, max_appearances), names_from = n, values_from = id, names_prefix = "id") %>%
relocate(order(colnames(.)))
date id1 id2 max_appearances min_appearances
<date> <chr> <chr> <int> <int>
1 2020-01-01 A D 1 0
2 2020-01-02 B C 1 1
3 2020-01-03 C B 2 1
4 2020-01-04 D A 2 2
5 2020-01-05 A D 3 2
6 2020-01-06 B C 3 3
7 2020-01-07 C B 4 3
8 2020-01-08 D A 4 4

How to remove rows including sequential condition between two columns

I'm trying to remove the factors from my dataframe, but only those after a specific date. Here I made a toy example:
I have a test dataframe, and an inspection dataframe inspec. I would like to remove the letters that are in var1 that appear in inspec, but only the rows after the date in inspec. For example, consider
> test = data.frame(var1 = c("A", "B", "A", "B", "C","B", "A"), measure = c(6,7,8,6,10,1,0), date = as.Date(c("2021-01-02", "2021-01-03", "2021-01-04", "2021-01-05", "2021-01-06", "2021-01-07", "2021-01-12")))
> test
var1 measure date
1 A 6 2021-01-02
2 B 7 2021-01-03
3 A 8 2021-01-04
4 B 6 2021-01-05
5 C 10 2021-01-06
6 B 1 2021-01-07
7 A 0 2021-01-12
>
> inspec = data.frame(var1 = c("A", "C", "D", "A"), date = as.Date(c("2021-01-03", "2021-01-06", "2021-01-10", "2021-01-12")))
> inspec
var1 date
1 A 2021-01-03
2 C 2021-01-06
3 D 2021-01-10
4 A 2021-01-12
Then, as result, I'd like to obtain:
> test
var1 measure date
1 A 6 2021-01-02
2 B 7 2021-01-03
3 B 6 2021-01-05
4 B 1 2021-01-07
Note that only the A in var1 that were inspected after the date indicated in the inspec dataframe were excluded. If I didnt' want to maintain the var1 before a inspec date, I could just use test= test[!(test$var1 %in% inspec$var1),]
Any hint on how can I do that?
base R
## reduce `inspec` to the earliest date
inspec$date <- as.Date(inspec$date)
tmpinspec <- inspec[ave(as.integer(inspec$date), inspec$var1, FUN = function(z) z == min(z)) > 0,]
tmpinspec
# var1 date
# 1 A 2021-01-03
# 2 C 2021-01-06
# 3 D 2021-01-10
tmp <- merge(test, tmpinspec, by = "var1", all.x = TRUE, suffixes = c("", ".y"))
tmp
# var1 measure date date.y
# 1 A 6 2021-01-02 2021-01-03
# 2 A 8 2021-01-04 2021-01-03
# 3 A 0 2021-01-12 2021-01-03
# 4 B 7 2021-01-03 <NA>
# 5 B 6 2021-01-05 <NA>
# 6 B 1 2021-01-07 <NA>
# 7 C 10 2021-01-06 2021-01-06
tmp <- tmp[with(tmp, is.na(date.y) | date < date.y),]
# tmp$date.y <- NULL
# tmp
var1 measure date
# 1 A 6 2021-01-02
# 4 B 7 2021-01-03
# 5 B 6 2021-01-05
# 6 B 1 2021-01-07
dplyr
library(dplyr)
group_by(inspec, var1) %>%
slice_min(date) %>%
left_join(test, ., by = "var1", suffix = c("", ".y")) %>%
filter(is.na(date.y) | date < date.y) %>%
select(-date.y)
# var1 measure date
# 1 A 6 2021-01-02
# 2 B 7 2021-01-03
# 3 B 6 2021-01-05
# 4 B 1 2021-01-07
We can join on var1 and filter based on dates while the data is grouped by var1 and only keep the first match. See below;
library(dplyr)
test %>%
left_join(inspec, by = "var1", suffix = c("", ".y")) %>%
group_by(var1) %>%
filter(is.na(date.y) | date < first(date.y)) %>%
select(-date.y) %>%
group_by_all() %>%
slice(1)
#> # A tibble: 4 x 3
#> # Groups: var1, measure, date [4]
#> var1 measure date
#> <fct> <dbl> <date>
#> 1 A 6 2021-01-02
#> 2 B 1 2021-01-07
#> 3 B 6 2021-01-05
#> 4 B 7 2021-01-03
This is a variation of r2evans's answer.
Using data.table, we can also use the merge and then filter:
library(data.table)
test <- setDT(test); inspec <- setDT(inspec)
test <- merge(test, inspec, by = "var1", all.x = TRUE, suffixes = c("", ".y"))
test <- test[date < date.y | !is.na(date.y), .(var1, measure, date)]

How to calculate the ratio of a condition based on another condition

I have a simplified data frame like this
date state hour
2020-01-01 A 6
2020-01-01 B 3
2020-01-02 A 4
2020-01-02 B 3.5
2020-01-03 A 5
2020-01-03 B 2.5
For each date, there are two states. I want to calculate the ratio of state A/B in hour each day
For example,
date ratio
2020-01-01 2
2020-01-02 1.143
2020-01-03 2
How do I get this result? Thank you!
With the help of match you can do :
library(dplyr)
df %>%
group_by(date) %>%
summarise(ratio = hour[match('A', state)]/hour[match('B', state)])
# date ratio
# <chr> <dbl>
#1 2020-01-01 2
#2 2020-01-02 1.14
#3 2020-01-03 2
You can use xtabs:
tt <- xtabs(hour ~ date + state, x)
data.frame(dimnames(tt)[1], ratio = tt[,1] / tt[,2])
# date ratio
#2020-01-01 2020-01-01 2.000000
#2020-01-02 2020-01-02 1.142857
#2020-01-03 2020-01-03 2.000000
Data:
x <- data.frame(date = c("2020-01-01", "2020-01-01", "2020-01-02",
"2020-01-02", "2020-01-03", "2020-01-03"), state = c("A", "B",
"A", "B", "A", "B"), hour = c(6, 3, 4, 3.5, 5, 2.5))
A data.table option
> setDT(df)[, .(ratio = Reduce(`/`, hour[order(state)])), date]
date ratio
1: 2020-01-01 2.000000
2: 2020-01-02 1.142857
3: 2020-01-03 2.000000
You can also use the following solution, albeit it is to some extent similar to the one posted by dear #Ronak Shah .
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = state, values_from = hour) %>%
group_by(date) %>%
summarise(ratio = A/B)
# A tibble: 3 x 2
date ratio
<chr> <dbl>
1 2020-01-01 2
2 2020-01-02 1.14
3 2020-01-03 2

2 Vectors under for loop

I have a dataframe df with ID column. I am trying to make all possible combination between another vector called dates
df
ID
Asda
Dsaa
Fasd
Das
dates <- seq(as.Date("2019-12-27"), as.Date(" "), by=1)
I have written a following for loop. But I get only below result
for (i in df$ID) {
for (j in dates) {
new_value <- c(i, j)
}
}
Result below
new_value
ID Date
"Das" "18258"
Expected result is : Can i get all combination of ID and dates?
You can use merge to get all combinations of your variables:
df <- data.frame(ID = c("Asda", "Dsaa", "Fasd", "Das"))
dates <- seq(as.Date("2019-12-27"), as.Date("2019-12-28"), by=1)
merge(df, dates, by = NULL)
Result:
ID y
1 Asda 2019-12-27
2 Dsaa 2019-12-27
3 Fasd 2019-12-27
4 Das 2019-12-27
5 Asda 2019-12-28
6 Dsaa 2019-12-28
7 Fasd 2019-12-28
8 Das 2019-12-28
Edit: For loop only
new_df <- data.frame()
for (i in df$ID) {
for (j in dates) {
new_df <- rbind(new_df, data.frame(ID = i, Date = j))
}
}
All possible combinations of ID and dates
expand.grid(df$ID,dates)
Edit: using for loops
k=1
temp=matrix(NA,length(df$ID)*length(dates),2)
for (i in df$ID) {
for (j in dates){
temp[k,]=c(i,j)
k=k+1
}
}
It is not a for loop but for the sake of completeness (expand.grid and mergeare R base nice answers) here is another solution using tidyr::crossing(). I am not sure why you would do so here since it is harder to optimise (see here a nice article from Miles Mcbain for-loops ).
dfr <- data.frame(id = paste0("id", 1:10), dates = seq.Date(from = as.Date("2019-12-27"), by = 1, length.out = 10 ) )
dfr
#> id dates
#> 1 id1 2019-12-27
#> 2 id2 2019-12-28
#> 3 id3 2019-12-29
#> 4 id4 2019-12-30
#> 5 id5 2019-12-31
#> 6 id6 2020-01-01
#> 7 id7 2020-01-02
#> 8 id8 2020-01-03
#> 9 id9 2020-01-04
#> 10 id10 2020-01-05
tidyr::crossing(dfr$id, dfr$dates)
#> # A tibble: 100 x 2
#> `dfr$id` `dfr$dates`
#> <fct> <date>
#> 1 id1 2019-12-27
#> 2 id1 2019-12-28
#> 3 id1 2019-12-29
#> 4 id1 2019-12-30
#> 5 id1 2019-12-31
#> 6 id1 2020-01-01
#> 7 id1 2020-01-02
#> 8 id1 2020-01-03
#> 9 id1 2020-01-04
#> 10 id1 2020-01-05
#> # ... with 90 more rows

Compare date by group in two data frames in R

I have one data frame containing event date by id:
data.frame(id = c("a", "a", "a", "d", "d"),
date = as.Date(c("2018-01-03", "2018-02-02", "2018-02-22", "2018-02-13", "2018-05-01")))
id date
1 a 2018-01-03
2 a 2018-02-02
3 a 2018-02-22
4 d 2018-02-13
5 d 2018-05-01
And another one containing start and end of periods by id:
data.frame(id = c("a", "a", "d", "d", "d", "d"),
start = as.Date(c("2018-01-15", "2018-01-30", "2018-03-01", "2018-02-01", "2018-04-02", "2018-03-19")),
end = as.Date(c("2018-01-18", "2018-02-10", "2018-03-03", "2018-04-22", "2018-05-23", "2018-08-29")))
id start end
1 a 2018-01-15 2018-01-18
2 a 2018-01-30 2018-02-10
3 d 2018-03-01 2018-03-03
4 d 2018-02-01 2018-04-22
5 d 2018-04-02 2018-05-23
6 d 2018-03-19 2018-08-29
For each id, I need to count the number of periods from the second data frame to which each date in the first data frame belongs.
My desired dataframe would be:
id date n
1 a 2018-01-03 0 # does not belong to any period
2 a 2018-02-02 1 # belongs to [2018-01-30,2018-02-10]
3 a 2018-02-22 0 # does not belong to any period
4 d 2018-02-13 1 # belongs to [2018-02-01,2018-04-22]
5 d 2018-05-01 2 # belongs to [2018-04-02,2018-05-23] and [2018-03-19,2018-08-29]
My problem is not about date comparison and summing the results. My problem is about performing those analysis inside each id group. I guess there is a way using split and/or the apply family, but I did not find how.
How can I do it in base R? I work in a restrictive environment where I only have access to base R.
base r approach
temp <- subset( merge(df1, df2), date >= start & date <= end, select = "date" )
df1$n <- sapply( df1$date, function(x) length( temp$date[ temp$date == x ] ))
# id date n
# 1 a 2018-01-03 0
# 2 a 2018-02-02 1
# 3 a 2018-02-22 0
# 4 d 2018-02-13 1
# 5 d 2018-05-01 2
Another base R approach:
dates <- data.frame(id = c("a", "a", "a", "d", "d"),
date = as.Date(c("2018-01-03", "2018-02-02", "2018-02-22", "2018-02-13", "2018-05-01")))
periods <- data.frame(id = c("a", "a", "d", "d", "d", "d"),
start = as.Date(c("2018-01-15", "2018-01-30", "2018-03-01", "2018-02-01", "2018-04-02", "2018-03-19")),
end = as.Date(c("2018-01-18", "2018-02-10", "2018-03-03", "2018-04-22", "2018-05-23", "2018-08-29")))
df <- transform(merge(dates, periods), belongs = date >= start & date <= end)
aggregate(belongs ~ date + id, data = df, sum)
# date id belongs
# 1 2018-01-03 a 0
# 2 2018-02-02 a 1
# 3 2018-02-22 a 0
# 4 2018-02-13 d 1
# 5 2018-05-01 d 2
Or using data.table:
library(data.table)
dt <- as.data.table(merge(dates, periods))
dt[, .(n = sum(date >= start & date <= end)), by=c("id","date")]
# id date n
# 1: a 2018-01-03 0
# 2: a 2018-02-02 1
# 3: a 2018-02-22 0
# 4: d 2018-02-13 1
# 5: d 2018-05-01 2

Resources