I have one data frame containing event date by id:
data.frame(id = c("a", "a", "a", "d", "d"),
date = as.Date(c("2018-01-03", "2018-02-02", "2018-02-22", "2018-02-13", "2018-05-01")))
id date
1 a 2018-01-03
2 a 2018-02-02
3 a 2018-02-22
4 d 2018-02-13
5 d 2018-05-01
And another one containing start and end of periods by id:
data.frame(id = c("a", "a", "d", "d", "d", "d"),
start = as.Date(c("2018-01-15", "2018-01-30", "2018-03-01", "2018-02-01", "2018-04-02", "2018-03-19")),
end = as.Date(c("2018-01-18", "2018-02-10", "2018-03-03", "2018-04-22", "2018-05-23", "2018-08-29")))
id start end
1 a 2018-01-15 2018-01-18
2 a 2018-01-30 2018-02-10
3 d 2018-03-01 2018-03-03
4 d 2018-02-01 2018-04-22
5 d 2018-04-02 2018-05-23
6 d 2018-03-19 2018-08-29
For each id, I need to count the number of periods from the second data frame to which each date in the first data frame belongs.
My desired dataframe would be:
id date n
1 a 2018-01-03 0 # does not belong to any period
2 a 2018-02-02 1 # belongs to [2018-01-30,2018-02-10]
3 a 2018-02-22 0 # does not belong to any period
4 d 2018-02-13 1 # belongs to [2018-02-01,2018-04-22]
5 d 2018-05-01 2 # belongs to [2018-04-02,2018-05-23] and [2018-03-19,2018-08-29]
My problem is not about date comparison and summing the results. My problem is about performing those analysis inside each id group. I guess there is a way using split and/or the apply family, but I did not find how.
How can I do it in base R? I work in a restrictive environment where I only have access to base R.
base r approach
temp <- subset( merge(df1, df2), date >= start & date <= end, select = "date" )
df1$n <- sapply( df1$date, function(x) length( temp$date[ temp$date == x ] ))
# id date n
# 1 a 2018-01-03 0
# 2 a 2018-02-02 1
# 3 a 2018-02-22 0
# 4 d 2018-02-13 1
# 5 d 2018-05-01 2
Another base R approach:
dates <- data.frame(id = c("a", "a", "a", "d", "d"),
date = as.Date(c("2018-01-03", "2018-02-02", "2018-02-22", "2018-02-13", "2018-05-01")))
periods <- data.frame(id = c("a", "a", "d", "d", "d", "d"),
start = as.Date(c("2018-01-15", "2018-01-30", "2018-03-01", "2018-02-01", "2018-04-02", "2018-03-19")),
end = as.Date(c("2018-01-18", "2018-02-10", "2018-03-03", "2018-04-22", "2018-05-23", "2018-08-29")))
df <- transform(merge(dates, periods), belongs = date >= start & date <= end)
aggregate(belongs ~ date + id, data = df, sum)
# date id belongs
# 1 2018-01-03 a 0
# 2 2018-02-02 a 1
# 3 2018-02-22 a 0
# 4 2018-02-13 d 1
# 5 2018-05-01 d 2
Or using data.table:
library(data.table)
dt <- as.data.table(merge(dates, periods))
dt[, .(n = sum(date >= start & date <= end)), by=c("id","date")]
# id date n
# 1: a 2018-01-03 0
# 2: a 2018-02-02 1
# 3: a 2018-02-22 0
# 4: d 2018-02-13 1
# 5: d 2018-05-01 2
Related
I am doing a migration study indexed to a specific event. To create the dataset, I basically subset a larger dataset to a specific date, and then made flags based on additional dates, then added in information. In total, this takes 7 scripts. Now, I want to create a comparison dataset, with the same information but indexed to two years earlier.
My question is, is there an easy way where I can use the same script and just tell R to somehow treat all the code as two years before, or do I have to create a duplicate of the code and then edit it in line to be two years before. Here's a very basic example of some of the code I'm using to generate the dataset from a larger framework:
#example of things I'd want shifted 2 years
df <- subset(df, DATE_AFTER > as.Date("2016-09-27"))
df$flag <- with(df,
as.numeric(DATE_BEFORE < as.Date("2016-09-28") &
DATE_AFTER > as.Date("2016-09-27")))
df
# ID DATE_BEFORE DATE_AFTER flag
# 1 A 2013-01-23 2018-01-23 1
# 3 C 2018-01-23 2020-01-23 0
# 5 E 2011-01-23 2019-01-23 1
# 6 F 2010-01-23 2019-01-23 1
# 7 G 2017-01-23 2018-01-23 0
Dummy data
df <- data.frame(ID=c("A", "B", "C", "D", "E", "F", "G"),
DATE_BEFORE=as.Date(c("2013-01-23", "2010-01-23", "2018-01-23",
"2014-01-23", "2011-01-23", "2010-01-23",
"2017-01-23")),
DATE_AFTER=as.Date(c("2018-01-23", "2016-01-23", "2020-01-23",
"2016-01-23", "2019-01-23", "2019-01-23",
"2018-01-23")))
Just wrap it in a function. To subtract on year we may use as.POSIXlt as shown in this answer.
my_df_subset <- \(date, subtract_yr=0L) {
dt <- as.POSIXlt(paste0(date, '-01'))
dt$year <- dt$year - subtract_yr
dt <- as.Date(dt)
transform(subset(df, DATE_AFTER > dt),
flag=as.numeric(DATE_BEFORE < dt + 1L &
DATE_AFTER > dt))
}
my_df_subset("2016-09-27")
# ID DATE_BEFORE DATE_AFTER flag
# 1 A 2013-01-23 2018-01-23 1
# 3 C 2018-01-23 2020-01-23 0
# 5 E 2011-01-23 2019-01-23 1
# 6 F 2010-01-23 2019-01-23 1
# 7 G 2017-01-23 2018-01-23 0
my_df_subset("2016-09-27", 2L) ## two years earlier
# ID DATE_BEFORE DATE_AFTER flag
# 1 A 2013-01-23 2018-01-23 1
# 2 B 2010-01-23 2016-01-23 1
# 3 C 2018-01-23 2020-01-23 0
# 4 D 2014-01-23 2016-01-23 1
# 5 E 2011-01-23 2019-01-23 1
# 6 F 2010-01-23 2019-01-23 1
# 7 G 2017-01-23 2018-01-23 0
Note: R >= 4.1 used.
Data:
df <- structure(list(ID = c("A", "B", "C", "D", "E", "F", "G"), DATE_BEFORE = structure(c(15728,
14632, 17554, 16093, 14997, 14632, 17189), class = "Date"), DATE_AFTER = structure(c(17554,
16823, 18284, 16823, 17919, 17919, 17554), class = "Date")), class = "data.frame", row.names = c(NA,
-7L))
I'm struggling on how can I calculate the quantity of the days between occurrences, since I need to calculate how many days does it take between maintenances on an equipment.
I have a dataframe with a lot of equipments and dates indicating the maintenance, then I need to calculate the days between the maintenances for each equipment. I will show a toy example:
test = data.frame(car = c("A", "A", "B", "B", "B", "C", "C", "D", "D", "D", "E"),
maintenance_date= c("20-09-2020", "25-09-2020", "14-05-2020", "20-05-2020", "20-05-2021", "11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "13-01-2021"))
#test
# car maintenance_date
#1 A 20-09-2020
#2 A 25-09-2020
#3 B 14-05-2020
#4 B 20-05-2020
#5 B 20-05-2021
#6 C 11-01-2021
#7 C 13-01-2021
#8 D 13-01-2021
#9 D 15-01-2021
#10 D 15-01-2021
#11 E 13-01-2021
#for result, I'd like something like:
result
# car maintenance_date
#1 A 5
#2 B 6
#3 B 365
#4 C 2
#5 D 2
#6 D 0
I thought of using something like test %>% arrange(maintenance_date) %>% group_by(car) %>% ....
Any hint on how can I do that?
We need to convert to Date class before doing the arrange and then do the group_by 'car' and get the difference
library(dplyr)
library(lubridate)
test %>%
mutate(maintenance_date = dmy(maintenance_date)) %>%
arrange(maintenance_date) %>%
group_by(car) %>%
summarise(maintenance_date = diff(maintenance_date), .groups = 'drop')
-output
# A tibble: 6 × 2
car maintenance_date
<chr> <drtn>
1 A 5 days
2 B 6 days
3 B 365 days
4 C 2 days
5 D 2 days
6 D 0 days
data.table
library(data.table)
setDT(test)
test[, maintenance_date := as.Date(maintenance_date, format="%d-%m-%Y")
][, .(ndays = diff(maintenance_date)), by = car]
# car ndays
# <char> <difftime>
# 1: A 5 days
# 2: B 6 days
# 3: B 365 days
# 4: C 2 days
# 5: D 2 days
# 6: D 0 days
Another solution, tidyverse-based, can be:
library(tidyverse)
library(lubridate)
test = data.frame(car = c("A", "A", "B", "B", "B", "C", "C", "D", "D", "D", "E"), maintenance_date= c("20-09-2020", "25-09-2020", "14-05-2020", "20-05-2020", "20-05-2021", "11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "13-01-2021"))
test %>%
group_by(car) %>%
mutate(maintenance_date = c(-1,diff(dmy(maintenance_date)))) %>%
filter(maintenance_date >= 0) %>% ungroup
#> # A tibble: 6 × 2
#> # Groups: car [4]
#> car maintenance_date
#> <chr> <dbl>
#> 1 A 5
#> 2 B 6
#> 3 B 365
#> 4 C 2
#> 5 D 2
#> 6 D 0
I want to identify the two-way combinations of levels in one column grouped by the id and Date variables. Basically, I want the daily unique letter pairs for each person.
I have a dataframe that looks like this:
in_df <- data.frame(id = c(1,1,1,1,1,2,2,3),
Date = as.Date(c("2019-01-01", "2019-01-01", "2019-01-01", "2019-01-02", "2019-01-02", "2019-01-01", "2019-01-01", "2019-01-01")),
letter = c("A", "B", "C", "A", "B", "A", "D", "B"))
in_df
id Date letter
1 1 2019-01-01 A
2 1 2019-01-01 B
3 1 2019-01-01 C
4 1 2019-01-02 A
5 1 2019-01-02 B
6 2 2019-01-01 A
7 2 2019-01-01 D
8 3 2019-01-01 B
And I want one that looks like this:
out_df
id Date letter_1 letter_2
1 1 2019-01-01 A B
2 1 2019-01-01 A C
3 1 2019-01-01 B C
4 1 2019-01-02 A B
5 2 2019-01-01 A D
6 3 2019-01-01 B NA
So the first id and the first Date have letters A, B, and C. I want every unique pair from the three. Order doesn't matter so switching what goes to letter_1 and letter_2 would be the same thing.
I have played around with expand.grid and combn, but neither seems quite appropriate for this task.
EDIT
I also have cases where there is only one row per id/Date so using combn gives me Error in combn(letter, m = 2) : n < m. How can I add an if case such that the letter_2 gets an NA? (I also updated the dfs above to address this)
Using data.table:
require(data.table); setDT(in_df)
dt = in_df[, data.table(t(combn(letter, m = 2))), .(id, Date)]
Output:
> dt
id Date V1 V2
1: 1 2019-01-01 A B
2: 1 2019-01-01 A C
3: 1 2019-01-01 B C
4: 1 2019-01-02 A B
5: 2 2019-01-01 A D
We can use split and combn:
do.call('rbind',
lapply(split(in_df, list(in_df$id, in_df$Date), drop = TRUE),
FUN = function(d)
cbind.data.frame(unique(d[c('id', 'Date')]),
data.frame(t(
if(length(d$letter) > 1){
combn(d$letter, 2)
}else{
matrix(c(d$letter, NA), nrow = 2)
})))))
# id Date X1 X2
# 1.2019-01-01.1 1 2019-01-01 A B
# 1.2019-01-01.2 1 2019-01-01 A C
# 1.2019-01-01.3 1 2019-01-01 B C
# 2.2019-01-01 2 2019-01-01 A D
# 1.2019-01-02 1 2019-01-02 A B
It might be helpful to step through this. Investigate the output of:
(ss <- split(in_df, list(in_df$id, in_df$Date), drop = TRUE))
Then check out:
lapply(ss, FUN = function(d) data.frame(t(combn(d$letter, 2))))
The rest of the way, we're just combining the data. You might want to adjust the column names a bit.
I think the following code works:
library("dplyr")
in_df %>%
group_by(id, Date) %>%
mutate(
letter_1 = combn(letter, 2)[1, ],
letter_2 = combn(letter, 2)[2, ]
) %>%
distinct(letter_1, letter_2)
# # A tibble: 5 x 4
# # Groups: id, Date [3]
# letter_1 letter_2 id Date
# <fct> <fct> <dbl> <date>
# 1 A B 1 2019-01-01
# 2 A C 1 2019-01-01
# 3 B C 1 2019-01-01
# 4 A B 1 2019-01-02
# 5 A D 2 2019-01-01
I have dataset of areas and scores in those areas.
I want to maintain an aggregated score (agg_score) that is equal to the sum of the most recent scores for A, B, and C.
For instance you will see in my expected_output for row 4 is 7, because the calue of C is now 2 while the most recent values of A and B are still 1 & 4.
All I have been able to do so far is sum the three most recent scores, which results in agg_score values that equal the sum of C, C, and B at times. It is important that I have an accurate agg_score at each possible date.
library(dplyr)
ds <-
tibble(
area = c("A", "B", "C", "C", "B", "A", "A", "B", "C"),
score = c(1,4,5,2,6,3,4,6,3),
scoring_date =
seq.Date(
from = as.Date("2019-01-01"),
to = as.Date("2019-01-09"),
by = "days"
),
expected_output = c(NA, NA, 10, 7, 9, 11, 12, 12, 13)
) %>%
arrange(scoring_date)
# Inadequate code for summing last three scores
ds %>%
mutate(agg_score = score + lag(score) + lag(score, 2))
Using dplyr::last we can find the last 'recent' value for each area then sum them when length reaches 3.
#small function to clarify
sum_fun<-function(x){
#browser()
lc_vec <- ds[1:x,] %>% group_by(area) %>% summarise(lc=last(score)) %>% pull(lc)
lc_vecf <- ifelse(length(lc_vec)==3,sum(lc_vec),NA)
return(lc_vecf)
}
library(dplyr)
ds %>% mutate(Output=sapply(1:nrow(.),sum_fun)) #Instead of sapply we can use purrr::map_dpl
# A tibble: 9 x 5
area score scoring_date expected_output Output
<chr> <dbl> <date> <dbl> <dbl>
1 A 1. 2019-01-01 NA NA
2 B 4. 2019-01-02 NA NA
3 C 5. 2019-01-03 10. 10.
4 C 2. 2019-01-04 7. 7.
5 B 6. 2019-01-05 9. 9.
6 A 3. 2019-01-06 11. 11.
7 A 4. 2019-01-07 12. 12.
8 B 6. 2019-01-08 12. 12.
9 C 3. 2019-01-09 13. 13.
There might be a data.table self-merge option out there, but I couldn't quite figure it out. Here's an idea using implementing your fill but in data.table. Should be flexible for more "area"s:
library(data.table)
lapply(unique(ds$area), function(a){
ds[, paste0("val_",a) := zoo::na.locf0(ifelse(area==a, score, NA))]
invisible(return(NULL))
})
ds[, agg_score := rowSums(.SD), .SDcols = paste0("val_", unique(ds$area))][, paste0("val_", unique(ds$area)) := NULL]
ds
# area score scoring_date agg_score
#1 A 1 2019-01-01 NA
#2 B 4 2019-01-02 NA
#3 C 5 2019-01-03 10
#4 C 2 2019-01-04 7
#5 B 6 2019-01-05 9
#6 A 3 2019-01-06 11
#7 A 4 2019-01-07 12
#8 B 6 2019-01-08 12
#9 C 3 2019-01-09 13
Original solution:
Alternatively you could try an sapply. The function is a little long, but that's because we have a lot of work to do! If you wanted to do this on more areas you wouldn't have to manually fill each one, so that could be a benefit:
ds$agg_score <- sapply(1:nrow(ds), function(i) {other_areas <- setdiff(unique(ds$area), ds[i, "area"])
f_idxs = Filter(function(x) x < i, which(ds$area %in% other_areas)) #Locate other areas that come before current index
if(length(f_idxs) == 0) return(NA)
idxs = sapply(split(f_idxs, ds[f_idxs, "area"]), max) #Split based on area so we can get maximum index before our date
if(length(idxs) < length(other_areas)) return(NA)
sum(ds[c(idxs, i), "score"])}) #Sum up our scores
So I found a way to do this using fill() to ensure the most recent value is always carried forward until replaced by a more recent value.
library(tidyr)
ds %>%
select(area, score, scoring_date) %>%
spread(area, score) %>%
fill(A, .direction = "down") %>%
fill(B, .direction = "down") %>%
fill(C, .direction = "down") %>%
rowwise() %>%
mutate(agg_score = sum(A, B, C))
nuevoDs<-ds %>% arrange(desc(scoring_date)) %>% as.data.frame
#getting length of dataframe
longitud<-nrow(nuevoDs)
#we will iterate on each value up until (longitud - 2) and save results to a vector
elVector <- vector()
for(i in 1:(longitud-2))
{
elVector[i] <- nuevoDs[i,"score"] + nuevoDs[i+1,"score"] + nuevoDs[i+2,"score"]
}
#before cbinding we need to make the vector the same length as your dataFrame
elVector[longitud-1] <- 0
elVector[longitud] <- 0
elVector
cbind(nuevoDs,elVector)
area score scoring_date elVector
1 C 3 2019-01-09 13
2 B 6 2019-01-08 13
3 A 4 2019-01-07 13
4 A 3 2019-01-06 11
5 B 6 2019-01-05 13
6 C 2 2019-01-04 11
7 C 5 2019-01-03 10
8 B 4 2019-01-02 0
9 A 1 2019-01-01 0
Another possible data.table approach.
ds[, output :=
ds[,
ds[.(area=unique(area), scd=.BY$scoring_date),
sum(score),
on=.(area=area, scoring_date<=scd),
mult="last"],
by=.(area, scoring_date)]$V1
]
output:
area score scoring_date output
1: A 1 2019-01-01 NA
2: B 4 2019-01-02 NA
3: C 5 2019-01-03 10
4: C 2 2019-01-04 7
5: B 6 2019-01-05 9
6: A 3 2019-01-06 11
7: A 4 2019-01-07 12
8: B 6 2019-01-08 12
9: C 3 2019-01-09 13
data:
library(data.table)
ds <- data.table(
area = c("A", "B", "C", "C", "B", "A", "A", "B", "C"),
score = c(1,4,5,2,6,3,4,6,3),
scoring_date = seq.Date(from = as.Date("2019-01-01"), to = as.Date("2019-01-09"), by = "days"))
Explanation:
The gist of the above code is:
ds[.(area=unique(area), scd=.BY$scoring_date),
sum(score),
on=.(area=area, scoring_date<=scd),
mult="last"]
It means for each date (scd=.BY$scoring_date), we try to perform a non-equi self join to find the latest (mult="last") score for all areas (area=unique(area))
I am working on a project that requires me to create a flag field if an item has n number of occurrences (on different dates) within a 30-day window (for each TYPE). If it did occur n times within the window then it would flag all of the relevant dates with 1 and if it did not occur then the dates would be 0.
My data looks something like this:
a <- data.frame("TYPE" = c("A", "A", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D"),
"DATE" = c("4/20/2018 11:47",
"4/25/2018 7:21",
"4/15/2018 6:11",
"4/19/2018 4:22",
"4/15/2018 17:46",
"4/16/2018 11:59",
"4/20/2018 7:50",
"4/26/2018 2:55",
"4/27/2018 11:46",
"4/27/2018 13:03",
"4/20/2018 7:31",
"4/22/2018 9:45"))
After sorting first by TYPE and then by DATE and then saying n <- 4 this would be my expected output:
It is important to note that there will be multiple dates that are the same date for a type, and this will need a rolling/moving window for each TYPE.
I am looking for assistance with how to approach this problem.
Would I be able to use the SlidingWindow function in R and then define my own function?
A dplyr::inner_join based approach could be as:
library(dplyr)
a %>% mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %H:%M")) %>%
inner_join(.,., by="TYPE") %>%
group_by(TYPE, DATE.x) %>%
summarise(FLAG = as.integer(sum(abs((DATE.x-DATE.y)/(24*60*60))<=30)>=4))
# # A tibble: 12 x 3
# # Groups: TYPE [?]
# TYPE DATE.x FLAG
# <fctr> <dttm> <int>
# 1 A 2018-04-20 11:47:00 0
# 2 A 2018-04-25 07:21:00 0
# 3 B 2018-04-15 06:11:00 0
# 4 B 2018-04-19 04:22:00 0
# 5 C 2018-04-15 17:46:00 1
# 6 C 2018-04-16 11:59:00 1
# 7 C 2018-04-20 07:50:00 1
# 8 C 2018-04-26 02:55:00 1
# 9 D 2018-04-20 07:31:00 1
# 10 D 2018-04-22 09:45:00 1
# 11 D 2018-04-27 11:46:00 1
# 12 D 2018-04-27 13:03:00 1
Note: The 4 records for TYPE = D are also within 30 days range. Hence, FLAG should be set as 1 for TYPE D as well.