I have multiple large data frames that capture events that last a certain amount of time. This example gives a simplified version of my data set
Data frame 1:
ID Days Date Value
1 10 80 30
1 10 85 30
2 20 75 20
2 10 80 20
3 5 90 30
Data frame 2:
ID Days Date Value
1 20 0 30
1 10 3 20
2 20 5 30
3 20 1 10
3 10 10 10
The same ID is used for the same person in all datasets
Days specifies the length of the event (if Days has the value 10 then the event lasts 10 days)
Date specifies the date at which the event starts. In this case,Date can be any number between 0 and 90 or 91 (the data represent days in quarter)
Value is an attribute that is repeated for the number of Days specified. For example, for the first row in df1, the value 30 is repeated for 10 times starting from day 80 ( 30 is repeated for 10 days)
What I am interested in is to give for each ID in each data frame the highest value per day. Keep in mind that multiple events can overlap and values then have to be summed.
The final data frame should look like this:
ID HighestValuedf1 HighestValuedf2
1 60 80
2 40 30
3 30 20
For example, for ID 1 three events overlapped and resulted in the highest value of 80 in data frame 2. There was no overlap between the events of df1 and df1 for ID 3, only an overlap withing df2.
I would prefer a solution that avoids merging all data frames into one data frame because of the size of my files.
EDIT
I rearranged my data so that all events that overlap are in one data frame. I only need the highest overlap value for every data frame.
Code to reproduce the data frames:
ID = c(1,1,2,2,3)
Date = c(80,85,75,80,90)
Days = c(10,10,20,10,5)
Value = c(30,30,20,20,30)
df1 = data.frame(ID,Days, Date,Value)
ID = c(1,1,2,3,3)
Date = c(1,3,5,1,10)
Days = c(20,10,20,20,10 )
Value =c(30,20,30,10,10)
df2 = data.frame(ID,Days, Date,Value)
ID= c(1,2,3)
HighestValuedf1 = c(60,40,30)
HighestValuedf2 = c(80,30,20)
df3 = data.frame(ID, HighestValuedf1, HighestValuedf2)
I am interpreting highest value per day to mean highest value on a single day throughout the time period. This is probably not the most efficient solution, since I expect something can be done with map or apply functions, but I didn't see how on a first look. Using df1 and df2 as defined above:
EDIT: Modified code upon understanding that df1 and df2 are supposed to represent sequential quarters. I think the easiest way to do this is simply to stack the dataframes so anything that overlaps will automatically be caught (i.e. day 1 of df2 is day 91 overall). You will probably need to either adjust this code manually because of the different length of quarters, or preferably simply convert days of quarters into actual dates of the year with a date formate ((df1 day 1 becomes January 1st 2017, for example). The code below just rearranges to achieve this and then produces the results desired for each quarter by filtering on days 1:90, 91:180 as shown)
ID = c(1,1,2,2,3)
Date = c(80,85,75,80,90)
Days = c(10,10,20,10,5)
Value = c(30,30,20,20,30)
df1 = data.frame(ID,Days, Date,Value)
ID = c(1,1,2,3,3)
Date = c(1,3,5,1,10)
Days = c(20,10,20,20,10 )
Value =c(30,20,30,10,10)
df2 = data.frame(ID,Days, Date,Value)
library(tidyverse)
#> -- Attaching packages --------------------------------------------------------------------- tidyverse 1.2.1 --
#> v ggplot2 2.2.1.9000 v purrr 0.2.4
#> v tibble 1.4.2 v dplyr 0.7.4
#> v tidyr 0.7.2 v stringr 1.2.0
#> v readr 1.1.1 v forcats 0.2.0
#> -- Conflicts ------------------------------------------------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
df2 <- df2 %>%
mutate(Date = Date + 90)
# Make a dataframe with complete set of day-ID combinations
df_completed <- df1 %>%
mutate(day = factor(Date, levels = 1:180)) %>% # set to total day length
complete(ID, day) %>%
mutate(daysum = 0) %>%
select(ID, day, daysum)
# Function to apply to each data frame containing events
# Should take each event and add value to the appropriate days
sum_df_daily <- function(df_complete, df){
for (i in 1:nrow(df)){
event_days <- seq(df[i, "Date"], df[i, "Date"] + df[i, "Days"] - 1)
df_complete <- df_complete %>%
mutate(
to_add = case_when(
ID == df[i, "ID"] & day %in% event_days ~ df[i, "Value"],
!(ID == df[i, "ID"] & day %in% event_days) ~ 0
),
daysum = daysum + to_add
)
}
return(df_complete)
}
df_filled <- df_completed %>%
sum_df_daily(df1) %>%
sum_df_daily(df2) %>%
mutate(
quarter = case_when(
day %in% 1:90 ~ "q1",
day %in% 91:180 ~ "q2"
)
)
df_filled %>%
group_by(quarter, ID) %>%
summarise(maxsum = max(daysum))
#> # A tibble: 6 x 3
#> # Groups: quarter [?]
#> quarter ID maxsum
#> <chr> <dbl> <dbl>
#> 1 q1 1.00 60.0
#> 2 q1 2.00 40.0
#> 3 q1 3.00 30.0
#> 4 q2 1.00 80.0
#> 5 q2 2.00 30.0
#> 6 q2 3.00 40.0
Related
ID
Date
101
10-17-2021
101
10-19-2021
101
10-20-2021
101
10-31-2021
101
11-01-2021
For each ID I want to remove observations that are within 7 days of each other. I want to keep the earliest date of the dates that are within 7 days of each other. So in this case I would want to keep "10-17-2021" and "10-31-2021". This process would continue until I have unique dates for each ID that are at least 7 days apart and do not contain other dates in between.
You can do it using group_by() and slice() functions. But first the Date column must be formatted using as.Date() function. Here is the code to remove observations within 7-day interval and keep only the earliest ID:
library(tidyverse)
df$Date <- as.Date(df$Date, format = "%m-%d-%Y")
df %>%
group_by(ID) %>%
slice(c(1, which(c(0, diff(Date)) >= 7)))
output
ID Date
101 2021-10-17
101 2021-10-31
In your example, you can't evaluate every observation independently because some of them may be removed when compared to the first value. Perhaps I'm not thinking about it the right way, but I think you need a loop to do this. Here's what I came up with (note: I made the sequence of dates longer to make sure it works):
library(dplyr)
d <- tibble(
ID = 101,
Date = seq(lubridate::mdy("01-01-2023"),
lubridate::mdy("02-07-2023"), by="days")
)
i <- 1
while(i < nrow(d)){
d <- d %>% mutate(diff = Date - d$Date[i])
d <- d %>% filter(diff <= 0 | diff > 7)
if(i < nrow(d)){
i <- i+1
}
}
d <- d %>% select(-diff)
d
#> # A tibble: 5 × 2
#> ID Date
#> <dbl> <date>
#> 1 101 2023-01-01
#> 2 101 2023-01-09
#> 3 101 2023-01-17
#> 4 101 2023-01-25
#> 5 101 2023-02-02
Created on 2023-02-08 by the reprex package (v2.0.1)
Essentially, what happens is that the loop initializes with the first observation and removes every observation within seven days. If more observations remain, it increments the counter and moves to the next day and evaluates all subsequent dates from there, keeping everything that came before.
These loops are difficult to do in the tidyverse, but you could split the data by group, run the loop on each group and then put the groups back together. Here's an example:
library(dplyr)
d <- tibble(
ID = 101,
Date = seq(lubridate::mdy("01-01-2023"),
lubridate::mdy("02-07-2023"), by="days")
)
d2 <- d %>% mutate(ID = 102)
alldat <- bind_rows(d, d2)
split_dat <- alldat %>%
group_by(ID) %>%
group_split()
result <- purrr::map(split_dat, function(d){
i <- 1
while(i < nrow(d)){
d <- d %>% mutate(diff = Date - d$Date[i])
d <- d %>% filter(diff <= 0 | diff > 7)
if(i < nrow(d)){
i <- i+1
}
}
d <- d %>% select(-diff)
d
})
result <- bind_rows(result)
result
#> # A tibble: 10 × 2
#> ID Date
#> <dbl> <date>
#> 1 101 2023-01-01
#> 2 101 2023-01-09
#> 3 101 2023-01-17
#> 4 101 2023-01-25
#> 5 101 2023-02-02
#> 6 102 2023-01-01
#> 7 102 2023-01-09
#> 8 102 2023-01-17
#> 9 102 2023-01-25
#> 10 102 2023-02-02
Created on 2023-02-08 by the reprex package (v2.0.1)
You can try using a recursive function as in this answer.
f <- function(d, ind = 1) {
ind.next <- dplyr::first(which(difftime(d, d[ind], units="days") > 7))
if (is.na(ind.next))
return(ind)
else
return(c(ind, f(d, ind.next)))
}
After the first date, the function will get the next index ind.next where the date is more than 7 days away. Recursively, add that index and get the next date after that. In the end, just return all the row indexes.
The code to use this function can group_by(ID) and slice to retain those rows based on indexes returned.
library(dplyr)
df %>%
group_by(ID) %>%
slice(f(Date))
I have two data. One has a year_month_week column and the other has a date column. I simplified and made two data for demonstration purpose.
df1<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2<-data.frame(id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
For df1, 2022051 means 1st week of May,2022. Likewise, 2022052 means 2nd week of May,2022. For df2,20220503 means May 3rd, 2022.
What I want to do now is merge df1 and df2 with respect to year_month_week. In this case, 20220503 and 20220506 are 1st week of May,2022. If more than one date are in year_month_week, I will just include the first of them. So my expected output is as follows:
df<-data.frame(id=c(1,1,2,2),
year_month_week=c(2022051,2022052,2022041,2022042),
points=c(65,58,21,25),
temperature=c(36.1,36.6,34.3,34.9))
One way of doing it is to extract the last two digits of your date column in df2, divide the digits by 7, then round them up. This would be your week number (this part is in the mutate function).
Then just group_by the year_month_week column and only output one record per year_month_week, and join with df1.
library(tidyverse)
library(stringr)
df <- df2 %>%
mutate(year_month_week =
as.integer(
paste0(str_extract(df2$date, ".*(?=\\d\\d$)"),
ceiling(as.integer(str_extract(df2$date, "\\d\\d$"))/7))
)) %>%
group_by(year_month_week) %>%
slice_min(date) %>%
left_join(df1,
by = c("year_month_week", "id")) %>%
select(-date)
df
# A tibble: 4 × 4
# Groups: year_month_week [4]
id temperature year_month_week points
<dbl> <dbl> <dbl> <dbl>
1 2 34.3 2022041 21
2 2 34.9 2022042 25
3 1 36.1 2022051 65
4 1 36.6 2022052 58
I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.
Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result
You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14
Essentially, I have a dataset with variables indicating group, date and value of variable. I need to take the difference between the value and the end-of-previous year value per group. Since the data is balanced, I was trying to do that with dplyr::lag, inserting the lag given the month of the observation:
x <- x %>% group_by(g) %>% mutate(y = v - lag(v, n=month(d))
This, however, does not work.
The results should be:
Mock dataset:
x <- data.frame('g'=c('B','B','B','C','A','A','A','A','A','A'),'d'=c('2018-11-30', '2018-12-31','2019-01-31','2019-12-31','2016-12-31','2017-11-30','2017-12-31','2018-12-31','2019-01-31','2019-02-28'),'v'=c(300,200,250,100,400,150,200,500,400,500))
Desired variable:
y <- c(NA,NA,-50,NA,NA,-250,-200,300,-100,0)
New dataset:
cbind(x,y)
An idea via dplyr can be to look for the last day, get the index and use that to subtract and then convert to NAs, i.e.
library(dplyr)
x %>%
group_by(g) %>%
mutate(new = which(sub('^[0-9]+-([0-9]+-[0-9]+)$', '\\1', d) == '12-31'),
y = v - v[new],
y = replace(y, row_number() <= new, NA)) %>%
select(-new)
which gives,
# A tibble: 7 x 4
# Groups: g [3]
g d v y
<fct> <fct> <dbl> <dbl>
1 B 2018-11-30 300 NA
2 B 2018-12-31 200 NA
3 B 2019-01-31 250 50
4 C 2017-12-31 400 NA
5 A 2018-12-31 500 NA
6 A 2019-01-31 400 -100
7 A 2019-02-28 500 0
In the end I decided to create an auxiliary variable ('eoy') to indicate the row of the corresponding end-of-year per group for each row. It requires a loop and is inefficient but facilitates the remaining computations that will depend on this. The desired computation would become:
mutate('y'= x - x[eoy])
I have a dataset that looks something like this
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1970,1970,1971,1980,1980,1981,1982),
"Val" = c(2,3,-2,5,2,5,3,5))
I have mulple observations for each id and time identifier - e.g. I have 3 different alpha 1970 values. I would like to retain only one observation per id/year most notably the last one that appears in for each id/year.
the final dataset should look something like this:
final <- data.frame("id" = c("Alpha","Alpha","Beta","Beta","Beta"),
"Year" = c(1970,1971,1980,1981,1982),
"Val" = c(-2,5,5,3,5))
Does anyone know how I can approach the problem?
Thanks a lot in advance for your help
If you are open to a data.table solution, this can be done quite concisely:
library(data.table)
setDT(df)[, .SD[.N], by = c("id", "Year")]
#> id Year Val
#> 1: Alpha 1970 -2
#> 2: Alpha 1971 5
#> 3: Beta 1980 5
#> 4: Beta 1981 3
#> 5: Beta 1982 5
by = c("id", "Year") groups the data.table by id and Year, and .SD[.N] then returns the last row within each such group.
How about this?
library(tidyverse)
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1970,1970,1971,1980,1980,1981,1982),
"Val" = c(2,3,-2,5,2,5,3,5))
final <-
df %>%
group_by(id, Year) %>%
slice(n()) %>%
ungroup()
final
#> # A tibble: 5 x 3
#> id Year Val
#> <fct> <dbl> <dbl>
#> 1 Alpha 1970 -2
#> 2 Alpha 1971 5
#> 3 Beta 1980 5
#> 4 Beta 1981 3
#> 5 Beta 1982 5
Created on 2019-09-29 by the reprex package (v0.3.0)
Translates to "within each id-Year group, take only the row where the row number is equal to the size of the group, i.e. it's the last row under the current ordering."
You could also use either filter(), e.g. filter(row_number() == n()), or distinct() (and then you wouldn't even have to group), e.g. distinct(id, Year, .keep_all = TRUE) - but distinct functions take the first distinct row, so you'd need to reverse the row ordering here first.
An option with base R
aggregate(Val ~ ., df, tail, 1)
# id Year Val
#1 Alpha 1970 -2
#2 Alpha 1971 5
#3 Beta 1980 5
#4 Beta 1981 3
#5 Beta 1982 5
If we need to select the first row
aggregate(Val ~ ., df, head, 1)