I have to dfs (dfA and dfB) that contain dates and I want to populate some columns in dfA with data from dfB based in some simple opreations.
Say df A has the following structure:
Location Mass Date
A 0.18 10/05/2001
B 0.25 15/08/2006
C 0.50 17/12/2019
Df B contains
Date Event Time
Where date has a wide range of dates. I would like to look in dfB for the dates in dfA and retrieve "Event" and "Time" data from dfB based in simple date operations, such getting data from one, two or three days from that showing in "Date" on dfA, giving me something like:
Location Mass Date Event 1 Event 2 Event 3
A 0.18 10/05/2001 (w) (x) (y)
B 0.25 15/08/2006 (z) (z1) (z2)
Where (w) would be the data extracted from "Event" in dfB on "Date" (-1) day from "Date" specified in dfA (09/05/2001), then (x) would retrieve the data from "Event" in dfB on "Date" (-2) days from that in df A (08/05/2001) and so on.
I believe using dplyr and lubridate could sort this out.
You can add dummy variables with lagged dates (day - 1, day - 2 etc.) then use a series of left_join to achieve intended results. Please see the code below:
library(lubridate)
library(tidyverse)
# Simulation
dfa <- tibble(location = LETTERS[1:4],
mass = c(0.18, 0.25, 0.5, 1),
date = dmy(c("10/05/2001", "15/08/2006", "15/07/2006", "17/12/2019")))
dfb <- tibble(date = dmy(c("9/05/2001", "13/08/2006", "13/07/2006", "14/12/2019")),
event = c("day-1a", "day-2a", "day-2b", "day-3"))
# Dplyr-ing, series of left_joins
dfc <- dfa %>%
mutate(date_1 = date - 1,
date_2 = date - 2,
date_3 = date - 3) %>%
left_join(dfb, by = c("date_1" = "date")) %>%
rename(event1 = event) %>%
left_join(dfb, by = c("date_2" = "date")) %>%
rename(event2 = event) %>%
left_join(dfb, by = c("date_3" = "date")) %>%
rename(event3 = event) %>%
select(-starts_with("date_"))
dfc
Output:
# A tibble: 4 x 6
location mass date event1 event2 event3
<chr> <dbl> <date> <chr> <chr> <chr>
1 A 0.18 2001-05-10 day-1a NA NA
2 B 0.25 2006-08-15 NA day-2a NA
3 C 0.5 2006-07-15 NA day-2b NA
4 D 1 2019-12-17 NA NA day-3
Related
I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.
Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result
You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14
I've tried doing this using a combo of Dplyr filter and lag, but it isn't working. This stack overflow answer works for individual inputs, but not for an input dataframe as far as I can tell.
I have a dataframe of stock names, prices and dates. I would like to be able to input a second dataframe of stock names and dates and return those observations from the first dataframe, plus the N above, let's say N=1 for example. The observation dates are uneven so relying on it being N days beforehand won't work.
So for example if I have this data:
stock.data <- data.frame(
stock_name = c("Walmart","Walmart","Walmart","Target","Target","Target"),
price = c(100,101,102,201,202,203),
date = as.Date(c("2012-01-01", "2012-03-01", "2012-04-01", "2012-01-01",
"2012-03-01","2012-04-01"))
)
And in the other data frame, I have
other <- data.frame(
stock_name = c("Walmart", "Target"),
date = as.Date(c("2012-03-01", "2012-04-01"))
)
N <- 1
I would like to get the rows with prices of 100, 101, 202 and 203.
Hopefully this makes sense and I'm happy to answer further questions.
I'll use a function I wrote in a different answer, https://stackoverflow.com/a/58716950/3358272, called leadlag. The premise for that function is similar to lead or lag (in dplyr-speak) but it has a cumulative effect.
Up front: I'm assuming that this "N prior" is per-group (per stock_name), not generally throughout all stock names.
For this data, I'll add a unique id to each row and find the rows to keep:
stock.data$rn <- seq_len(nrow(stock.data))
rownums <- merge(stock.data, other_data)$rn
From there, let's lead/lag the filtering:
stock.data %>%
group_by(stock_name) %>%
filter(leadlag(rn %in% rownums, bef=1, aft=0)) %>%
ungroup()
# # A tibble: 4 x 4
# stock_name price date rn
# <chr> <dbl> <date> <int>
# 1 Walmart 100 2012-01-01 1
# 2 Walmart 101 2012-03-01 2
# 3 Target 202 2012-03-01 5
# 4 Target 203 2012-04-01 6
and if you wanted N=2 before, then
stock.data %>%
group_by(stock_name) %>%
filter(leadlag(rn %in% rownums, bef=2, aft=0)) %>%
ungroup()
# # A tibble: 5 x 4
# stock_name price date rn
# <chr> <dbl> <date> <int>
# 1 Walmart 100 2012-01-01 1
# 2 Walmart 101 2012-03-01 2
# 3 Target 201 2012-01-01 4
# 4 Target 202 2012-03-01 5
# 5 Target 203 2012-04-01 6
Data
stock.data <- data.frame(
stock_name = c("Walmart","Walmart","Walmart","Target","Target","Target"),
price = c(100,101,102,201,202,203),
date = as.Date(c("2012-01-01", "2012-03-01", "2012-04-01", "2012-01-01",
"2012-03-01","2012-04-01"))
)
other_data <- data.frame(
stock_name = c("Walmart", "Target"),
date = as.Date(c("2012-03-01", "2012-04-01"))
)
A copy of the leadlag function defined in the other answer:
#' Lead/Lag a logical
#'
#' #param lgl logical vector
#' #param bef integer, number of elements to lead by
#' #param aft integer, number of elements to lag by
#' #return logical, same length as 'lgl'
#' #export
leadlag <- function(lgl, bef = 1, aft = 1) {
n <- length(lgl)
bef <- min(n, max(0, bef))
aft <- min(n, max(0, aft))
befx <- if (bef > 0) sapply(seq_len(bef), function(b) c(tail(lgl, n = -b), rep(FALSE, b)))
aftx <- if (aft > 0) sapply(seq_len(aft), function(a) c(rep(FALSE, a), head(lgl, n = -a)))
rowSums(cbind(befx, lgl, aftx), na.rm = TRUE) > 0
}
If you are looking for N = 1 or N = 2, I would do it like this:
library(dplyr)
stock.data %>% left_join(mutate(other, in_other = TRUE)) %>%
filter(in_other | lead(in_other, 1))
# Joining, by = c("stock_name", "date")
# stock_name price date in_other
# 1 Walmart 100 2012-01-01 NA
# 2 Walmart 101 2012-03-01 TRUE
# 3 Target 202 2012-03-01 NA
# 4 Target 203 2012-04-01 TRUE
But this solution doesn't scale well to additional values of N.
Using this "other" data frame:
other = tribble(
~stock_name, ~date,
"Walmart", "2012-03-01",
"Target", "2012-04-01"
) %>% mutate(date = as.Date(date))
I have 2 dataframes. One is a list of occasional events. It has a date column and a column of values.
df1 = data.frame(date = c(as.Date('2020-01-01'), as.Date('2020-02-02'), as.Date('2020-03-01')),
value = c(1,5,9))
I have another data frame that is a daily record. It too has a date column and a column of values.
set.seed(1)
df2 = data.frame(date = seq.Date(from = as.Date('2020-01-01'), to = as.Date('2020-04-01'), by = 1),
value = rnorm(92))
I want to create a new column in df1 that is the mean of df2$value from the current row date to the subsequent date value (non inclusive of the second value, so in this example, the first new value would be the mean of values from df2 of row 1 through row 32, where row 33 is the row that matches df1$date[2]). The resultant data frame would look like the following:
date value value_new
1 2020-01-01 1 0.1165512
2 2020-02-02 5 0.0974052
3 2020-03-01 9 0.1241778
But I have no idea how to specify that. Also I would prefer the last value to be the mean of whatever data is beyond the last value of df1$date, but I would also accept an NA.
We can joion df2 with df1, fill the NA values with previous values and get mean of value_new column.
library(dplyr)
df2 %>%
rename(value_new = value) %>%
left_join(df1, by = 'date') %>%
tidyr::fill(value) %>%
group_by(value) %>%
summarise(date = first(date),
value_new = mean(value_new))
# A tibble: 3 x 3
# value date value_new
# <dbl> <date> <dbl>
#1 1 2020-01-01 0.117
#2 5 2020-02-02 0.0974
#3 9 2020-03-01 0.124
I have taken a function from this post to create a random time efficiently generate a random sample of times and dates between two dates.
Here is my data set:
latemail <- function(N, st="2012/01/01", et="2012/12/31") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
set.seed(42);
a<- print(latemail(9))
a<- sort(a)
data <- data.frame(time= a, place=c("Start", "B", "O", "A", "A", "Start", "A", "O", "A"), ID=c(rep(1, 5), rep(2,4)))
The data looks like this:
time place ID
1 2012-02-19 04:40:45 Start 1
2 2012-04-14 12:34:56 B 1
3 2012-07-08 13:16:49 O 1
4 2012-08-22 07:41:26 A 1
5 2012-08-27 21:15:08 A 1
6 2012-09-14 10:22:03 Start 2
7 2012-09-25 22:30:49 B 2
8 2012-10-30 03:43:16 B 2
9 2012-11-29 22:42:03 O 2
I would like to take the time difference when place is "O" and "start", within each group (ID).
Questions:
1) Is the structure of the above data in accordance with tidy data? Because I think it makes more sense to spread the data so one can take difftime column wise. If each ID has only one row will it be tidy data (to separate e.g. between the A's one could call them A_1, A_2 if they have to be columns). But which format is tidy data.
2) Is there a better way to do accomplish this than bellow?
data2 <- data %>%
filter(place %in% c("Start", "O")) %>%
group_by(ID) %>%
mutate(diff=difftime(lead(time), time, units="days")) %>%
filter(!is.na(diff))
Output:
# A tibble: 2 x 4
# Groups: ID [2]
time place ID diff
<dttm> <fct> <dbl> <time>
1 2012-02-19 04:40:45 Start 1 140.31671 days
2 2012-09-25 22:30:49 Start 2 " 65.04947 days"
We can keep the structure as it is but simplify the code a bit by using summarise (assuming you only have one "O" and "Start" for each ID.
library(dplyr)
data %>%
group_by(ID) %>%
summarise(diff = difftime(time[place == "O"], time[place == "Start"]))
# ID diff
# <dbl> <time>
#1 1 140.31671 days
#2 2 " 65.04947 days"
If there are some ID's which do not have either "Start" or "O" we can return NA for them
data %>%
group_by(ID) %>%
summarise(diff = if (any(place == "O") & any(place == "Start"))
difftime(time[place == "O"], time[place == "Start"]) else NA)
I have multiple large data frames that capture events that last a certain amount of time. This example gives a simplified version of my data set
Data frame 1:
ID Days Date Value
1 10 80 30
1 10 85 30
2 20 75 20
2 10 80 20
3 5 90 30
Data frame 2:
ID Days Date Value
1 20 0 30
1 10 3 20
2 20 5 30
3 20 1 10
3 10 10 10
The same ID is used for the same person in all datasets
Days specifies the length of the event (if Days has the value 10 then the event lasts 10 days)
Date specifies the date at which the event starts. In this case,Date can be any number between 0 and 90 or 91 (the data represent days in quarter)
Value is an attribute that is repeated for the number of Days specified. For example, for the first row in df1, the value 30 is repeated for 10 times starting from day 80 ( 30 is repeated for 10 days)
What I am interested in is to give for each ID in each data frame the highest value per day. Keep in mind that multiple events can overlap and values then have to be summed.
The final data frame should look like this:
ID HighestValuedf1 HighestValuedf2
1 60 80
2 40 30
3 30 20
For example, for ID 1 three events overlapped and resulted in the highest value of 80 in data frame 2. There was no overlap between the events of df1 and df1 for ID 3, only an overlap withing df2.
I would prefer a solution that avoids merging all data frames into one data frame because of the size of my files.
EDIT
I rearranged my data so that all events that overlap are in one data frame. I only need the highest overlap value for every data frame.
Code to reproduce the data frames:
ID = c(1,1,2,2,3)
Date = c(80,85,75,80,90)
Days = c(10,10,20,10,5)
Value = c(30,30,20,20,30)
df1 = data.frame(ID,Days, Date,Value)
ID = c(1,1,2,3,3)
Date = c(1,3,5,1,10)
Days = c(20,10,20,20,10 )
Value =c(30,20,30,10,10)
df2 = data.frame(ID,Days, Date,Value)
ID= c(1,2,3)
HighestValuedf1 = c(60,40,30)
HighestValuedf2 = c(80,30,20)
df3 = data.frame(ID, HighestValuedf1, HighestValuedf2)
I am interpreting highest value per day to mean highest value on a single day throughout the time period. This is probably not the most efficient solution, since I expect something can be done with map or apply functions, but I didn't see how on a first look. Using df1 and df2 as defined above:
EDIT: Modified code upon understanding that df1 and df2 are supposed to represent sequential quarters. I think the easiest way to do this is simply to stack the dataframes so anything that overlaps will automatically be caught (i.e. day 1 of df2 is day 91 overall). You will probably need to either adjust this code manually because of the different length of quarters, or preferably simply convert days of quarters into actual dates of the year with a date formate ((df1 day 1 becomes January 1st 2017, for example). The code below just rearranges to achieve this and then produces the results desired for each quarter by filtering on days 1:90, 91:180 as shown)
ID = c(1,1,2,2,3)
Date = c(80,85,75,80,90)
Days = c(10,10,20,10,5)
Value = c(30,30,20,20,30)
df1 = data.frame(ID,Days, Date,Value)
ID = c(1,1,2,3,3)
Date = c(1,3,5,1,10)
Days = c(20,10,20,20,10 )
Value =c(30,20,30,10,10)
df2 = data.frame(ID,Days, Date,Value)
library(tidyverse)
#> -- Attaching packages --------------------------------------------------------------------- tidyverse 1.2.1 --
#> v ggplot2 2.2.1.9000 v purrr 0.2.4
#> v tibble 1.4.2 v dplyr 0.7.4
#> v tidyr 0.7.2 v stringr 1.2.0
#> v readr 1.1.1 v forcats 0.2.0
#> -- Conflicts ------------------------------------------------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
df2 <- df2 %>%
mutate(Date = Date + 90)
# Make a dataframe with complete set of day-ID combinations
df_completed <- df1 %>%
mutate(day = factor(Date, levels = 1:180)) %>% # set to total day length
complete(ID, day) %>%
mutate(daysum = 0) %>%
select(ID, day, daysum)
# Function to apply to each data frame containing events
# Should take each event and add value to the appropriate days
sum_df_daily <- function(df_complete, df){
for (i in 1:nrow(df)){
event_days <- seq(df[i, "Date"], df[i, "Date"] + df[i, "Days"] - 1)
df_complete <- df_complete %>%
mutate(
to_add = case_when(
ID == df[i, "ID"] & day %in% event_days ~ df[i, "Value"],
!(ID == df[i, "ID"] & day %in% event_days) ~ 0
),
daysum = daysum + to_add
)
}
return(df_complete)
}
df_filled <- df_completed %>%
sum_df_daily(df1) %>%
sum_df_daily(df2) %>%
mutate(
quarter = case_when(
day %in% 1:90 ~ "q1",
day %in% 91:180 ~ "q2"
)
)
df_filled %>%
group_by(quarter, ID) %>%
summarise(maxsum = max(daysum))
#> # A tibble: 6 x 3
#> # Groups: quarter [?]
#> quarter ID maxsum
#> <chr> <dbl> <dbl>
#> 1 q1 1.00 60.0
#> 2 q1 2.00 40.0
#> 3 q1 3.00 30.0
#> 4 q2 1.00 80.0
#> 5 q2 2.00 30.0
#> 6 q2 3.00 40.0