I have to create a database with a single row for every day in the interval between the two dates (date_in - date_out).
I have to use R.
How can I do this?
My data:
id date_in date_out days
1 1 13May2022 0:00:00 03Jul2022 0:00:00 51
2 3 10Nov2020 0:00:00 15Nov2020 0:00:00 5
3 4 25Feb2020 0:00:00 05Apr2020 0:00:00 40
> dput(df)
structure(list(id = c(1L, 3L, 4L), date_in = c("13May2022 0:00:00",
"10Nov2020 0:00:00", "25Feb2020 0:00:00"), date_out = c("03Jul2022 0:00:00",
"15Nov2020 0:00:00", "05Apr2020 0:00:00"), days = c(51, 5, 40
)), class = "data.frame", row.names = c(NA, -3L))
Here is an option. First, change dates into dates (yours might already be), then we map out all the dates from the start to the end, lastly we unnest.
library(tidyverse)
#data
df <- read.csv(textConnection("id, date_in, date_out, days,
1, 13May2022 0:00:00, 03Jul2022 0:00:00, 51,
3, 10Nov2020 0:00:00, 15Nov2020 0:00:00, 5,
4, 25Feb2020 0:00:00, 05Apr2020 0:00:00, 40")) |>
select(-X)
#solution
df|>
mutate(across(starts_with("date"), \(x) lubridate::dmy_hms(x) |>
lubridate::date()),
full_date = map2(date_in, date_out, \(x,y) seq(x, y, by = "1 day"))) |>
unnest_longer(full_date) |>
select(id, date = full_date)
#> # A tibble: 99 x 2
#> id date
#> <int> <date>
#> 1 1 2022-05-13
#> 2 1 2022-05-14
#> 3 1 2022-05-15
#> 4 1 2022-05-16
#> 5 1 2022-05-17
#> 6 1 2022-05-18
#> 7 1 2022-05-19
#> 8 1 2022-05-20
#> 9 1 2022-05-21
#> 10 1 2022-05-22
#> # ... with 89 more rows
Here is a similar approach to AndS.'s, but using summarize:
library(tidyverse)
library(lubridate)
# data
df <- read.csv(textConnection("id, date_in, date_out, days,
1, 13May2022 0:00:00, 03Jul2022 0:00:00, 51,
3, 10Nov2020 0:00:00, 15Nov2020 0:00:00, 5,
4, 25Feb2020 0:00:00, 05Apr2020 0:00:00, 40")) |>
select(-X)
# answer
df |>
mutate(across(c(date_in, date_out), ~date(dmy_hms(.x)))) |>
group_by(id) |>
summarize(date=seq(date_in, date_out, by="1 day"))
Related
I am trying to disaggregate the monthly data and spread them into weekly data in two ways.
First, To find the first Monday from the start date and then create days which are Mondays till the last date (month) of the sequence. And then spread the data within the respective week which is in the month.
Second, To create a weekly sequence from start date and end date and spread the data within the respective week which is in the month.
The data which I am working with is given below:
structure(list(`Row Labels` = c("X6", "X7", "X8", "X9"), `2022-11-01` = c(100,
200, 300, 400), `2022-12-01` = c(160, 200, 300, 400), `2023-01-01` = c(500,
550, 600, 650)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-4L))
And it looks like this:
The expected output 1 is given below, as you can see all the dates are Mondays:
The expected output 2 is given below:
Is this doable, or is it a bit too much to expect from R?
For Mondays we can create a list of Mondays between the dates in the dataframe, join it with the data in long format, count number of the Mondays for each variable in each month, divide the values by the number of Mondays, and revert back the format to wide;
library(dplyr)
library(tidyr)
library(lubridate)
all_dates <- as.Date(names(df1)[-1])
MON <- seq(min(floor_date(all_dates, "month")),
max(ceiling_date(all_dates, "month")),
by="1 day") %>%
.[wday(.,label = TRUE) == "Mon"] %>%
data.frame("Mondays" = .) %>%
mutate(mmm = format(Mondays, "%Y-%m"))
df1 %>%
pivot_longer(cols = -`Row Labels`, names_to = "dates") %>%
mutate(dates = as.Date(dates),
mmm = format(dates, "%Y-%m")) %>%
right_join(MON, by = "mmm") %>%
arrange(mmm) %>%
group_by(`Row Labels`, dates) %>%
mutate(value = value / n()) %>%
ungroup() %>%
select(`Row Labels`, Mondays, value) %>%
pivot_wider(`Row Labels`, names_from = "Mondays", values_from = "value")
#> # A tibble: 4 x 14
#> `Row Labels` `2022-11-07` `2022-11-14` `2022-11-21` `2022-11-28` `2022-12-05`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 X6 25 25 25 25 40
#> 2 X7 50 50 50 50 50
#> 3 X8 75 75 75 75 75
#> 4 X9 100 100 100 100 100
#> # ... with 8 more variables: 2022-12-12 <dbl>, 2022-12-19 <dbl>,
#> # 2022-12-26 <dbl>, 2023-01-02 <dbl>, 2023-01-09 <dbl>, 2023-01-16 <dbl>,
#> # 2023-01-23 <dbl>, 2023-01-30 <dbl>
Same principal goes to doing it weekly:
WKLY <- seq(min(floor_date(all_dates, "month")),
max(ceiling_date(all_dates, "month")),
by="week") %>%
data.frame("Weekly" = .) %>%
mutate(mmm = format(Weekly, "%Y-%m"))
df1 %>%
pivot_longer(cols = -`Row Labels`, names_to = "dates") %>%
mutate(dates = as.Date(dates),
mmm = format(dates, "%Y-%m")) %>%
right_join(WKLY, by = "mmm") %>%
arrange(mmm) %>%
group_by(`Row Labels`, dates) %>%
mutate(value = value / n()) %>%
ungroup() %>%
select(`Row Labels`, Weekly, value) %>%
pivot_wider(`Row Labels`, names_from = "Weekly", values_from = "value")
#> # A tibble: 4 x 15
#> `Row Labels` `2022-11-01` `2022-11-08` `2022-11-15` `2022-11-22` `2022-11-29`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 X6 20 20 20 20 20
#> 2 X7 40 40 40 40 40
#> 3 X8 60 60 60 60 60
#> 4 X9 80 80 80 80 80
#> # ... with 9 more variables: 2022-12-06 <dbl>, 2022-12-13 <dbl>,
#> # 2022-12-20 <dbl>, 2022-12-27 <dbl>, 2023-01-03 <dbl>, 2023-01-10 <dbl>,
#> # 2023-01-17 <dbl>, 2023-01-24 <dbl>, 2023-01-31 <dbl>
Data:
df1 <- structure(list(`Row Labels` = c("X6", "X7", "X8", "X9"),
`2022-11-01` = c(100, 200, 300, 400),
`2022-12-01` = c(160, 200, 300, 400),
`2023-01-01` = c(500, 550, 600, 650)),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -4L))
I've been looking for answers and messing around with my code for a couple hours. I have a dataset that looks like the following for a specific ID:
# A tibble: 14 × 3
ID state orderDate
<dbl> <chr> <dttm>
1 4227631 1 2022-03-14 19:00:00
2 4227631 1 2022-03-14 20:00:00
3 4227631 1 2022-03-15 11:00:00
4 4227631 0 2022-03-15 11:00:00
5 4227631 1 2022-03-15 20:00:00
6 4227631 1 2022-03-16 04:00:00
7 4227631 0 2022-03-16 04:00:00
8 4227631 1 2022-03-16 05:00:00
9 4227631 0 2022-03-16 13:00:00
10 4227631 1 2022-03-16 15:00:00
This occurs for hundreds of IDs. For this example, I am using dplyr to group_by ID. I only care when status changes between values, not if it stays the same.
I want to calculate the cumulative time each ID remains in status 1. The instances where status 1 is repeated multiple times before it changes should be ignored. I have been planning to use lubridate and dplyr to perform the analysis.
Tibble I am using for this example:
structure(list(ID = c(4227631, 4227631, 4227631, 4227631, 4227631,
4227631, 4227631, 4227631, 4227631, 4227631), state = c("1",
"1", "1", "0", "1", "1", "0", "1", "0", "1"), orderDate = structure(c(1647284400,
1647288000, 1647342000, 1647342000, 1647374400, 1647403200, 1647403200,
1647406800, 1647435600, 1647442800), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))
I've tried various solutions such as Cumulative time with reset however I'm having trouble with lag and incorporating it into this specific analysis.
The expected output would maybe look something like this:
And then I would plan to sum all statusOne together to figure out cumulative time spent in this state.
Invite all more elegant solutions or if someone has a link to a prior question.
EDIT
Using solution below I figured it out!
The solution didn't look at the situations where state 0 immediately followed state 1 and we wanted to look at the total time elapsed between these states.
df %>%
group_by(ID) %>%
mutate(max = cumsum(ifelse(orderName == lag(orderName, default = "1"), 0, 1))) %>%
mutate(hours1 = ifelse(max == lag(max) &
orderName=="1", difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
mutate(hours2 = ifelse(orderName=="0" & lag(orderName)=="1",
difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
mutate(hours1 = replace_na(hours1, 0),
hours2 = replace_na(hours2, 0)) %>%
mutate(hours = hours1+hours2) %>%
select(-hours1, -hours2) %>%
summarise(total_hours = sum(hours, na.rm = TRUE)) %>%
filter(total_hours!=0)
This is far from elegant, but at least it appears to provide the correct answer:
library(tidyverse)
df <- structure(list(ID = c(4227631, 4227631, 4227631, 4227631, 4227631,
4227631, 4227631, 4227631, 4227631, 4227631),
state = c("1", "1", "1", "0", "1", "1", "0", "1", "0", "1"),
orderDate = structure(c(1647284400, 1647288000, 1647342000,
1647342000, 1647374400, 1647403200,
1647403200, 1647406800, 1647435600,
1647442800),
tzone = "UTC",
class = c("POSIXct", "POSIXt"))),
row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame"))
df2 <- df %>%
group_by(ID) %>%
mutate(tmp = ifelse(state == lag(state, default = "1"), 0, 1),
max = cumsum(tmp)) %>%
mutate(hours = ifelse(max == lag(max), difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
select(-tmp)
df3 <- df2 %>%
group_by(max) %>%
summarise(max, statusOne = sum(hours, na.rm = TRUE))
df4 <- left_join(df2, df3, by = "max") %>%
distinct() %>%
select(-c(max, hours)) %>%
mutate(statusOne = ifelse(statusOne != 0 & lag(statusOne, default = 1) == statusOne, 0, statusOne))
df4
#> # A tibble: 10 × 4
#> # Groups: ID [1]
#> ID state orderDate statusOne
#> <dbl> <chr> <dttm> <dbl>
#> 1 4227631 1 2022-03-14 19:00:00 16
#> 2 4227631 1 2022-03-14 20:00:00 0
#> 3 4227631 1 2022-03-15 11:00:00 0
#> 4 4227631 0 2022-03-15 11:00:00 0
#> 5 4227631 1 2022-03-15 20:00:00 8
#> 6 4227631 1 2022-03-16 04:00:00 0
#> 7 4227631 0 2022-03-16 04:00:00 0
#> 8 4227631 1 2022-03-16 05:00:00 0
#> 9 4227631 0 2022-03-16 13:00:00 0
#> 10 4227631 1 2022-03-16 15:00:00 0
Created on 2022-04-04 by the reprex package (v2.0.1)
Edit
It's a lot more straightforward to get the total_hours state=1 for each ID:
df %>%
group_by(ID) %>%
mutate(max = cumsum(ifelse(state == lag(state, default = "1"), 0, 1))) %>%
mutate(hours = ifelse(max == lag(max), difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
summarise(total_hours = sum(hours, na.rm = TRUE))
#> # A tibble: 1 × 2
#> ID total_hours
#> <dbl> <dbl>
#> 1 4227631 24
Created on 2022-04-04 by the reprex package (v2.0.1)
My question extends this one: Generate rows between two dates into a data frame in R
I have a dataset on admissions, discharges and lengths of stay (Stay_in_days) of patients from a hospital. It looks like this:
ID Admission Discharge Stay_in_days
1 2020-08-20 15:25:03 2020-08-21 21:09:34 1.239
2 2020-10-04 21:53:43 2020-10-09 11:02:57 4.548
...
Dates are in POSIXct format so far.
I aim for this:
ID Date Stay_in_days
1 2020-08-20 15:25:03 0.357
1 2020-08-21 21:09:49 1.239
2 2020-10-04 21:53:43 0.087
2 2020-10-05 00:00:00 1.087
2 2020-10-06 00:00:00 2.087
2 2020-10-07 00:00:00 3.087
2 2020-10-08 00:00:00 4.087
2 2020-10-09 11:02:57 4.548
...
What I have done so far:
M <- Map(seq, patients$Admission, patients$Discharge, by = "day")
patients2 <- data.frame(
ID = rep.int(patients$ID, vapply(M, length, 1L)),
Date = do.call(c, M)
)
patients <- patients %>%
mutate(
Date2=as.Date(Date, format = "%Y-%m-%d"),
Dat2=Date2+1,
Diff=difftime(Date2, Date, units = "days")
)
but this gives me:
ID Date Date2 Diff
1 2020-08-20 17:25:03 2020-08-21 0.375
1 2020-08-21 17:25:03 2020-08-22 0.357
2 2020-10-04 23:53:43 2020-10-05 0.087
2 2020-10-05 23:53:43 2020-10-06 0.087
2 2020-10-06 23:53:43 2020-10-07 0.087
2 2020-10-07 23:53:43 2020-10-08 0.087
2 2020-10-08 23:53:43 2020-10-09 0.087
...
Strangely enough, it adds two hours to the Admission date but calculates the correct length of stay. Can someone explain?
Here is some data:
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), Admission = structure(c(1597937103.872,
1598717768.704, 1599060521.984, 1599758087.168, 1599815496.704,
1600702198.784, 1600719631.36, 1601065923.584, 1601119400.96,
1601215476.736, 1601236710.4, 1601416934.4, 1601499640.832, 1601545647.104,
1601587328, 1601644868.608, 1601741206.528, 1601848423.424, 1601901245.44,
1601913828.352), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Discharge = structure(c(1598044189.696, 1598897337.344, 1599144670.208,
1599845118.976, 1599842366.464, 1602733683.712, 1603372135.424,
1601125168.128, 1601314173.952, 1605193905.152, 1602190259.2,
1601560720.384, 1601737143.296, 1602705634.304, 1602410460.16,
1602698425.344, 1601770566.656, 1602241377.28, 1602780476.416,
1602612048.896), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Stay_in_days = c(1.239, 2.078, 0.974, 1.007, 0.311, 23.513,
30.7, 0.686, 2.254, 46.047, 11.036, 1.664, 2.749, 13.426,
9.527, 12.194, 0.34, 4.548, 10.176, 8.081)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
Thanks in advance for your help!
Though it is a bit crude but it'll work
library(tidyverse)
library(lubridate)
df %>%
pivot_longer(cols = -c(ID, Stay_in_days), names_to = "Event", values_to = "DATE") %>%
group_by(ID) %>%
mutate(dummy = case_when(Event == "Admission" ~ 0,
Event == "Discharge" ~ max(floor(Stay_in_days),1))) %>%
complete(dummy = seq(min(dummy), max(dummy), 1)) %>%
mutate(Event = ifelse(is.na(Event), "Dummy", Event),
DATE = if_else(is.na(DATE), first(DATE)+dummy*24*60*60, DATE),
Stay_in_days = case_when(Event == "Admission" ~ as.numeric(difftime(ceiling_date(DATE, "day"), DATE, units = "days")),
Event == "Discharge" ~ Stay_in_days,
TRUE ~ dummy + as.numeric(difftime(ceiling_date(first(DATE), "day"), first(DATE), units = "days")))) %>%
select(ID, DATE, Stay_in_days)
# A tibble: 199 x 3
# Groups: ID [20]
ID DATE Stay_in_days
<dbl> <dttm> <dbl>
1 1 2020-08-20 15:25:03 0.358
2 1 2020-08-21 21:09:49 1.24
3 2 2020-08-29 16:16:08 0.322
4 2 2020-08-30 16:16:08 1.32
5 2 2020-08-31 18:08:57 2.08
6 3 2020-09-02 15:28:41 0.355
7 3 2020-09-03 14:51:10 0.974
8 4 2020-09-10 17:14:47 0.281
9 4 2020-09-11 17:25:18 1.01
10 5 2020-09-11 09:11:36 0.617
# ... with 189 more rows
Explanation of logic For the first date in every ID, the stay_in_days gives the duration from admission date-time to following 24 Hrs. For intermediate dates, it just adds 1 to previous value. For discharge_date it retains the stay value calculated prior to pivoting. Hope this was you after.
Explanation of code After pivoting longer, I used a dummy column to create intermediate date-time objects. After that I just mutate the columns for generating output as described above.
You can achieve this with pivot_longer from tidyr.
Edit: with comments:
df1 <- df %>%
select(ID = ID, date1 = Admission, date2 = Discharge, Stay_in_days) %>% # prepare for pivoting
pivot_longer(
cols = starts_with("date"),
names_to = "Date1",
values_to = "Date",
) %>% # pivot to longformat
select(-Date1) %>% # remove temporary Date1
relocate(Stay_in_days, .after = Date) %>% # change column order
group_by(ID) %>%
mutate(idgroup = rep(row_number(), each=1:2, length.out = n())) %>% # id for admission = 1 and for discharge id = 2
mutate(Stay_in_days = replace(Stay_in_days, row_number() == 1, 0)) %>% # set Admission to zero
ungroup()
I have a dataframe as so
df <- structure(list(TIME = c("11:15:00", NA, "15:15:00", "12:00:00",
"18:40:00", "18:15:00", "7:10:00", "15:58:00", "10:00:00", "10:00:00"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
And I basically want to create a new variable which tells me if the time is in a certain group.
I wrote the following but it's not correct, tried changing to as.POSICxt but no dice.
df <- df %>%
mutate(time_groups = ifelse(between(as.POSIXct(TIME),00:00, 5:59), 1,
ifelse(between(as.POSIXct(TIME),06:00, 8:59), 2,
ifelse(between(as.POSIXct(TIME),09:00,11:59), 3,
ifelse(between(as.POSIXct(TIME),12:00,14:59), 4,
ifelse(between(as.POSIXct(TIME),15:00,17:59), 5,
ifelse(between(as.POSIXct(TIME),18:00,23:59), 6,
), NA)
You could use the findInterval function:
library(tidyverse)
library(lubridate)
a <- c("00:00","5:59", "8:59", "11:59", "14:59", "17:59", "23:59")
b <- ymd_hm(paste(Sys.Date(), a))
df %>%
mutate(Interval = findInterval(ymd_hms(paste(Sys.Date(), TIME)), b))
TIME Interval
<chr> <int>
1 11:15:00 3
2 NA NA
3 15:15:00 5
4 12:00:00 4
5 18:40:00 6
6 18:15:00 6
7 7:10:00 2
8 15:58:00 5
9 10:00:00 3
10 10:00:00 3
I am attempting to calculate correlation by (group_by) MktDate, for all columns in a dataframe to another column (Security Return).
I have attempted a number of dplyr solutions and can't quite get the correlation example to work properly but have no issues getting an example using mean to work properly.
This works, to calculate mean by specified columns
MyMeanTest <- MyDataTest %>%
filter(MktDate >='2009-12-31') %>%
group_by(MktDate) %>%
summarize_at(c('RtnVol_EM','OCFROI_EM'),mean,na.rm=TRUE)
This does not work. essentially I want the correlation for the columns specified, grouped by MktDate with the column FwdRet_12M. I get the following error message -
Error in summarise_impl(.data, dots) :
Evaluation error: not all arguments have the same length.
MyCorTest <- MyDataTest %>%
group_by(MktDate) %>%
summarize_at(c('RtnVol_EM','OCFROI_EM'),funs(cor(.,MyDataTest$FwdRet_12M,use="pairwise.complete.obs", "spearman")))
With the code example above I should end with something like this
MktDate,RtnVol_EM,OCFROI_EM...
Here is some sample code that should help to understand the structure of the data and end objective.
MyDataTest <- structure(list(MktDate = structure(c(17896, 17896, 17896, 17896,
17927, 17927, 17927, 17927), class = "Date"), FwdRet = c(2, 3,
4, 5, 5, 2, 1, 4), Fact1 = c(10, 30, 20, 15, 12, 25, 26, 28),
Fact2 = c(100, 500, 300, 400, 150, 400, 430, 420)), .Names = c("MktDate",
"FwdRet", "Fact1", "Fact2"), row.names = c(NA, -8L), class = "data.frame")
When running the pairwise correlation grouped by date on that data set the following should be the result.
MktDate,Fact1,Fact2
12/31/18,.2,.4
1/31/19,.4,-.8
One possible approach would be to reshape your data so that you have the variable you always want in the correlation (FwdRet) in one column and the variable that changes in a separate column. Like so:
MyDataTest_reshape <- MyDataTest %>%
gather(factor, value, -MktDate, -FwdRet)
MyDataTest_reshape
MktDate FwdRet factor value
1 2018-12-31 2 Fact1 10
2 2018-12-31 3 Fact1 30
3 2018-12-31 4 Fact1 20
4 2018-12-31 5 Fact1 15
5 2019-01-31 5 Fact1 12
6 2019-01-31 2 Fact1 25
7 2019-01-31 1 Fact1 26
8 2019-01-31 4 Fact1 28
9 2018-12-31 2 Fact2 100
10 2018-12-31 3 Fact2 500
11 2018-12-31 4 Fact2 300
12 2018-12-31 5 Fact2 400
13 2019-01-31 5 Fact2 150
14 2019-01-31 2 Fact2 400
15 2019-01-31 1 Fact2 430
16 2019-01-31 4 Fact2 420
Then you can take that reshaped data and feed it into your correlation:
MyDataTest_reshape %>%
group_by(MktDate, factor) %>%
summarize(correlation = cor(FwdRet, value)) %>%
spread(factor, correlation)
# A tibble: 2 x 3
# Groups: MktDate [2]
MktDate Fact1 Fact2
<date> <dbl> <dbl>
1 2018-12-31 0.0756 0.529
2 2019-01-31 -0.627 -0.736
You can also do this all in one step, of course:
MyDataTest %>%
gather(factor, value, -MktDate, -FwdRet) %>%
group_by(MktDate, factor) %>%
summarize(correlation = cor(FwdRet, value)) %>%
spread(factor, correlation)
This works for me.
library(tidyverse)
MyDataTest <- structure(list(MktDate = structure(c(17896, 17896, 17896, 17896,
17927, 17927, 17927, 17927), class = "Date"), FwdRet = c(2, 3,
4, 5, 5, 2, 1, 4), Fact1 = c(10, 30, 20, 15, 12, 25, 26, 28),
Fact2 = c(100, 500, 300, 400, 150, 400, 430, 420)), .Names = c("MktDate",
"FwdRet", "Fact1", "Fact2"), row.names = c(NA, -8L), class = "data.frame")
MyDataTest %>%
group_by(MktDate) %>%
summarize_at(c("Fact1", "Fact2"), list(~cor(., FwdRet, use="pairwise.complete.obs", "spearman")))
#> # A tibble: 2 x 3
#> MktDate Fact1 Fact2
#> <date> <dbl> <dbl>
#> 1 2018-12-31 0.2 0.4
#> 2 2019-01-31 -0.4 -0.8