calculate time difference in the same group - r

I want to convert the column time to be in time decimal format and then find the time interval within each group of the user_id. I have tried the answer below, but I could not get it to work:
Days difference between two dates in same column in R
structure(list(question_id = c(5502L, 5502L, 5502L, 5502L, 5502L
), user_id = c(112197L, 112197L, 112197L, 114033L, 114033L),
time = structure(c(1603720173, 1603720388, 1603720702, 1603603115,
1603949442), class = c("POSIXct", "POSIXt"), tzone = ""),
prediction = c(0.9, 0.95, 0.9, 0.99, 0.94), log_score = c(0.84799690655495,
0.925999418556223, 0.84799690655495, 0.985500430304885, 0.910732661902913
)), row.names = 156182:156186, class = "data.frame")

Perhaps this is what you're looking for?
library(dplyr)
user_data %>%
group_by(user_id) %>%
summarise(day.interval = difftime(max(time), min(time),units = "days"))
# A tibble: 2 x 2
user_id day.interval
<int> <drtn>
1 112197 0.006122685 days
2 114033 4.008414352 days

library(tidyverse)
library(lubridate)
df <- tibble::tribble(
~question_id, ~user_id, ~time, ~prediction, ~log_score,
5502L, 112197L, "2020-10-26 14:49:33", 0.9, 0.84799690655495,
5502L, 112197L, "2020-10-26 14:53:08", 0.95, 0.925999418556223,
5502L, 112197L, "2020-10-26 14:58:22", 0.9, 0.84799690655495,
5502L, 114033L, "2020-10-25 06:18:35", 0.99, 0.985500430304885,
5502L, 114033L, "2020-10-29 06:30:42", 0.94, 0.910732661902913
)
df %>%
as_tibble() %>%
mutate(time = lubridate::ymd_hms(time)) %>%
group_by(user_id) %>%
mutate(diff = time - lag(time),
diff2 = hms::hms(seconds_to_period(diff)))
#> # A tibble: 5 x 7
#> # Groups: user_id [2]
#> question_id user_id time prediction log_score diff diff2
#> <int> <int> <dttm> <dbl> <dbl> <drtn> <time>
#> 1 5502 112197 2020-10-26 14:49:33 0.9 0.848 NA secs NA
#> 2 5502 112197 2020-10-26 14:53:08 0.95 0.926 215 secs 00:03:35
#> 3 5502 112197 2020-10-26 14:58:22 0.9 0.848 314 secs 00:05:14
#> 4 5502 114033 2020-10-25 06:18:35 0.99 0.986 NA secs NA
#> 5 5502 114033 2020-10-29 06:30:42 0.94 0.911 346327 secs 96:12:07

Related

Make connections between two datasets

I would like to make a connection between the x and df2 datasets. Notice that the dataset x, I have a percentage value, which in this case for the day 03-01-2021 is 0.1 and for the days 01-02-2021 and 01-01-2022 it is 0.45. So from that information, I know the percentage value for 03-01-2021 is 0.1, so this value falls into category I of my dataset df2 (since the values range from 0.1 to 0.2). As for the days 02-01-2021 and 01-01-2022, they correspond to category F of the df2,since the values range from 0.4 to 0.5. So, I would like to generate an output table as follows:
library(dplyr)
df1<- structure(
list(date2= c("01-01-2022","01-01-2022","03-01-2021","03-01-2021","01-02-2021","01-02-2021"),
Category= c("ABC","CDE","ABC","CDE","ABC","CDE"),
coef= c(5,4,0,2,4,5)),
class = "data.frame", row.names = c(NA, -6L))
x<-df1 %>%
group_by(date2) %>%
summarize(across("coef", sum),.groups = 'drop')%>%
arrange(date2 = as.Date(date2, format = "%d-%m-%Y"))
number<-20
x$Percentage<-x$coef/number
date2 coef Percentage
<chr> <dbl> <dbl>
1 03-01-2021 2 0.1
2 01-02-2021 9 0.45
3 01-01-2022 9 0.45
df2 <- structure(
list(
Category = c("A", "B", "C", "D",
"E", "F", "G", "H", "I", "J"),
From = c(0.9,
0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0),
Until = c(
1,
0.8999,
0.7999,
0.6999,
0.5999,
0.4999,
0.3999,
0.2999,
0.1999,
0.0999
),
`1 Val` = c(
2222,
2017.8,
1793.6,
1621.5,
1522.4,
1457.3,
1325.2,
1229.15,
1223.1,
1177.05
),
`2 Val` = c(3200, 2220, 2560,
2200, 2220, 2080, 1220, 1240, 1720, 1620),
`3 Val` = c(
4665,
4122.5,
3732,
3498.75,
3265.5,
3032.25,
2799,
2682.375,
2565.75,
2449.125
),
`4 Val` = c(
6112,
5222.8,
4889.6,
4224,
4278.4,
3972.8,
3667.2,
3224.4,
3361.6,
3222.8
)
),
row.names = c(NA,-10L),
class = c("tbl_df",
"tbl", "data.frame")
)
Category From Until 1 Val 2 Val 3 Val 4 Val
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0.9 1 2222 3200 4665 6112
2 B 0.8 0.900 2018 2220 4122 5223
3 C 0.7 0.800 1794 2560 3732 4890
4 D 0.6 0.700 1622 2200 3499 4224
5 E 0.5 0.600 1522 2220 3266 4278
6 F 0.4 0.500 1457 2080 3032 3973
7 G 0.3 0.400 1325 1220 2799 3667
8 H 0.2 0.300 1229 1240 2682 3224
9 I 0.1 0.200 1223 1720 2566 3362
10 J 0 0.0999 1177 1620 2449 3223
Using tidyverse, we do a rowwise on the 'x' dataset, slice the rows of 'df2' where the 'Percentage' falls between the 'From' and 'Until', and unpack the data.frame/tibble column
library(dplyr)
library(tidyr)
x %>%
rowwise %>%
mutate(out = df2 %>%
slice(which(Percentage>= From &
Percentage <= Until)[1]) %>%
select(-(1:3)) ) %>%
ungroup %>%
unpack(out)
-output
# A tibble: 3 × 7
date2 coef Percentage `1 Val` `2 Val` `3 Val` `4 Val`
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 03-01-2021 2 0.1 1223. 1720 2566. 3362.
2 01-02-2021 9 0.45 1457. 2080 3032. 3973.
3 01-01-2022 9 0.45 1457. 2080 3032. 3973.
Or this could be done with a non-equi join
library(data.table)
nm1 <- names(df2)[endsWith(names(df2), 'Val')]
setDT(x)[setDT(df2), (nm1) := mget(nm1),
on = .(Percentage >= From, Percentage <= Until)]
-output
> x
date2 coef Percentage 1 Val 2 Val 3 Val 4 Val
1: 03-01-2021 2 0.10 1223.1 1720 2565.75 3361.6
2: 01-02-2021 9 0.45 1457.3 2080 3032.25 3972.8
3: 01-01-2022 9 0.45 1457.3 2080 3032.25 3972.8

R: Expand rows according to start and end date and calculate hours between days

My question extends this one: Generate rows between two dates into a data frame in R
I have a dataset on admissions, discharges and lengths of stay (Stay_in_days) of patients from a hospital. It looks like this:
ID Admission Discharge Stay_in_days
1 2020-08-20 15:25:03 2020-08-21 21:09:34 1.239
2 2020-10-04 21:53:43 2020-10-09 11:02:57 4.548
...
Dates are in POSIXct format so far.
I aim for this:
ID Date Stay_in_days
1 2020-08-20 15:25:03 0.357
1 2020-08-21 21:09:49 1.239
2 2020-10-04 21:53:43 0.087
2 2020-10-05 00:00:00 1.087
2 2020-10-06 00:00:00 2.087
2 2020-10-07 00:00:00 3.087
2 2020-10-08 00:00:00 4.087
2 2020-10-09 11:02:57 4.548
...
What I have done so far:
M <- Map(seq, patients$Admission, patients$Discharge, by = "day")
patients2 <- data.frame(
ID = rep.int(patients$ID, vapply(M, length, 1L)),
Date = do.call(c, M)
)
patients <- patients %>%
mutate(
Date2=as.Date(Date, format = "%Y-%m-%d"),
Dat2=Date2+1,
Diff=difftime(Date2, Date, units = "days")
)
but this gives me:
ID Date Date2 Diff
1 2020-08-20 17:25:03 2020-08-21 0.375
1 2020-08-21 17:25:03 2020-08-22 0.357
2 2020-10-04 23:53:43 2020-10-05 0.087
2 2020-10-05 23:53:43 2020-10-06 0.087
2 2020-10-06 23:53:43 2020-10-07 0.087
2 2020-10-07 23:53:43 2020-10-08 0.087
2 2020-10-08 23:53:43 2020-10-09 0.087
...
Strangely enough, it adds two hours to the Admission date but calculates the correct length of stay. Can someone explain?
Here is some data:
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), Admission = structure(c(1597937103.872,
1598717768.704, 1599060521.984, 1599758087.168, 1599815496.704,
1600702198.784, 1600719631.36, 1601065923.584, 1601119400.96,
1601215476.736, 1601236710.4, 1601416934.4, 1601499640.832, 1601545647.104,
1601587328, 1601644868.608, 1601741206.528, 1601848423.424, 1601901245.44,
1601913828.352), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Discharge = structure(c(1598044189.696, 1598897337.344, 1599144670.208,
1599845118.976, 1599842366.464, 1602733683.712, 1603372135.424,
1601125168.128, 1601314173.952, 1605193905.152, 1602190259.2,
1601560720.384, 1601737143.296, 1602705634.304, 1602410460.16,
1602698425.344, 1601770566.656, 1602241377.28, 1602780476.416,
1602612048.896), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Stay_in_days = c(1.239, 2.078, 0.974, 1.007, 0.311, 23.513,
30.7, 0.686, 2.254, 46.047, 11.036, 1.664, 2.749, 13.426,
9.527, 12.194, 0.34, 4.548, 10.176, 8.081)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
Thanks in advance for your help!
Though it is a bit crude but it'll work
library(tidyverse)
library(lubridate)
df %>%
pivot_longer(cols = -c(ID, Stay_in_days), names_to = "Event", values_to = "DATE") %>%
group_by(ID) %>%
mutate(dummy = case_when(Event == "Admission" ~ 0,
Event == "Discharge" ~ max(floor(Stay_in_days),1))) %>%
complete(dummy = seq(min(dummy), max(dummy), 1)) %>%
mutate(Event = ifelse(is.na(Event), "Dummy", Event),
DATE = if_else(is.na(DATE), first(DATE)+dummy*24*60*60, DATE),
Stay_in_days = case_when(Event == "Admission" ~ as.numeric(difftime(ceiling_date(DATE, "day"), DATE, units = "days")),
Event == "Discharge" ~ Stay_in_days,
TRUE ~ dummy + as.numeric(difftime(ceiling_date(first(DATE), "day"), first(DATE), units = "days")))) %>%
select(ID, DATE, Stay_in_days)
# A tibble: 199 x 3
# Groups: ID [20]
ID DATE Stay_in_days
<dbl> <dttm> <dbl>
1 1 2020-08-20 15:25:03 0.358
2 1 2020-08-21 21:09:49 1.24
3 2 2020-08-29 16:16:08 0.322
4 2 2020-08-30 16:16:08 1.32
5 2 2020-08-31 18:08:57 2.08
6 3 2020-09-02 15:28:41 0.355
7 3 2020-09-03 14:51:10 0.974
8 4 2020-09-10 17:14:47 0.281
9 4 2020-09-11 17:25:18 1.01
10 5 2020-09-11 09:11:36 0.617
# ... with 189 more rows
Explanation of logic For the first date in every ID, the stay_in_days gives the duration from admission date-time to following 24 Hrs. For intermediate dates, it just adds 1 to previous value. For discharge_date it retains the stay value calculated prior to pivoting. Hope this was you after.
Explanation of code After pivoting longer, I used a dummy column to create intermediate date-time objects. After that I just mutate the columns for generating output as described above.
You can achieve this with pivot_longer from tidyr.
Edit: with comments:
df1 <- df %>%
select(ID = ID, date1 = Admission, date2 = Discharge, Stay_in_days) %>% # prepare for pivoting
pivot_longer(
cols = starts_with("date"),
names_to = "Date1",
values_to = "Date",
) %>% # pivot to longformat
select(-Date1) %>% # remove temporary Date1
relocate(Stay_in_days, .after = Date) %>% # change column order
group_by(ID) %>%
mutate(idgroup = rep(row_number(), each=1:2, length.out = n())) %>% # id for admission = 1 and for discharge id = 2
mutate(Stay_in_days = replace(Stay_in_days, row_number() == 1, 0)) %>% # set Admission to zero
ungroup()

Compute month on month difference in weights

I am working on some portfolio data and I'm stumped by this data manipulation. I have this sample data
df <- tibble(
date = as.Date(c("2020-01-31", "2020-01-31", "2020-01-31",
"2020-02-29", "2020-02-29", "2020-02-29",
"2020-03-31", "2020-03-31", "2020-03-31") ),
id = c("KO", "AAPL", "MSFT",
"KO", "AAPL", "GOOG",
"KO", "AAPL", "MSFT"),
weight = c(0.3, 0.4, 0.3,
0.5, 0.3, 0.2,
0.6, 0.2, 0.2),
`weight_change (desired column)` = c(NA, NA, NA,
0.2, -0.1, 0.2,
0.1, -0.1, 0.2)
)
These are the positions in a sample portfolio. The portfolio gets new weights every month. What I want to calculate is the change in weight for each item in terms of the previous months weight. In this example we see that at the end of February, KO's current weight is 0.5 which is up 0.2 from the previous month. AAPL is down 0.1, while GOOG replaces MSFT so the change with the previous month is its entire current weight: 0.2. How can I set up a mutate such that it looks for the stock in the previous date and calculates the difference between the weights?
If the data is monthly for each 'id', we can do a complete to take account of the missing months, then do a group by diff
library(dplyr)
library(tidyr)
library(zoo)
df %>%
mutate(yearmonth = as.Date(as.yearmon(date))) %>%
group_by(id) %>%
complete(yearmonth = seq(first(yearmonth), last(yearmonth), by = '1 month')) %>%
mutate(weight_change = if(n() == 1) weight else c(NA, diff(replace_na(weight, 0)))) %>%
ungroup %>%
select(names(df), weight_change) %>%
filter(!is.na(date))
# A tibble: 9 x 5
# date id weight `weight_change (desired column)` weight_change
# <date> <chr> <dbl> <dbl> <dbl>
#1 2020-01-31 AAPL 0.4 NA NA
#2 2020-02-29 AAPL 0.3 -0.1 -0.1
#3 2020-03-31 AAPL 0.2 -0.1 -0.100
#4 2020-02-29 GOOG 0.2 0.2 0.2
#5 2020-01-31 KO 0.3 NA NA
#6 2020-02-29 KO 0.5 0.2 0.2
#7 2020-03-31 KO 0.6 0.1 0.100
#8 2020-01-31 MSFT 0.3 NA NA
#9 2020-03-31 MSFT 0.2 0.2 0.2
Here is my not so compact solution. I just use some helper columns, which I leave in so that one can follow.
library(tidyverse)
library(lubridate)
df <- tibble(
date = c("2020-01-31", "2020-01-31", "2020-01-31",
"2020-02-29", "2020-02-29", "2020-02-29",
"2020-03-31", "2020-03-31", "2020-03-31"),
id = c("KO", "AAPL", "MSFT", "KO", "AAPL", "GOOG", "KO", "AAPL", "MSFT"),
weight = c(0.3, 0.4, 0.3, 0.5, 0.3, 0.2, 0.6, 0.2, 0.2),
`weight_change (desired_column)` = c(NA, NA, NA, 0.2, -0.1, 0.2, 0.1, -0.1, 0.2)
) %>% #new code starts here
mutate(
date = as_date(date),
date_ym = floor_date(date,
unit = "month"))%>%
group_by(id)%>%
arrange(date)%>%
mutate(id_n = row_number(),
prev_exist = case_when(lag(date_ym) == date_ym - months(1) ~ "immediate month", #if there is an immediate month
id_n == 1 & date != min(df$date)~ "new month", #if this is a new month
TRUE ~ "no immediate month"),
weight_change = case_when(prev_exist == "new month"~ weight,
prev_exist == "no immediate month" & id_n > 1~ weight,
TRUE ~ weight-lag(weight)),
date_ym = NULL,
id_n = NULL,
prev_exist = NULL)
A timetk approach:
library(timetk)
df %>%
mutate(Month = lubridate::floor_date(date, "month")) %>%
group_by(id) %>%
timetk::pad_by_time(.date_var = Month, .by="month") %>%
select(-Month) %>%
mutate(WC = if(n() == 1) weight else c(NA, diff(weight)))
A tibble: 10 x 5
Groups: id [4]
id date weight weight_change WC
<chr> <date> <dbl> <dbl> <dbl>
1 KO 2020-01-31 0.3 NA NA
2 KO 2020-02-29 0.5 0.2 0.2
3 KO 2020-03-31 0.6 0.1 0.100
4 AAPL 2020-01-31 0.4 NA NA
5 AAPL 2020-02-29 0.3 -0.1 -0.1
6 AAPL 2020-03-31 0.2 -0.1 -0.100
7 MSFT 2020-01-31 0.3 NA NA
8 MSFT NA NA NA NA
9 MSFT 2020-03-31 0.2 0.2 NA
10 GOOG 2020-02-29 0.2 0.2 0.2

tidyverse gather multiple columns

I have the following data frame:
df <- structure(list(ID = 1:4, col1.date = structure(c(1546188000,
1272294300, 1087908540, 1512241620), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), col2.date = structure(c(1546237740, 1272928800,
1087966800, 1512277200), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
col3.date = structure(c(1546323000, 1272949200, 1088049600,
1512396000), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
col1.result = c(1.31, 0.95, 3.3, 0.55), col2.result = c(1.19,
1.57, 1.6, 0.59), col3.result = c(0.97, 2.13, 1.1, 0.57)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
I would like to have for each ID three rows and two columns: result and date.
This is what I have tried:
df_long <- df %>%
gather(v, value, col1.date:col3.result) %>%
separate(v, c("var", "col")
however I am getting the date transformed to numeric.
What am I doing wrong?
Since you ultimately want to reshape multiple columns (and it is the "new way" of tidyr-1.0.0), then try pivot_longer. This answer is adapted directly from the example in the help page at ?pivot_longer:
df %>%
pivot_longer(
col1.date:col3.result,
names_to = c("set", ".value"),
names_pattern = "(.*)\\.(.*)"
)
# # A tibble: 12 x 4
# ID set date result
# <int> <chr> <dttm> <dbl>
# 1 1 col1 2018-12-30 16:40:00 1.31
# 2 1 col2 2018-12-31 06:29:00 1.19
# 3 1 col3 2019-01-01 06:10:00 0.97
# 4 2 col1 2010-04-26 15:05:00 0.95
# 5 2 col2 2010-05-03 23:20:00 1.57
# 6 2 col3 2010-05-04 05:00:00 2.13
# 7 3 col1 2004-06-22 12:49:00 3.3
# 8 3 col2 2004-06-23 05:00:00 1.6
# 9 3 col3 2004-06-24 04:00:00 1.1
# 10 4 col1 2017-12-02 19:07:00 0.55
# 11 4 col2 2017-12-03 05:00:00 0.59
# 12 4 col3 2017-12-04 14:00:00 0.570

Error : Error in mutate_impl(.data, dots) : Column `three_month` must be length 1 (the group size), not 3766742 dplyr R

My weekly dataset has different state_id associated with different cities.Value1 and value2 need to be aggregated to monthly level and then quarterly level.So am trying below code:
library(dplyr)
df <- dataset %>%
group_by(state_id,city_id) %>%
group_by(three_month = round_date(weekly_dt, "quarter")) %>%
summarise_at(vars(starts_with('value')), mean)
But its popping out this error
Error in mutate_impl(.data, dots) :
Column `three_month` must be length 1 (the group size), not 3766742
Note : All cities don't have same level of weekly data that is the reason I used group_by first.
Can someone help me in R.
EDIT :my dat
structure(list(city_id = c("B02", "B02", "B02",
"B02", "B02", "B02"), state_id = c(609L, 609L,
609L, 609L, 609L, 609L), weekly_dt = structure(c(17601,
17545, 17447, 17727, 17510, 17664), class = "Date"), value1 = c(0.194669883125,
0.35, 0.35, 0.124875972916667, 0.35, 0.140909438125), value2 = c(0.203018924883721,
0.35, 0.35, 0.35, 0.35, 0.35)), class = c("data.table", "data.frame"
), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000004541ef0>)
The mutate function adds additional columns to the data frame, which can then be referenced in a group_by. floor_date instead of round_date may be better here because all dates within the quarter will be placed in the same quarter.
library(dplyr)
library(lubridate)
df <- dataset %>%
mutate(three_month = floor_date(weekly_dt, "quarter")) %>%
group_by(state_id, city_id, three_month) %>%
summarise_at(vars(starts_with('value')), mean)
# A tibble: 4 x 5
# Groups: state_id, city_id [?]
# state_id city_id three_month value1 value2
# <int> <chr> <date> <dbl> <dbl>
# 1 609 B02 2017-10-01 0.350 0.350
# 2 609 B02 2018-01-01 0.272 0.277
# 3 609 B02 2018-04-01 0.141 0.350
# 4 609 B02 2018-07-01 0.125 0.350

Resources