This question already has answers here:
How to create a lag variable within each group?
(5 answers)
Closed 5 years ago.
I have repeat encounters with animals that have a unique IndIDII and a unique GPSSrl number. Each encounter has a FstCptr date.
dat <- structure(list(IndIDII = c("BHS_115", "BHS_115", "BHS_372", "BHS_372",
"BHS_372", "BHS_372"), GPSSrl = c("035665", "036052", "034818",
"035339", "036030", "036059"), FstCptr = structure(c(1481439600,
1450162800, 1426831200, 1481439600, 1457766000, 1489215600), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("IndIDII", "GPSSrl", "FstCptr"
), class = "data.frame", row.names = c(1L, 2L, 29L, 30L, 31L,
32L))
> dat
IndIDII GPSSrl FstCptr
1 BHS_115 035665 2016-12-11
2 BHS_115 036052 2015-12-15
29 BHS_372 034818 2015-03-20
30 BHS_372 035339 2016-12-11
31 BHS_372 036030 2016-03-12
32 BHS_372 036059 2017-03-11
For each IndID-GPSSrl grouping, I want to create a new field (NextCptr) that documents the date of the next encounter. For the last encounter, the new field would be NA, for example:
dat$NextCptr <- as.Date(c("2015-12-15", NA, "2016-12-11", "2016-03-12", "2017-03-11", NA))
> dat
IndIDII GPSSrl FstCptr NextCptr
1 BHS_115 035665 2016-12-11 2015-12-15
2 BHS_115 036052 2015-12-15 <NA>
29 BHS_372 034818 2015-03-20 2016-12-11
30 BHS_372 035339 2016-12-11 2016-03-12
31 BHS_372 036030 2016-03-12 2017-03-11
32 BHS_372 036059 2017-03-11 <NA>
I would like to work within dplyr and group_by(IndIDII, GPSSrl).
As always, many thanks!
Group by column IndIDII then use lead to shift FstCptr forward by one:
dat %>% group_by(IndIDII) %>% mutate(NextCptr = lead(FstCptr))
# A tibble: 6 x 4
# Groups: IndIDII [2]
# IndIDII GPSSrl FstCptr NextCptr
# <chr> <chr> <dttm> <dttm>
#1 BHS_115 035665 2016-12-11 02:00:00 2015-12-15 02:00:00
#2 BHS_115 036052 2015-12-15 02:00:00 NA
#3 BHS_372 034818 2015-03-20 02:00:00 2016-12-11 02:00:00
#4 BHS_372 035339 2016-12-11 02:00:00 2016-03-12 02:00:00
#5 BHS_372 036030 2016-03-12 02:00:00 2017-03-11 02:00:00
#6 BHS_372 036059 2017-03-11 02:00:00 NA
If you need to shift the column in the opposite direction, lag could also be useful, dat %>% group_by(IndIDII) %>% mutate(NextCptr = lag(FstCptr)).
Related
I am working in R. I have a data frame that consists of Sampling Date and water temperature. I have provided a sample dataframe below:
Date Temperature
2015-06-01 11
2015-08-11 13
2016-01-12 2
2016-07-01 12
2017-01-08 4
2017-08-13 14
2018-03-04 7
2018-09-19 10
2019-8-24 8
Due to the erratic nature of sampling dates (due to samplers ability to site) I am unable to classify years normally January 1 to December 31st and instead am using the beginning of the sampling period as the start of 1 year. In this case a year would start June 1st and End may 31st, that way I can accruately compare the years to one another. Thus I want 4 years to have the following labels
Year_One = "2015-06-01" - "2016-05-31"
Year_Two = "2016-06-01" - "2017-05-31"
Year_Three = "2017-06-01" - "2018-05-31"
Year_Four = "2018-06-01" - "2019-08-24"
My goal is to create an additional column with these labels but have thus far been unable to do so.
I create two columns year1 and year2 with two different approaches. The year2 approach needs that all the periods start june 1st and end may 31st (in your code the year_four ends 2019-08-24) so it may not be exactly what you need:
library(tidyverse)
library(lubridate)
dt$Date <- as.Date(dt$Date)
dt %>%
mutate(year1= case_when(between(Date, as.Date("2015-06-01") , as.Date("2016-05-31")) ~ "Year_One",
between(Date, as.Date("2016-06-01") , as.Date("2017-05-31")) ~ "Year_Two",
between(Date, as.Date("2017-06-01") , as.Date("2018-05-31")) ~ "Year_Three",
between(Date, as.Date("2018-06-01") , as.Date("2019-08-24")) ~ "Year_Four",
TRUE ~ "0")) %>%
mutate(year2 = paste0(year(Date-months(5)),"/", year(Date-months(5))+1))
The output:
# A tibble: 9 x 4
Date Temperature year1 year2
<date> <dbl> <chr> <chr>
1 2015-06-01 11 Year_One 2015/2016
2 2015-08-11 13 Year_One 2015/2016
3 2016-01-12 2 Year_One 2015/2016
4 2016-07-01 12 Year_Two 2016/2017
5 2017-01-08 4 Year_Two 2016/2017
6 2017-08-13 14 Year_Three 2017/2018
7 2018-03-04 7 Year_Three 2017/2018
8 2018-09-19 10 Year_Four 2018/2019
9 2019-08-24 8 Year_Four 2019/2020
Using strftime to get the years, then make a factor with levels on the unique values. I'd recommend numbers instead of words, because they can be coded automatically. Else, use labels=c("one", "two", ...).
d <- within(d, {
year <- strftime(Date, "%Y")
year <- paste("Year", factor(year, labels=seq(unique(year))), sep="_")
})
# Date temperature year
# 1 2017-06-01 1 Year_1
# 2 2017-09-01 2 Year_1
# 3 2017-12-01 3 Year_1
# 4 2018-03-01 4 Year_2
# 5 2018-06-01 5 Year_2
# 6 2018-09-01 6 Year_2
# 7 2018-12-01 7 Year_2
# 8 2019-03-01 8 Year_3
# 9 2019-06-01 9 Year_3
# 10 2019-09-01 10 Year_3
# 11 2019-12-01 11 Year_3
# 12 2020-03-01 12 Year_4
# 13 2020-06-01 13 Year_4
Data:
d <- structure(list(Date = structure(c(17318, 17410, 17501, 17591,
17683, 17775, 17866, 17956, 18048, 18140, 18231, 18322, 18414
), class = "Date"), temperature = 1:13), class = "data.frame", row.names = c(NA,
-13L))
I have a dataset of several thousand ICU patients covering several years. Some patients (each with a unique identifier, ID) have had multiple ICU admissions. Each row covers a single ICU admission, and therefore an individual patient may have multiple rows of data.
For each patient, I want to determine whether their ICU admission was:
A readmission during the same hospital stay. This could be identified by an icu_adm time occurring prior to their previous hosp_dis time, or by multiple rows with the same hosp_dis time.
A transfer to a different ICU for management of the same illness. I am defining this as an icu_adm time occurring within 24 hours of their previous hosp_dis time. These patients icu_dis time and hosp_dis time should be the same, as their hospital discharge occured from ICU.
A new admission of the same patient
I am able to use lubridate to compare times without difficulty, but I am stuck on how to do the between-row comparisons, especially for patients with multiple ICU admissions (who have new admissions, readmissions, and transfers all in the time period of interest).
Some example data:
ID site icu_adm icu_dis hosp_adm hosp_dis
1 A 2016-02-02 15:38:00 2016-02-06 14:25:00 2016-02-02 15:17:00 2016-02-06 14:25:00
1 B 2016-02-06 16:17:00 2016-02-16 14:16:00 2016-02-06 16:16:00 2016-03-16 17:50:00
2 C 2009-08-09 14:27:00 2009-08-10 15:06:00 2009-08-03 02:51:00 2009-09-02 00:00:00
3 C 2009-08-18 20:32:00 2009-08-27 15:10:00 2009-08-03 02:51:00 2009-09-02 00:00:00
3 A 2010-02-20 21:00:00 2010-03-03 13:00:00 2010-02-18 03:00:00 2010-03-18 15:21:00
3 B 2010-05-05 17:00:00 2010-05-08 09:13:00 2010-05-03 11:21:00 2010-05-20 17:18:00
Desired output would be:
ID … readmission transferred new_adm
1 0 0 1
1 0 1 0
2 0 0 1
2 1 0 0
3 0 0 1
3 0 0 1
I'm not entirely sure this will work with all of your data, but thought this might be helpful.
Using tidyverse (or dplyr package in this case), you can start by grouping by ID to look at transfers. Based on your definition, if your icu_adm time is less than 24 hours of the previous row's discharge time (hosp_dis), then it is considered an ICU transfer. You can use lag to compare with previous row, assume dates/times are in chronological order (if not, you can use arrange to order).
Next, you can group by ID, hosp_adm, and hosp_dis. This will help look at readmissions. After grouping, all rows of data after the first row (for the same hospital admission) will be considered ICU readmissions.
Then, everything left that is not a transfer or readmission could be considered a new ICU admission.
Let me know if this is what you had in mind.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(transfer = ifelse(abs(difftime(icu_adm, lag(hosp_dis, default = first(hosp_dis)), units = "hours")) < 24, 1, 0)) %>%
group_by(ID, hosp_adm, hosp_dis) %>%
mutate(readmission = ifelse(row_number() > 1, 1, 0),
new_adm = ifelse(transfer != 1 & readmission != 1, 1, 0))
Output
ID site icu_adm icu_dis hosp_adm hosp_dis transfer readmission new_adm
<int> <chr> <dttm> <dttm> <dttm> <dttm> <dbl> <dbl> <dbl>
1 1 A 2016-02-02 15:38:00 2016-02-06 14:25:00 2016-02-02 15:17:00 2016-02-06 14:25:00 0 0 1
2 1 B 2016-02-06 16:17:00 2016-02-16 14:16:00 2016-02-06 16:16:00 2016-03-16 17:50:00 1 0 0
3 2 C 2009-08-09 14:27:00 2009-08-10 15:06:00 2009-08-03 02:51:00 2009-09-02 00:00:00 0 0 1
4 2 C 2009-08-18 20:32:00 2009-08-27 15:10:00 2009-08-03 02:51:00 2009-09-02 00:00:00 0 1 0
5 3 A 2010-02-20 21:00:00 2010-03-03 13:00:00 2010-02-18 03:00:00 2010-03-18 15:21:00 0 0 1
6 3 B 2010-05-05 17:00:00 2010-05-08 09:13:00 2010-05-03 11:21:00 2010-05-20 17:18:00 0 0 1
Data
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L), site = c("A",
"B", "C", "C", "A", "B"), icu_adm = structure(c(1454427480, 1454775420,
1249828020, 1250627520, 1266699600, 1273078800), tzone = "", class = c("POSIXct",
"POSIXt")), icu_dis = structure(c(1454768700, 1455632160, 1249916760,
1251385800, 1267621200, 1273309980), tzone = "", class = c("POSIXct",
"POSIXt")), hosp_adm = structure(c(1454426220, 1454775360, 1249267860,
1249267860, 1266462000, 1272885660), tzone = "", class = c("POSIXct",
"POSIXt")), hosp_dis = structure(c(1454768700, 1458150600, 1251849600,
1251849600, 1268925660, 1274375880), tzone = "", class = c("POSIXct",
"POSIXt"))), class = "data.frame", row.names = c(NA, -6L))
This question already has answers here:
Collapsing data frame by selecting one row per group
(4 answers)
Remove duplicated rows using dplyr
(6 answers)
Filtering out duplicated/non-unique rows in data.table
(5 answers)
Closed 2 years ago.
I have a dataframe:
Date ID Type Value
2020-08-04 03:00:00 1 active 14
2020-08-04 03:00:00 1 active 15
2020-08-04 03:00:00 2 active 16
2020-08-04 03:00:00 2 passive 17
I want to remove rows which has same values in columns Date ID Type. So desired result is:
Date ID Type Value
2020-08-04 03:00:00 1 active 14
2020-08-04 03:00:00 2 active 16
2020-08-04 03:00:00 2 passive 17
As you see, second row disappeared. How could i do that?
I would suggest creating a global id like this with paste() and then use duplicated():
#Code
mdf[duplicated(mdf$Date,mdf$ID,mdf$Type,fromLast = F),]
Output:
Date ID Type Value
2 04/08/2020 3:00:00 1 active 15
3 04/08/2020 3:00:00 2 active 16
4 04/08/2020 3:00:00 2 passive 17
Some data used:
#Data
mdf <- structure(list(Date = c("04/08/2020 3:00:00", "04/08/2020 3:00:00",
"04/08/2020 3:00:00", "04/08/2020 3:00:00"), ID = c(1L, 1L, 2L,
2L), Type = c("active", "active", "active", "passive"), Value = 14:17), row.names = c(NA,
-4L), class = "data.frame")
If your goal is to keep the minimum value for a given ID, you can use this dplyr solution:
mdf %>%
group_by(Date, ID, Type) %>%
mutate(Value = min(Value)) %>%
unique()
Which gives us:
Date ID Type Value
<chr> <int> <chr> <int>
1 04/08/2020 3:00:00 1 active 14
2 04/08/2020 3:00:00 2 active 16
3 04/08/2020 3:00:00 2 passive 17
Using dplyr
tble = read.table(text='
S.no Date ID Type Value
1 2020-08-04 03:00:00 1 active 14
2 2020-08-04 03:00:00 1 active 15
3 2020-08-04 03:00:00 2 active 16
4 2020-08-04 03:00:00 2 passive 17')
library(dplyr)
tble %>% distinct(Date, ID, Type, .keep_all=TRUE)
#> S.no Date ID Type Value
#> 1 2020-08-04 03:00:00 1 active 14
#> 3 2020-08-04 03:00:00 2 active 16
#> 4 2020-08-04 03:00:00 2 passive 17
Created on 2020-09-04 by the reprex package (v0.3.0)
I am trying to use group_by and then summarise using date difference calculation. I am not sure if its a runtime error or something wrong in what I am doing. Sometimes when I run the code I get the output as days and other times as seconds. I am not sure what is causing this change. I am not changing dataset or codes. The dataset I am using is huge (2,304,433 rows and 40 columns). Both the times, the output value (digits) are the same but only the name changes (days to secs). I would like to see the output in days.
This is the code that I am using:
data %>%
group_by(PRODUCT,PERSON_ID) %>%
summarise(Freq = n(),
Revenue = max(TOTAL_AMT + 0.000001/QUANTITY),
No_Days = (max(ORDER_DT) - min(ORDER_DT) + 1)/n())
This is the output.
Can anyone please help me on this?
Use difftime() You might need to specify the units.
set.seed(314)
data <- data.frame(PRODUCT = sample(1:10, size = 10000, replace = TRUE),
PERSON_ID = sample(1:10, size = 10000, replace = TRUE),
ORDER_DT = as.POSIXct(as.Date('2019/01/01') + sample(-300:+300, size = 10000, replace = TRUE)))
require(dplyr)
data %>%
group_by(PRODUCT,PERSON_ID) %>%
summarise(Freq = n(),
start = min(ORDER_DT),
end = max(ORDER_DT)) %>%
mutate(No_Days = (as.double(difftime(end, start, units = "days"), units = "days")+1)/Freq)
gives:
PRODUCT PERSON_ID Freq start end No_Days
<int> <int> <int> <dttm> <dttm> <dbl>
1 1 1 109 2018-03-21 01:00:00 2019-10-27 02:00:00 5.38
2 1 2 117 2018-03-23 01:00:00 2019-10-26 02:00:00 4.98
3 1 3 106 2018-03-19 01:00:00 2019-10-28 01:00:00 5.56
4 1 4 109 2018-03-07 01:00:00 2019-10-26 02:00:00 5.50
5 1 5 95 2018-03-07 01:00:00 2019-10-16 02:00:00 6.2
6 1 6 79 2018-03-09 01:00:00 2019-10-04 02:00:00 7.28
7 1 7 83 2018-03-09 01:00:00 2019-10-28 01:00:00 7.22
8 1 8 114 2018-03-09 01:00:00 2019-10-16 02:00:00 5.15
9 1 9 100 2018-03-09 01:00:00 2019-10-13 02:00:00 5.84
10 1 10 91 2018-03-11 01:00:00 2019-10-26 02:00:00 6.54
# ... with 90 more rows
Why is the value devided by n()?
Simple as.integer(max(ORDER_DT) - min(ORDER_DT)) should work, but if it doesn't then please be more specific and update me with more information.
Also while working with datetime values it's good to know lubridate library
I have a data frame with missing values for "SNAP_ID". I'd like to fill in the missing values with floating point values based on a sequence from the previous non-missing value (lag()?). I would really like to achieve this using just dplyr if possible.
Assumptions:
There will never be missing data as the first or last row I'm generating the missing dates based on missing days between a min and max of a data set
There can be multiple gaps in the data set
Current data:
end SNAP_ID
1 2015-06-26 12:59:00 365
2 2015-06-26 13:59:00 366
3 2015-06-27 00:01:00 NA
4 2015-06-27 23:00:00 NA
5 2015-06-28 00:01:00 NA
6 2015-06-28 23:00:00 NA
7 2015-06-29 09:00:00 367
8 2015-06-29 09:59:00 368
What I want to achieve:
end SNAP_ID
1 2015-06-26 12:59:00 365.0
2 2015-06-26 13:59:00 366.0
3 2015-06-27 00:01:00 366.1
4 2015-06-27 23:00:00 366.2
5 2015-06-28 00:01:00 366.3
6 2015-06-28 23:00:00 366.4
7 2015-06-29 09:00:00 367.0
8 2015-06-29 09:59:00 368.0
As a data frame:
df <- structure(list(end = structure(c(1435323540, 1435327140, 1435363260,
1435446000, 1435449660, 1435532400, 1435568400, 1435571940), tzone = "UTC", class = c("POSIXct",
"POSIXt")), SNAP_ID = c(365, 366, NA, NA, NA, NA, 367, 368)), .Names = c("end",
"SNAP_ID"), row.names = c(NA, -8L), class = "data.frame")
This was my attempt at achieving this goal, but it only works for the first missing value:
df %>%
arrange(end) %>%
mutate(SNAP_ID=ifelse(is.na(SNAP_ID),lag(SNAP_ID)+0.1,SNAP_ID))
end SNAP_ID
1 2015-06-26 12:59:00 365.0
2 2015-06-26 13:59:00 366.0
3 2015-06-27 00:01:00 366.1
4 2015-06-27 23:00:00 NA
5 2015-06-28 00:01:00 NA
6 2015-06-28 23:00:00 NA
7 2015-06-29 09:00:00 367.0
8 2015-06-29 09:59:00 368.0
The outstanding answer from #mathematical.coffee below:
df %>%
arrange(end) %>%
group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1))) %>%
ungroup() %>%
select(-tmp)
EDIT: new version works for any number of NA runs.
This one doesn't need zoo, either.
First, notice that tmp=cumsum(!is.na(SNAP_ID)) groups the SNAP_IDs such groups of the same tmp consist of one non-NA value followed by a run of NA values.
Then group by this variable and just add .1 to the first SNAP_ID to fill out the NAs:
df %>%
arrange(end) %>%
group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1)))
end SNAP_ID tmp
1 2015-06-26 12:59:00 365.0 1
2 2015-06-26 13:59:00 366.0 2
3 2015-06-27 00:01:00 366.1 2
4 2015-06-27 23:00:00 366.2 2
5 2015-06-28 00:01:00 366.3 2
6 2015-06-28 23:00:00 366.4 2
7 2015-06-29 09:00:00 367.0 3
8 2015-06-29 09:59:00 368.0 4
Then you can drop the tmp column afterwards (add %>% select(-tmp) to the end).
EDIT: this is the old version which doesn't work for subsequent runs of NAs.
If your aim is to fill each NA with the previous value + 0.1, you can use zoo's na.locf (which fills each NA with the previous value), along with cumsum(is.na(SNAP_ID))*0.1 to add the extra 0.1.
library(zoo)
df %>%
arrange(end) %>%
mutate(SNAP_ID=ifelse(is.na(SNAP_ID),
na.locf(SNAP_ID) + cumsum(is.na(SNAP_ID))*0.1,
SNAP_ID))