My data set looks like this:
ID start.date end.date program
1 2016.05.05 2017.05.05 A
1 2017.05.06 2019.06.16 A
2 2012.06.05 2013.06.18 B
3 2014.09.09 2017.07.01 B
3 2017.09.09 2018.09.09 B
I want to identify the people who were present in a program (character variable) consecutively, and then calculate the time between each end.date and start.date (if the occurrence was consecutive).
So the resulting data should look like this:
ID start.date end.date program days
1 2016.05.05 2017.05.05 A NA
1 2017.05.06 2019.06.16 A . 1
2 2012.06.05 2013.06.18 B . NA
3 2014.09.09 2017.07.01 B . NA
3 2017.09.09 2018.09.09 B . 63
Don't know how to start on this!
library(dplyr)
dat %>%
group_by(ID, program) %>%
arrange(start.date) %>% # Added in case the data isn't sorted
mutate(days = start.date - lag(end.date))
I get slightly different results, though:
# A tibble: 5 x 5
# Groups: ID, program [3]
ID start.date end.date program days
<int> <date> <date> <chr> <time>
1 1 2016-05-05 2017-05-05 A NA
2 1 2017-05-06 2019-06-16 A 1
3 2 2012-06-05 2013-06-18 B NA
4 3 2014-09-09 2017-07-01 B NA
5 3 2017-09-09 2018-09-09 B 70
To bring the data in, I converted to dates:
dat <- read.table(header = T, stringsAsFactors = F,
text = "ID start.date end.date program
1 2016.05.05 2017.05.05 A
1 2017.05.06 2019.06.16 A
2 2012.06.05 2013.06.18 B
3 2014.09.09 2017.07.01 B
3 2017.09.09 2018.09.09 B") %>%
mutate_at(vars(matches("date")), lubridate::ymd)
Related
Thank you, experts for previous answers (How to filter by range of dates in R?)
I am still having some problems dealing with the data.
Example:
id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
My idea is to eliminate the observations that have more than 3 "units" in a period of 30 days. That is, if "a" has a unit "q" on "12/02/2021" [dd/mm]yyyy]: (a) if between 12/01/2021 and 12/02/2021 there are already 3 observations it must be deleted . (b) If there are less than 3 this one must remain.
My expected result is:
p q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
With this code:
df <- df %>%
mutate(day = dmy(data))%>%
group_by(p) %>%
arrange(day, .by_group = TRUE) %>%
mutate(diff = day - first(day)) %>%
mutate(row = row_number()) %>%
filter(row <= 3 | !diff < 30)
But the result is:
P Q DATE DAY DIFF ROW
a 1 1/1/2021 1/1/2021 0 1
a 1 1/1/2021 1/1/2021 0 2
a 1 21/1/2021 21/1/2021 20 3
a 1 12/2/2021 12/2/2021 42 5
a 1 12/2/2021 12/2/2021 42 6
a 1 12/2/2021 12/2/2021 42 7
a 1 12/2/2021 12/2/2021 42 8
The main problem is that the diff variable must count days in periods of 30 days from the last day of the previous 30-days period - not since the first observation day.
Any help? Thanks
Using floor_date it is quite straighforward:
library(lubridate)
library(dplyr)
df %>%
group_by(floor = floor_date(date, '30 days')) %>%
slice_head(n = 3) %>%
ungroup() %>%
select(-floor)
# A tibble: 6 x 3
id q date
<chr> <int> <date>
1 a 1 2021-01-01
2 a 1 2021-01-01
3 a 1 2021-01-21
4 a 1 2021-02-12
5 a 1 2021-02-12
6 a 1 2021-02-12
data
df <- read.table(header = T, text = "id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021")
df$date<-as.Date(df$date, format = "%d/%m/%Y")
I have a big dataset of about 4 Milion rows.
the columns are
Idx - dog serial number
date - date of event YYYY-MM-DD ( 2016 till 2021)
Is_sterilized - 1 if the dog was sterilized and 0 if not sterilized.
each dog can appear many times in a year,
It can appear in 2016 and 2020 but not in 2017-2019.
I want to count how many dogs were sterilized each year, meaning, if a dog change from Is_serilized==0 to Is_sterilized ==1 in a year I count it as sterilized that year, the first year it appears sterilized counted as his year fo sterilization.
The issue is that my database is not clean and for some dogs goes from sterilized to not sterilized, this can not happen since sterilization is one-way ticket surgery.
It can happen that a dog appears sterilized, 3 years consecutive and then one year by mistake unsterilized and then sterilized for 2 years.
What I'm asking is if there is a logic that I can estimate/count how many dogs having the wrong direction.
And if so, how can I deduce those dogs from my dataset?
In the example data, Idx = A and C make sense but B and D does not make senese
df_test <- data.frame(Idx=c( 'A', 'B', 'B', 'B','A', 'A', 'C', 'C', 'D','D','D','D','D','D','C', 'C','A' ),
YEAR_date=as.Date(c("2016-01-01","2016-01-29","2017-01-01","2016-05-01","2016-05-06","2016-05-01","2016-03-03","2016-04-22","2018-05-05", "2017-02-01"," 2021-11-12"," 2019-09-13"," 2019-11-12"," 2019-08-17", "2011-09-01"," 2011-07-05","2021-01-05")),
Is_sterilized =c(0,1,0,1,1,1,1,1,1,1,0,1,0,1,1,1,1)
)
df_test[,c( "Idx" ,"YEAR_date", "Is_sterilized")] %>% arrange(Idx ,YEAR_date )
Idx YEAR_date Is_sterilized
1 A 2016-01-01 0
2 A 2016-05-01 1
3 A 2016-05-06 1
4 A 2021-01-05 1
5 B 2016-01-29 1
6 B 2016-05-01 1
7 B 2017-01-01 0
8 C 2011-07-05 1
9 C 2011-09-01 1
10 C 2016-03-03 1
11 C 2016-04-22 1
12 D 2017-02-01 1
13 D 2018-05-05 1
14 D 2019-08-17 1
15 D 2019-09-13 1
16 D 2019-11-12 0
17 D 2021-11-12 0
I have more columns is if you thing anything else is relevant please write and I'll check I have it.
Any hint idea anything will be helpul
Thanks You in advance
Here's some dplyr code to identify instances where a dog's sterilization went from 1 to zero:
library(dplyr)
df_test %>%
group_by(Idx) %>%
mutate(change = Is_sterilized-lag(Is_sterilized, default = 0)) %>%
filter(change == -1) %>%
ungroup()
# A tibble: 3 x 4
Idx YEAR_date Is_sterilized change
<chr> <date> <dbl> <dbl>
1 B 2017-01-01 0 -1
2 D 2021-11-12 0 -1
3 D 2019-11-12 0 -1
If you want to count the number of dogs in that list, add %>% count(Idx) at the end.
df_test %>%
group_by(Idx) %>%
mutate(change = Is_sterilized-lag(Is_sterilized, default = 0)) %>%
filter(change == -1) %>%
ungroup() %>%
count(Idx, name = "times_desterilized")
# A tibble: 2 x 2
Idx times_desterilized
<chr> <int>
1 B 1
2 D 2
If I had:
person_ID visit date
1 2/25/2001
1 2/27/2001
1 4/2/2001
2 3/18/2004
3 9/22/2004
3 10/27/2004
3 5/15/2008
and I wanted another column to indicate the earliest recurring observation within 90 days, grouped by patient ID, with the desired output:
person_ID visit date date
1 2/25/2001 2/27/2001
1 2/27/2001 4/2/2001
1 4/2/2001 NA
2 3/18/2004 NA
3 9/22/2004 10/27/2004
3 10/27/2004 NA
3 5/15/2008 NA
Thank you!
We convert the 'visit_date' to Date class, grouped by 'person_ID', create a binary column that returns 1 if the difference between the current and next visit_date is less than 90 or else 0, using this column, get the correponding next visit_date' where the value is 1
library(dplyr)
library(lubridate)
library(tidyr)
df1 %>%
mutate(visit_date = mdy(visit_date)) %>%
group_by(person_ID) %>%
mutate(i1 = replace_na(+(difftime(lead(visit_date),
visit_date, units = 'day') < 90), 0),
date = case_when(as.logical(i1)~ lead(visit_date)), i1 = NULL ) %>%
ungroup
-output
# A tibble: 7 x 3
# person_ID visit_date date
# <int> <date> <date>
#1 1 2001-02-25 2001-02-27
#2 1 2001-02-27 2001-04-02
#3 1 2001-04-02 NA
#4 2 2004-03-18 NA
#5 3 2004-09-22 2004-10-27
#6 3 2004-10-27 NA
#7 3 2008-05-15 NA
I'm a complete beginner to R and I just need to do some quick cleaning of my data. But I ran into a problem I can't wrap my head around.
So I have a Postgres db with timeseries, Columns are ID, DATE and VALUE (temperature). Each ID is a new measuring station, so I have a time serie for each id (around 2000 unique ids, 4m rows). The dates span from 1915-2016, some series are overlapping some are not. If there is missing measurement from a week I want to fill those weeks with an NA value (which i interpolate after).
The problem i run into is that complete(Date.seq) creates NA values for all weeks between 1915 and 2016, I clearly understand why it happens. How can I make so it only fills values between the actual start and end date of the specific timeserie? I want a moving min and max which is dependent on the start date and end date of each specific ID and than fill missing dates between the start and end date of each ID.
library("RpostgreSQL")
library("tidyverse")
library("lubridate")
con <- dbConnect(PostgreSQL(), user = "postgres",
dbname="", password = "", host = "localhost", port= "5432")
out <- dbGetQuery(con, "SELECT * FROM *******.Weekly_series")
out %>%
group_by(ID)%>%
mutate(DATE = as.Date(DATE)) %>%
complete(DATE = seq(ymd("1915-04-14"), ymd("2016-03-30"), by= "week"))
Ignore errors in the connect line.
Thanks in advance.
Edit1
Sample data
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Excpected output
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-22 NA
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-08 NA
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-08 NA
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Using the data you provided, this works. I don't know why this works and your whole code does not, but possibly in your code, the data structure is not what is needed. If so, something like out <- tibble::as_tibble(out) might work. My other guess is that complete isn't drawing from the package you need. Using tidyr::complete works on the sample.
library(lubridate)
library(dplyr)
library(tidyr)
a <- "ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1"
df <- read.table(text = a, header = TRUE)
big_df1 <- df %>%
filter(ID == 1)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df2 <- df %>%
filter(ID == 2)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df3 <- df %>%
filter(ID == 3)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df <- rbind(big_df1, big_df2, big_df3)
big_df
DATE ID VALUE
<date> <int> <int>
1 2015-10-01 1 1
2 2015-10-08 1 1
3 2015-10-15 1 1
4 2015-10-22 NA NA
5 2015-10-29 1 1
6 1956-01-01 2 1
7 1956-01-08 NA NA
8 1956-01-15 2 1
9 1956-01-22 2 1
10 1982-01-01 3 1
11 1982-01-08 NA NA
12 1982-01-15 3 1
13 1982-01-22 3 1
14 1982-01-29 3 1
I have a data set of this format
Order_Name Frequency Order_Dt
A 2 2016-01-20
A 2 2016-05-01
B 1 2016-02-12
C 3 2016-03-04
C 3 2016-07-01
C 3 2016-08-09
I need to find the average difference between the dates of those order which have been placed for more than 1 times, i.e., frequency > 1.
require(dplyr)
# loading the data
df0 <- read.table(text =
'Order_Name Frequency Order_Dt
A 2 2016-01-20
A 2 2016-05-01
B 1 2016-02-12
C 3 2016-03-04
C 3 2016-07-01
C 3 2016-08-09',
stringsAsFactors = F,
header = T)
# putting the date in the right format
df0$Order_Dt <- as.Date(df0$Order_Dt)
# obtaining the averages
df0 %>% filter(Frequency > 1) %>%
arrange(., Order_Name, Order_Dt) %>%
mutate(diff_date = Order_Dt - lag(Order_Dt)) %>%
group_by(Order_Name) %>%
summarise(avg_days = mean(diff_date, na.rm = T))
# A tibble: 2 × 2
Order_Name avg_days
<chr> <time>
1 A 102.00000 days
2 C 33.33333 days