I have a big dataset of about 4 Milion rows.
the columns are
Idx - dog serial number
date - date of event YYYY-MM-DD ( 2016 till 2021)
Is_sterilized - 1 if the dog was sterilized and 0 if not sterilized.
each dog can appear many times in a year,
It can appear in 2016 and 2020 but not in 2017-2019.
I want to count how many dogs were sterilized each year, meaning, if a dog change from Is_serilized==0 to Is_sterilized ==1 in a year I count it as sterilized that year, the first year it appears sterilized counted as his year fo sterilization.
The issue is that my database is not clean and for some dogs goes from sterilized to not sterilized, this can not happen since sterilization is one-way ticket surgery.
It can happen that a dog appears sterilized, 3 years consecutive and then one year by mistake unsterilized and then sterilized for 2 years.
What I'm asking is if there is a logic that I can estimate/count how many dogs having the wrong direction.
And if so, how can I deduce those dogs from my dataset?
In the example data, Idx = A and C make sense but B and D does not make senese
df_test <- data.frame(Idx=c( 'A', 'B', 'B', 'B','A', 'A', 'C', 'C', 'D','D','D','D','D','D','C', 'C','A' ),
YEAR_date=as.Date(c("2016-01-01","2016-01-29","2017-01-01","2016-05-01","2016-05-06","2016-05-01","2016-03-03","2016-04-22","2018-05-05", "2017-02-01"," 2021-11-12"," 2019-09-13"," 2019-11-12"," 2019-08-17", "2011-09-01"," 2011-07-05","2021-01-05")),
Is_sterilized =c(0,1,0,1,1,1,1,1,1,1,0,1,0,1,1,1,1)
)
df_test[,c( "Idx" ,"YEAR_date", "Is_sterilized")] %>% arrange(Idx ,YEAR_date )
Idx YEAR_date Is_sterilized
1 A 2016-01-01 0
2 A 2016-05-01 1
3 A 2016-05-06 1
4 A 2021-01-05 1
5 B 2016-01-29 1
6 B 2016-05-01 1
7 B 2017-01-01 0
8 C 2011-07-05 1
9 C 2011-09-01 1
10 C 2016-03-03 1
11 C 2016-04-22 1
12 D 2017-02-01 1
13 D 2018-05-05 1
14 D 2019-08-17 1
15 D 2019-09-13 1
16 D 2019-11-12 0
17 D 2021-11-12 0
I have more columns is if you thing anything else is relevant please write and I'll check I have it.
Any hint idea anything will be helpul
Thanks You in advance
Here's some dplyr code to identify instances where a dog's sterilization went from 1 to zero:
library(dplyr)
df_test %>%
group_by(Idx) %>%
mutate(change = Is_sterilized-lag(Is_sterilized, default = 0)) %>%
filter(change == -1) %>%
ungroup()
# A tibble: 3 x 4
Idx YEAR_date Is_sterilized change
<chr> <date> <dbl> <dbl>
1 B 2017-01-01 0 -1
2 D 2021-11-12 0 -1
3 D 2019-11-12 0 -1
If you want to count the number of dogs in that list, add %>% count(Idx) at the end.
df_test %>%
group_by(Idx) %>%
mutate(change = Is_sterilized-lag(Is_sterilized, default = 0)) %>%
filter(change == -1) %>%
ungroup() %>%
count(Idx, name = "times_desterilized")
# A tibble: 2 x 2
Idx times_desterilized
<chr> <int>
1 B 1
2 D 2
Related
Thank you, experts for previous answers (How to filter by range of dates in R?)
I am still having some problems dealing with the data.
Example:
id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
My idea is to eliminate the observations that have more than 3 "units" in a period of 30 days. That is, if "a" has a unit "q" on "12/02/2021" [dd/mm]yyyy]: (a) if between 12/01/2021 and 12/02/2021 there are already 3 observations it must be deleted . (b) If there are less than 3 this one must remain.
My expected result is:
p q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
With this code:
df <- df %>%
mutate(day = dmy(data))%>%
group_by(p) %>%
arrange(day, .by_group = TRUE) %>%
mutate(diff = day - first(day)) %>%
mutate(row = row_number()) %>%
filter(row <= 3 | !diff < 30)
But the result is:
P Q DATE DAY DIFF ROW
a 1 1/1/2021 1/1/2021 0 1
a 1 1/1/2021 1/1/2021 0 2
a 1 21/1/2021 21/1/2021 20 3
a 1 12/2/2021 12/2/2021 42 5
a 1 12/2/2021 12/2/2021 42 6
a 1 12/2/2021 12/2/2021 42 7
a 1 12/2/2021 12/2/2021 42 8
The main problem is that the diff variable must count days in periods of 30 days from the last day of the previous 30-days period - not since the first observation day.
Any help? Thanks
Using floor_date it is quite straighforward:
library(lubridate)
library(dplyr)
df %>%
group_by(floor = floor_date(date, '30 days')) %>%
slice_head(n = 3) %>%
ungroup() %>%
select(-floor)
# A tibble: 6 x 3
id q date
<chr> <int> <date>
1 a 1 2021-01-01
2 a 1 2021-01-01
3 a 1 2021-01-21
4 a 1 2021-02-12
5 a 1 2021-02-12
6 a 1 2021-02-12
data
df <- read.table(header = T, text = "id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021")
df$date<-as.Date(df$date, format = "%d/%m/%Y")
I am trying to count the appearances of a value (across 2 columns) consecutively over the previous days. In the example this would be counting the consecutive days a team made an appearance (either in Hteam or Ateam) prior to that date. The aim would be to produce additional columns for both the home and away teams that showed these new values.
Test data:
data<- data.frame(
Date= c("2018-01-01", "2018-01-01", "2018-01-02", "2018-01-03", "2018-01-04", "2018-01-05"),
Hteam= c("A","D","B","A","C","A"),
Ateam= c("B","C","A","C","B","C"))
Date Hteam Ateam
1 2018-01-01 A B
2 2018-01-01 D C
3 2018-01-02 B A
4 2018-01-03 A C
5 2018-01-04 C B
6 2018-01-05 A C
The aim would end up looking like:
Date Hteam Ateam Hdays Adays
1 2018-01-01 A B 0 0
2 2018-01-01 D C 0 0
3 2018-01-02 B A 1 1
4 2018-01-03 A C 2 0
5 2018-01-04 C B 1 0
6 2018-01-05 A C 0 2
In my searching I haven't found an example close enough that I am able to adapt to this situation. I feel like I should be using a rollapply or dplyr grouping, but I can't get close to a solution.
Thanks.
Maybe the following gives what you wanted assuming that data is sorted by Date and missing days are not considered.
t1 <- unique(unlist(data[-1]))
t2 <- do.call(rbind, lapply(split(data[-1], data$Date), function(x) t1 %in% unlist(x)))
t3 <- apply(t2, 2, function(x) ave(x, cumsum(!x), FUN=cumsum))-1
data.frame(data
, Hdays=t3[cbind(match(data$Date, rownames(t3)), match(data$Hteam, t1))]
, Adays=t3[cbind(match(data$Date, rownames(t3)), match(data$Ateam, t1))])
# Date Hteam Ateam Hdays Adays
#1 2018-01-01 A B 0 0
#2 2018-01-01 D C 0 0
#3 2018-01-02 B A 1 1
#4 2018-01-03 A C 2 0
#5 2018-01-04 C B 1 0
#6 2018-01-05 A C 0 2
I think your expected output is incorrect. Namely, row 5's "C" occurs twice above it, but has a 1.
Here's a tidyverse version:
library(dplyr)
library(tidyr)
data %>%
mutate(rn = row_number()) %>%
pivot_longer(-c(Date, rn), names_to = "x", values_to = "team") %>%
mutate(x = gsub("team$", "", x)) %>%
group_by(team) %>%
mutate(days = row_number() - 1) %>%
ungroup() %>%
pivot_wider(c(Date, rn), names_from = x, values_from = c(team, days)) %>%
select(-rn)
# # A tibble: 6 x 5
# Date team_H team_A days_H days_A
# <chr> <chr> <chr> <dbl> <dbl>
# 1 2018-01-01 A B 0 0
# 2 2018-01-01 D C 0 0
# 3 2018-01-02 B A 1 1
# 4 2018-01-03 A C 2 1
# 5 2018-01-04 C B 2 2
# 6 2018-01-05 A C 3 3
I have these two toy example tables:
Table 1:
attendance_events <- data.frame(student_id = c("RA123","RB123","RC123","RA456","RB456","RC456","RA123","RB123","RC123","RA456","RB456","RC456"),
dates = c("2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02"),
attendance = c(1,1,1,0,1,1,0,0,1,0,0,1),
stringsAsFactors = F)
attendance_events
student_id dates attendance
1 RA123 2020-02-01 1
2 RB123 2020-02-01 1
3 RC123 2020-02-01 1
4 RA456 2020-02-01 0
5 RB456 2020-02-01 1
6 RC456 2020-02-01 1
7 RA123 2020-02-02 0
8 RB123 2020-02-02 0
9 RC123 2020-02-02 1
10 RA456 2020-02-02 0
11 RB456 2020-02-02 0
12 RC456 2020-02-02 1
Table2:
all_students <- data.frame(student_id = c("RA123","RB123","RC123","RA456","RB456",'RC456'),
school_id = c(1,1,1,1,1,2),
grade_level = c(10,10,9,9,11,11),
date_of_birth = c("1990-02-02","1990-02-02","1991-01-01","1991-02-01","1989-02-02","1989-02-02"),
hometown = c("farm","farm","farm","farm","farm","city"),
stringsAsFactors = F)
> all_students
student_id school_id grade_level date_of_birth hometown
1 RA123 1 10 1990-02-02 farm
2 RB123 1 10 1990-02-02 farm
3 RC123 1 9 1991-01-01 farm
4 RA456 1 9 1991-02-01 farm
5 RB456 1 11 1989-02-02 farm
6 RC456 2 11 1989-02-02 city
attendance in attendance_events is 0 if the student was absent that day.
My question is what is the most efficient way in R to find the grade_level that had the largest drop off in attendance between "2020-02-01" and "2020-02-02"
My code is:
#Only include absences because it will be a smaller dataset
att_ws_alt <- inner_join(attendance_events, all_students[,c("student_id","grade_level")], by = "student_id") %>%
filter(attendance == 0)
#Set days to check between
date_from <- "2020-02-01"
date_to <- "2020-02-02"
#Continously pipe to not have to store and reference(?)
att_drop_alt <- att_ws_alt %>%
filter(dates %in% c(date_from, date_to)) %>%
group_by(grade_level,dates) %>%
summarize(absence_bydate = n()) %>%
dcast(grade_level ~ dates) %>%
sapply(FUN = function(x) { x[is.na(x)] <- 0; x}) %>%
as.data.frame() %>%
mutate("absence_change" = .[,3] - .[,2]) %>%
select(grade_level, absence_change) %>%
arrange(desc(absence_change))
>att_drop_alt
grade_level absence_change
1 10 2
2 11 1
3 9 0
However, this feels a bit complex for what seems like a reasonably simple question. I want to see other ways R programmers could answer this question, ideally for better performance but even readability would be good to see.
Thanks community!
With data.table
library(data.table)
setDT(attendance_events)[all_students, .SD[, .(sum(attendance)),
.(grade_level, dates)], on = .(student_id)][,
.(attendanace_change = diff(rev(V1))), .(grade_level)]
# grade_level attendanace_change
#1: 10 2
#2: 9 0
#3: 11 1
I guess this is a little more concise:
left_join(attendance_events, all_students, by = "student_id") %>%
group_by(grade_level, dates) %>%
summarise(attendance = sum(attendance)) %>%
group_by(grade_level) %>%
summarize(attendance_change = diff(attendance))
#> # A tibble: 3 x 2
#> grade_level attendance_change
#> <dbl> <dbl>
#> 1 9 0
#> 2 10 -2
#> 3 11 -1
Of course, if you want to count absences instead of attendances, just put a minus sign in front of the diff on the last line.
Sorry if this doesn't exactly answer your question, but I wouldn't want to unfairly accuse the students of being more absent then they were ;)
library(dplyr)
all_students %>%
left_join(attendance_events) %>%
mutate(dates = as.Date(dates)) %>%
group_by(grade_level, dates) %>%
summarise(NAbs = sum(ifelse(attendance == 0, 1, 0)),
N = n(),
pctAbs = NAbs / n() * 100) %>%
arrange(dates) %>%
mutate(change = pctAbs - lag(pctAbs)) %>%
ungroup() %>%
arrange(change)
# A tibble: 6 x 6
dates grade_level NAbs N pctAbs change
<date> <dbl> <dbl> <int> <dbl> <dbl>
1 2020-02-02 9 1 2 50 0
2 2020-02-02 11 1 2 50 50
3 2020-02-02 10 2 2 100 100
4 2020-02-01 9 1 2 50 NA
5 2020-02-01 10 0 2 0 NA
6 2020-02-01 11 0 2 0 NA
I have a data frame in the below format and I'm trying to find the difference in time between the Event 'ASSIGNED' and the last time the Event is 'CREATED' that comes before it.
**AccountID** **TIME** **EVENT**
1 2016-11-08T01:54:15.000Z CREATED
1 2016-11-09T01:54:15.000Z ASSIGNED
1 2016-11-10T01:54:15.000Z CREATED
1 2016-11-11T01:54:15.000Z CALLED
1 2016-11-12T01:54:15.000Z ASSIGNED
1 2016-11-12T01:54:15.000Z SLEEP
Currently my code is as follows, my difficulty is selecting the CREATED that just comes before the ASSIGNED Event
test <- timetable.filter %>%
group_by(AccountID) %>%
mutate(timeToAssign = ifelse(EVENT == 'ASSIGNED',
interval(ymd_hms(TIME), max(ymd_hms(TIME[EVENT == 'CREATED']))) %/% hours(1), NA))
I'm looking for the output to be
**AccountID** **TIME** **EVENT** **timeToAssign**
1 2016-11-08T01:54:15.000Z CREATED NA
1 2016-11-09T01:54:15.000Z ASSIGNED 12
1 2016-11-10T01:54:15.000Z CREATED NA
1 2016-11-11T01:54:15.000Z CALLED NA
1 2016-11-12T01:54:15.000Z ASSIGNED 24
1 2016-11-12T01:54:15.000Z SLEEP NA
With dplyr and tidyr:
library(dplyr); library(tidyr); library(anytime)
df %>%
group_by(AccountID) %>%
mutate(CREATED_INDEX = if_else(EVENT == 'CREATED', row_number(), NA_integer_),
TIME = anytime(TIME)) %>%
fill(CREATED_INDEX) %>%
mutate(TimeToAssign = if_else(EVENT == 'ASSIGNED',
as.numeric(TIME - TIME[CREATED_INDEX], units = 'hours'),
NA_real_)) %>%
select(-CREATED_INDEX)
# A tibble: 6 x 4
# Groups: AccountID [1]
# AccountID TIME EVENT TimeToAssign
# <int> <dttm> <fctr> <dbl>
#1 1 2016-11-08 01:54:15 CREATED NA
#2 1 2016-11-09 01:54:15 ASSIGNED 24
#3 1 2016-11-10 01:54:15 CREATED NA
#4 1 2016-11-11 01:54:15 CALLED NA
#5 1 2016-11-12 01:54:15 ASSIGNED 48
#6 1 2016-11-12 01:54:15 SLEEP NA
I have irregular timeseries data representing a certain type of transaction for users. Each line of data is timestamped and represents a transaction at that time. By the irregular nature of the data some users might have 100 rows in a day and other users might have 0 or 1 transaction in a day.
The data might look something like this:
data.frame(
id = c(1, 1, 1, 1, 1, 2, 2, 3, 4),
date = c("2015-01-01",
"2015-01-01",
"2015-01-05",
"2015-01-25",
"2015-02-15",
"2015-05-05",
"2015-01-01",
"2015-08-01",
"2015-01-01"),
n_widgets = c(1,2,3,4,4,5,2,4,5)
)
id date n_widgets
1 1 2015-01-01 1
2 1 2015-01-01 2
3 1 2015-01-05 3
4 1 2015-01-25 4
5 1 2015-02-15 4
6 2 2015-05-05 5
7 2 2015-01-01 2
8 3 2015-08-01 4
9 4 2015-01-01 5
Often I'd like to know some rolling statistics about users. For example: for this user on a certain day, how many transactions occurred in the previous 30 days, how many widgets were sold in the previous 30 days etc.
Corresponding to the above example, the data should look like:
id date n_widgets n_trans_30 total_widgets_30
1 1 2015-01-01 1 1 1
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 1 2015-02-15 4 2 8
6 2 2015-05-05 5 1 5
7 2 2015-01-01 2 1 2
8 3 2015-08-01 4 1 4
9 4 2015-01-01 5 1 5
If the time window is daily then the solution is simple: data %>% group_by(id, date) %>% summarize(...)
Similarly if the time window is monthly this is also relatively simple with lubridate: data %>% group_by(id, year(date), month(date)) %>% summarize(...)
However the challenge I'm having is how to setup a time window for an arbitrary period: 5-days, 10-days etc.
There's also the RcppRoll library but both RcppRoll and the rolling functions in zoo seem more setup for regular time series. As far as I can tell these window functions work based on the number of rows instead of a specified time period -- the key difference is that a certain time period might have a differing number of rows depending on date and user.
For example, it's possible for user 1, that the number of transactions in the 5 days previous of 2015-01-01 is equal to 100 transactions and for the same user the number of transactions in the 5 days previous of 2015-02-01 is equal to 5 transactions. Thus looking back a set number of rows will simply not work.
Additionally, there is another SO thread discussing rolling dates for irregular time series type data (Create new column based on condition that exists within a rolling date) however the accepted solution was using data.table and I'm specifically looking for a dplyr way of achieving this.
I suppose at the heart of this issue, this problem can be solved by answering this question: how can I group_by arbitrary time periods in dplyr. Alternatively, if there's a different dplyr way to achieve above without a complicated group_by, how can I do it?
EDIT: updated example to make nature of the rolling window more clear.
This can be done using SQL:
library(sqldf)
dd <- transform(data, date = as.Date(date))
sqldf("select a.*, count(*) n_trans30, sum(b.n_widgets) 'total_widgets30'
from dd a
left join dd b on b.date between a.date - 30 and a.date
and b.id = a.id
and b.rowid <= a.rowid
group by a.rowid")
giving:
id date n_widgets n_trans30 total_widgets30
1 1 2015-01-01 1 1 1
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 2 2015-05-05 5 1 5
6 2 2015-01-01 2 1 2
7 3 2015-08-01 4 1 4
8 4 2015-01-01 5 1 5
Another approach is to expand your dataset to contain all possible days (using tidyr::complete), then use a rolling function (RcppRoll::roll_sum)
The fact that you have multiple observations per day is probably creating an issue though...
library(tidyr)
library(RcppRoll)
df2 <- df %>%
mutate(date=as.Date(date))
## create full dataset with all possible dates (go even 30 days back for first observation)
df_full<- df2 %>%
mutate(date=as.Date(date)) %>%
complete(id,
date=seq(from=min(.$date)-30,to=max(.$date), by=1),
fill=list(n_widgets=0))
## now use rolling function, and keep only original rows (left join)
df_roll <- df_full %>%
group_by(id) %>%
mutate(n_trans_30=roll_sum(x=n_widgets!=0, n=30, fill=0, align="right"),
total_widgets_30=roll_sum(x=n_widgets, n=30, fill=0, align="right")) %>%
ungroup() %>%
right_join(df2, by = c("date", "id", "n_widgets"))
The result is the same as yours (by chance)
id date n_widgets n_trans_30 total_widgets_30
<dbl> <date> <dbl> <dbl> <dbl>
1 1 2015-01-01 1 1 1
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 1 2015-02-15 4 2 8
6 2 2015-05-05 5 1 5
7 2 2015-01-01 2 1 2
8 3 2015-08-01 4 1 4
9 4 2015-01-01 5 1 5
But as said, it will fail for some days as it count last 30 obs, not last 30 days. So you might want first to summarise the information by day, then apply this.
EDITED based on comment below.
You can try something like this for up to 5 days:
df %>%
arrange(id, date) %>%
group_by(id) %>%
filter(as.numeric(difftime(Sys.Date(), date, unit = 'days')) <= 5) %>%
summarise(n_total_widgets = sum(n_widgets))
In this case, there are no days within five of current. So, it won't produce any output.
To get last five days for each ID, you can do something like this:
df %>%
arrange(id, date) %>%
group_by(id) %>%
filter(as.numeric(difftime(max(date), date, unit = 'days')) <= 5) %>%
summarise(n_total_widgets = sum(n_widgets))
Resulting output will be:
Source: local data frame [4 x 2]
id n_total_widgets
(dbl) (dbl)
1 1 4
2 2 5
3 3 4
4 4 5
I found a way to do this while working on this question
df <- data.frame(
id = c(1, 1, 1, 1, 1, 2, 2, 3, 4),
date = c("2015-01-01",
"2015-01-01",
"2015-01-05",
"2015-01-25",
"2015-02-15",
"2015-05-05",
"2015-01-01",
"2015-08-01",
"2015-01-01"),
n_widgets = c(1,2,3,4,4,5,2,4,5)
)
count_window <- function(df, date2, w, id2){
min_date <- date2 - w
df2 <- df %>% filter(id == id2, date >= min_date, date <= date2)
out <- length(df2$date)
return(out)
}
v_count_window <- Vectorize(count_window, vectorize.args = c("date2","id2"))
sum_window <- function(df, date2, w, id2){
min_date <- date2 - w
df2 <- df %>% filter(id == id2, date >= min_date, date <= date2)
out <- sum(df2$n_widgets)
return(out)
}
v_sum_window <- Vectorize(sum_window, vectorize.args = c("date2","id2"))
res <- df %>% mutate(date = ymd(date)) %>%
mutate(min_date = date - 30,
n_trans = v_count_window(., date, 30, id),
total_widgets = v_sum_window(., date, 30, id)) %>%
select(id, date, n_widgets, n_trans, total_widgets)
res
id date n_widgets n_trans total_widgets
1 1 2015-01-01 1 2 3
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 1 2015-02-15 4 2 8
6 2 2015-05-05 5 1 5
7 2 2015-01-01 2 1 2
8 3 2015-08-01 4 1 4
9 4 2015-01-01 5 1 5
This version is fairly case specific but you could probably make a version of the functions that is more general.
For simplicity reasons I recommend runner package which handles sliding window operations. In OP request window size k = 30 and windows depend on date idx = date. You can use runner function which applies any R function on given window, and sum_run
library(runner)
library(dplyr)
df %>%
group_by(id) %>%
arrange(date, .by_group = TRUE) %>%
mutate(
n_trans30 = runner(n_widgets, k = 30, idx = date, function(x) length(x)),
n_widgets30 = sum_run(n_widgets, k = 30, idx = date),
)
# id date n_widgets n_trans30 n_widgets30
#<dbl> <date> <dbl> <dbl> <dbl>
# 1 2015-01-01 1 1 1
# 1 2015-01-01 2 2 3
# 1 2015-01-05 3 3 6
# 1 2015-01-25 4 4 10
# 1 2015-02-15 4 2 8
# 2 2015-01-01 2 1 2
# 2 2015-05-05 5 1 5
# 3 2015-08-01 4 1 4
# 4 2015-01-01 5 1 5
Important: idx = date should be in ascending order.
For more go to documentation and vignettes