How to merge observations with close dates in r - r

I have a database where animals in a herd are tested every 6 months (number of animals can change over the time). The issue is that all the animals in a herd are not tested on the same day but within a period of time of 2 months.
I would like to know who I can create a new column that merges all these close dates (grouping by herd), so I can calculate the number of times a herd has been tested.
This is an example of a herd that has been tested 8 times, but at different dates. Each dot represents an animal:
Here is an example of the data:
df <- data.frame(
animal = c("Animal1", "Animal2", "Animal3", "Animal4", "Animal5", "Animal6", "Animal1", "Animal2", "Animal3", "Animal4", "Animal5", "Animal6", "Animal7", "Animal8", "Animal9", "Animal10", "Animal11", "Animal12", "Animal7", "Animal8", "Animal9", "Animal10", "Animal11", "Animal12"),
herd = c("Herd1","Herd1","Herd1", "Herd1","Herd1","Herd1", "Herd1","Herd1","Herd1", "Herd1","Herd1","Herd1","Herd2","Herd2", "Herd2","Herd2","Herd2","Herd2", "Herd2","Herd2", "Herd2","Herd2","Herd2","Herd2"),
date = c("2017-01-01", "2017-01-01", "2017-01-17","2017-02-04", "2017-02-04", "2017-02-05", "2017-06-01" , "2017-06-03", "2017-07-01", "2017-06-21", "2017-06-01", "2017-06-15", "2017-02-01", "2017-02-01", "2017-02-15", "2017-02-21", "2017-03-05", "2017-03-01", "2017-07-01", "2017-07-01", "2017-07-15", "2017-07-21", "2017-08-05", "2017-08-01"))
So the desired outcome will be:
animal herd date testing
1 Animal1 Herd1 2017-01-01 1
2 Animal2 Herd1 2017-01-01 1
3 Animal3 Herd1 2017-01-17 1
4 Animal4 Herd1 2017-02-04 1
5 Animal5 Herd1 2017-02-04 1
6 Animal6 Herd1 2017-02-05 1
7 Animal1 Herd1 2017-06-01 2
8 Animal2 Herd1 2017-06-03 2
9 Animal3 Herd1 2017-07-01 2
10 Animal4 Herd1 2017-06-21 2
11 Animal5 Herd1 2017-06-01 2
12 Animal6 Herd1 2017-06-15 2
13 Animal7 Herd2 2017-02-01 1
14 Animal8 Herd2 2017-02-01 1
15 Animal9 Herd2 2017-02-15 1
16 Animal10 Herd2 2017-02-21 1
17 Animal11 Herd2 2017-03-05 1
18 Animal12 Herd2 2017-03-01 1
19 Animal7 Herd2 2017-07-01 2
20 Animal8 Herd2 2017-07-01 2
21 Animal9 Herd2 2017-07-15 2
22 Animal10 Herd2 2017-07-21 2
23 Animal11 Herd2 2017-08-05 2
24 Animal12 Herd2 2017-08-01 2
I would like to apply something like this but considering that dates closed to each other are the same testing
df %>%
group_by(herd) %>%
mutate(testing = dense_rank(date))
Thanks!

You can group_by every 5 months and apply dense_rank. Since your smallest gap between two dates from the same animal is 5 months, the unit has to be 5 months.
library(dplyr)
library(lubridate)
df %>%
group_by(testing = dense_rank(floor_date(ymd(date), unit = "5 months")))

Related

Create variable for day of the experiment

I have a large data set that spanned a month in time with the data stamped in a column called txn_date like the below. (this is a toy reproduction of it)
dat1 <- read.table(text = "var1 txn_date
5 2020-10-25
1 2020-10-25
3 2020-10-26
4 2020-10-27
1 2020-10-27
3 2020-10-31
3 2020-11-01
8 2020-11-02 ", header = TRUE)
Ideally I would like to get a column in my data frame for each date in the data which I think could be done by first getting a single column that is 1 for the first date that appears and then so on.
So something like this
dat1 <- read.table(text = "var1 txn_date day
5 2020-10-25 1
1 2020-10-25 1
3 2020-10-26 2
4 2020-10-27 3
1 2020-10-27 3
3 2020-10-31 7
3 2020-11-01 8
8 2020-11-12 9 ", header = TRUE
I'm not quite sure how to get this. The txn_date column is as.Date in my actual data frame. I think if I could get the single day column like is listed above (then convert it to a factor) then I could always one hot encode the actual levels of that column if I need to. Ultimately I need to use the day of the experiment as a regressor in a regression I'm going to run.
Something along the lines of y ~ x + day_1 + day_2 +...+ error
Would this be suitable?
library(tidyverse)
dat1 <- read.table(text = "var1 txn_date
5 2020-10-25
1 2020-10-25
3 2020-10-26
4 2020-10-27
1 2020-10-27
3 2020-10-31
3 2020-11-01
8 2020-11-02 ", header = TRUE)
dat1$txn_date <- as.Date(dat1$txn_date)
dat1 %>%
mutate(days = txn_date - txn_date[1] + 1)
# var1 txn_date days
#1 5 2020-10-25 1 days
#2 1 2020-10-25 1 days
#3 3 2020-10-26 2 days
#4 4 2020-10-27 3 days
#5 1 2020-10-27 3 days
#6 3 2020-10-31 7 days
#7 3 2020-11-01 8 days
#8 8 2020-11-02 9 days
We create a sequence of dates based on the min and max of 'txn_date' and match
dates <- seq(min(as.Date(dat1$txn_date)),
max(as.Date(dat1$txn_date)), by = '1 day')
dat1$day <- with(dat1, match(as.Date(txn_date), dates))
dat1$day
#[1] 1 1 2 3 3 7 8 9
Or may use factor route
with(dat1, as.integer(factor(txn_date, levels = as.character(dates))))
#[1] 1 1 2 3 3 7 8 9

Creating Labels for Dates

I am working in R. I have a data frame that consists of Sampling Date and water temperature. I have provided a sample dataframe below:
Date Temperature
2015-06-01 11
2015-08-11 13
2016-01-12 2
2016-07-01 12
2017-01-08 4
2017-08-13 14
2018-03-04 7
2018-09-19 10
2019-8-24 8
Due to the erratic nature of sampling dates (due to samplers ability to site) I am unable to classify years normally January 1 to December 31st and instead am using the beginning of the sampling period as the start of 1 year. In this case a year would start June 1st and End may 31st, that way I can accruately compare the years to one another. Thus I want 4 years to have the following labels
Year_One = "2015-06-01" - "2016-05-31"
Year_Two = "2016-06-01" - "2017-05-31"
Year_Three = "2017-06-01" - "2018-05-31"
Year_Four = "2018-06-01" - "2019-08-24"
My goal is to create an additional column with these labels but have thus far been unable to do so.
I create two columns year1 and year2 with two different approaches. The year2 approach needs that all the periods start june 1st and end may 31st (in your code the year_four ends 2019-08-24) so it may not be exactly what you need:
library(tidyverse)
library(lubridate)
dt$Date <- as.Date(dt$Date)
dt %>%
mutate(year1= case_when(between(Date, as.Date("2015-06-01") , as.Date("2016-05-31")) ~ "Year_One",
between(Date, as.Date("2016-06-01") , as.Date("2017-05-31")) ~ "Year_Two",
between(Date, as.Date("2017-06-01") , as.Date("2018-05-31")) ~ "Year_Three",
between(Date, as.Date("2018-06-01") , as.Date("2019-08-24")) ~ "Year_Four",
TRUE ~ "0")) %>%
mutate(year2 = paste0(year(Date-months(5)),"/", year(Date-months(5))+1))
The output:
# A tibble: 9 x 4
Date Temperature year1 year2
<date> <dbl> <chr> <chr>
1 2015-06-01 11 Year_One 2015/2016
2 2015-08-11 13 Year_One 2015/2016
3 2016-01-12 2 Year_One 2015/2016
4 2016-07-01 12 Year_Two 2016/2017
5 2017-01-08 4 Year_Two 2016/2017
6 2017-08-13 14 Year_Three 2017/2018
7 2018-03-04 7 Year_Three 2017/2018
8 2018-09-19 10 Year_Four 2018/2019
9 2019-08-24 8 Year_Four 2019/2020
Using strftime to get the years, then make a factor with levels on the unique values. I'd recommend numbers instead of words, because they can be coded automatically. Else, use labels=c("one", "two", ...).
d <- within(d, {
year <- strftime(Date, "%Y")
year <- paste("Year", factor(year, labels=seq(unique(year))), sep="_")
})
# Date temperature year
# 1 2017-06-01 1 Year_1
# 2 2017-09-01 2 Year_1
# 3 2017-12-01 3 Year_1
# 4 2018-03-01 4 Year_2
# 5 2018-06-01 5 Year_2
# 6 2018-09-01 6 Year_2
# 7 2018-12-01 7 Year_2
# 8 2019-03-01 8 Year_3
# 9 2019-06-01 9 Year_3
# 10 2019-09-01 10 Year_3
# 11 2019-12-01 11 Year_3
# 12 2020-03-01 12 Year_4
# 13 2020-06-01 13 Year_4
Data:
d <- structure(list(Date = structure(c(17318, 17410, 17501, 17591,
17683, 17775, 17866, 17956, 18048, 18140, 18231, 18322, 18414
), class = "Date"), temperature = 1:13), class = "data.frame", row.names = c(NA,
-13L))

R aggregate second data to minutes more efficient

I have a data.table, allData, containing data on roughly every (POSIXct) second from different nights. Some nights however are on the same date since data is collected from different people, so I have a column nightNo as an id for every different night.
timestamp nightNo data1 data2
2018-10-19 19:15:00 1 1 7
2018-10-19 19:15:01 1 2 8
2018-10-19 19:15:02 1 3 9
2018-10-19 18:10:22 2 4 10
2018-10-19 18:10:23 2 5 11
2018-10-19 18:10:24 2 6 12
I'd like to aggregate the data to minutes (per night) and using this question I've come up with the following code:
aggregate_minute <- function(df){
df %>%
group_by(timestamp = cut(timestamp, breaks= "1 min")) %>%
summarise(data1= mean(data1), data2= mean(data2)) %>%
as.data.table()
}
allData <- allData[, aggregate_minute(allData), by=nightNo]
However my data.table is quite large and this code isn't fast enough. Is there a more efficient way to solve this problem?
allData <- data.table(timestamp = c(rep(Sys.time(), 3), rep(Sys.time() + 320, 3)),
nightNo = rep(1:2, c(3, 3)),
data1 = 1:6,
data2 = 7:12)
timestamp nightNo data1 data2
1: 2018-06-14 10:43:11 1 1 7
2: 2018-06-14 10:43:11 1 2 8
3: 2018-06-14 10:43:11 1 3 9
4: 2018-06-14 10:48:31 2 4 10
5: 2018-06-14 10:48:31 2 5 11
6: 2018-06-14 10:48:31 2 6 12
allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]
nightNo timestamp data1 data2
1: 1 2018-06-14 10:43:00 2 8
2: 2 2018-06-14 10:48:00 5 11
> system.time(replicate(500, allData[, aggregate_minute(allData), by=nightNo]))
user system elapsed
3.25 0.02 3.31
> system.time(replicate(500, allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]))
user system elapsed
1.02 0.04 1.06
You can use lubridate to 'round' the dates and then use data.table to aggregate the columns.
library(data.table)
library(lubridate)
Reproducible data:
text <- "timestamp nightNo data1 data2
'2018-10-19 19:15:00' 1 1 7
'2018-10-19 19:15:01' 1 2 8
'2018-10-19 19:15:02' 1 3 9
'2018-10-19 18:10:22' 2 4 10
'2018-10-19 18:10:23' 2 5 11
'2018-10-19 18:10:24' 2 6 12"
allData <- read.table(text = text, header = TRUE, stringsAsFactors = FALSE)
Create data.table:
setDT(allData)
Create a timestamp and floor it to the nearest minute:
allData[, timestamp := floor_date(ymd_hms(timestamp), "minutes")]
Change the type of the integer columns to numeric:
allData[, ':='(data1 = as.numeric(data1),
data2 = as.numeric(data2))]
Replace the data columns with their means by nightNo group:
allData[, ':='(data1 = mean(data1),
data2 = mean(data2)),
by = nightNo]
The result is:
timestamp nightNo data1 data2
1: 2018-10-19 19:15:00 1 2 8
2: 2018-10-19 19:15:00 1 2 8
3: 2018-10-19 19:15:00 1 2 8
4: 2018-10-19 18:10:00 2 5 11
5: 2018-10-19 18:10:00 2 5 11
6: 2018-10-19 18:10:00 2 5 11

Count how many cases exist per week given start and end dates of each case [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I'm new here, so I apologize if I miss any conventions.
I have a ~2000 row dataset with data on unique cases happening in a three year period. Each case has a start date and an end date. I want to be able to get a new dataframe that shows how many cases occur per week in this three year period.
The structure of the dataset I have is like this:
ID Start_Date End_Date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03
`
This problem can be solved more easily with sqldf package but I thought to stick with dplyr package.
The approach:
library(dplyr)
library(lubridate)
# First create a data frame having all weeks from chosen start date to end date.
# 2015-01-01 to 2017-12-31
df_week <- data.frame(weekStart = seq(floor_date(as.Date("2015-01-01"), "week"),
as.Date("2017-12-31"), by = 7))
df_week <- df_week %>%
mutate(weekEnd = weekStart + 7,
weekNum = as.character(weekStart, "%V-%Y"),
dummy = TRUE)
# The dummy column is only for joining purpose.
# Header looks like
#> head(df_week)
# weekStart weekEnd weekNum dummy
#1 2014-12-28 2015-01-04 52-2014 TRUE
#2 2015-01-04 2015-01-11 01-2015 TRUE
#3 2015-01-11 2015-01-18 02-2015 TRUE
#4 2015-01-18 2015-01-25 03-2015 TRUE
#5 2015-01-25 2015-02-01 04-2015 TRUE
#6 2015-02-01 2015-02-08 05-2015 TRUE
# Prepare the data as mentioned in OP
df <- read.table(text = "ID Start_Date End_Date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03", header = TRUE, stringsAsFactors = FALSE)
df$Start_Date <- as.Date(df$Start_Date)
df$End_Date <- as.Date(df$End_Date)
df <- df %>% mutate(dummy = TRUE) # just for joining
# Use dplyr to join, filter and then group on week to find number of cases
# in each week
df_week %>%
left_join(df, by = "dummy") %>%
select(-dummy) %>%
filter((weekStart >= Start_Date & weekStart <= End_Date) |
(weekEnd >= Start_Date & weekEnd <= End_Date)) %>%
group_by(weekStart, weekEnd, weekNum) %>%
summarise(cases = n())
# Result
# weekStart weekEnd weekNum cases
# <date> <date> <chr> <int>
# 1 2014-12-28 2015-01-04 52-2014 1
# 2 2015-01-04 2015-01-11 01-2015 3
# 3 2015-01-11 2015-01-18 02-2015 5
# 4 2015-01-18 2015-01-25 03-2015 8
# 5 2015-01-25 2015-02-01 04-2015 8
# 6 2015-02-01 2015-02-08 05-2015 8
# 7 2015-02-08 2015-02-15 06-2015 8
# 8 2015-02-15 2015-02-22 07-2015 8
# 9 2015-02-22 2015-03-01 08-2015 8
#10 2015-03-01 2015-03-08 09-2015 8
# ... with 139 more rows
Welcome to SO!
Before solving the problem be sure to have installed some packages and run
install.packages(c("tidyr","dplyr","lubridate"))
if you haven installed those packages yet.
I'll present you a modern R solution next and those packages are magic.
This is a way to solve it:
library(readr)
library(dplyr)
library(lubridate)
raw_data <- 'id start_date end_date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03'
curated_data <- read_delim(raw_data, delim = "\t") %>%
mutate(start_date = as.Date(start_date)) %>% # convert column 2 to date format assuming the date is yyyy-mm-dd
mutate(weeks_lapse = as.integer((start_date - min(start_date))/dweeks(1))) # count how many weeks passed since the lowest date in the data
curated_data %>%
group_by(weeks_lapse) %>% # I group to count by week
summarise(cases_per_week = n()) # now count by group by week
And the solution is:
# A tibble: 3 x 2
weeks_lapse cases_per_week
<int> <int>
1 0 3
2 1 2
3 2 3

How can I implement a rolling function partitioned over another column?

I have some data that looks like
CustomerID InvoiceDate
<fctr> <dttm>
1 13313 2011-01-04 10:00:00
2 18097 2011-01-04 10:22:00
3 16656 2011-01-04 10:23:00
4 16875 2011-01-04 10:37:00
5 13094 2011-01-04 10:37:00
6 17315 2011-01-04 10:38:00
7 16255 2011-01-04 11:30:00
8 14606 2011-01-04 11:34:00
9 13319 2011-01-04 11:40:00
10 16282 2011-01-04 11:42:00
It tells me when a person make a transaction. I would like to know the time between transactions for each customer, preferably in days. I do this in the following way
d <- data %>%
arrange(CustomerID,InvoiceDate) %>%
group_by(CustomerID) %>%
mutate(delta.t = InvoiceDate - lag(InvoiceDate), #calculating the difference
delta.day = as.numeric(delta.t, unit = 'days')) %>%
na.omit() %>%
arrange(CustomerID) %>%
inner_join(Ntrans) %>% #Existing data.frame telling me the number of transactions per customer
filter(N>=10) %>% #only want people with more than 10 transactions
select(-N)
However, the result doesn't make sense (seen below)
CustomerID InvoiceDate delta.t delta.day
<fctr> <dttm> <time> <dbl>
1 12415 2011-01-10 09:58:00 5686 days 5686
2 12415 2011-02-15 09:52:00 51834 days 51834
3 12415 2011-03-03 10:59:00 23107 days 23107
4 12415 2011-04-01 14:28:00 41969 days 41969
5 12415 2011-05-17 15:42:00 66314 days 66314
6 12415 2011-05-20 14:13:00 4231 days 4231
7 12415 2011-06-15 13:37:00 37404 days 37404
8 12415 2011-07-13 15:30:00 40433 days 40433
9 12415 2011-07-13 15:31:00 1 days 1
10 12415 2011-07-19 10:51:00 8360 days 8360
The differences measured in days are way off. What I want is something close to SQL's rolling window function partitioned over customerID. How can I implement this?
If you just want to change the difference to days you can use the package lubridate.
> library('lubridate')
> library('dplyr')
>
> InvoiceDate <- c('2011-01-10 09:58:00', '2011-02-15 09:52:00', '2011-03-03 10:59:00')
> CustomerID <- c(111, 111, 111)
>
> dat <- data.frame('Invo' = InvoiceDate, 'ID' = CustomerID)
>
> dat %>% mutate('Delta' = as_date(Invo) - as_date(lag(Invo)))
Invo ID Delta
1 2011-01-10 09:58:00 111 NA days
2 2011-02-15 09:52:00 111 36 days
3 2011-03-03 10:59:00 111 16 days

Resources