How to aggregate data based on consecutive row values? - r

I am doing data analysis of photos of animals from trail cameras. My data includes what camera a picture was taken with, the date and time the picture was taken, and the animal in the photo. I wish to aggregate my data based on the time animals spent in front of the camera. For our purposes, an encounter is anytime we photograph an animal more than 10 minutes after photographing another of the same species. Encounters can be more than 10 minutes long in some cases, such as if we took 3 pictures of the same animal 7 minutes apart from one another, a 21 minute encounter. I want my output to aggregate my data into individual encounters for all animals photographed, and include start times and end times for each encounter photo series.
My code thus far
library(dplyr)
#Data
df <- structure(list(camera_id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L), date = c("11-May-21", "11-May-21", "11-May-21",
"15-May-21", "15-May-21", "10-May-21", "10-May-21", "12-May-21",
"12-May-21", "12-May-21", "12-May-21", "12-May-21", "13-May-21",
"13-May-21"), time = c("5:23:46", "5:23:50", "5:32:34", "9:35:20",
"9:35:35", "23:11:16", "23:11:17", "11:06:08", "11:15:09", "11:24:10",
"2:04:01", "2:04:03", "1:15:00", "1:15:50"), organism = c("mouse",
"mouse", "bird", "squirrel", "squirrel", "mouse", "mouse", "woodchuck",
"woodchuck", "woodchuck", "mouse", "mouse", "mouse", "mouse")), class = "data.frame", row.names = c(NA,
-14L))
#Combining date and time
df$datetime <- as.POSIXct(paste(df$date, df$time), format ="%d-%B-%y %H:%M:%S")
#Time differences in minutes, based on organism
df <- df %>% group_by(organism) %>%
mutate(timediff = (datetime - lag(datetime))/60
)
#Round minutes to 2 decimal points
df$timediff <- round(df$timediff, digits=2)
#Make negative and NA values = 0. Negative values appear when going from one camera to the next. R thinks it is going back in time, rather than
#swapping cameras
df$timediff[df$timediff<0] <- 0
df$timediff[is.na(df$timediff)] <- 0
At this point, I want to use timediff as my condition for aggregation, and aggregate any subsequent rows of data with a timediff < 10, as long as the row has the same camera_id and organism. I've been trying different dplyr approaches but havent been able to crack this. The output should look like this.
structure(list(camera_id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), start_datetime = c("5/11/2021 5:23",
"5/11/2021 5:32", "5/15/2021 9:35", "5/10/2021 23:11", "5/10/2021 11:06",
"5/12/2021 2:04", "5/13/2021 1:15"), end_datetime = c("5/11/2021 5:23",
"5/11/2021 5:32", "5/15/2021 9:35", "5/10/2021 23:11", "5/10/2021 11:24",
"5/12/2021 2:04", "5/13/2021 1:15"), organism = c("mouse", "bird",
"squirrel", "mouse", "woodchuck", "mouse", "mouse"), encounter_time = c("0:00:04",
"0:00:00", "0:00:15", "0:00:01", "0:18:00", "0:00:02", "0:00:50"
)), class = "data.frame", row.names = c(NA, -7L))

I think this gets you your desired result:
A couple key changes: when we calculate timediff it makes sense to group by camera_id in addition to organism, since that grouping persists throughout.
Then, we need to create some helper columns to generate our grouping based on the 10 second condition.
under_10 is 0 for all values of timediff less than 10, and also when timediff is NA (when a row is the first within the group). Under 10 is 1 when timelapsed > 10.
Then we create a grouping variable that increments when time elapsed is > 10. Then we simply summarize, calculating start and end based on min/max datetimes, and remove the grouping column.
library(tidyverse)
df$datetime <- as.POSIXct(paste(df$date, df$time), format ="%d-%B-%y %H:%M:%S")
#Time differences in minutes, based on organism
df <- df %>% group_by(organism, camera_id) %>%
mutate(timediff = (datetime - lag(datetime))/60
)
#Round minutes to 2 decimal points
df$timediff <- round(df$timediff, digits=2)
df %>% mutate(under_10 = ifelse(timediff < 10 | is.na(timediff), 0, 1)) %>%
arrange(camera_id, datetime) %>%
mutate(grouping = cumsum(under_10)) %>%
group_by(camera_id, organism, grouping) %>%
summarize(start_datetime = min(datetime), end_datetime = max(datetime),
encounter_time = end_datetime-start_datetime) %>%
select(-grouping)
camera_id organism start_datetime end_datetime encounter_time
<int> <chr> <dttm> <dttm> <drtn>
1 1 bird 2021-05-11 05:32:34 2021-05-11 05:32:34 0 secs
2 1 mouse 2021-05-11 05:23:46 2021-05-11 05:23:50 4 secs
3 1 squirrel 2021-05-15 09:35:20 2021-05-15 09:35:35 15 secs
4 2 mouse 2021-05-10 23:11:16 2021-05-10 23:11:17 1 secs
5 2 woodchuck 2021-05-12 11:06:08 2021-05-12 11:24:10 1082 secs
6 3 mouse 2021-05-12 02:04:01 2021-05-12 02:04:03 2 secs
7 3 mouse 2021-05-13 01:15:00 2021-05-13 01:15:50 50 secs
Also, if you'd rather have the H:MM:SS format for encounter_time you can get here like this, adding the following after the summarize call in the above code:
library(lubridate)
...
mutate(encounter_time = seconds_to_period(as.character(encounter_time))) %>%
select(-grouping) %>%
mutate(encounter_time = sprintf("%1i:%02i:%02i",
lubridate::hour(encounter_time),
lubridate::minute(encounter_time),
lubridate::second(encounter_time)))
camera_id organism start_datetime end_datetime encounter_time
<int> <chr> <dttm> <dttm> <chr>
1 1 bird 2021-05-11 05:32:34 2021-05-11 05:32:34 0:00:00
2 1 mouse 2021-05-11 05:23:46 2021-05-11 05:23:50 0:00:04
3 1 squirrel 2021-05-15 09:35:20 2021-05-15 09:35:35 0:00:15
4 2 mouse 2021-05-10 23:11:16 2021-05-10 23:11:17 0:00:01
5 2 woodchuck 2021-05-12 11:06:08 2021-05-12 11:24:10 0:18:02
6 3 mouse 2021-05-12 02:04:01 2021-05-12 02:04:03 0:00:02
7 3 mouse 2021-05-13 01:15:00 2021-05-13 01:15:50 0:00:50
However you end up with encounter_time stored as character, so that may or may not be useful

Related

In R, make a conditional indicator variable based on (a) the first instance of a record type and (b) a date difference

Background
Here's a df with some data in it from a Costco-like members-only big-box store:
d <- data.frame(ID = c("a","a","b","c","c","d"),
purchase_type = c("grocery","grocery",NA,"auto","grocery",NA),
date_joined = as.Date(c("2014-01-01","2014-01-01","2013-04-30","2009-03-08","2009-03-08","2015-03-04")),
date_purchase = as.Date(c("2014-04-30","2016-07-08","2013-06-29","2015-04-07","2017-09-10","2017-03-10")),
stringsAsFactors=T)
d <- d %>%
mutate(date_diff = d$date_purchase - d$date_joined)
This yields the following table:
As you can see, it's got a member ID, purchase types based on the broad category of what people bought, and two dates: the date the member originally became a member, and the date of a given purchase. I've also made a variable date_diff to tally the time between a given purchase and the beginning of membership.
The Problem
I'd like to make a new variable early_shopper that's marked 1 on all of a member's purchases if
That member's first purchase was made within a year of joining (so date_diff <= 365 days).
This first purchase doesn't have an NA in purchase_type.
If these criteria aren't met, give a 0.
What I'm looking for is a table that looks like this:
Note that Member a is the only "true" early_shopper: their first purchase is non-NA in purchase_type, and only 119 days passed between their joining the store and making a purchase there. Member b looks like they could be based on my date_diff criterion, but since they don't have a non-NA value in purchase_type, they don't count as an early_shopper.
What I've Tried
So far, I've tried using mutate and first functions like this:
d <- d %>%
mutate(early_shopper = if_else(!is.na(first(purchase_type,order_by = date_joined)) & date_diff < 365, 1, 0))
Which gives me this:
Something's kinda working here, but not fully. As you can see, I get the correct early_shopper = 1 in Member a's first purchase, but not their second. I also get a false positive with member b, who's marked as an early_shopper when I don't want them to be (because their purchase_type is NA).
Any ideas? I can further clarify if need be. Thanks!
You could use
library(dplyr)
d %>%
mutate(date_diff = date_purchase - date_joined) %>%
group_by(ID, purchase_type) %>%
arrange(ID, date_joined) %>%
mutate(
early_shopper = +(!is.na(first(purchase_type)) & date_diff <= 365)
) %>%
group_by(ID) %>%
mutate(early_shopper = max(early_shopper)) %>%
ungroup()
which returns
# A tibble: 6 x 6
ID purchase_type date_joined date_purchase date_diff early_shopper
<fct> <fct> <date> <date> <drtn> <int>
1 a grocery 2014-01-01 2014-04-30 119 days 1
2 a grocery 2014-01-01 2016-07-08 919 days 1
3 b NA 2013-04-30 2013-06-29 60 days 0
4 c auto 2009-03-08 2015-04-07 2221 days 0
5 c grocery 2009-03-08 2017-09-10 3108 days 0
6 d NA 2015-03-04 2017-03-10 737 days 0
If you want the early_shopper column to be boolean/logical, just remove the +.
Data
I used this data, here the date_joined for b is 2013-04-30 like shown in your images and not like in your actual data posted.
structure(list(ID = structure(c(1L, 1L, 2L, 3L, 3L, 4L), .Label = c("a",
"b", "c", "d"), class = "factor"), purchase_type = structure(c(2L,
2L, NA, 1L, 2L, NA), .Label = c("auto", "grocery"), class = "factor"),
date_joined = structure(c(16071, 16071, 15825, 14311, 14311,
16498), class = "Date"), date_purchase = structure(c(16190,
16990, 15885, 16532, 17419, 17235), class = "Date")), class = "data.frame", row.names = c(NA,
-6L))
Here is my approach using a join to get the early_shopper value to be the same for all rows of the same ID.
library(dplyr)
d <- structure(list(ID = structure(c(1L, 1L, 2L, 3L, 3L, 4L),
.Label = c("a","b", "c", "d"),
class = "factor"),
purchase_type = structure(c(2L, 2L, NA, 1L, 2L, NA),
.Label = c("auto", "grocery"),
class = "factor"),
date_joined = structure(c(16071, 16071, 15825, 14311, 14311, 16498),
class = "Date"),
date_purchase = structure(c(16190, 16990, 15885, 16532, 17419, 17235),
class = "Date")),
class = "data.frame", row.names = c(NA, -6L))
d %>%
inner_join(d %>%
mutate(date_diff = d$date_purchase - d$date_joined) %>%
group_by(ID) %>%
slice_min(date_diff) %>%
transmute(early_shopper = if_else(!is.na(first(purchase_type,
order_by = date_joined)) &
date_diff < 365, 1, 0)) %>%
ungroup()
)
ID purchase_type date_joined date_purchase early_shopper
1 a grocery 2014-01-01 2014-04-30 1
2 a grocery 2014-01-01 2016-07-08 1
3 b <NA> 2013-04-30 2013-06-29 0
4 c auto 2009-03-08 2015-04-07 0
5 c grocery 2009-03-08 2017-09-10 0
6 d <NA> 2015-03-04 2017-03-10 0

How to count the number of customer per month in R?

So I have a table of customers with the respective date as below:
ID
Date
1
2019-04-17
4
2019-05-12
1
2019-04-25
2
2019-05-19
I just want to count how many Customer is there for each month-year like below:
Month-Year
Count of Customer
Apr-19
2
May-19
2
EDIT:
Sorry but I think my Question should be clearer.
The same customer can appear more than once in a month and would be counted as 2 customer for the same month. I would basically like to find the number of transaction per month based on customer id.
My assumed approach would be to first change the date into a month-year format? And then I count each customer and grouped it for each month? but I am not sure how to do this in R. Thank you!
You can use count -
library(dplyr)
df %>% count(Month_Year = format(as.Date(Date), '%b-%y'))
# Month_Year n
#1 Apr-19 2
#2 May-19 2
Or table in base R -
table(format(as.Date(df$Date), '%b-%y'))
#Apr-19 May-19
# 2 2
data
df <- structure(list(ID = c(1L, 4L, 1L, 2L), Date = c("2019-04-17",
"2019-05-12", "2019-04-25", "2019-05-19")),
class = "data.frame", row.names = c(NA, -4L))
We can use zoo::as.yearmon
library(dplyr)
df %>%
count(Date = zoo::as.yearmon(Date))
Date n
1 Apr 2019 2
2 May 2019 2
data
df <- structure(list(ID = c(1L, 4L, 1L, 2L), Date = c("2019-04-17",
"2019-05-12", "2019-04-25", "2019-05-19")),
class = "data.frame", row.names = c(NA, -4L))

how to filter data based on the latest date of a date group?

i know my question is not as clear as it should be so i hope my explanation will make it more comprehensible. I have a data like this:
# total_call data
call_id | from_number | retrieved_date
1 1 2020-01-12 12:03:34
2 1 2020-01-12 12:06:34
3 2 2020-01-15 13:02:40
4 2 2020-01-15 13:05:40
5 1 2020-01-12 13:09:34
I want to group the calls by the from_number and the retrieved_date variables, which its time must be within 1 hour since the earliest. After 1 hour, it belongs to a new group. Then i want to filter the latest time of each group. This is the result i want:
# total_call data
call_id | from_number | retrieved_date
2 1 2020-01-12 12:06:34
4 2 2020-01-15 13:05:40
5 1 2020-01-12 13:09:34
Thanks for your attention. I’m looking forward to your reply.
We convert retrieved_date to POSIXct format, arrange the data and create a new group when the current retrieved_date is greater than previous retrieved_date by more than an hour and select the row with max retrieved_date.
library(dplyr)
df %>%
mutate(retrieved_date = lubridate::ymd_hms(retrieved_date)) %>%
arrange(from_number, retrieved_date) %>%
group_by(from_number) %>%
group_by(gr = cumsum(difftime(retrieved_date, lag(retrieved_date,
default = first(retrieved_date)), units = "hours") > 1),add = TRUE) %>%
slice(which.max(retrieved_date)) %>%
ungroup() %>%
select(-gr)
# A tibble: 3 x 3
# call_id from_number retrieved_date
# <int> <int> <dttm>
#1 2 1 2020-01-12 12:06:34
#2 5 1 2020-01-12 13:09:34
#3 4 2 2020-01-15 13:05:40
data
df <- structure(list(call_id = 1:5, from_number = c(1L, 1L, 2L, 2L,
1L), retrieved_date = structure(c(1L, 2L, 4L, 5L, 3L),
.Label = c("2020- 01-12 12:03:34","2020-01-12 12:06:34", "2020-01-12 13:09:34",
"2020-01-15 13:02:40", "2020-01-15 13:05:40"), class = "factor")),
class = "data.frame", row.names = c(NA, -5L))

How to divide contents of one column by different values, conditional on contents of a second column?

I've got a data frame that looks like something along these lines:
Day Salesperson Value
==== ============ =====
Monday John 40
Monday Sarah 50
Tuesday John 60
Tuesday Sarah 30
Wednesday John 50
Wednesday Sarah 40
I want to divide the value for each salesperson by the number of times that each of the days of the week has occurred. So: There have been 3 Monday, 3 Tuesdays, and 2 Wednesdays — I don't have this information digitally, but can create a vector along the lines of
c(3, 3, 2)
How can I conditionally divide the Value column based on the number of times each day occurs?
I've found an inelegant solution, which entails copying the Day column to a temp column, replacing each of the names of the week in the new column with the number of times each day occurs using
df$temp <- sub("Monday, 3, df$temp)
but doing this seems kinda clunky. Is there a neat way to do this?
Suppose your auxiliary data is in another data.frame:
Day N_Day
1 Monday 3
2 Tuesday 3
3 Wednesday 2
The simplest way would be to merge:
DF_new <- merge(DF, DF2, by="Day")
DF_new$newcol <- DF_new$Value / DF_new$N_Day
which gives
Day Salesperson Value N_Day newcol
1 Monday John 40 3 13.33333
2 Monday Sarah 50 3 16.66667
3 Tuesday John 60 3 20.00000
4 Tuesday Sarah 30 3 10.00000
5 Wednesday John 50 2 25.00000
6 Wednesday Sarah 40 2 20.00000
The mergeless shortcut is
DF$newcol <- DF$Value / DF2$N_Day[match(DF$Day, DF2$Day)]
Data:
DF <- structure(list(Day = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label =
c("Monday",
"Tuesday", "Wednesday"), class = "factor"), Salesperson = structure(c(1L,
2L, 1L, 2L, 1L, 2L), .Label = c("John", "Sarah"), class = "factor"),
Value = c(40L, 50L, 60L, 30L, 50L, 40L)), .Names = c("Day",
"Salesperson", "Value"), class = "data.frame", row.names = c(NA,
-6L))
DF2 <- structure(list(Day = structure(1:3, .Label = c("Monday", "Tuesday",
"Wednesday"), class = "factor"), N_Day = c(3, 3, 2)), .Names = c("Day",
"N_Day"), row.names = c(NA, -3L), class = "data.frame")
You can use the library dplyr to merge your data frame with the frequency of each day.
df <- data.frame(
Day=c("Monday","Monday","Tuesday","Tuesday","Wednesday","Wednesday"),
Salesperson=c("John","Sarah","John","Sarah","John","Sarah"),
Value=c(40,50,60,30,50,40), stringsAsFactors=F)
aux <- data.frame(
Day=c("Monday","Tuesday","Wednesday"),
freq=c(3,3,2)
)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value2=Value/n)
To create this auxiliary table with the count of days that appear in your original data instead of doing it manually. You could use:
aux <- df %>% group_by(Day) %>% summarise(n=n())
> output
Day Salesperson Value n Value2
1 Monday John 40 2 20
2 Monday Sarah 50 2 25
3 Tuesday John 60 2 30
4 Tuesday Sarah 30 2 15
5 Wednesday John 50 2 25
6 Wednesday Sarah 40 2 20
If you want to substitute the actual valuecolumn, then use mutate(Value=Value/n) and to remove the additional columns, you can add a select(-n)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value=Value/n) %>% select(-n)

Convert factor type time into number of minutes

There is a column in my dataset that contains time in the format 00:20:10. I have two questions. First, when I import it into R using read.xlsx2(), this column is converted to factor type. How can I convert it to time type?
Second, I want to calculate each person's total time in number of minutes.
ID Time
1 00:10:00
1 00:21:30
2 00:30:10
2 00:04:10
The output I want is:
ID Total.time
1 31.5
2 34.3
I haven't dealt with time issue before, and I hope someone would recommend some packages as well.
You could use times() from the chron package to convert the Time column to "times" class. Then aggregate() to sum the times, grouped by the ID column. This first block will give us actual times in the result.
library(chron)
df$Time <- times(df$Time)
aggregate(list(Total.Time = df$Time), df[1], sum)
# ID Total.Time
# 1 1 00:31:30
# 2 2 00:34:20
For decimal output, we can employ minutes() and seconds(), also from chron.
aggregate(list(Total.Time = df$Time), df[1], function(x) {
minutes(s <- sum(x)) + (seconds(s) / 60)
})
# ID Total.Time
# 1 1 31.50000
# 2 2 34.33333
Furthermore, we can also use data.table for improved efficiency.
library(data.table)
setDT(df)[, .(Total.Time = minutes(s <- sum(Time)) + (seconds(s) / 60)), by = ID]
# ID Total.Time
# 1: 1 31.50000
# 2: 2 34.33333
Data:
df <- structure(list(ID = c(1L, 1L, 2L, 2L), Time = structure(c(2L,
3L, 4L, 1L), .Label = c("00:04:10", "00:10:00", "00:21:30", "00:30:10"
), class = "factor")), .Names = c("ID", "Time"), class = "data.frame", row.names = c(NA,
-4L))

Resources