Related
I am doing data analysis of photos of animals from trail cameras. My data includes what camera a picture was taken with, the date and time the picture was taken, and the animal in the photo. I wish to aggregate my data based on the time animals spent in front of the camera. For our purposes, an encounter is anytime we photograph an animal more than 10 minutes after photographing another of the same species. Encounters can be more than 10 minutes long in some cases, such as if we took 3 pictures of the same animal 7 minutes apart from one another, a 21 minute encounter. I want my output to aggregate my data into individual encounters for all animals photographed, and include start times and end times for each encounter photo series.
My code thus far
library(dplyr)
#Data
df <- structure(list(camera_id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L), date = c("11-May-21", "11-May-21", "11-May-21",
"15-May-21", "15-May-21", "10-May-21", "10-May-21", "12-May-21",
"12-May-21", "12-May-21", "12-May-21", "12-May-21", "13-May-21",
"13-May-21"), time = c("5:23:46", "5:23:50", "5:32:34", "9:35:20",
"9:35:35", "23:11:16", "23:11:17", "11:06:08", "11:15:09", "11:24:10",
"2:04:01", "2:04:03", "1:15:00", "1:15:50"), organism = c("mouse",
"mouse", "bird", "squirrel", "squirrel", "mouse", "mouse", "woodchuck",
"woodchuck", "woodchuck", "mouse", "mouse", "mouse", "mouse")), class = "data.frame", row.names = c(NA,
-14L))
#Combining date and time
df$datetime <- as.POSIXct(paste(df$date, df$time), format ="%d-%B-%y %H:%M:%S")
#Time differences in minutes, based on organism
df <- df %>% group_by(organism) %>%
mutate(timediff = (datetime - lag(datetime))/60
)
#Round minutes to 2 decimal points
df$timediff <- round(df$timediff, digits=2)
#Make negative and NA values = 0. Negative values appear when going from one camera to the next. R thinks it is going back in time, rather than
#swapping cameras
df$timediff[df$timediff<0] <- 0
df$timediff[is.na(df$timediff)] <- 0
At this point, I want to use timediff as my condition for aggregation, and aggregate any subsequent rows of data with a timediff < 10, as long as the row has the same camera_id and organism. I've been trying different dplyr approaches but havent been able to crack this. The output should look like this.
structure(list(camera_id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), start_datetime = c("5/11/2021 5:23",
"5/11/2021 5:32", "5/15/2021 9:35", "5/10/2021 23:11", "5/10/2021 11:06",
"5/12/2021 2:04", "5/13/2021 1:15"), end_datetime = c("5/11/2021 5:23",
"5/11/2021 5:32", "5/15/2021 9:35", "5/10/2021 23:11", "5/10/2021 11:24",
"5/12/2021 2:04", "5/13/2021 1:15"), organism = c("mouse", "bird",
"squirrel", "mouse", "woodchuck", "mouse", "mouse"), encounter_time = c("0:00:04",
"0:00:00", "0:00:15", "0:00:01", "0:18:00", "0:00:02", "0:00:50"
)), class = "data.frame", row.names = c(NA, -7L))
I think this gets you your desired result:
A couple key changes: when we calculate timediff it makes sense to group by camera_id in addition to organism, since that grouping persists throughout.
Then, we need to create some helper columns to generate our grouping based on the 10 second condition.
under_10 is 0 for all values of timediff less than 10, and also when timediff is NA (when a row is the first within the group). Under 10 is 1 when timelapsed > 10.
Then we create a grouping variable that increments when time elapsed is > 10. Then we simply summarize, calculating start and end based on min/max datetimes, and remove the grouping column.
library(tidyverse)
df$datetime <- as.POSIXct(paste(df$date, df$time), format ="%d-%B-%y %H:%M:%S")
#Time differences in minutes, based on organism
df <- df %>% group_by(organism, camera_id) %>%
mutate(timediff = (datetime - lag(datetime))/60
)
#Round minutes to 2 decimal points
df$timediff <- round(df$timediff, digits=2)
df %>% mutate(under_10 = ifelse(timediff < 10 | is.na(timediff), 0, 1)) %>%
arrange(camera_id, datetime) %>%
mutate(grouping = cumsum(under_10)) %>%
group_by(camera_id, organism, grouping) %>%
summarize(start_datetime = min(datetime), end_datetime = max(datetime),
encounter_time = end_datetime-start_datetime) %>%
select(-grouping)
camera_id organism start_datetime end_datetime encounter_time
<int> <chr> <dttm> <dttm> <drtn>
1 1 bird 2021-05-11 05:32:34 2021-05-11 05:32:34 0 secs
2 1 mouse 2021-05-11 05:23:46 2021-05-11 05:23:50 4 secs
3 1 squirrel 2021-05-15 09:35:20 2021-05-15 09:35:35 15 secs
4 2 mouse 2021-05-10 23:11:16 2021-05-10 23:11:17 1 secs
5 2 woodchuck 2021-05-12 11:06:08 2021-05-12 11:24:10 1082 secs
6 3 mouse 2021-05-12 02:04:01 2021-05-12 02:04:03 2 secs
7 3 mouse 2021-05-13 01:15:00 2021-05-13 01:15:50 50 secs
Also, if you'd rather have the H:MM:SS format for encounter_time you can get here like this, adding the following after the summarize call in the above code:
library(lubridate)
...
mutate(encounter_time = seconds_to_period(as.character(encounter_time))) %>%
select(-grouping) %>%
mutate(encounter_time = sprintf("%1i:%02i:%02i",
lubridate::hour(encounter_time),
lubridate::minute(encounter_time),
lubridate::second(encounter_time)))
camera_id organism start_datetime end_datetime encounter_time
<int> <chr> <dttm> <dttm> <chr>
1 1 bird 2021-05-11 05:32:34 2021-05-11 05:32:34 0:00:00
2 1 mouse 2021-05-11 05:23:46 2021-05-11 05:23:50 0:00:04
3 1 squirrel 2021-05-15 09:35:20 2021-05-15 09:35:35 0:00:15
4 2 mouse 2021-05-10 23:11:16 2021-05-10 23:11:17 0:00:01
5 2 woodchuck 2021-05-12 11:06:08 2021-05-12 11:24:10 0:18:02
6 3 mouse 2021-05-12 02:04:01 2021-05-12 02:04:03 0:00:02
7 3 mouse 2021-05-13 01:15:00 2021-05-13 01:15:50 0:00:50
However you end up with encounter_time stored as character, so that may or may not be useful
So I have a table of customers with the respective date as below:
ID
Date
1
2019-04-17
4
2019-05-12
1
2019-04-25
2
2019-05-19
I just want to count how many Customer is there for each month-year like below:
Month-Year
Count of Customer
Apr-19
2
May-19
2
EDIT:
Sorry but I think my Question should be clearer.
The same customer can appear more than once in a month and would be counted as 2 customer for the same month. I would basically like to find the number of transaction per month based on customer id.
My assumed approach would be to first change the date into a month-year format? And then I count each customer and grouped it for each month? but I am not sure how to do this in R. Thank you!
You can use count -
library(dplyr)
df %>% count(Month_Year = format(as.Date(Date), '%b-%y'))
# Month_Year n
#1 Apr-19 2
#2 May-19 2
Or table in base R -
table(format(as.Date(df$Date), '%b-%y'))
#Apr-19 May-19
# 2 2
data
df <- structure(list(ID = c(1L, 4L, 1L, 2L), Date = c("2019-04-17",
"2019-05-12", "2019-04-25", "2019-05-19")),
class = "data.frame", row.names = c(NA, -4L))
We can use zoo::as.yearmon
library(dplyr)
df %>%
count(Date = zoo::as.yearmon(Date))
Date n
1 Apr 2019 2
2 May 2019 2
data
df <- structure(list(ID = c(1L, 4L, 1L, 2L), Date = c("2019-04-17",
"2019-05-12", "2019-04-25", "2019-05-19")),
class = "data.frame", row.names = c(NA, -4L))
This question already has an answer here:
finding overlapping time between start time and end time of individuals in a group
(1 answer)
Closed 3 years ago.
I have
household person start time end time
1 1 07:45:00 21:45:00
1 2 09:45:00 17:45:00
1 3 22:45:00 23:45:00
1 4 08:45:00 01:45:00
1 1 23:50:00 24:00:00
2 1 07:45:00 21:45:00
2 2 016:45:00 22:45:00
I want to find a column to find overlapping time between family members.
The indicator is: if a person's start and end time has intersection with another member's is 1 otherwise 0
In the above example first family, the time of first, second and forth persons have intersection so indicator is 1 and third and fifth rows doesn't have intersection with non of the other people in the household.
output:
household person start time end time overlap
1 1 07:45:00 21:45:00 1
1 2 09:45:00 17:45:00 1
1 3 22:45:00 23:45:00 0
1 4 08:45:00 01:45:00 1
1 1 23:50:00 24:00:00 0
2 1 07:45:00 21:45:00 1
2 2 016:45:00 22:45:00 1
data with dput format:
structure(list(SAMPN = c(1L, 1L, 1L, 2L, 2L, 2L), PERNO = c(1,
1, 1, 1, 1, 1), arr = structure(c(30300, 35280, 37200, 32400,
34200, 39600), class = c("hms", "difftime"), units = "secs"),
dep = structure(c(34200, 36300, 61800, 33600, 37800, 50400
), class = c("hms", "difftime"), units = "secs")), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(
SAMPN = 1:2, PERNO = c(1, 1), .rows = list(1:3, 4:6)), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
I have tried a tidyverse solution:
library(tidyverse)
df = structure(list(SAMPN = c(1L, 1L, 1L, 2L, 2L, 2L),
PERNO = c(1:3, 1:3), arr = structure(c(30300, 35280, 37200, 32400,
34200, 39600), class = c("hms", "difftime"), units = "secs"),
dep = structure(c(34200, 36300, 61800, 33600, 37800, 50400), class = c("hms", "difftime"), units = "secs")), class = c("grouped_df","tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(SAMPN = 1:2, PERNO = c(1, 1), .rows = list(1:3, 4:6)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
Then, I added:
df %>% group_by(SAMPN) %>%
mutate(
arr_min = mapply(function(x) min(arr[-x]), 1:n()),
dep_max = mapply(function(x) max(dep[-x]), 1:n()),
overlap = ifelse(arr<arr_min | dep>dep_max, 0, 1)
)
You will get:
SAMPN PERNO arr dep arr_min dep_max overlap
<int> <int> <time> <time> <dbl> <dbl> <dbl>
1 1 1 08:25 09:30 35280 61800 0
2 1 2 09:48 10:05 30300 61800 1
3 1 3 10:20 17:10 30300 36300 0
4 2 1 09:00 09:20 34200 50400 0
5 2 2 09:30 10:30 32400 50400 1
6 2 3 11:00 14:00 32400 37800 0
You basically compare the current arr and dep with arr_min (min(arr) value excluding the current case) and dep_max (max(dep) excluding the current case).
Tidyverse solution
Here's a solution in tidyverse syntax. The basic idea is the same. We perform a many to many merge matching on household (sampn in your current example data) and remove the cases of comparing a person to themself (perno). We check for overlaps, then collapse to a single record per household and person. Note that this code will error if all records have the same perno.
compare <-
df %>%
left_join(df %>%
rename(compare_PERNO = PERNO,
compare_arr = arr,
compare_dep = dep), by = ("SAMPN")) %>%
filter(PERNO != compare_PERNO) %>%
mutate(overlap = arr <= compare_dep & dep >= compare_arr) %>%
group_by(SAMPN, PERNO) %>%
summarize(overlap = max(overlap))
SQL Solution with household grouping
Grouping the data by household actually makes this problem slightly easier. Again, I'm using SQL to solve it. In the inner SQL statement I do a many to many merge matching all members of a household to all other members, I remove any cases of matching a person to themself. Then, in the outer SQL statement we reduce to one record per household and person, which indicates if they ever overlapped.
df = data.frame(
household = c(rep(1,5), rep(2,2)),
person = c(1:5, 1:2),
start_time=as.Date(c("2017-05-31","2018-01-14", "2019-02-03", "2018-01-19", "2019-04-17",
"2018-02-03", "2018-03-03"),
format="%Y-%m-%d"),
end_time=as.Date(c("2018-01-17", "2018-01-20", "2019-04-15", "2018-02-20", "2019-05-17",
"2019-03-03", "2019-03-03"),
format="%Y-%m-%d"))
library(sqldf)
compare <- sqldf(
"
SELECT * FROM (
SELECT L.* ,
CASE WHEN L.start_time <= R.end_time AND L.end_time >= R.start_time THEN 1
ELSE 0 END AS overlap
FROM df as L
LEFT JOIN df as R ON L.household = R.household
WHERE L.person != R.person
)
GROUP BY household, person
HAVING overlap = MAX(overlap)
"
)
SQL Solution without household grouping
This is an SQL solution to your problem. I do a keyless many to many merge to compare each row to every other row (but dont compare a row to itself), then I parse the big data frame down to a single record per ID that records whether any matches were found. Your data isn't quite a reprex (use the dput function in R), so I used an example dataset I had lying around. If you have trouble adapting this to your exact data, post reproducible data and I can help you out.
df = data.frame(
id = 1:3,
start_time=as.Date(c("2017-05-31","2018-01-14", "2018-02-03"), format="%Y-%m-%d"),
end_time=as.Date(c("2018-01-17", "2018-01-20", "2018-04-17"), format="%Y-%m-%d"))
library(sqldf)
compare <- sqldf(
"
SELECT * FROM (
SELECT L.* ,
CASE WHEN L.start_time <= R.end_time AND L.end_time >= R.start_time THEN 1
ELSE 0 END AS overlap
FROM df as L
CROSS JOIN df as R
WHERE L.id != R.id
)
GROUP BY ID
HAVING overlap = MAX(overlap)
"
)
I've got a data frame that looks like something along these lines:
Day Salesperson Value
==== ============ =====
Monday John 40
Monday Sarah 50
Tuesday John 60
Tuesday Sarah 30
Wednesday John 50
Wednesday Sarah 40
I want to divide the value for each salesperson by the number of times that each of the days of the week has occurred. So: There have been 3 Monday, 3 Tuesdays, and 2 Wednesdays — I don't have this information digitally, but can create a vector along the lines of
c(3, 3, 2)
How can I conditionally divide the Value column based on the number of times each day occurs?
I've found an inelegant solution, which entails copying the Day column to a temp column, replacing each of the names of the week in the new column with the number of times each day occurs using
df$temp <- sub("Monday, 3, df$temp)
but doing this seems kinda clunky. Is there a neat way to do this?
Suppose your auxiliary data is in another data.frame:
Day N_Day
1 Monday 3
2 Tuesday 3
3 Wednesday 2
The simplest way would be to merge:
DF_new <- merge(DF, DF2, by="Day")
DF_new$newcol <- DF_new$Value / DF_new$N_Day
which gives
Day Salesperson Value N_Day newcol
1 Monday John 40 3 13.33333
2 Monday Sarah 50 3 16.66667
3 Tuesday John 60 3 20.00000
4 Tuesday Sarah 30 3 10.00000
5 Wednesday John 50 2 25.00000
6 Wednesday Sarah 40 2 20.00000
The mergeless shortcut is
DF$newcol <- DF$Value / DF2$N_Day[match(DF$Day, DF2$Day)]
Data:
DF <- structure(list(Day = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label =
c("Monday",
"Tuesday", "Wednesday"), class = "factor"), Salesperson = structure(c(1L,
2L, 1L, 2L, 1L, 2L), .Label = c("John", "Sarah"), class = "factor"),
Value = c(40L, 50L, 60L, 30L, 50L, 40L)), .Names = c("Day",
"Salesperson", "Value"), class = "data.frame", row.names = c(NA,
-6L))
DF2 <- structure(list(Day = structure(1:3, .Label = c("Monday", "Tuesday",
"Wednesday"), class = "factor"), N_Day = c(3, 3, 2)), .Names = c("Day",
"N_Day"), row.names = c(NA, -3L), class = "data.frame")
You can use the library dplyr to merge your data frame with the frequency of each day.
df <- data.frame(
Day=c("Monday","Monday","Tuesday","Tuesday","Wednesday","Wednesday"),
Salesperson=c("John","Sarah","John","Sarah","John","Sarah"),
Value=c(40,50,60,30,50,40), stringsAsFactors=F)
aux <- data.frame(
Day=c("Monday","Tuesday","Wednesday"),
freq=c(3,3,2)
)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value2=Value/n)
To create this auxiliary table with the count of days that appear in your original data instead of doing it manually. You could use:
aux <- df %>% group_by(Day) %>% summarise(n=n())
> output
Day Salesperson Value n Value2
1 Monday John 40 2 20
2 Monday Sarah 50 2 25
3 Tuesday John 60 2 30
4 Tuesday Sarah 30 2 15
5 Wednesday John 50 2 25
6 Wednesday Sarah 40 2 20
If you want to substitute the actual valuecolumn, then use mutate(Value=Value/n) and to remove the additional columns, you can add a select(-n)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value=Value/n) %>% select(-n)
There is a column in my dataset that contains time in the format 00:20:10. I have two questions. First, when I import it into R using read.xlsx2(), this column is converted to factor type. How can I convert it to time type?
Second, I want to calculate each person's total time in number of minutes.
ID Time
1 00:10:00
1 00:21:30
2 00:30:10
2 00:04:10
The output I want is:
ID Total.time
1 31.5
2 34.3
I haven't dealt with time issue before, and I hope someone would recommend some packages as well.
You could use times() from the chron package to convert the Time column to "times" class. Then aggregate() to sum the times, grouped by the ID column. This first block will give us actual times in the result.
library(chron)
df$Time <- times(df$Time)
aggregate(list(Total.Time = df$Time), df[1], sum)
# ID Total.Time
# 1 1 00:31:30
# 2 2 00:34:20
For decimal output, we can employ minutes() and seconds(), also from chron.
aggregate(list(Total.Time = df$Time), df[1], function(x) {
minutes(s <- sum(x)) + (seconds(s) / 60)
})
# ID Total.Time
# 1 1 31.50000
# 2 2 34.33333
Furthermore, we can also use data.table for improved efficiency.
library(data.table)
setDT(df)[, .(Total.Time = minutes(s <- sum(Time)) + (seconds(s) / 60)), by = ID]
# ID Total.Time
# 1: 1 31.50000
# 2: 2 34.33333
Data:
df <- structure(list(ID = c(1L, 1L, 2L, 2L), Time = structure(c(2L,
3L, 4L, 1L), .Label = c("00:04:10", "00:10:00", "00:21:30", "00:30:10"
), class = "factor")), .Names = c("ID", "Time"), class = "data.frame", row.names = c(NA,
-4L))