I am currently working with a dataset containing sensordata. I wish to get some summary statistics. More precisely I wish to get the number of visits, and the total occupancy length. One visit is defined if there are several 0 values over X amount of minutes after a timestamp having value 1
my data looks like this
SensorId timestamp value
1 10:10:10 1
1 10:12:10 1
1 10:14:00 1
1 10:16:00 0
1 10:18:00 0
1 10:20:00 0
2 13:10:10 1
2 13:12:10 1
2 13:14:00 1
2 13:20:00 1
2 13:22:00 0
this is my desired result:
SensorId total time in use Number of visits
1 4 1
2 10 1
there are quite a lot of rows, so I wish for the total time in use, and number of visits to update each time.
We can convert timestamp to POSIXct class, arrange them, group them by SensorId and consecutive similar value and take subtraction of last timestamp with the first one.
library(dplyr)
df %>%
mutate(timestamp = as.POSIXct(timestamp, format = "%T")) %>%
arrange(SensorId, timestamp) %>%
group_by(SensorId, grp = data.table::rleid(value)) %>%
summarise(total_time = round(last(timestamp) - first(timestamp)),
number_of_visit = first(value)) %>%
filter(number_of_visit == 1) %>%
select(-grp)
# SensorId total_time number_of_visit
# <int> <drtn> <int>
#1 1 4 mins 1
#2 2 10 mins 1
Related
My idea is to count observations (grouped by Id's) within 30 days windows. My problem is that I want to introduce an exception in the counting process: if during the 30 days analyzed there is an observation that will be discarded (because n> 1) the count is only constructed with the data not discarded. (n is the variable that counts the number of observations within 30 days windows).
Example
id date
1 1/1/2021
1 22/1/2021
1 1/2/2021
Code:
test<-test%>%
group_by(id)%>%
mutate(n=sapply(seq(length(date)),
function(x) sum(between(date[1:x],date[x]-days(30),date[x]))))
id date n
1 1/1/2021 1
1 22/1/2021 2
1 1/2/2021 2
1 3/3/2021 2
1 2/2/2021 3
1 7/7/2021 1
Expected result:
id date n nexpected
1 1/1/2021 1 1
1 22/1/2021 2 2
1 1/2/2021 2 1
1 3/3/2021 2 2
1 2/2/2021 3 1
1 7/7/2021 1 1
Alternative explanation
I just want to keep an observation (grouped by ID) for every 30 days. I want to do this by creating a variable that tells me which observations are left inside (1) and which ones are outside (0) of the filter.
Not sure this is what you want, but lubridate::floor_date is often useful in those situations:
library(tidyverse)
library(lubridate)
test %>%
mutate(date = dmy(date)) %>%
group_by(id, floor = floor_date(date, 'month')) %>%
mutate(n = row_number())
id date floor n
<int> <date> <date> <int>
1 1 2021-01-01 2021-01-01 1
2 1 2021-01-22 2021-01-01 2
3 1 2021-02-01 2021-02-01 1
Thank you, experts for previous answers (How to filter by range of dates in R?)
I am still having some problems dealing with the data.
Example:
id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
My idea is to eliminate the observations that have more than 3 "units" in a period of 30 days. That is, if "a" has a unit "q" on "12/02/2021" [dd/mm]yyyy]: (a) if between 12/01/2021 and 12/02/2021 there are already 3 observations it must be deleted . (b) If there are less than 3 this one must remain.
My expected result is:
p q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
With this code:
df <- df %>%
mutate(day = dmy(data))%>%
group_by(p) %>%
arrange(day, .by_group = TRUE) %>%
mutate(diff = day - first(day)) %>%
mutate(row = row_number()) %>%
filter(row <= 3 | !diff < 30)
But the result is:
P Q DATE DAY DIFF ROW
a 1 1/1/2021 1/1/2021 0 1
a 1 1/1/2021 1/1/2021 0 2
a 1 21/1/2021 21/1/2021 20 3
a 1 12/2/2021 12/2/2021 42 5
a 1 12/2/2021 12/2/2021 42 6
a 1 12/2/2021 12/2/2021 42 7
a 1 12/2/2021 12/2/2021 42 8
The main problem is that the diff variable must count days in periods of 30 days from the last day of the previous 30-days period - not since the first observation day.
Any help? Thanks
Using floor_date it is quite straighforward:
library(lubridate)
library(dplyr)
df %>%
group_by(floor = floor_date(date, '30 days')) %>%
slice_head(n = 3) %>%
ungroup() %>%
select(-floor)
# A tibble: 6 x 3
id q date
<chr> <int> <date>
1 a 1 2021-01-01
2 a 1 2021-01-01
3 a 1 2021-01-21
4 a 1 2021-02-12
5 a 1 2021-02-12
6 a 1 2021-02-12
data
df <- read.table(header = T, text = "id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021")
df$date<-as.Date(df$date, format = "%d/%m/%Y")
If I had:
person_ID visit date
1 2/25/2001
1 2/27/2001
1 4/2/2001
2 3/18/2004
3 9/22/2004
3 10/27/2004
3 5/15/2008
and I wanted another column to indicate the earliest recurring observation within 90 days, grouped by patient ID, with the desired output:
person_ID visit date date
1 2/25/2001 2/27/2001
1 2/27/2001 4/2/2001
1 4/2/2001 NA
2 3/18/2004 NA
3 9/22/2004 10/27/2004
3 10/27/2004 NA
3 5/15/2008 NA
Thank you!
We convert the 'visit_date' to Date class, grouped by 'person_ID', create a binary column that returns 1 if the difference between the current and next visit_date is less than 90 or else 0, using this column, get the correponding next visit_date' where the value is 1
library(dplyr)
library(lubridate)
library(tidyr)
df1 %>%
mutate(visit_date = mdy(visit_date)) %>%
group_by(person_ID) %>%
mutate(i1 = replace_na(+(difftime(lead(visit_date),
visit_date, units = 'day') < 90), 0),
date = case_when(as.logical(i1)~ lead(visit_date)), i1 = NULL ) %>%
ungroup
-output
# A tibble: 7 x 3
# person_ID visit_date date
# <int> <date> <date>
#1 1 2001-02-25 2001-02-27
#2 1 2001-02-27 2001-04-02
#3 1 2001-04-02 NA
#4 2 2004-03-18 NA
#5 3 2004-09-22 2004-10-27
#6 3 2004-10-27 NA
#7 3 2008-05-15 NA
I would like to get the time elapsed between events in my dataframe, for each grouping of data by the ID. The dates i want to use are in their own columns. I have done the following already using dplyr:
Grouped my data by the ID
Ordered by the ID
This is how the data looks. I would like the output to be the time_diff column. Any help would be very much appreciated!
ID: Status: Start-time: End-time: time-diff:
1 Active 01/01/2018 NA 0
1 Complete NA 01/02/2018 1
2 Active 03/02/2018 0
2 Active NA 0
2 Complete NA 03/06/2018 4
Taking the time difference between a time and a NA value will just return NA. A more meaningful approach would be to take the individual time different of each event, and then summarize over each group (id).
d <- tibble(id = c(1,1,2,2),
st = ymd(c("2019-05-03", "2019-02-06", "2019-07-11","2019-05-13")),
et = ymd(c("2019-05-10", "2019-02-16", "2019-07-04","2019-05-09")))
d2 <- d %>%
mutate(td = et-st, # calculate the time difference (td)
atd = abs(td)) %>% # calculate the absolute td (atd)
d2
# A tibble: 4 x 5
id st et td atd
<dbl> <date> <date> <time> <time>
1 1 2019-05-03 2019-05-10 7 days 7 days
2 1 2019-02-06 2019-02-16 10 days 10 days
3 2 2019-07-11 2019-07-04 -7 days 7 days
4 2 2019-05-13 2019-05-09 -4 days 4 days
Then you can take the mean of the absolute differences for example and get:
d2 %>%
group_by(id) %>% # for each group (id)
summarise(mtd = mean(atd)) # calculate the mean time difference (mtd)
# A tibble: 2 x 2
id mtd
<dbl> <time>
1 1 8.5 days
2 2 5.5 days
I would like to create a subset from the following example of data frame. The condition is to select those rows where time column values belong to time range from the minimum time for the certain id till the next lets say one hour.
id time
1 1468696537
1 1468696637
1 1482007490
2 1471902849
2 1471902850
2 1483361074
3 1474207754
3 1474207744
3 1471446673
3 1471446693
And the output should be like this:
id time
1 1468696537
1 1468696637
2 1471902849
2 1471902850
3 1471446673
3 1471446693
Please, help me how to do that?
We can do the following:
library(magrittr);
library(dplyr);
df %>%
group_by(id) %>%
filter(time <= min(time) + 3600)
# id time
# <int> <int>
#1 1 1468696537
#2 1 1468696637
#3 2 1471902849
#4 2 1471902850
#5 3 1471446673
#6 3 1471446693
Explanation: Group entries by id, then filter entries that are within min(time) + 1 hour.
Sample data
df <- read.table(text =
" id time
1 1468696537
1 1468696637
1 1482007490
2 1471902849
2 1471902850
2 1483361074
3 1474207754
3 1474207744
3 1471446673
3 1471446693 ", header = T)