Mark each row in a large dataframe via two variables

Mark each row in a large dataframe via two variables - r

I have a dataframe like this (the real one is much larger):
time<-c(as.POSIXct('2011-11-11 06:00:00'),as.POSIXct('2011-11-11 06:05:00'),as.POSIXct('2011-11-11 07:05:00'),
as.POSIXct('2011-11-11 07:10:00'),as.POSIXct('2011-11-11 07:13:00'),as.POSIXct('2011-11-11 07:33:00'),
as.POSIXct('2011-11-11 05:05:00'),as.POSIXct('2011-11-11 06:05:00'),as.POSIXct('2011-11-11 06:20:00'),
as.POSIXct('2011-11-11 09:05:00'))
plate<-c('a','a','a','b','c','d','e','e','e','e')
df<-data.frame(time,plate)
The time variable represents the time that the vehicle be identified by the video device. The plate variable represents the vehicle's plate. The dataframe has been well ordered by firstly plate and secondly time.
Given this, I want to devide each vehicle's trip by marking the rows. Different vehicles (plates) certainly represent different trips. For one vehicle, the identified time difference within one trip should be shorter than 30 minutes, if not, the rows should belong to different trips.
In my way, I will do this by the following code:
trip<-vector()
trip[1]<-1
time_diff<-as.POSIXct('2011-11-11 07:00:00')-as.POSIXct('2011-11-11 06:30:00')
for (x in 2:nrow(df)) {
if (!df$plate[x]==df$plate[x-1]) (trip[x]<-trip[x-1]+1
) else{if (df$time[x]-df$time[x-1]<time_diff) (trip[x]<-trip[x-1]
) else (trip[x]<-trip[x-1]+1)}
}
df<-cbind(df,trip)
However, my df contains more than seven million rows thus my method will be very slow. So I'm asking if there are some more efficient ways to do this.

I'll suggest using dplyr for this, though with 7M rows you might consider data.table solution if this doesn't work well for you.
library(dplyr)
time_diff<-as.POSIXct('2011-11-11 07:00:00')-as.POSIXct('2011-11-11 06:30:00')
df %>%
arrange(time) %>% # it's important, so I reinforce it here
group_by(plate) %>%
mutate(
trip = cumsum( c(TRUE, diff(time) > time_diff) )
) %>%
ungroup()
# # A tibble: 10 × 3
# time plate trip
# <dttm> <fctr> <int>
# 1 2011-11-11 06:00:00 a 1
# 2 2011-11-11 06:05:00 a 1
# 3 2011-11-11 07:05:00 a 2
# 4 2011-11-11 07:10:00 b 1
# 5 2011-11-11 07:13:00 c 1
# 6 2011-11-11 07:33:00 d 1
# 7 2011-11-11 05:05:00 e 1
# 8 2011-11-11 06:05:00 e 2
# 9 2011-11-11 06:20:00 e 2
# 10 2011-11-11 09:05:00 e 3
I much prefer the above solution using group_by, but if you want the trip to be unique across plates, one technique is to handle the grouping yourself (requiring strict ordering):
df %>%
arrange(plate, time) %>%
mutate(
trip = cumsum( plate != lag(plate, default = plate[1]) | c(TRUE, diff(time) > time_diff) )
)
# time plate trip
# 1 2011-11-11 06:00:00 a 1
# 2 2011-11-11 06:05:00 a 1
# 3 2011-11-11 07:05:00 a 2
# 4 2011-11-11 07:10:00 b 3
# 5 2011-11-11 07:13:00 c 4
# 6 2011-11-11 07:33:00 d 5
# 7 2011-11-11 05:05:00 e 6
# 8 2011-11-11 06:05:00 e 7
# 9 2011-11-11 06:20:00 e 7
# 10 2011-11-11 09:05:00 e 8

Related

How to splice an existing date-bounded row of data into two new rows based on the date of a new variable?

In my longitudinal data set, each row represents a time period of observation for each person, and each row is bounded by a start and end date. The rows are numbered ('episode'), and contain many row-specific variables (eg, 'edu_level') that I need to retain throughout the following steps.
I created a new date variable, hx_start, which can relate to the start and end date of each row of data in 1 of 3 ways (below). For each scenario, I need to edit (splice) the existing row of data accordingly, using dplyr:
1. Between a given row's start and end date (ie, as it does for persons 2 and 4)
In this case, I want to splice the existing row into two new ones, so that the date of
hx_start is the start date of one of the rows. The other row would retain the original row's
start date and its end date would be one day before the date of hx_start.
2. On the same date as someone's row start date (ie, person 1)
In this case, no change is needed.
3. On the same date as someone's row end date (ie, person 3)
Same as #1: I need to splice the existing row into two new ones, so that the date of hx_start
is the start date of one of the rows. The other row would retain the original row's
start date and its end date would be one day before the date of hx_start.
So far, I have created a new data set that has 2 duplicates of each row, assuming that I will need to edit up to 2 rows per existing row, and then drop the originals (or retain only the original, in the case of person 1). Importantly, I need a way to carry forward all of the other variables from the original row to all new rows without naming them all, if possible (there are many in my real data set).
#Load packages
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
#Create data set
person <- c(1, 2, 3, 4)
episode <- c(33, 50, 65, 70)
start <- c('2013-01-01', '2010-01-21', '2009-09-18', '2010-05-26')
end <- c('2013-06-04', '2010-06-19', '2009-12-31', '2010-12-24')
hx_start <- c('2013-01-01', '2010-03-09', '2009-12-31', '2010-07-04')
edu_level <- c(2, 3, 2, 1)
#Populate data frame
d <- cbind(person, episode, start, hx_start, end, edu_level)
d <- as.data.frame(d)
#Format dates and add to data frame
d$start <- as.Date(start, format = '%Y-%m-%d')
d$end <- as.Date(end, format = '%Y-%m-%d')
d$hx_start <- as.Date(hx_start, format = '%Y-%m-%d')
#Create 2 duplicates of this row for each person
d1 <- d[rep(seq_len(nrow(d)), each = 3), ]
d1
#> person episode start hx_start end edu_level
#> 1 1 33 2013-01-01 2013-01-01 2013-06-04 2
#> 1.1 1 33 2013-01-01 2013-01-01 2013-06-04 2
#> 1.2 1 33 2013-01-01 2013-01-01 2013-06-04 2
#> 2 2 50 2010-01-21 2010-03-09 2010-06-19 3
#> 2.1 2 50 2010-01-21 2010-03-09 2010-06-19 3
#> 2.2 2 50 2010-01-21 2010-03-09 2010-06-19 3
#> 3 3 65 2009-09-18 2009-12-31 2009-12-31 2
#> 3.1 3 65 2009-09-18 2009-12-31 2009-12-31 2
#> 3.2 3 65 2009-09-18 2009-12-31 2009-12-31 2
#> 4 4 70 2010-05-26 2010-07-04 2010-12-24 1
#> 4.1 4 70 2010-05-26 2010-07-04 2010-12-24 1
#> 4.2 4 70 2010-05-26 2010-07-04 2010-12-24 1
Created on 2022-03-23 by the reprex package (v2.0.0)

You can do this by creating a small helper function. I've done this using data.table formatting
library(data.table)
f <- function(s,m,e) {
if(m>s) return(list("start" = c(m,s),"hx_start" = c(m,m),"end" = c(e,m-1)))
if(m == s) return (list("start" = s,"hx_start" = m,"end" =e))
}
setDT(d)[,!c(3:5)][d[ ,f(start,hx_start,end), by=person], on=.(person)]
Output:
person episode edu_level start hx_start end
1: 1 33 2 2013-01-01 2013-01-01 2013-06-04
2: 2 50 3 2010-03-09 2010-03-09 2010-06-19
3: 2 50 3 2010-01-21 2010-03-09 2010-03-08
4: 3 65 2 2009-12-31 2009-12-31 2009-12-31
5: 3 65 2 2009-09-18 2009-12-31 2009-12-30
6: 4 70 1 2010-07-04 2010-07-04 2010-12-24
7: 4 70 1 2010-05-26 2010-07-04 2010-07-03
Notice that:
For person 2,4, one row now has hx_start as the start date, and the other row has the original start date, while the end date is one day before the hx_start date.
For person 1, there has been no change
For person 3, one row now has hx_start as the start date, and the other row has the original start date, while the end date is one day before the hx_start date.
Tidyverse option (also uses function above)
inner_join(
d %>% select(-c(start,hx_start,end)),
d %>%
rowwise() %>%
summarize(person = max(person),
dates = list(f(start,hx_start,end))) %>%
unnest_wider(dates) %>%
unnest(cols=everything()),
by = "person"
)
Output:
person episode edu_level start hx_start end
1: 1 33 2 2013-01-01 2013-01-01 2013-06-04
2: 2 50 3 2010-03-09 2010-03-09 2010-06-19
3: 2 50 3 2010-01-21 2010-03-09 2010-03-08
4: 3 65 2 2009-12-31 2009-12-31 2009-12-31
5: 3 65 2 2009-09-18 2009-12-31 2009-12-30
6: 4 70 1 2010-07-04 2010-07-04 2010-12-24
7: 4 70 1 2010-05-26 2010-07-04 2010-07-03

Count number of rows for each row that meet a logical condition

So I have some data with a time stamp, and for each row, I want to count the number of rows that fall within a certain time window. For example, if I have the data below with a time stamp in h:mm (column ts), I want to count the number of rows that occur from that time stamp to five minutes in the past (column count). The first n rows that are less than five minutes from the first data point should be NAs.
ts data count
1:01 123 NA
1:02 123 NA
1:03 123 NA
1:04 123 NA
1:06 123 5
1:07 123 5
1:10 123 3
1:11 123 4
1:12 123 4
This is straightforward to do with a for loop, but I've been trying to implement with the apply() family and have not yet found any success. Any suggestions?

EDIT: modified to account for the potential for multiple readings per minute, raised in comment.
Data with new mid-minute reading:
library(dplyr)
df %>%
# Take the text above and convert to datetime
mutate(ts = lubridate::ymd_hms(paste(Sys.Date(), ts))) %>%
# Count how many observations per minute
group_by(ts_min = lubridate::floor_date(ts, "1 minute")) %>%
summarize(obs_per_min = sum(!is.na(data))) %>%
# Add rows for any missing minutes, count as zero observations
padr::pad(interval = "1 min") %>%
replace_na(list(obs_per_min = 0)) %>%
# Count cumulative observations, and calc how many in window that
# begins 5 minutes ago and ends at end of current minute
mutate(cuml_count = cumsum(obs_per_min),
prior_cuml = lag(cuml_count) %>% tidyr::replace_na(0),
in_window = cuml_count - lag(prior_cuml, 5)) %>%
# Exclude unneeded columns and rows
select(-cuml_count, -prior_cuml) %>%
filter(obs_per_min > 0)
Output (now reflects add'l reading at 1:06:30)
# A tibble: 12 x 3
ts_min obs_per_min in_window
<dttm> <dbl> <dbl>
1 2018-09-26 01:01:00 1 NA
2 2018-09-26 01:02:00 1 NA
3 2018-09-26 01:03:00 1 NA
4 2018-09-26 01:04:00 1 NA
5 2018-09-26 01:06:00 2 6
6 2018-09-26 01:07:00 1 6
7 2018-09-26 01:10:00 1 4
8 2018-09-26 01:11:00 1 5
9 2018-09-26 01:12:00 1 4

Count the number of active episodes per month from data with start and end dates

I am trying to get a count of active clients per month, using data that has a start and end date to each client's episode. The code I am using I can't work out how to count per month, rather than per every n days.
Here is some sample data:
Start.Date <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
End.Date<- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
Make sure the dates are dates:
Start.Date <- as.Date(Start.Date, "%d/%m/%Y")
End.Date <- as.Date(End.Date, "%d/%m/%Y")
Here is the code I am using, which current counts the number per day:
library(plyr)
count(Reduce(c, Map(seq, start.month, end.month, by = 1)))
which returns:
x freq
1 2014-01-01 1
2 2014-01-02 2
3 2014-01-03 4
4 2014-01-04 2
The "by" argument can be changed to be however many days I want, but problems arise because months have different lengths.
Would anyone be able to suggest how I can count per month?
Thanks a lot.
note: I now realize that for my example data I have only used dates in the same month, but my real data has dates spanning 3 years.

Here's a solution that seems to work. First, I set the seed so that the example is reproducible.
# Set seed for reproducible example
set.seed(33550336)
Next, I create a dummy data frame.
# Test data
df <- data.frame(Start_date = as.Date(sample(seq(as.Date('2014/01/01'), as.Date('2015/01/01'), by="day"), 12))) %>%
mutate(End_date = as.Date(Start_date + sample(1:365, 12, replace = TRUE)))
which looks like,
# Start_date End_date
# 1 2014-11-13 2015-09-26
# 2 2014-05-09 2014-06-16
# 3 2014-07-11 2014-08-16
# 4 2014-01-25 2014-04-23
# 5 2014-05-16 2014-12-19
# 6 2014-11-29 2015-07-11
# 7 2014-09-21 2015-03-30
# 8 2014-09-15 2015-01-03
# 9 2014-09-17 2014-09-26
# 10 2014-12-03 2015-05-08
# 11 2014-08-03 2015-01-12
# 12 2014-01-16 2014-12-12
The function below takes a start date and end date and creates a sequence of months between these dates.
# Sequence of months
mon_seq <- function(start, end){
# Change each day to the first to aid month counting
day(start) <- 1
day(end) <- 1
# Create a sequence of months
seq(start, end, by = "month")
}
Right, this is the tricky bit. I apply my function mon_seq to all rows in the data frame using mapply. This gives the months between each start and end date. Then, I combine all these months together into a vector. I format this vector so that dates just contain months and years. Finally, I pipe (using dplyr's %>%) this into table which counts each occurrence of year-month and I cast as a data frame.
data.frame(format(do.call("c", mapply(mon_seq, df$Start_date, df$End_date)), "%Y-%m") %>% table)
This gives,
# . Freq
# 1 2014-01 2
# 2 2014-02 2
# 3 2014-03 2
# 4 2014-04 2
# 5 2014-05 3
# 6 2014-06 3
# 7 2014-07 3
# 8 2014-08 4
# 9 2014-09 6
# 10 2014-10 5
# 11 2014-11 7
# 12 2014-12 8
# 13 2015-01 6
# 14 2015-02 4
# 15 2015-03 4
# 16 2015-04 3
# 17 2015-05 3
# 18 2015-06 2
# 19 2015-07 2
# 20 2015-08 1
# 21 2015-09 1

lag date upon condition, carry over

I have repeated measurements over individuals who have made donations, or not, when solicited. I wish I could carry over the last successful solicitation date to the next observations until a new success is hit.
Here is my sample data:
set.seed(13)
df <- data.frame(ID=rep(letters[1:3], each=4),
SolicitationDate= sample(seq(as.Date('2016/01/01'),
as.Date('2018/01/01'), by="day"), 3),
Success=rbinom(4,1,0.2))
df$ExpectedResult <- c(NA, NA, "2016-06-28", "2016-06-28",
NA, NA, "2016-10-11", "2016-10-11",
NA,NA,"2017-06-03", "2017-06-03")
Should an individual have multiple successes, the last success date should be carrried over.
Thanks
Romain

Here's a version using tidyverse. I think your expected output may be off as the dates should be ordered within ID but that may be wrong. In that case let me know.
df %>%
group_by(ID) %>% # Group by ID
arrange(SolicitationDate) %>% # Sort according to date
mutate(res=replace(SolicitationDate, Success==0, NA)) %>% # Create new value
tidyr::fill(res) # Fill down
This will give you
# A tibble: 12 x 4
# Groups: ID [3]
ID SolicitationDate Success res
<fct> <date> <int> <date>
1 a 2016-06-28 1 2016-06-28
2 a 2016-10-11 0 2016-06-28
3 a 2017-06-03 0 2016-06-28
4 a 2017-06-03 0 2016-06-28
5 b 2016-06-28 0 NA
6 b 2016-06-28 0 NA
7 b 2016-10-11 1 2016-10-11
8 b 2017-06-03 0 2016-10-11
9 c 2016-06-28 0 NA
10 c 2016-10-11 0 NA
11 c 2016-10-11 0 NA
12 c 2017-06-03 1 2017-06-03
I'm not sure if you want the success dates to be part of the result or not. If not then you could set to missing and fill down again. In any case: hope this helps.

Using a rolling time interval to count rows in R and dplyr

Let's say I have a dataframe of timestamps with the corresponding number of tickets sold at that time.
Timestamp ticket_count
(time) (int)
1 2016-01-01 05:30:00 1
2 2016-01-01 05:32:00 1
3 2016-01-01 05:38:00 1
4 2016-01-01 05:46:00 1
5 2016-01-01 05:47:00 1
6 2016-01-01 06:07:00 1
7 2016-01-01 06:13:00 2
8 2016-01-01 06:21:00 1
9 2016-01-01 06:22:00 1
10 2016-01-01 06:25:00 1
I want to know how to calculate the number of tickets sold within a certain time frame of all tickets. For example, I want to calculate the number of tickets sold up to 15 minutes after all tickets. In this case, the first row would have three tickets, the second row would have four tickets, etc.
Ideally, I'm looking for a dplyr solution, as I want to do this for multiple stores with a group_by() function. However, I'm having a little trouble figuring out how to hold each Timestamp fixed for a given row while simultaneously searching through all Timestamps via dplyr syntax.

In the current development version of data.table, v1.9.7, non-equi joins are implemented. Assuming your data.frame is called df and the Timestamp column is POSIXct type:
require(data.table) # v1.9.7+
window = 15L # minutes
(counts = setDT(df)[.(t=Timestamp+window*60L), on=.(Timestamp<t),
.(counts=sum(ticket_count)), by=.EACHI]$counts)
# [1] 3 4 5 5 5 9 11 11 11 11
# add that as a column to original data.table by reference
df[, counts := counts]
For each row in t, all rows where df$Timestamp < that_row is fetched. And by=.EACHI instructs the expression sum(ticket_count) to run for each row in t. That gives your desired result.
Hope this helps.

This is a simpler version of the ugly one I wrote earlier..
# install.packages('dplyr')
library(dplyr)
your_data %>%
mutate(timestamp = as.POSIXct(timestamp, format = '%m/%d/%Y %H:%M'),
ticket_count = as.numeric(ticket_count)) %>%
mutate(window = cut(timestamp, '15 min')) %>%
group_by(window) %>%
dplyr::summarise(tickets = sum(ticket_count))
window tickets
(fctr) (dbl)
1 2016-01-01 05:30:00 3
2 2016-01-01 05:45:00 2
3 2016-01-01 06:00:00 3
4 2016-01-01 06:15:00 3

Here is a solution using data.table. Also incorporating different stores.
Example data:
library(data.table)
dt <- data.table(Timestamp = as.POSIXct("2016-01-01 05:30:00")+seq(60,120000,by=60),
ticket_count = sample(1:9, 2000, T),
store = c(rep(c("A","B","C","D"), 500)))
Now apply the following:
ts <- dt$Timestamp
for(x in ts) {
end <- x+900
dt[Timestamp <= end & Timestamp >= x ,CS := sum(ticket_count),by=store]
}
This gives you
Timestamp ticket_count store CS
1: 2016-01-01 05:31:00 3 A 13
2: 2016-01-01 05:32:00 5 B 20
3: 2016-01-01 05:33:00 3 C 19
4: 2016-01-01 05:34:00 7 D 12
5: 2016-01-01 05:35:00 1 A 15
---
1996: 2016-01-02 14:46:00 4 D 10
1997: 2016-01-02 14:47:00 9 A 9
1998: 2016-01-02 14:48:00 2 B 2
1999: 2016-01-02 14:49:00 2 C 2
2000: 2016-01-02 14:50:00 6 D 6

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Mark each row in a large dataframe via two variables - r

Related

How to splice an existing date-bounded row of data into two new rows based on the date of a new variable?

Count number of rows for each row that meet a logical condition

Count the number of active episodes per month from data with start and end dates

lag date upon condition, carry over

Using a rolling time interval to count rows in R and dplyr

Categories

Resources