I am looking to run a cumulative sum at every row for values that occur in two columns before and after that point. So in this case I have volume of 2 incident types at every given minute over two days. I want to create a column which adds all the incidents that occured before and after for each row by the type. Sumif from excel comes to mind but I'm not sure how to port that over to R:
EDIT: ADDED set.seed and easier numbers
I have the following data set:
set.seed(42)
master_min =
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-2 23:00", tz="America/New_York"),
by="min"
))
)
incident1= round(runif(2821, min=0, max=10))
incident2= round(runif(2821, min=0, max=10))
master_min = head(cbind(master_min, incident1, incident2), 5)
How do I essentially compute the following logic:
for each row, sum all the incident1s that occured before that row's timestamp and all the incident2s that occured after that row's timestamp? It would be great to get a data table solution, if not a dplyr as I am working with a large dataset. Below is a before and after for the data`:
BEFORE:
master_min incident1 incident2
1: 2016-01-01 00:00:00 9 6
2: 2016-01-01 00:01:00 9 5
3: 2016-01-01 00:02:00 3 5
4: 2016-01-01 00:03:00 8 6
5: 2016-01-01 00:04:00 6 9
AFTER THE CALCULATION:
master_min incident1 incident2 new_column
1: 2016-01-01 00:00:00 9 6 25
2: 2016-01-01 00:01:00 9 5 29
3: 2016-01-01 00:02:00 3 5 33
4: 2016-01-01 00:03:00 8 6 30
5: 2016-01-01 00:04:00 6 9 29
If I understand correctly:
# Cumsum of incident1, without current row:
master_min$sum1 <- cumsum(master_min$incident1) - master_min$incident1
# Reverse cumsum of incident2, without current row:
master_min$sum2 <- rev(cumsum(rev(master_min$incident2))) - master_min$incident2
# Your new column:
master_min$new_column <- master_min$sum1 + master_min$sum2
*update
The following two lines can do the job
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
I rewrote the question a bit to show a bit more comprehensive structure
library(data.table)
master_min <-
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-1 0:09", tz="America/New_York"),
by="min"
))
)
set.seed(2)
incident1= as.integer(runif(10, min=0, max=10))
incident2= as.integer(runif(10, min=0, max=10))
master_min = cbind(master_min, incident1, incident2)
Now master_min looks like this
> master_min
master_min incident1 incident2
1: 2016-01-01 00:00:00 1 5
2: 2016-01-01 00:01:00 7 2
3: 2016-01-01 00:02:00 5 7
4: 2016-01-01 00:03:00 1 1
5: 2016-01-01 00:04:00 9 4
6: 2016-01-01 00:05:00 9 8
7: 2016-01-01 00:06:00 1 9
8: 2016-01-01 00:07:00 8 2
9: 2016-01-01 00:08:00 4 4
10: 2016-01-01 00:09:00 5 0
Apply transformations
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
Results
> master_min
master_min incident1 incident2 sum1 sum2
1: 2016-01-01 00:00:00 1 5 1 37
2: 2016-01-01 00:01:00 7 2 8 35
3: 2016-01-01 00:02:00 5 7 13 28
4: 2016-01-01 00:03:00 1 1 14 27
5: 2016-01-01 00:04:00 9 4 23 23
6: 2016-01-01 00:05:00 9 8 32 15
7: 2016-01-01 00:06:00 1 9 33 6
8: 2016-01-01 00:07:00 8 2 41 4
9: 2016-01-01 00:08:00 4 4 45 0
10: 2016-01-01 00:09:00 5 0 50 0
Related
So I basically got a while loop function that creates 1's in the "algorithm_column" based on the highest percentages in the "percent" column, until a certain total percentage is reached (90% or something). The rest of the rows that are not taken into account will have a value of 0 in the "algorithm_column" ( Create while loop function that takes next largest value untill condition is met)
I want to show, based on what the loop function found, the min and max times of the column "timeinterval" (the min is where the 1's start and max is the last row with a 1, the 0's are out of the scope). And then finally create a time interval from this.
So if we have the following code, I want to create in another column, lets say "total_time" a calculation from the min time 09:00 ( this is where 1 start in the algorithm_column) until 11:15, which makes a time interval of 02:15 hours added to the "total_time" column.
algorithm
# pc4 timeinterval stops percent idgroup algorithm_column
#1 5464 08:45:00 1 1.3889 1 0
#2 5464 09:00:00 5 6.9444 2 1
#3 5464 09:15:00 8 11.1111 3 1
#4 5464 09:30:00 7 9.7222 4 1
#5 5464 09:45:00 5 6.9444 5 1
#6 5464 10:00:00 10 13.8889 6 1
#7 5464 10:15:00 6 8.3333 7 1
#8 5464 10:30:00 4 5.5556 8 1
#9 5464 10:45:00 7 9.7222 9 1
#10 5464 11:00:00 6 8.3333 10 1
#11 5464 11:15:00 5 6.9444 11 1
#12 5464 11:30:00 8 11.1111 12 0
I have multiple pc4 groups, so it should look at every group and calculate a total_time for each group respectively.
I got this function, but I'm a bit stuck if this is what I need.
test <- function(x) {
ind <- x[["algorithm$algorithm_column"]] == 0
Mx <- max(x[["timeinterval"]][ind], na.rm = TRUE);
ind <- x[["algorithm$algorithm_column"]] == 1
Mn <- min(x[["timeinterval"]][ind], na.rm = TRUE);
list(Mn, Mx) ## or return(list(Mn, Mx))
}
test(algorithm)
Here is a dplyr solution.
library(dplyr)
algorithm %>%
mutate(tmp = cumsum(c(0, diff(algorithm_column) != 0))) %>%
filter(algorithm_column == 1) %>%
group_by(pc4, tmp) %>%
summarise(first = first(timeinterval),
last = last(timeinterval)) %>%
select(-tmp)
## A tibble: 1 x 3
## Groups: pc4 [1]
# pc4 first last
# <int> <fct> <fct>
#1 5464 09:00:00 11:15:00
Data.
algorithm <- read.table(text = "
pc4 timeinterval stops percent idgroup algorithm_column
1 5464 08:45:00 1 1.3889 1 0
2 5464 09:00:00 5 6.9444 2 1
3 5464 09:15:00 8 11.1111 3 1
4 5464 09:30:00 7 9.7222 4 1
5 5464 09:45:00 5 6.9444 5 1
6 5464 10:00:00 10 13.8889 6 1
7 5464 10:15:00 6 8.3333 7 1
8 5464 10:30:00 4 5.5556 8 1
9 5464 10:45:00 7 9.7222 9 1
10 5464 11:00:00 6 8.3333 10 1
11 5464 11:15:00 5 6.9444 11 1
12 5464 11:30:00 8 11.1111 12 0
", header = TRUE)
The dataframe df1 summarizes detections of different individuals (ID) through time (Datetime). As a short example:
library(lubridate)
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Datetime= ymd_hms(c("2016-08-21 00:00:00","2016-08-24 08:00:00","2016-08-23 12:00:00","2016-08-29 03:00:00","2016-08-27 23:00:00","2016-09-02 02:00:00","2016-09-01 12:00:00","2016-09-09 04:00:00","2016-09-01 12:00:00","2016-09-10 12:00:00")))
> df1
ID Datetime
1 1 2016-08-21 00:00:00
2 2 2016-08-24 08:00:00
3 1 2016-08-23 12:00:00
4 2 2016-08-29 03:00:00
5 1 2016-08-27 23:00:00
6 2 2016-09-02 02:00:00
7 1 2016-09-01 12:00:00
8 2 2016-09-09 04:00:00
9 1 2016-09-01 12:00:00
10 2 2016-09-10 12:00:00
I want to calculate for each row, the number of hours (Hours_since_begining) since the first time that the individual was detected.
I would expect something like that (It can contain some mistakes since I did the calculations by hand):
> df1
ID Datetime Hours_since_begining
1 1 2016-08-21 00:00:00 0
2 2 2016-08-24 08:00:00 0
3 1 2016-08-23 12:00:00 60 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-23 12:00:00"
4 2 2016-08-29 03:00:00 115
5 1 2016-08-27 23:00:00 167 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-27 23:00:00"
6 2 2016-09-02 02:00:00 210
7 1 2016-09-01 12:00:00 276
8 2 2016-09-09 04:00:00 380
9 1 2016-09-01 12:00:00 276
10 2 2016-09-10 12:00:00 412
Does anyone know how to do it?
Thanks in advance!
You can do this :
library(tidyverse)
# first get min datetime by ID
min_datetime_id <- df1 %>% group_by(ID) %>% summarise(min_datetime=min(Datetime))
# join with df1 and compute time difference
df1 <- df1 %>% left_join(min_datetime_id) %>% mutate(Hours_since_beginning= as.numeric(difftime(Datetime, min_datetime,units="hours")))
I have a dataframe with a lot of time series:
1 0:03 B 1
2 0:05 A 1
3 0:05 A 1
4 0:05 B 1
5 0:10 A 1
6 0:10 B 1
7 0:14 B 1
8 0:18 A 1
9 0:20 A 1
10 0:23 B 1
11 0:30 A 1
I want to group the time series into every 6 minutes and count the frequency of A and B:
1 0:06 A 2
2 0:06 B 2
3 0:12 A 1
4 0:12 B 1
5 0:18 A 1
6 0:24 A 1
7 0:24 B 1
8 0:18 A 1
9 0:30 A 1
Also, the class of the time series is character. What should I do?
Here's an approach to convert times to POSIXct, cut the times by 6 minute intervals, then count.
First, you need to specify the year, month, day, hour, minute, and seconds of your data. This will help with scaling it to larger datasets.
library(tidyverse)
library(lubridate)
# sample data
d <- data.frame(t = paste0("2019-06-02 ",
c("0:03","0:06","0:09","0:12","0:15",
"0:18","0:21","0:24","0:27","0:30"),
":00"),
g = c("A","A","B","B","B"))
d$t <- ymd_hms(d$t) # convert to POSIXct with `lubridate::ymd_hms()`
If you check the class of your new date column, you will see it is "POSIXct".
> class(d$t)
[1] "POSIXct" "POSIXt"
Now that the data is in "POSIXct", you can cut it by minute intervals! We will add this new grouping factor as a new column called tc.
d$tc <- cut(d$t, breaks = "6 min")
d
t g tc
1 2019-06-02 00:03:00 A 2019-06-02 00:03:00
2 2019-06-02 00:06:00 A 2019-06-02 00:03:00
3 2019-06-02 00:09:00 B 2019-06-02 00:09:00
4 2019-06-02 00:12:00 B 2019-06-02 00:09:00
5 2019-06-02 00:15:00 B 2019-06-02 00:15:00
6 2019-06-02 00:18:00 A 2019-06-02 00:15:00
7 2019-06-02 00:21:00 A 2019-06-02 00:21:00
8 2019-06-02 00:24:00 B 2019-06-02 00:21:00
9 2019-06-02 00:27:00 B 2019-06-02 00:27:00
10 2019-06-02 00:30:00 B 2019-06-02 00:27:00
Now you can group_by this new interval (tc) and your grouping column (g), and count the frequency of occurences. Getting the frequency of observations in a group is a fairly common operation, so dplyr provides count for this:
count(d, g, tc)
# A tibble: 7 x 3
g tc n
<fct> <fct> <int>
1 A 2019-06-02 00:03:00 2
2 A 2019-06-02 00:15:00 1
3 A 2019-06-02 00:21:00 1
4 B 2019-06-02 00:09:00 2
5 B 2019-06-02 00:15:00 1
6 B 2019-06-02 00:21:00 1
7 B 2019-06-02 00:27:00 2
If you run ?dplyr::count() in the console, you'll see that count(d, tc) is simply a wrapper for group_by(d, g, tc) %>% summarise(n = n()).
According to the sample dataset, the time series is given as time-of-day, i.e., without date.
The data.table package has the ITime class which is a time-of-day class stored as the integer number of seconds in the day. With data.table, we can use a rolling join to map times to the upper limit of the 6 minutes intervals (right-closed intervals):
library(data.table)
# coerce from character to class ITime
setDT(ts)[, time := as.ITime(time)]
# create sequence of breaks
breaks <- as.ITime(seq(as.ITime("0:00"), as.ITime("23:59:59"), as.ITime("0:06")))
# rolling join and aggregate
ts[, CJ(breaks, group, unique = TRUE)
][ts, on = .(group, breaks = time), roll = -Inf, .(x.breaks, group)
][, .N, by = .(upper = x.breaks, group)]
which returns
upper group N
1: 00:06:00 B 2
2: 00:06:00 A 2
3: 00:12:00 A 1
4: 00:12:00 B 1
5: 00:18:00 B 1
6: 00:18:00 A 1
7: 00:24:00 A 1
8: 00:24:00 B 1
9: 00:30:00 A 1
Addendum
If the direction of the rolling join is changed (roll = +Inf instead of roll = -Inf) we get left-closed intervals
ts[, CJ(breaks, group, unique = TRUE)
][ts, on = .(group, breaks = time), roll = +Inf, .(x.breaks, group)
][, .N, by = .(lower = x.breaks, group)]
which changes the result significantly:
lower group N
1: 00:00:00 B 2
2: 00:00:00 A 2
3: 00:06:00 A 1
4: 00:06:00 B 1
5: 00:12:00 B 1
6: 00:18:00 A 2
7: 00:18:00 B 1
8: 00:30:00 A 1
Data
library(data.table)
ts <- fread("
1 0:03 B 1
2 0:05 A 1
3 0:05 A 1
4 0:05 B 1
5 0:10 A 1
6 0:10 B 1
7 0:14 B 1
8 0:18 A 1
9 0:20 A 1
10 0:23 B 1
11 0:30 A 1"
, header = FALSE
, col.names = c("rn", "time", "group", "value"))
Hi i would like to change my data frame profile_table_long which represents 24 hour/data for 50 companies from 2 years.
Data - date from 2015-01-01 to 2016-12-31
name - name of firm 1:50
hour - hour 1:24 (with additional 2a between 2 and 3)
load - variable
x <- NULL
x$Data <- rep(seq(as.Date("2015/1/1"), as.Date("2016/12/31"), "days"), length.out=913750)
x$Name <- rep(rep(1:50, each=731), length.out=913750)
x$hour <- rep(rep(c(1, 2, "2a", 3:24), each=36550),length.out=913750)
x$load <- sample(2000:2500, 913750, replace=T)
x <- data.frame(x)
Data name hour load
1 2015-01-01 1 1 8837.050
2 2015-01-01 1 2 6990.952
3 2015-01-01 1 2a 8394.421
4 2015-01-01 1 3 8267.276
5 2015-01-01 1 4 8324.069
6 2015-01-01 1 5 8644.901
7 2015-01-01 1 6 8720.878
8 2015-01-01 1 7 9213.204
9 2015-01-01 1 8 9601.976
10 2015-01-01 1 9 8549.170
11 2015-01-01 1 10 9379.324
12 2015-01-01 1 11 9370.418
13 2015-01-01 1 12 7159.201
14 2015-01-01 1 13 8497.344
15 2015-01-01 1 14 6419.835
16 2015-01-01 1 15 9354.910
17 2015-01-01 1 16 9320.462
18 2015-01-01 1 17 9263.098
19 2015-01-01 1 18 9167.991
20 2015-01-01 1 19 9004.010
21 2015-01-01 1 20 9134.466
22 2015-01-01 1 21 7631.472
23 2015-01-01 1 22 6492.074
24 2015-01-01 1 23 6888.025
25 2015-01-01 1 24 8821.283
25 2015-01-02 1 1 8902.135
I would like to make it look like that:
data hour name1 name2 .... name49 name50
2015-01-01 1 load load .... load load
2015-01-01 2 load load .... load load
.....
2015-01-01 24 load load .... load load
2015-01-02 1 load load .... load load
.....
2016-12-31 24 load load .... load load
I tried spread() from tidyr package profile_table_tidy <- spread(profile_table_long, name, load) but I am getting an error Error: Duplicate identifiers for rows
This method uses the reshape2 package:
library("reshape2")
profile_table_wide = dcast(data = profile_table_long,
formula = Data + hour ~ name,
value.var = "load")
You might also want to choose a value for fill as well. Good luck!
I am trying to find average of values that are within a certain time frame within the same data.table and save it to a new column.
Below is a sample data set
Updated the dataset to represent the discontinuous timeline in my original dataset.
> x
ts value avg
1: 2015-01-01 00:00:23 9 0
2: 2015-01-01 00:01:56 11 0
3: 2015-01-01 00:02:03 18 0
4: 2015-01-01 00:03:16 1 0
5: 2015-01-01 00:05:19 6 0
6: 2015-01-01 00:05:54 16 0
7: 2015-01-01 00:06:27 13 0
8: 2015-01-01 00:06:50 7 0
9: 2015-01-01 00:08:41 12 0
10: 2015-01-01 00:09:08 17 0
11: 2015-01-01 00:09:28 8 0
12: 2015-01-01 00:10:56 5 0
13: 2015-01-01 00:11:44 10 0
14: 2015-01-01 00:12:23 20 0
15: 2015-01-01 00:12:28 2 0
16: 2015-01-01 00:12:37 15 0
17: 2015-01-01 00:12:42 4 0
18: 2015-01-01 00:12:48 19 0
19: 2015-01-01 00:13:41 3 0
20: 2015-01-01 00:16:04 14 0
My code assigns value 10.5 to all the rows and I donot get the expected results. Here is my code.
require(lubridate)
x[, avg := x[ts>=ts-minutes(2) & ts<=ts , mean(value)], verbose=TRUE ]
Updated
I want the results to be as below
ts value avg
1 01-01-2015 00:00:23 9 0
2 01-01-2015 00:01:56 11 9
3 01-01-2015 00:02:03 18 10
4 01-01-2015 00:03:16 1 14.5
5 01-01-2015 00:05:19 6 0
6 01-01-2015 00:05:54 16 6
7 01-01-2015 00:06:27 13 11
8 01-01-2015 00:06:50 7 11.66666667
9 01-01-2015 00:08:41 12 7
10 01-01-2015 00:09:08 17 12
11 01-01-2015 00:09:28 8 14.5
12 01-01-2015 00:10:56 5 12.5
13 01-01-2015 00:11:44 10 5
14 01-01-2015 00:12:23 20 7.5
15 01-01-2015 00:12:28 2 11.66666667
16 01-01-2015 00:12:37 15 9.25
17 01-01-2015 00:12:42 4 10.4
18 01-01-2015 00:12:48 19 9.333333333
19 01-01-2015 00:13:41 3 11.666667
20 01-01-2015 00:16:04 14 0
I want to do this to a data with a larger data set, also with min and max values in separate columns separately( here I have shown only the average function). Any help would be great.
Updated
Below is the reproducible code.
#reproducible code
ts<- seq(from=ISOdatetime(2015,1,1,0,0,0,tz="GMT"),to=ISOdatetime(2015,1,1,0,0,19,tz="GMT"), by="sec")
set.seed(2)
ts <-ts + seconds(round(runif(20,0,1000),0))
value <- 1:20
avg <- 0
x <- data.table(ts,value,avg)
setkey(x,ts)
x
Solution
Thanks to #Saksham for poining me towards apply functions. Here is the solution that I have come up with.
find <- function(y){
mean(x[ts>=y-minutes(2) & ts<y,value])
}
x$avg <- mapply(find,x[,ts])
> x
ts value avg
1: 2015-01-01 00:00:23 9 NaN
2: 2015-01-01 00:01:56 11 9.000000
3: 2015-01-01 00:02:03 18 10.000000
4: 2015-01-01 00:03:16 1 14.500000
5: 2015-01-01 00:05:19 6 NaN
6: 2015-01-01 00:05:54 16 6.000000
7: 2015-01-01 00:06:27 13 11.000000
8: 2015-01-01 00:06:50 7 11.666667
9: 2015-01-01 00:08:41 12 7.000000
10: 2015-01-01 00:09:08 17 12.000000
11: 2015-01-01 00:09:28 8 14.500000
12: 2015-01-01 00:10:56 5 12.500000
13: 2015-01-01 00:11:44 10 5.000000
14: 2015-01-01 00:12:23 20 7.500000
15: 2015-01-01 00:12:28 2 11.666667
16: 2015-01-01 00:12:37 15 9.250000
17: 2015-01-01 00:12:42 4 10.400000
18: 2015-01-01 00:12:48 19 9.333333
19: 2015-01-01 00:13:41 3 11.666667
20: 2015-01-01 00:16:04 14 NaN
Will this do
ts[,avg] <- ts[,val] - 0.5
As logically and seeing your expected result, it is doing the same thing. You can edit you expected result to make it more flexible if I interpreted it wrong.
EDIT:
This base R approach should do the trick. As I an not familiar with manipulating time, I am assuming that arithmetic works in the same way as it does in most of the languages
interval <- minutes(2) #Assuming this is how we define 5 minutes
x$avg <- apply( x, 1, function(y){
mean(x$value[x$time > ( y["time"]) - interval ) && x$time < y["time"]])
})