R: Aggregating by date and hour and placing into a separate matrix - r

I am looking to take a dataframe which has data ordered through time and aggregate up to the hourly level, and place the data into a separate dataframe. It's best explained with an example:
tradeData dataframe:
Time Amount
2014-05-16 14:00:05 10
2014-05-16 14:00:10 20
2014-05-16 14:08:15 30
2014-05-16 14:23:09 51
2014-05-16 14:59:54 84
2014-05-16 15:09:45 94
2014-05-16 15:24:41 53
2014-05-16 16:30:51 44
The matrix above contains the data I would like to aggregate. Below is the dataframe into which I would like to insert it:
HourlyData dataframe:
Time Profit
2014-05-16 00:00:00 100
2014-05-16 01:00:00 200
2014-05-16 02:00:00 250
...
2014-05-16 14:00:00 30
2014-05-16 15:00:00 -50
2014-05-16 16:00:00 67
...
2014-05-16 23:00:00 -8
I would like to aggregate the data in the tradeData dataframe and place it in the correct place in the hourlyData dataframe as below:
New hourlyData dataframe:
Time Profit Amount
2014-05-16 00:00:00 100 0
2014-05-16 01:00:00 200 0
2014-05-16 02:00:00 250 0
...
2014-05-16 14:00:00 30 0
2014-05-16 15:00:00 -50 195 (10+20+30+51+84)
2014-05-16 16:00:00 67 147 (94+53)
2014-05-16 17:00:00 20 44
...
2014-05-16 23:00:00 -8 0
Using the solution provided by Akrun below, I was able to get a solution for most instances. However, there appears to be an issue when an event occurs within the last hour of the day, as below:
TradeData
Time Amount
2014-08-15 22:09:07 11037.778
2014-08-15 23:01:33 13374.724
2014-08-20 23:25:40 133373.000
HourlyData
Time Amount
2014-08-15 23:00:00 11037.778 (correct)
2014-08-18 00:00:00 0 (incorrect)
2014-08-21 00:00:00 133373 (correct)
The formula appears to be skip the data for the second trade in the tradeData dataframe when aggregating in the hourlyData dataframe. It appears as though this occurs for trades that occur in the last hour of a Friday,because (I imagine) data doesn't exist for a Saturday at 12am i.e. Friday 11PM + 1 hour. It works for a trade occurring in the last hour of Monday to Thursday.
Any ideas on how to adjust the algo? Please let me know if anything is unclear.
Thanks
Mike

Try
library(dplyr)
res <- left_join(df2,
df %>%
group_by(hour=as.POSIXct(cut(Time, breaks='hour'))+3600) %>%
summarise(Amount=sum(Amount)),
by=c('Time'='hour'))
res$Amount[is.na(res$Amount)] <- 0
res
# Time Profit Amount
#1 2014-05-16 00:00:00 100 0
#2 2014-05-16 01:00:00 200 0
#3 2014-05-16 02:00:00 250 0
#4 2014-05-16 14:00:00 30 0
#5 2014-05-16 15:00:00 -50 195
#6 2014-05-16 16:00:00 67 147
#7 2014-05-16 23:00:00 -8 0
Or using data.table
library(data.table)
DT <- data.table(df)
DT2 <- data.table(df2)
DT1 <- DT[,list(Amount=sum(Amount)), by=(Time=
as.POSIXct(cut(Time, breaks='hour'))+3600)]
setkey(DT1, Time)
DT1[DT2][is.na(Amount), Amount:=0][]
# Time Amount Profit
#1: 2014-05-16 00:00:00 0 100
#2: 2014-05-16 01:00:00 0 200
#3: 2014-05-16 02:00:00 0 250
#4: 2014-05-16 14:00:00 0 30
#5: 2014-05-16 15:00:00 195 -50
#6: 2014-05-16 16:00:00 147 67
#7: 2014-05-16 23:00:00 0 -8
Update
Based on the weekends info,
indx <- with(df, as.numeric(format(Time, '%H'))==23 &
as.numeric(format(Time, '%S'))>0& format(Time, '%a')=='Fri')
grp <- with(df, as.POSIXct(cut(Time, breaks='hour')))
grp[indx] <- grp[indx] +3600*49
grp[!indx] <- grp[!indx]+3600
df$Time <- grp
df %>%
group_by(Time) %>%
summarise(Amount=sum(Amount)) #in the example dataset, it is just 3 rows
# Time Amount
#1 2014-08-15 23:00:00 11037.78
#2 2014-08-18 00:00:00 13374.72
#3 2014-08-21 00:00:00 133373.00
data
df <- structure(list(Time = structure(c(1400263205, 1400263210, 1400263695,
1400264589, 1400266794, 1400267385, 1400268281, 1400272251), class = c("POSIXct",
"POSIXt"), tzone = ""), Amount = c(10L, 20L, 30L, 51L, 84L, 94L,
53L, 44L)), .Names = c("Time", "Amount"), row.names = c(NA, -8L
), class = "data.frame")
df2 <- structure(list(Time = structure(c(1400212800, 1400216400, 1400220000,
1400263200, 1400266800, 1400270400, 1400295600), class = c("POSIXct",
"POSIXt"), tzone = ""), Profit = c(100L, 200L, 250L, 30L, -50L,
67L, -8L)), .Names = c("Time", "Profit"), row.names = c(NA, -7L
), class = "data.frame")
newdata
df <- structure(list(Time = structure(c(1408158000, 1408334400, 1408593600
), tzone = "", class = c("POSIXct", "POSIXt")), Amount = c(11037.778,
13374.724, 133373)), .Names = c("Time", "Amount"), row.names = c(NA,
-3L), class = "data.frame")

Related

'Interpolation' of a missing date/value in R?

I have a dataframe like so:
Month CumulativeSum
2019-02-01 40
2019-03-01 70
2019-04-01 80
2019-07-01 100
2019-08-01 120
Problem is that nothing happen in May and June, hence there is no data. Plotting this in barcharts results in some empty space on the x-axis.
Is there some way to "fill" the missing spot like so, using the last known value?:
Month CumulativeSum
2019-02-01 40
2019-03-01 70
2019-04-01 80
**2019-05-01 80** <--
**2019-06-01 80** <--
2019-07-01 100
2019-08-01 120
We can use complete
library(dplyr)
library(tidyr)
df1 %>%
complete(Month = seq(min(Month), max(Month), by = '1 month')) %>%
fill(CumulativeSum)
-output
# A tibble: 7 x 2
# Month CumulativeSum
# <date> <int>
#1 2019-02-01 40
#2 2019-03-01 70
#3 2019-04-01 80
#4 2019-05-01 80
#5 2019-06-01 80
#6 2019-07-01 100
#7 2019-08-01 120
data
df1 <- structure(list(Month = structure(c(17928, 17956, 17987, 18078,
18109), class = "Date"), CumulativeSum = c(40L, 70L, 80L, 100L,
120L)), row.names = c(NA, -5L), class = "data.frame")
Here is a base R option using cummax
transform(
data.frame(
Month = seq(min(df$Month), max(df$Month), by = "1 month"),
CumulativeSum = -Inf
),
CumulativeSum = cummax(replace(CumulativeSum, Month %in% df$Month, df$CumulativeSum))
)
which gives
Month CumulativeSum
1 2019-02-01 40
2 2019-03-01 70
3 2019-04-01 80
4 2019-05-01 80
5 2019-06-01 80
6 2019-07-01 100
7 2019-08-01 120

Beginner: set up time series in R

I am brand new to R, and am having trouble figuring out how to set up a simple time series.
Illustration: say I have three variables: Event (0 or 1), HR (heart rate), DT (datetime):
df = data.frame(Event = c(1,0,0,0,1,0,0),
HR= c(100,120,115,105,105,115,100),
DT= c("2020-01-01 09:00:00","2020-01-01 09:15:00","2020-01-01 10:00:00","2020-01-01 10:30:00",
"2020-01-01 11:00:00","2020-01-01 12:00:00","2020-01-01 13:00:00"),
stringsAsFactors = F
)
Event HR DT
1 1 100 2020-01-01 09:00:00
2 0 120 2020-01-01 09:15:00
3 0 115 2020-01-01 10:00:00
4 0 105 2020-01-01 10:30:00
5 1 105 2020-01-01 11:00:00
6 0 115 2020-01-01 12:00:00
7 0 100 2020-01-01 13:00:00
What I would like to do is to calculate elapsed time after each new event: So, row1=0 min, row2=15, row3=60,... row5=0, row6=60 Then I can do things like plot HR vs elapsed.
What might be a simple way to calculate elapsed time?
Apologies for such a low level question, but would be very grateful for any help!
Here is a one line approach using data.table.
Data:
df <- structure(list(Event = c(1, 0, 0, 0, 1, 0, 0), HR = c(100, 120,
115, 105, 105, 115, 100), DT = structure(c(1577869200, 1577870100,
1577872800, 1577874600, 1577876400, 1577880000, 1577883600), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -7L), class = "data.frame")
Code:
library(data.table)
dt <- as.data.table(df)
dt[, mins_since_last_event := as.numeric(difftime(DT,DT[1],units = "mins")), by = .(cumsum(Event))]
Output:
dt
Event HR DT mins_since_last_event
1: 1 100 2020-01-01 09:00:00 0
2: 0 120 2020-01-01 09:15:00 15
3: 0 115 2020-01-01 10:00:00 60
4: 0 105 2020-01-01 10:30:00 90
5: 1 105 2020-01-01 11:00:00 0
6: 0 115 2020-01-01 12:00:00 60
7: 0 100 2020-01-01 13:00:00 120
The following uses the Chron library and converts your date/time column to time objects for the library to be able to run calculations and conversions on.
Example Data:
df <- data.frame(
Event=c(1,0,0,0,1,0,0),
HR=c(100,125,115,105,105,115,100),
DT=c("2020-01-01 09:00:00"
,"2020-01-01 09:15:00"
,"2020-01-01 10:00:00"
,"2020-01-01 10:30:00"
,"2020-01-01 11:00:00"
,"2020-01-01 12:00:00"
,"2020-01-01 13:00:00"))
Code:
library(chron)
Dates <- lapply(strsplit(as.character(df$DT)," "),head,n=1)
Times <- lapply(strsplit(as.character(df$DT)," "),tail,n=1)
df$DT <- chron(as.character(Dates),as.character(Times),format=c(dates="y-m-d",times="h:m:s"))
df$TimeElapsed[1] <- 0
for(i in 1:nrow(df)){
if(df$Event[i]==1){TimeStart <- df$DT[i]}
df$TimeElapsed[i] <- (df$DT[i]-TimeStart)*24*60
}
output:
> df
Event HR DT TimeElapsed
1 1 100 (20-01-01 09:00:00) 0
2 0 125 (20-01-01 09:15:00) 15
3 0 115 (20-01-01 10:00:00) 60
4 0 105 (20-01-01 10:30:00) 90
5 1 105 (20-01-01 11:00:00) 0
6 0 115 (20-01-01 12:00:00) 60
7 0 100 (20-01-01 13:00:00) 120
Welcome to Stack Overflow #greyguy.
Here is an approach with dplyr library wich is pretty good with large data sets:
library(dplyr)
#Yours Data
df = data.frame(Event = c(1,0,0,0,1,0,0),
HR= c(100,120,115,105,105,115,100),
DT= c("2020-01-01 09:00:00","2020-01-01 09:15:00","2020-01-01 10:00:00","2020-01-01 10:30:00",
"2020-01-01 11:00:00","2020-01-01 12:00:00","2020-01-01 13:00:00"),
stringsAsFactors = F
)
# Transform in time format not string and order by time if not ordered
Transform in time format not string and order by time if not ordered
df = df %>%
mutate(DT = as.POSIXct(DT, format = "%Y-%m-%d %H:%M:%S")) %>%
arrange(DT) %>%
mutate(#Litte trick to get last DT Observation
last_DT = case_when(Event==1 ~ DT),
last_DT = na.locf(last_DT),
Elapsed_min = as.numeric( (DT - last_DT)/60)
) %>%
select(-last_DT)
The output:
# Event HR DT Elapsed_min
# 1 100 2020-01-01 09:00:00 0
# 0 120 2020-01-01 09:15:00 15
# 0 115 2020-01-01 10:00:00 60
# 0 105 2020-01-01 10:30:00 90
# 1 105 2020-01-01 11:00:00 0
# 0 115 2020-01-01 12:00:00 60
# 0 100 2020-01-01 13:00:00 120

Apply Scale function for every 24 hour data period

I have several days of heart rate data for every second of the day (with random missing gaps of data) like this:
structure(list(TimePoint = structure(c(1523237795, 1523237796,
1523237797, 1523237798, 1523237799, 1523237800, 1523237801, 1523237802,
1523237803, 1523237804), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
HR = c(80L, 83L, 87L, 91L, 95L, 99L, 102L, 104L, 104L, 103L
)), row.names = c(NA, 10L), class = "data.frame")
------------------------------
TimePoint HR
1 2018-04-09 01:36:35 80
2 2018-04-09 01:36:36 83
3 2018-04-09 01:36:37 87
4 2018-04-09 01:36:38 91
5 2018-04-09 01:36:39 95
6 2018-04-09 01:36:40 99
7 2018-04-09 01:36:41 102
8 2018-04-09 01:36:42 104
9 2018-04-09 01:36:43 104
10 2018-04-09 01:36:44 103
.
.
.
I would like to apply the Scale(center = T, scale = T) function to the data to normalize across participants.
However, I don't want to normalize across entire days of available data, but across every 24 hour period
So if a participant has 3 days of data, the HR will be scaled to a z-distribution 3 separate times; each for it's respective day
I am having trouble doing this successfully.
# read csv
DF = read.csv(x)
# make sure date stamp is read YYYY Month Day & convert timestamp into class POSIXct
x2 = as.POSIXct(DF[,1], format = '%d.%m.%Y %H:%M:%S', tz = "UTC") %>% data.frame()
# rename column
colnames(x2)[1] = "TimePoint"
# add the participant HR data to this dataframe
x2$HR = DF[,2]
# break time stamps into 60 minute windows
by60 = cut(x2$TimePoint, breaks = "60 min")
# get the average HR per 60 min window
DF_Sum = aggregate(HR ~ by60, FUN=mean, data=x2)
# add weekday /hours for future plot visualization
DF_Sum$WeekDay = wday(DF_Sum$by60, label = T)
DF_Sum$Hour = hour(DF_Sum$by60)
I am able to split the data by timeseries and average the HR by hour but I cannot seem to add the scale function properly.
Help appreciated.
Create time intervals of 24 hours for each patient, group_by patient and time intervals, then calculate the scaled HR for each group.
library(dplyr)
df %>%
#remove the following mutate and replace ID in group_by by the ID's column name in your data set
mutate(ID=1) %>%
group_by(ID, Int=cut(TimePoint, breaks="24 hours")) %>%
mutate(HR_sc=scale(HR, center = TRUE, scale = TRUE))
# A tibble: 10 x 5
# Groups: ID, Int [1]
TimePoint HR ID Int HR_sc
<dttm> <int> <dbl> <fct> <dbl>
1 2018-04-09 01:26:35 80 1 2018-04-09 01:00:00 -1.63
2 2018-04-09 01:28:16 83 1 2018-04-09 01:00:00 -1.30
3 2018-04-09 01:29:57 87 1 2018-04-09 01:00:00 -0.860
4 2018-04-09 01:31:38 91 1 2018-04-09 01:00:00 -0.419
5 2018-04-09 01:33:19 95 1 2018-04-09 01:00:00 0.0221
6 2018-04-09 01:33:20 99 1 2018-04-09 01:00:00 0.463
7 2018-04-09 01:35:01 102 1 2018-04-09 01:00:00 0.794
8 2018-04-09 01:36:42 104 1 2018-04-09 01:00:00 1.01
9 2018-04-09 01:38:23 104 1 2018-04-09 01:00:00 1.01
10 2018-04-09 01:39:59 103 1 2018-04-09 01:00:00 0.905

Add a new date field based of off repeat encounters [duplicate]

This question already has answers here:
How to create a lag variable within each group?
(5 answers)
Closed 5 years ago.
I have repeat encounters with animals that have a unique IndIDII and a unique GPSSrl number. Each encounter has a FstCptr date.
dat <- structure(list(IndIDII = c("BHS_115", "BHS_115", "BHS_372", "BHS_372",
"BHS_372", "BHS_372"), GPSSrl = c("035665", "036052", "034818",
"035339", "036030", "036059"), FstCptr = structure(c(1481439600,
1450162800, 1426831200, 1481439600, 1457766000, 1489215600), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("IndIDII", "GPSSrl", "FstCptr"
), class = "data.frame", row.names = c(1L, 2L, 29L, 30L, 31L,
32L))
> dat
IndIDII GPSSrl FstCptr
1 BHS_115 035665 2016-12-11
2 BHS_115 036052 2015-12-15
29 BHS_372 034818 2015-03-20
30 BHS_372 035339 2016-12-11
31 BHS_372 036030 2016-03-12
32 BHS_372 036059 2017-03-11
For each IndID-GPSSrl grouping, I want to create a new field (NextCptr) that documents the date of the next encounter. For the last encounter, the new field would be NA, for example:
dat$NextCptr <- as.Date(c("2015-12-15", NA, "2016-12-11", "2016-03-12", "2017-03-11", NA))
> dat
IndIDII GPSSrl FstCptr NextCptr
1 BHS_115 035665 2016-12-11 2015-12-15
2 BHS_115 036052 2015-12-15 <NA>
29 BHS_372 034818 2015-03-20 2016-12-11
30 BHS_372 035339 2016-12-11 2016-03-12
31 BHS_372 036030 2016-03-12 2017-03-11
32 BHS_372 036059 2017-03-11 <NA>
I would like to work within dplyr and group_by(IndIDII, GPSSrl).
As always, many thanks!
Group by column IndIDII then use lead to shift FstCptr forward by one:
dat %>% group_by(IndIDII) %>% mutate(NextCptr = lead(FstCptr))
# A tibble: 6 x 4
# Groups: IndIDII [2]
# IndIDII GPSSrl FstCptr NextCptr
# <chr> <chr> <dttm> <dttm>
#1 BHS_115 035665 2016-12-11 02:00:00 2015-12-15 02:00:00
#2 BHS_115 036052 2015-12-15 02:00:00 NA
#3 BHS_372 034818 2015-03-20 02:00:00 2016-12-11 02:00:00
#4 BHS_372 035339 2016-12-11 02:00:00 2016-03-12 02:00:00
#5 BHS_372 036030 2016-03-12 02:00:00 2017-03-11 02:00:00
#6 BHS_372 036059 2017-03-11 02:00:00 NA
If you need to shift the column in the opposite direction, lag could also be useful, dat %>% group_by(IndIDII) %>% mutate(NextCptr = lag(FstCptr)).

R time aggregate with start/stop

I have a set of time series data that has a start and stop time. Each event can last from few seconds to few days, I need to calculate the sum, in this example the total memory used, every hour of the jobs active at the time. Here is a sample of the data:
mem_used start_time stop_time
16 2015-10-24 17:24:41 2015-10-25 04:19:44
80 2015-10-24 17:24:51 2015-10-25 03:14:59
44 2015-10-24 17:25:27 2015-10-25 01:16:10
28 2015-10-24 17:25:43 2015-10-25 00:00:31
72 2015-10-24 17:30:23 2015-10-24 23:58:31
In this case it should give something like:
time total_mem
2015-10-24 17:00:00 240
2015-10-24 18:00:00 240
...
2015-10-25 00:00:00 168
2015-10-25 01:00:00 140
2015-10-25 02:00:00 96
2015-10-25 03:00:00 96
2015-10-25 04:00:00 16
I'm trying to do something with the aggregate function but I can not figure it out. Any ideas? Thanks.
Here's how I would do it, using lubridate.
First, make sure that your dates are in POSIXct format:
dat$start_time = as.POSIXct(dat$start_time, format = "%Y-%m-%d %H:%M:%S")
dat$stop_time = as.POSIXct(dat$stop_time, format = "%Y-%m-%d %H:%M:%S")
Then make an interval object with lubridate:
library(lubridate)
dat$interval <- interval(dat$start_time, dat$stop_time)
Now we can make a vector of times, replace these with your desired times:
z <- seq(start = dat$start_time[1], stop = dat$stop_time[5], by = "hours")
And sum those where we have an overlap:
out <- data.frame(times = z,
mem_used = sapply(z, function(x) sum(dat$mem_used[x %within% dat$interval])))
times mem_used
1 2015-10-24 17:24:41 16
2 2015-10-24 18:24:41 240
3 2015-10-24 19:24:41 240
4 2015-10-24 20:24:41 240
5 2015-10-24 21:24:41 240
6 2015-10-24 22:24:41 240
7 2015-10-24 23:24:41 240
Here's the data used:
structure(list(mem_used = c(16L, 80L, 44L, 28L, 72L), start_time = structure(c(1445721881,
1445721891, 1445721927, 1445721943, 1445722223), class = c("POSIXct",
"POSIXt"), tzone = ""), stop_time = structure(c(1445761184, 1445757299,
1445750170, 1445745631, 1445745511), class = c("POSIXct", "POSIXt"
), tzone = "")), .Names = c("mem_used", "start_time", "stop_time"
), row.names = c(NA, -5L), class = "data.frame")
Here is another solution based on dplyr and lubridate.
Make sure first to have the data in the right format (e.g date in POSIXct)
library(dplyr)
library(lubridate)
glimpse(df)
## Observations: 5
## Variables: 3
## $ mem_used (int) 16, 80, 44, 28, 72
## $ start_time (time) 2015-10-24 17:24:41, 2015-10-24 17:24:51...
## $ end_time (time) 2015-10-25 04:19:44, 2015-10-25 03:14:59...
Then we will just keep the hour (removing minutes and seconds) since we want to aggregate per hour.
### Remove minutes and seconds
minute(df$start_time) <- 0
second(df$start_time) <- 0
minute(df$end_time) <- 0
second(df$end_time) <- 0
The most important step now, is to create a new data.frame with one row for each hour between start_time and end_time. For example, if on the first line of the original data.frame we have 5 hours between start_time and end_time, we will end with 5 rows and the value mem_used duplicated 5 times.
###
n <- nrow(df)
l <- lapply(1:n, function(i) {
date <- seq.POSIXt(df$start_time[i], df$end_time[i], by = "hour")
mem_used <- rep(df$mem_used[i], length(date))
data.frame(time = date, mem_used = mem_used)
})
df <- Reduce(rbind, l)
glimpse(df)
## Observations: 47
## Variables: 2
## $ time (time) 2015-10-24 17:00:00, 2015-10-24 18:00:00, ...
## $ mem_used (int) 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,...
Finally, we can now aggregate using dplyr or aggregate (or other similar functions)
df %>%
group_by(time) %>%
summarise(tot = sum(mem_used))
## time tot
## (time) (int)
## 1 2015-10-24 17:00:00 240
## 2 2015-10-24 18:00:00 240
## 3 2015-10-24 19:00:00 240
## 4 2015-10-24 20:00:00 240
## 5 2015-10-24 21:00:00 240
## 6 2015-10-24 22:00:00 240
## 7 2015-10-24 23:00:00 240
## 8 2015-10-25 00:00:00 168
## 9 2015-10-25 01:00:00 140
## 10 2015-10-25 02:00:00 96
## 11 2015-10-25 03:00:00 96
## 12 2015-10-25 04:00:00 16
## Or aggregate
aggregate(mem_used ~ time, FUN = sum, data = df)

Resources