Slow performance of split() function

Slow performance of split() function - r

I have a csv file consisting of around 200.000 rows of transactions. Here is the import and little preprocessing of the data:
data <- read.csv("bitfinex_data/trades.csv", header=T)
data$date <- as.character(data$date)
data$date <- substr(data$date, 1, 10)
data$date <- as.numeric(data$date)
data$date <- as.POSIXct(data$date, origin="1970-01-01", tz = "GMT")
head(data)
id exchange symbol date price amount sell
1 24892563 bf btcusd 2018-01-02 00:00:00 13375 0.05743154 False
2 24892564 bf btcusd 2018-01-02 00:00:01 13374 0.12226129 False
3 24892565 bf btcusd 2018-01-02 00:00:02 13373 0.00489140 False
4 24892566 bf btcusd 2018-01-02 00:00:02 13373 0.07510860 False
5 24892567 bf btcusd 2018-01-02 00:00:02 13373 0.11606086 False
6 24892568 bf btcusd 2018-01-02 00:00:03 13373 0.47000000 False
My goal is to obtain hourly sums of amount of token being traded. For this I need to split my data based on hours, which I did in a following way:
tmp <- split(data, cut(data$date,"hour"))
However this is taking way too long (up to 1 hour) and I wonder whether or not this is normal behaviour for functions such as split() and cut()? Is there any alternative to using those two functions?
UPDATE:
After using great suggestion by #Maurits Evers, my output table looks like this:
# A tibble: 25 x 2
date_hour amount.sum
<chr> <dbl>
1 1970-01-01 00 48.2
2 2018-01-02 00 2746.
3 2018-01-02 01 1552.
4 2018-01-02 02 2010.
5 2018-01-02 03 2171.
6 2018-01-02 04 3640.
7 2018-01-02 05 1399.
8 2018-01-02 06 836.
9 2018-01-02 07 856.
10 2018-01-02 08 819.
# ... with 15 more rows
This is exactly what I wanted, expect for the first row, where the date is from year 1970. Any suggestion on what might be causing the problem? I tried to change the origin parameter of as.POSIXct() function but that did not solve the problem.

I agree with #Roland's comment. To illustrate, here is an example.
Let's generate some data with 200000 entries in one minute time intervals.
set.seed(2018);
df <- data.frame(
date = seq(from = as.POSIXct("2018-01-01 00:00"), by = "min", length.out = 200000),
amount = runif(200000))
head(df);
# date amount
#1 2018-01-01 00:00:00 0.33615347
#2 2018-01-01 00:01:00 0.46372327
#3 2018-01-01 00:02:00 0.06058539
#4 2018-01-01 00:03:00 0.19743361
#5 2018-01-01 00:04:00 0.47431419
#6 2018-01-01 00:05:00 0.30104860
We now (1) create a new column date_hour that includes the date & hour part of the full date&time, (2) group_by column date_hour, and (3) sum entries from column amount to give amount.sum.
df %>%
mutate(date_hour = format(date, "%Y-%m-%d %H")) %>%
group_by(date_hour) %>%
summarise(amount.sum = sum(amount))
## A tibble: 3,333 x 2
# date_hour amount.sum
# <chr> <dbl>
# 1 2018-01-01 00 28.9
# 2 2018-01-01 01 26.4
# 3 2018-01-01 02 32.7
# 4 2018-01-01 03 29.9
# 5 2018-01-01 04 29.7
# 6 2018-01-01 05 28.5
# 7 2018-01-01 06 34.2
# 8 2018-01-01 07 33.8
# 9 2018-01-01 08 30.7
#10 2018-01-01 09 27.7
## ... with 3,323 more rows
This is very fast (it takes around 0.3 seconds on my 2012 MacBook Air), and you should be able to easily adjust this example to your particular case.

You can compute hourly sums without any packages, by using tapply. I use the random data as suggested by Maurits Evers:
set.seed(2018)
df <- data.frame(
date = seq(from = as.POSIXct("2018-01-01 00:00"),
by = "min", length.out = 200000),
amount = runif(200000))
head(df)
## date amount
## 1 2018-01-01 00:00:00 0.33615347
## 2 2018-01-01 00:01:00 0.46372327
## 3 2018-01-01 00:02:00 0.06058539
## 4 2018-01-01 00:03:00 0.19743361
## 5 2018-01-01 00:04:00 0.47431419
## 6 2018-01-01 00:05:00 0.30104860
tapply(df$amount,
format(df$date, "%Y-%m-%d %H"),
sum)
## 2018-01-01 00 2018-01-01 01 2018-01-01 02
## 28.85825 26.39385 32.73600
## 2018-01-01 03 2018-01-01 04 2018-01-01 05
## 29.88545 29.74048 28.46781
## ...

Related

Remove days based on number of hours missing

I have some air pollution data measured by hours.
Datetime
PM2.5
Station.id
2020-01-01 00:00:00
10
1
2020-01-01 01:00:00
NA
1
2020-01-01 02:00:00
15
1
2020-01-01 03:00:00
NA
1
2020-01-01 04:00:00
7
1
2020-01-01 05:00:00
20
1
2020-01-01 06:00:00
30
1
2020-01-01 00:00:00
NA
2
2020-01-01 01:00:00
17
2
2020-01-01 02:00:00
21
2
2020-01-01 03:00:00
55
2
I have a very large number of data collected from many stations. Using R, what is the most efficient way to remove a day when it has 1. A total of 18 hours of missing data AND 2. 8 hours continuous missing data.
PS. The original data can be either NAs have already been removed OR NAs are inserted.

The "most efficient" way will almost certainly use data.table. Something like this:
library(data.table)
setDT(your_data)
your_data[, date := as.IDate(Datetime)][,
if(
!(sum(is.na(PM2.5)) >= 18 &
with(rle(is.na(PM2.5)), max(lengths[values])) >= 8
)) .SD,
by = .(date, station.id)
]
# date Datetime PM2.5
# 1: 2020-01-01 2020-01-01 00:00:00 10
# 2: 2020-01-01 2020-01-01 01:00:00 NA
# 3: 2020-01-01 2020-01-01 02:00:00 15
# 4: 2020-01-01 2020-01-01 03:00:00 NA
# 5: 2020-01-01 2020-01-01 04:00:00 7
# 6: 2020-01-01 2020-01-01 05:00:00 20
# 7: 2020-01-01 2020-01-01 06:00:00 30
Using this sample data:
your_data = fread(text = 'Datetime PM2.5
2020-01-01 00:00:00 10
2020-01-01 01:00:00 NA
2020-01-01 02:00:00 15
2020-01-01 03:00:00 NA
2020-01-01 04:00:00 7
2020-01-01 05:00:00 20
2020-01-01 06:00:00 30')

R: how can I split one row of a time period into multiple rows based on day and time

I am trying to split rows in an excel file based on day and time. The data is from a study which participants will need to wear a tracking watch. Each row of the data set is started with participants put on the watch (Variable: 'Wear Time Start ') and ended with them taking off the device (Variable: 'Wear Time End').
I need to calculate how many hours of each participant wearing the device on each day (NOT each time period in one row).
Data set before split:
ID WearStart WearEnd
1 01 2018-05-14 09:00:00 2018-05-14 20:00:00
2 01 2018-05-14 21:30:00 2018-05-15 02:00:00
3 01 2018-05-15 07:00:00 2018-05-16 22:30:00
4 01 2018-05-16 23:00:00 2018-05-16 23:40:00
5 01 2018-05-17 01:00:00 2018-05-19 15:00:00
6 02 ...
Some explanation about the data set before split: the data type of 'WearStart' and 'WearEnd' are POSIXlt.
Desired output after split:
ID WearStart WearEnd Interval
1 01 2018-05-14 09:00:00 2018-05-14 20:00:00 11
2 01 2018-05-14 21:30:00 2018-05-15 00:00:00 2.5
3 01 2018-05-15 00:00:00 2018-05-15 02:00:00 2
4 01 2018-05-15 07:00:00 2018-05-16 00:00:00 17
5 01 2018-05-16 00:00:00 2018-05-16 22:30:00 22.5
4 01 2018-05-16 23:00:00 2018-05-16 23:40:00 0.4
5 01 2018-05-17 01:00:00 2018-05-18 00:00:00 23
6 01 2018-05-18 00:00:00 2018-05-19 00:00:00 24
7 01 2018-05-19 00:00:00 2018-05-19 15:00:00 15
Then I need to accumulate hours based on day:
ID Wear_Day Total_Hours
1 01 2018-05-14 13.5
2 01 2018-05-15 19
3 01 2018-05-16 22.9
4 01 2018-05-17 23
5 01 2018-05-18 24
4 01 2018-05-19 15

So, I reworked the entire answer. Please, review the code. I am pretty sure this is what you want.
Short summary
The problem is that you need to split rows which start and end on different dates. And you need to do this recursively. So, I split the dataframe into a list of 1-row dataframes. For each I check whether start and end is on the same day. If not, I make it a 2-row dataframe with the adjusted start and end times. This is then split up again into a list of 1-row dataframes and so on so forth.
In the end there is a nested list of 1-row dataframes where start and end is on the same day. And this list is then recursively bound together again.
# Load Packages ---------------------------------------------------------------------------------------------------
library(tidyverse)
library(lubridate)
df <- tribble(
~ID, ~WearStart, ~WearEnd
, 01, "2018-05-14 09:00:00", "2018-05-14 20:00:00"
, 01, "2018-05-14 21:30:00", "2018-05-15 02:00:00"
, 01, "2018-05-15 07:00:00", "2018-05-16 22:30:00"
, 01, "2018-05-16 23:00:00", "2018-05-16 23:40:00"
, 01, "2018-05-17 01:00:00", "2018-05-19 15:00:00"
)
df <- df %>% mutate_at(vars(starts_with("Wear")), ymd_hms)
# Helper Functions ------------------------------------------------------------------------------------------------
endsOnOtherDay <- function(df){
as_date(df$WearStart) != as_date(df$WearEnd)
}
split1rowInto2Days <- function(df){
df1 <- df
df2 <- df
df1$WearEnd <- as_date(df1$WearStart) + days(1) - milliseconds(1)
df2$WearStart <- as_date(df2$WearStart) + days(1)
rbind(df1, df2)
}
splitDates <- function(df){
if (nrow(df) > 1){
return(df %>%
split(f = 1:nrow(df)) %>%
lapply(splitDates) %>%
reduce(rbind))
}
if (df %>% endsOnOtherDay()){
return(df %>%
split1rowInto2Days() %>%
splitDates())
}
df
}
# The actual Calculation ------------------------------------------------------------------------------------------
df %>%
splitDates() %>%
mutate(wearDuration = difftime(WearEnd, WearStart, units = "hours")
, wearDay = as_date(WearStart)) %>%
group_by(ID, wearDay) %>%
summarise(wearDuration_perDay = sum(wearDuration))
ID wearDay wearDuration_perDay
<dbl> <date> <drtn>
1 1 2018-05-14 13.50000 hours
2 1 2018-05-15 19.00000 hours
3 1 2018-05-16 23.16667 hours
4 1 2018-05-17 23.00000 hours
5 1 2018-05-18 24.00000 hours
6 1 2018-05-19 15.00000 hours

Here is my solution to your question with just using basic functions in R:
#step 1: read data from file
d <- read.csv("dt.csv", header = TRUE)
d
ID WearStart WearEnd
1 1 2018-05-14 09:00:00 2018-05-14 20:00:00
2 1 2018-05-14 21:30:00 2018-05-15 02:00:00
3 1 2018-05-15 07:00:00 2018-05-16 22:30:00
4 1 2018-05-16 23:00:00 2018-05-16 23:40:00
5 1 2018-05-17 01:00:00 2018-05-19 15:00:00
6 2 2018-05-16 11:30:00 2018-05-16 11:40:00
7 2 2018-05-16 22:05:00 2018-05-22 22:42:00
#step 2: change class of WearStart and WearEnd to POSIlct
d$WearStart <- as.POSIXlt(d$WearStart, tryFormats = "%Y-%m-%d %H:%M")
d$WearEnd <- as.POSIXlt(d$WearEnd, tryFormats = "%Y-%m-%d %H:%M")
#step 3: calculate time interval (days and hours) for each record
timeInt <- function(d) {
WearStartDay <- as.Date(d$WearStart, "%Y/%m/%d")
Interval_days <- as.numeric(difftime(d$WearEnd,d$WearStart, units = "days"))
Days <- WearStartDay + seq(0, Interval_days,1)
N_FullBTWDays <- length(Days) - 2
if (N_FullBTWDays >= 0) {
sd <- d$WearStart
sd_h <- 24 - sd$hour -1
sd_m <- (60 - sd$min)/60
sd_total <- sd_h + sd_m
hours <- sd_total
hours <- c(hours, rep(24,N_FullBTWDays))
ed <- d$WearEnd
ed_h <- ed$hour
ed_m <- ed$min/60
ed_total <- ed_h + ed_m
hours <- c(hours,ed_total)
} else {
hours <- as.numeric(difftime(d$WearEnd,d$WearStart, units = "hours"))
}
df <- data.frame(id = rep(d$ID, length(Days)), days = Days, hours = hours)
return(df)
}
df <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(df) <- c("id", "days", "hours")
for ( i in 1:nrow(d)) {
df <- rbind(df,timeInt(d[i,]))
}
id days hours
1 1 2018-05-14 11.0000000
2 1 2018-05-14 4.5000000
3 1 2018-05-15 17.0000000
4 1 2018-05-16 22.5000000
5 1 2018-05-16 0.6666667
6 1 2018-05-17 23.0000000
7 1 2018-05-18 24.0000000
8 1 2018-05-19 15.0000000
9 2 2018-05-16 0.1666667
10 2 2018-05-16 1.9166667
11 2 2018-05-17 24.0000000
12 2 2018-05-18 24.0000000
13 2 2018-05-19 24.0000000
14 2 2018-05-20 24.0000000
15 2 2018-05-21 24.0000000
16 2 2018-05-22 22.7000000
#daily usage of device for each customer
res <- as.data.frame(tapply(df$hours, list(df$days,df$id), sum))
res[is.na(res)] <- 0
res$date <- rownames(res)
res
1 2 date
2018-05-14 15.50000 0.000000 2018-05-14
2018-05-15 17.00000 0.000000 2018-05-15
2018-05-16 23.16667 2.083333 2018-05-16
2018-05-17 23.00000 24.000000 2018-05-17
2018-05-18 24.00000 24.000000 2018-05-18
2018-05-19 15.00000 24.000000 2018-05-19
2018-05-20 0.00000 24.000000 2018-05-20
2018-05-21 0.00000 24.000000 2018-05-21
2018-05-22 0.00000 22.700000 2018-05-22

How to plot 24 hour for 365

How can I plot time series data hourly so that x-axis is 1:24. If I hav let's say one year of data so 365 days and 8000+ rows?
Tried with ggplot2 but didn't get it to work.
head looks like this
Value DateTime
1 104 2018-01-01 01:00:00
2 104 2018-01-01 02:00:00
3 108 2018-01-01 03:00:00
4 106 2018-01-01 04:00:00
5 117 2018-01-01 05:00:00
6 166 2018-01-01 06:00:00
And Tail
Value DateTime
8754 160.10 2018-12-31 19:00:00
8755 156.11 2018-12-31 20:00:00
8756 139.11 2018-12-31 21:00:00
8757 112.11 2018-12-31 22:00:00
8758 96.10 2018-12-31 23:00:00
8759 90.11 2019-01-01 00:00:00
Here is an image what I'm trying to achieve

What about having time of the day and date as seperate variables? You can use the package hms to do this.
timeOfDay <- as.hms(df$DateTime)
date <- as.Date(df$DateTime)
Now, you can use timeOfDay on the x-axis and date as your grouping aesthetics.
This works for me:
ggplot(df, aes(x = timeOfDay, y = value)) +
geom_line(aes(group = date))

Daily select one hour of the day for analysis

Have dataset of 1 year
hourly records
for analysis, I need to extract seperately for each month of the year, each hour of the day , so january 00h, january 01h, january 02h, january 03h, ...., ... , march 21h, march 22h, march 23h
Thanks in advance for any useful help!
Select observations for specified hours of the day during a period with datetime, filter, subset, ...
Code below (filter, month (time) generates fatal errror
Error: unexpected ')' in "at<-subset(groenenborgerno, timestamp=hour(time) == 01))"
groenenborgerno$timestamp <- as.POSIXct(groenenborgerno$date, format="%Y-%m-%d %H:%M:%S")
library(lubridate)
january01<-filter(atimeframe,
(month(time) == 01 & hour(time) == 01) )

Since no data is provided, I will try to answer your question with sample data:
require(lubridate)
require(tidyverse)
## Create some sample data:
time_index <- seq(from = as.POSIXct("2017-01-01 07:00"),
to = as.POSIXct("2018-01-01 18:00"), by = "hour")
value <- rnorm(n = length(time_index))
data <- data.frame(time_index,value)
data <- data %>% mutate (hour = hour(time_index),
month = month(time_index)) %>%
group_by(month,hour)
head(data)
> data
# A tibble: 8,772 x 4
# Groups: month, hour [288]
time_index value hour month
<dttm> <dbl> <int> <dbl>
1 2017-01-01 07:00:00 -0.626 7 1
2 2017-01-01 08:00:00 0.184 8 1
3 2017-01-01 09:00:00 -0.836 9 1
4 2017-01-01 10:00:00 1.60 10 1
5 2017-01-01 11:00:00 0.330 11 1
6 2017-01-01 12:00:00 -0.820 12 1
7 2017-01-01 13:00:00 0.487 13 1
8 2017-01-01 14:00:00 0.738 14 1
9 2017-01-01 15:00:00 0.576 15 1
10 2017-01-01 16:00:00 -0.305 16 1
# ... with 8,762 more rows
and then just filter() the hour / month combination you would want like so:
data %>% filter(hour > 12 & month == 1)
# A tibble: 347 x 4
# Groups: month, hour [11]
time_index value hour month
<dttm> <dbl> <int> <dbl>
1 2017-01-01 13:00:00 0.487 13 1
2 2017-01-01 14:00:00 0.738 14 1
3 2017-01-01 15:00:00 0.576 15 1
4 2017-01-01 16:00:00 -0.305 16 1
5 2017-01-01 17:00:00 1.51 17 1
6 2017-01-01 18:00:00 0.390 18 1
7 2017-01-01 19:00:00 -0.621 19 1
8 2017-01-01 20:00:00 -2.21 20 1
9 2017-01-01 21:00:00 1.12 21 1
10 2017-01-01 22:00:00 -0.0449 22 1
# ... with 337 more rows

Interpolation of 15 minute values

I have a dataframe that looks like this:
dat <- data.frame(time = seq(as.POSIXct("2010-01-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60*15),
radiation = sample(1:500, 245383, replace = TRUE))
So I have every 15 minutes a measurement value. The structure is:
> str(dat)
'data.frame': 245383 obs. of 2 variables:
$ time : POSIXct, format: "2010-01-01 00:00:00" "2010-01-01 00:15:00" "2010-01-01 00:30:00" "2010-01-01 00:45:00" ...
$ radiation: num 230 443 282 314 286 225 77 89 97 330 ...
Now I want to interpolate, so my aim is a dataframe with values for every minute.
I searched a few times and tried some methods with the zoo package. But I have some problems with the dataframe. I have to convert it to a text file i guess? I have no idea how to do that.

Here is a tidyverse solution.
library('tidyverse')
dat <- data.frame(time = seq(as.POSIXct("2010-01-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60*15),
radiation = sample(1:500, 245383, replace = TRUE))
dat <- head(dat, 3)
dat
# time radiation
# 1 2010-01-01 00:00:00 241
# 2 2010-01-01 00:15:00 438
# 3 2010-01-01 00:30:00 457
You can create a data frame with all of the required times. Using full_join will make the missing radiation values be NA.
approx will fill the NAs with a linear approximation.
dat %>%
full_join(data.frame(time = seq(
from = min(.$time),
to = max(.$time),
by = 'min'))) %>%
arrange(time) %>%
mutate(radiation = approx(radiation, n = n())$y)
# Joining, by = "time"
# time radiation
# 1 2010-01-01 00:00:00 241.0000
# 2 2010-01-01 00:01:00 254.1333
# 3 2010-01-01 00:02:00 267.2667
# 4 2010-01-01 00:03:00 280.4000
# 5 2010-01-01 00:04:00 293.5333
# 6 2010-01-01 00:05:00 306.6667
# 7 2010-01-01 00:06:00 319.8000
# 8 2010-01-01 00:07:00 332.9333
# 9 2010-01-01 00:08:00 346.0667
# 10 2010-01-01 00:09:00 359.2000
# 11 2010-01-01 00:10:00 372.3333
# 12 2010-01-01 00:11:00 385.4667
# 13 2010-01-01 00:12:00 398.6000
# 14 2010-01-01 00:13:00 411.7333
# 15 2010-01-01 00:14:00 424.8667
# 16 2010-01-01 00:15:00 438.0000
# 17 2010-01-01 00:16:00 439.2667
# 18 2010-01-01 00:17:00 440.5333
# 19 2010-01-01 00:18:00 441.8000
# 20 2010-01-01 00:19:00 443.0667
# 21 2010-01-01 00:20:00 444.3333
# 22 2010-01-01 00:21:00 445.6000
# 23 2010-01-01 00:22:00 446.8667
# 24 2010-01-01 00:23:00 448.1333
# 25 2010-01-01 00:24:00 449.4000
# 26 2010-01-01 00:25:00 450.6667
# 27 2010-01-01 00:26:00 451.9333
# 28 2010-01-01 00:27:00 453.2000
# 29 2010-01-01 00:28:00 454.4667
# 30 2010-01-01 00:29:00 455.7333
# 31 2010-01-01 00:30:00 457.0000

You can use the approx function like this:
dat <- data.frame(time = seq(as.POSIXct("2016-12-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60*15),
radiation = sample(1:500, 2887, replace = TRUE))
mins <- seq(as.POSIXct("2016-12-01"),
as.POSIXct("2016-12-31") + 60*99,
by = 60)
out <- approx(dat$time, dat$radiation, mins)

Here is a solution using pad from the padr package to fill the gaps in your time column. na.approx is used for interpolation.
library(padr)
library(zoo)
dat[1:2, ]
time radiation
#1 2010-01-01 00:00:00 133
#2 2010-01-01 00:15:00 187
dat_padded <- pad(dat[1:2, ], interval = "min")
dat_padded$radiation <- zoo::na.approx(dat_padded$radiation)
dat_padded
time radiation
#1 2010-01-01 00:00:00 133.0
#2 2010-01-01 00:01:00 136.6
#3 2010-01-01 00:02:00 140.2
#4 2010-01-01 00:03:00 143.8
#5 2010-01-01 00:04:00 147.4
#6 2010-01-01 00:05:00 151.0
#7 2010-01-01 00:06:00 154.6
#8 2010-01-01 00:07:00 158.2
#9 2010-01-01 00:08:00 161.8
#10 2010-01-01 00:09:00 165.4
#11 2010-01-01 00:10:00 169.0
#12 2010-01-01 00:11:00 172.6
#13 2010-01-01 00:12:00 176.2
#14 2010-01-01 00:13:00 179.8
#15 2010-01-01 00:14:00 183.4
#16 2010-01-01 00:15:00 187.0
data
set.seed(1)
dat <-
data.frame(
time = seq(
as.POSIXct("2010-01-01"),
as.POSIXct("2016-12-31") + 60 * 99,
by = 60 * 15
),
radiation = sample(1:500, 245383, replace = TRUE)
)

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Slow performance of split() function - r

Related

Remove days based on number of hours missing

R: how can I split one row of a time period into multiple rows based on day and time

How to plot 24 hour for 365

Daily select one hour of the day for analysis

Interpolation of 15 minute values

Categories

Resources