I'm currently struggling with a beginner's issue regarding the calculation of a time difference between two events.
I want to take a column consisting of date and time (both values in one column) into consideration and calculate a time difference with the value of the previous/next row with the same ID (A or B in this example).
ID = c("A", "A", "B", "B")
time = c("08.09.2014 10:34","12.09.2014 09:33","13.08.2014 15:52","11.09.2014 02:30")
d = data.frame(ID,time)
My desired output is in the format Hours:Minutes
time difference = c("94:59","94:59","682:38","682:38")
The format Days:Hours:Minutes or anything similar would also work, as long as it could be conveniently implemented. I am flexible regarding the format of the output, the above is just an idea that crossed my mind.
For each single ID, I always have two rows (in the example 2xA and 2xB). I don't have a convincing idea how to avoid the repition of the difference.
I've tried some examples before, which I found on stackoverflow. Most of them used POSIXt and strptime. However, I didn't manage to apply those ideas to my data set.
Here's my attempt using dplyr
library(dplyr)
d %>%
mutate(time = as.POSIXct(time, format = "%d.%m.%Y %H:%M")) %>%
group_by(ID) %>%
mutate(diff = paste0(gsub("[.].*", "", diff(time)*24), ":",
round(as.numeric(gsub(".*[.]", ".", diff(time)*24))*60)))
# Source: local data frame [4 x 3]
# Groups: ID
#
# ID time diff
# 1 A 2014-09-08 10:34:00 94:59
# 2 A 2014-09-12 09:33:00 94:59
# 3 B 2014-08-13 15:52:00 682:38
# 4 B 2014-09-11 02:30:00 682:38
A very (to me) hack-ish base solution:
ID <- c("A", "A", "B", "B")
time <- c("08.09.2014 10:34", "12.09.2014 09:33", "13.08.2014 15:52","11.09.2014 02:30")
d <- data.frame(ID, time)
d$time <- as.POSIXct(d$time, format="%d.%m.%Y %H:%M")
unlist(unname(lapply(split(d, d$ID), function(d) {
sapply(abs(diff(c(d$time[2], d$time))), function(x) {
sprintf("%s:%s", round(((x*24)%/%1)), round(((x*24)%%1 *60)))
})
})))
## [1] "94:59" "94:59" "682:38" "682:38"
I have to believe this function exists somewhere already, tho.
similar to the attempts of David and hrmbrmstr, I found that this solution using difftime works
I use a rowshift script I found on stackoverflow
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
d$time.c <- as.POSIXct(d$time, format = "%d.%m.%Y %H:%M")
d$time.prev <- rowShift(d$time.c,-1)
d$diff <- difftime(d$time.c,d$time.prev, units="hours")
Every other row of d$diff has positive/negative values in the results. I do remove all the rows with negative values and have the difference between the first and the last time for every ID.
Related
Hi I'd like to groupby two dataframe columns, and apply a function to aother two dataframe columns.
For e.g.,
ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,2,4,6,9,5)
vol <- c(3,5,1,6,2,3)
dat <- data.frame(ticker,date,ret,vol)
For each ticker and each date, I'd like to calculate its PIN.
Now, to avoid further confusion, perhaps it helps to just speak out the actual function. YZ is a function in the InfoTrad package, and YZ only accepts a dataframe with two columns. It uses some optimisation tool and returns an estimated PIN.
install.packages(InfoTrad)
library(InfoTrad)
get_pin_yz <- function(data) {
return(YZ(data[ ,c('volume_krw_buy', 'volume_krw_sell')])[['PIN']])
}
I know how to do this in R using for loop. But for loop is very computationally costly, and it might take weeks to finish running my large dataset. Thus, I would like to ask how to do this using groupby.
# output format is wide wrt long format as "dat"
dat_w <- data.frame(ticker = NA, date = NA, PIN = NA)
for (j in c("A", "B")){
for (k in c(1:2)){
subset <- dat %>% subset((ticker == j & date == k), select = c('ret', "vol"))
new_row <- data.frame(ticker = j, date = k, PIN = YZ(subset)$PIN)
dat_w <- rbind(dat_w, new_row)
}
}
dat_w <- dat_w[-1, ]
dat_w
Don't know if this can help you help me -- I know how to do this in python: I just write a function and run df.groupby(['ticker','date']).apply(function).
Finally, the wanted dataframe is:
ticker <- c('A','A','B','B')
date <- c(1,2,1,2)
PIN <- c(1.05e-17,2.81e-09,1.12e-08,5.39e-09)
data.frame(ticker,date,PIN)
Could somebody help out, please?
Thank you!
Best,
Darcy
Previous stuff (Feel free to ignore)
Previously, I wrote this:
My function is:
get_rv <- function(data) {
return(data[['vol']] + data[['ret']])
}
What I want is:
ticker_wanted <- c('A','A', 'B', 'B')
date_wanted <- c(1,2,1,2)
rv_wanted <- c(7,5,10,11)
df_wanted <-data.frame(ticker_wanted,date_wanted,rv_wanted)
But this is not literally what my actual function is. The vol+ret is just an example. I'm more interested in the more general case: how to groupby and apply a general function to two or more dataframes. I use the vol + ret just because I didn't want to bother others by asking them to install some potentially irrelevant package on their PC.
Update based on real-life example:
You can do a direct approach like this:
library(tidyverse)
library(InfoTrad)
dat %>%
group_by(ticker, date) %>%
summarize(PIN = YZ(as.data.frame(cur_data()))$PIN)
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date PIN
<chr> <dbl> <dbl>
1 A 1 1.05e-17
2 A 2 1.56e- 1
3 B 1 1.12e- 8
4 B 2 7.07e- 9
The difficulty here was that the YZ function only accepts true data frames, not tibbles and that it returns several values, not just PIN.
You could theoretically wrap this up into your own function and then run your own function like I‘ve shown in the example below, but maybe this way already does the trick.
I also don‘t expect this to run much faster than a for loop. It seems that this YZ function has some more-than-linear runtime, so passing larger amount of data will still take some time. You can try to start with a small set of data and then repeat it by increasing the size of your data with a factor of maybe 10 and then check how fast it runs.
In your example, you can do:
my_function <- function(data) {
data %>%
summarize(rv = sum(ret, vol))
}
library(tidyverse)
df %>%
group_by(ticker, date) %>%
my_function()
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date rv
<chr> <dbl> <dbl>
1 A 1 7
2 A 2 5
3 B 1 10
4 B 2 11
But as mentioned in my comment, I‘m not sure if this general example would help in your real-life use case.
Might also be that you don‘t need to create your own function because built-in functions already exist. Like in the example, you sre better off with directly summarizing instead of wrapping it into a function.
you could just do this? (with summarise as an example of your function):
ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,-2,4,6,9,-5)
vol <- c(3,5,1,6,2,3)
df <- data.frame(ticker,date,ret,vol)
df_wanted <- get_rv(df)
get_rv <- function(data){
result <- data %>%
group_by(ticker,date) %>%
summarise(rv =sum(ret) + sum(vol)) %>%
as.data.frame()
names(result) <- c('ticker_wanted', 'date_wanted', 'rv_wanted')
return(result)
}
Assuming that your dataframe is as follows:
data <- data.frame(ticker,date,ret,vol)
Use split to split your dataframe into a group of dataframes bases on the values of ticker, and date.
dflist = split(data, f = list(data$ticker, data$date), drop = TRUE)
Now use lapply or sapply to run the function YZ() on each dataframe member of dflist.
pins <- lapply(dflist, function(x) YZ(x)$PIN)
I have a question about a faster way to compute something about date intervals
My input:
A data frame : one row by couple (people, period). On each row, I have the ID of a person, a date of start and a date of end.
A period of time : All the dates day by day during two years
What i try to do is to calculate the number of people I have date by date.
I have a code which is working, but not enough efficient with a large dataset (~ from 100 k to 1 M rows).
The current problem is since I have two years of date, my code does 730 times (365x2) the following steps :
Filter the dataset with the specific date included between the start date and the end date
Calculate the number of unique id in the filtered dataset
And these operations are very long or impossible with a large dataset
I am wondering if it exists a better and faster way to do these operations, like with aggregation or with another technique.
An example with a short input and output :
library(lubridate)
library(dplyr)
# Vector of date
vector_day <- seq(ymd('2017-01-01'), ymd('2018-12-30'), by= "days")
# Input Data
df <- data.frame(
id_people = c(1, 2, 3, 4, 1),
StartDate = c(as.Date("2018-11-01"), as.Date("2018-11-03"),as.Date("2018-12-01"),as.Date("2018-11-15") ,as.Date("2018-11-15")),
EndDate = c(as.Date("2018-11-10"), as.Date("2018-12-04"),as.Date("2018-12-10"),as.Date("2018-11-17"), as.Date("2018-11-23")),
Gender = c("F", "F", "M", "F", "F"))
# Function to compute the number of people given a spécific date
compute_nb_f_by_day <- function(date) {
cond1 <- df_f$StartDate <= date
cond2 <- df_f$EndDate > date
cond <- cond1 & cond2
res <- length(unique(df_f[cond,]$id_people))
return(res)
}
# An example of how the function works for on date
compute_nb_f_by_day(as.Date("2018-12-01"))
# Computation for all the dates
nb_f_by_day <- cbind(
data.frame(vector_day),
data.frame(nb_f_by_day <- sapply(vector_day, compute_nb_f_by_day)))
Thanks.
This solution benchmarked significantly faster than your code in the given example (your code: 0.132s; this code: 0.032s in my system). Give it a try to see if it'll significantly improve for the large dataset!
#-- Create the 'Interval'
df2 <- df %>%
mutate(DateInterval = StartDate %--% EndDate)
#-- Create a result df instead of using cbind (more efficient)
result_df <- data.frame(Day = vector_day, Nb = NA)
#-- Get intervals that contain the days in vector_day
result_df$Nb <- sapply(vector_day, function(day) {sum(day %within% df2$DateInterval)})
I can't seem to figure the following out.
I have a data frame with 398 rows and 16 variables. I want to add a date variable. I know that for each row the date increases by a week and starts with 2010-01-01. I've tried the following:
date <- ymd("2010-01-01")
df <- as.data.frame(c(1:nrow(data), 1))
for (i in 1:nrow(data)){
date <- date + 7
df[i,] <- as.Date(date)
}
I then want to bind it to my data-frame. However, the values inside df are non-dates. If I perform the date +7 calculation it works (e.g. once it goes to 2010-01-08), but if I assign it to the df it turns into weird numerical values.
Appreciate any help.
Try the following:
library(lubridate)
date <- ymd("2010-01-01")
df <- data.frame(ind = 1:5)
df$dates <- seq.Date(from = date, length.out = nrow(df), by = 7)
# note that `by = "1 week"` would also work, if you prefer more readable code.
df
ind dates
1 1 2010-01-01
2 2 2010-01-08
3 3 2010-01-15
4 4 2010-01-22
5 5 2010-01-29
Try this:
df$date <- seq(as.Date("2010-01-01"), by = 7, length.out = 398)
also try to get in the habit of not calling your variables names that are already being used by functions such as data and date.
I'm currently struggling with R and calculating the time difference in days.
I have data.frame with around 60 000 rows. In this data frame there are two columns called "start" and "end". Both columns contain data in UNIX time format WITH milliseconds - as you can see by the last three digits.
Start <- c("1470581434000", "1470784954000", "1470811368000", "1470764345000")
End <- c("1470560601000", "1470581549000", "1470785452000", "1470764722000")
d <- data.frame(Start, End)
My desired output should be a extra column called timediff where the time difference is outline in days.
I tried it with timediff and strptime which I found here. But nothing worked out.
Maybe one of you worked with calculation of time differences in the past.
Thanks a lot
There is a very small and fast solution:
Start_POSIX <- as.POSIXct(as.numeric(Start)/1000, origin="1970-01-01")
End_POSIX <- as.POSIXct(as.numeric(End)/1000, origin="1970-01-01")
difftime(Start_POSIX, End_POSIX)
Time differences in mins
[1] 347.216667 3390.083333 431.933333 -6.283333
or if you want another unit:
difftime(Start_POSIX, End_POSIX, unit = "sec")
Time differences in secs
[1] 20833 203405 25916 -377
You have a few steps you'll need to take:
# 1. Separate the milliseconds.
# To do this, insert a period in front of the last three digits
Start <-
sub(pattern = "(\\d{3}$)", # get the pattern of three digits at the end of the string
replacement = ".\\1", # replace with a . and then the pattern
x = Start)
# 2. Convert to numeric
Start <- as.numeric(Start)
# 3. Convert to POSIXct
Start <- as.POSIXct(Start,
origin = "1970-01-01")
For convenience, it would be good to put these all into a function
# Bundle all three steps into one function
unixtime_to_posixct <- function(x)
{
x <- sub(pattern = "(\\d{3}$)",
replacement = ".\\1",
x = x)
x <- as.numeric(x)
as.POSIXct(x,
origin = "1970-01-01")
}
And with that, you can get your differences in days
#* Put it all together.
library(dplyr)
library(magrittr)
Start <- c("1470581434000", "1470784954000", "1470811368000", "1470764345000")
End <- c("1470560601000", "1470581549000", "1470785452000", "1470764722000")
d <- data.frame(Start,
End,
stringsAsFactors = FALSE)
lapply(
X = d,
FUN = unixtime_to_posixct
) %>%
as.data.frame() %>%
mutate(diff = difftime(Start, End, units = "days"))
I have a data.table of millions of rows and one of the columns is date column. I would like to add 12 months to all the dates in that column and create a new column. So I use the dplyr and lubridate packages E.g.
library(dplyr)
library(lubridate)
new_data <- data %>% mutate(date12m = date %m+% months(12))
This works, however it is very slow for large datasets. Am I missing something? How can this be sped up? I generally don't expect R to run for more than 10 minutes for such a simple task
Edit:
I note that my solution is already more efficient than using as.yearmon. Thanks to Colonel Beauvel for the solution
a <- data.frame(date = rep(today(),1000000))
func = function(u) {
d = as.Date(as.yearmon(u)+1, frac=1)
if(day(u)>day(d)) return(d)
day(d) = day(u)
d
}
pt <- proc.time()
a <- a %>% mutate(date12m = func(date))
data.table::timetaken(pt)
pt <- proc.time()
a <- a %>% mutate(date12m = date %m+% 12)
data.table::timetaken(pt)
Just add 1 with month:
x=seq.Date(from=as.Date("2007-01-01"), to=as.Date("2014-12-12"), by="day")
month(x) = month(x) + 1
#> head(x)
#[1] "2007-02-01" "2007-02-02" "2007-02-03" "2007-02-04" "2007-02-05" "2007-02-06"
Edit : as per #akrun comment here is the solution, using as.yearmon from zoo package. The trick is to do quick check when taking the day of the last date of the next month:
library(zoo)
func = function(u)
{
d = as.Date(as.yearmon(u)+1/12, frac=1)
if(day(u)>day(d)) return(d)
day(d) = day(u)
d
}
x=as.Date(c("2014-01-31","2015-02-28","2013-03-02"))
#> as.Date(sapply(x, func))
#[1] "2014-02-28" "2015-03-28" "2013-04-02"
I am also working with big data frames in R, you can use the package DescTools, it has a function named AddMonths(date,NoOfMonths).
It works quite well for me.
> a <- ymd("2011-09-9")
> b <- AddMonths(a,1)
> b
[1] "2011-10-09"