A subject was measured at several time points over several days. I have a row "resptime_s" (time that the subject was answered the beep on his smartphone). Now I want to know the mean time between those (so between the rows of this column) with the night time taken out (nighttime is always from 22:30 p.m till 7:30 a.m). Take as example:
The R script:
setwd("C:/Users/Hanne/Desktop/")
dat <- read.csv(file="datnew2.csv", sep=";",header=TRUE)
rows <- c(1:388) #time points
columns <- c(2,60) # datum and time
nVariables = 2
newdata<-dat[rows,columns]
head(newdata)
fun2 <- function(x){
bt <- as.integer(sub("(^\\d{1,2}):.*", "\\1", x))
f <- cumsum(c(FALSE, diff(bt) < 0))
d <- rep(as.Date("2018-01-01"), length.out = length(bt))
bt <- as.POSIXct(paste(d, x))
res <- sapply(split(bt, f), function(b) c(0, difftime(b[-1], b[1])))
unname(unlist(res))
}
fun2(newdata$resptime_s)
But the result isn't correct.
And with:
dput(head(newdata, 30))
I obtained this output:
Using the different functions for working with time intervals in lubridate gives the most elegant and easy to understand solution.
library(tidyverse)
library(lubridate)
data <- tribble(
~time_point, ~beeptime,
1, "08:30",
2, "11:13",
3, "12:08",
4, "17:20",
5, "22:47",
6, "7:36",
7, "9:40"
) %>%
mutate(beeptime = as_datetime(hm(beeptime)))
1. Define the daytime interval
day <- interval(
as_datetime(hm("07:30")),
as_datetime(hm("22:30"))
)
2. Keep daytime beeps and estimate the time (interval) between them
# %--% is basically the same as interval() above.
data_interval <-
data %>%
filter(beeptime %within% day) %>%
mutate(beep_interval = lag(beeptime) %--% beeptime)
3. Take the average
# You can use as.numeric() to extract (e.g.) minutes, which you can
# just pass to mean().
data_interval$beep_interval %>%
as.numeric("minutes") %>%
abs() %>%
mean(na.rm = TRUE)
#> [1] 247.6
Try the following. It pastes a date that increments every time the next hour is less than the previous one. Then difftime works as expected.
fun <- function(x){
bt <- as.integer(sub("(^\\d{1,2}):.*", "\\1", x))
inx <- as.logical(cumsum(c(FALSE, diff(bt) < 0)))
d <- rep(as.Date("2018-01-01"), length.out = length(bt))
d[inx] <- d[inx] + 1
beeptime <- as.POSIXct(paste(d, x))
difftime(beeptime[-1], beeptime[1])
}
fun(newdata$beeptime)
#Time differences in hours
#[1] 2.716667 3.633333 8.833333 14.283333 23.100000 25.166667
Data.
newdata <-
structure(list(time_point = 1:7, beeptime = structure(1:7, .Label = c("08:30",
"11:13", "12:08", "17:20", "22:47", "7:36", "9:40"), class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
Edit.
I believe that I have missunderstood the question. The OP does not want differences between the first hour and all others. What is needed is the differences restarting from zero every night.
If this is the case, the following function will do it.
fun2 <- function(x){
bt <- as.integer(sub("(^\\d{1,2}):.*", "\\1", x))
f <- cumsum(c(FALSE, diff(bt) < 0))
d <- rep(as.Date("2018-01-01"), length.out = length(bt))
bt <- as.POSIXct(paste(d, x))
res <- sapply(split(bt, f), function(b) c(0, difftime(b[-1], b[1])))
unname(unlist(res))
}
fun2(newdata$beeptime)
#[1] 0.000000 2.716667 3.633333 8.833333 14.283333 0.000000 2.066667
Another approach could be to convert beeptime in offset (in seconds) from midnight using lubridate package.
We can then write a function(s) to calculate difference in time excluding night time (22:30 - 7:30).
Before we start solution, lets have a look for offset in seconds from midnight for 7:30 and 22:30.
library(lubridate)
as.numeric(seconds(hm("7:30")))
# [1] 27000
as.numeric(seconds(hm("22:30")))
# [1] 81000
I have written two sets of function to calculate difference between two times:
# Function checks individual time and shifts them to night boundary. So that
# time over night can be excluded.
checkNightBoundry <- function(val){
if(val < 27000){
val = 27000
} else if(val > 81000) {
val = 81000
}
val
}
# Arguments are offset from midnight in seconds
# Calculate difference between two time, excluding midtime
calcDifftime <- function(currVal, prevVal){
diffTime <- 0
currVal = checkNightBoundry(currVal)
prevVal = checkNightBoundry(prevVal)
if(currVal > prevVal){
diffTime = currVal - prevVal
}else if(currVal < prevVal){
diffTime = (81000 - prevVal) + (currVal - 27000)
}
diffTime
}
Now, use above functions:
library(dplyr)
library(lubridate)
df %>% mutate(beeptimeOffset = as.numeric(seconds(hm(beeptime)))) %>%
mutate(diffTime = mapply(calcDifftime,
beeptimeOffset, lag(beeptimeOffset, default = first(beeptimeOffset)))/3600)
# timepoint beeptime beeptimeOffset(sec) diffTime(hrs)
# 1 1 08:30 30600 0.0000000
# 2 2 11:13 40380 2.7166667
# 3 3 12:08 43680 0.9166667
# 4 4 17:20 62400 5.2000000
# 5 5 22:47 82020 5.1666667
# 6 6 7:36 27360 0.1000000
# 7 7 9:40 34800 2.0666667
Data:
df <- read.table(text =
"timepoint beeptime
1 08:30
2 11:13
3 12:08
4 17:20
5 22:47
6 7:36
7 9:40",
header = TRUE, stringsAsFactors = FALSE)
Related
I want to count the number of rows within a certain time range based on each row after grouping by id. For instance, let us say a 1-month window around each datetime entry in the column "cleaned_date".
head(data$cleaned_date)
[1] "2004-10-11 CDT" "2008-09-10 CDT" "2011-10-25 CDT" "2011-12-31 CST"
The dates are in POSIXct format.
For the first entry, I need to count the number of rows within the time from 2004-09-11 to 2004-11-11, for the second entry, count the number of rows within the time from 2008-08-10 to 2008-10-10, so on and so forth.
I used roughly the following code
data %>% group_by(id) %>% filter(cleaned_date %within% interval(cleaned_date - 24 * 60 * 60 * 30, cleaned_date + 24 * 60 * 60 * 30)) %>% mutate(counts = n())
But it does not seem to work and I got counts as an empty column. Any help would be appreciated, thanks!
A reproducible example can be the following:
The input is
cleaned_date id
1 2008-09-11 A
2 2008-09-10 B
3 2008-09-30 B
4 2011-10-25 A
5 2011-11-14 A
And I want the output to be
cleaned_date id counts
1 2008-09-11 A 1
2 2008-09-10 B 2
3 2008-09-30 B 2
4 2011-10-25 A 2
5 2011-11-14 A 2
For the first entry, I want to count the rows in the timeframe 2008-08-11 to 2008-10-11, the second entry seems to satisfy but we need to group by "id", so it does not count. For the second entry I want to count the rows in the timeframe 2008-08-10 to 2008-10-10, rows 2 and 3 satisfy, so the counts is 2. For the third entry I want to count the rows in the timeframe 2008-08-30 to 2008-10-30, rows 2 and 3 satisfy again, so on and so forth.
Note that the actual dataset I would like to operate on has millions of rows, so it might be more efficient to use tidyverse rather than base R.
Perhaps not the most elegant solution.
# input data. Dates as character vector
input = data.frame(
cleaned_date = c("2008-09-11", "2008-09-10", "2008-09-30", "2011-10-25", "2011-11-14"),
id = c("A", "B", "B", "A", "A")
)
# function to create a date window n months around specified date
window <- function(x, n = 1){
x <- rep(as.POSIXlt(x),2)
x[1]$mon <- x[1]$mon - n
x[2]$mon <- x[2]$mon + n
return(format(seq(from = x[1], to = x[2], by = "day"), format="%Y-%m-%d"))
}
# find counts for each row
input$counts <- unlist(lapply(1:nrow(input), function(x){
length(which((input$cleaned_date %in% window(input$cleaned_date[x])) & input$id == input$id[x]))
}))
input
cleaned_date id counts
1 2008-09-11 A 1
2 2008-09-10 B 2
3 2008-09-30 B 2
4 2011-10-25 A 2
5 2011-11-14 A 2
Edit for large datasets:
# dummy dataset with 1,000,000 rows
years <- c(2000:2020)
months <- c(1:12)
days <- c(1:20)
n <- 1000000
dates <- paste(sample(years, size = n, replace = T), sample(months, size = n, replace = T), sample(days, size = n, replace = T), sep = "-")
groups <- sample(c("A","B","C"), size = n, replace = T)
input <- data.frame(
cleaned_date = dates,
id = groups
)
input$cleaned_date <- format(as.POSIXlt(input$cleaned_date), format="%Y-%m-%d")
# optional, sort data by date for small boost in performance
input <- input[order(input$cleaned_date),]
counts <- NULL
#pb <- progress::progress_bar$new(total = length(unique(input$cleaned_date)))
t1 <- Sys.time()
# split up vectorization for each unique date.
for(date in unique(input$cleaned_date)){
#pb$tick()
w <- window(date)
tmp <- input[which(input$cleaned_date %in% w),]
tmp_counts <- unlist(lapply(which(tmp$cleaned_date == date), function(x){
length(which(tmp$id == tmp$id[x]))
}))
counts <- c(counts, tmp_counts)
}
# add counts to dataset
input$counts <- counts
# optional, re-order data to original format
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)
Time difference of 3.247204 mins
If you want to go faster, you can run the loop in parallel
library(foreach)
library(doParallel)
cores=detectCores()
cl <- makeCluster(cores[1]-1)
registerDoParallel(cl)
dates = unique(input$cleaned_date)
t1 <- Sys.time()
counts <- foreach(i=1:length(dates), .combine= "c") %dopar% {
w <- window(dates[i])
tmp <- input[which(input$cleaned_date %in% w),]
tmp_counts <- unlist(lapply(which(tmp$cleaned_date == dates[i]), function(x){
length(which(tmp$id == tmp$id[x]))
}))
tmp_counts
}
stopCluster(cl)
input$counts <- counts
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)
Time difference of 37.37211 secs
Note, I'm running this on a MacBook Pro with a 2.3 GHz Quad-Core Intel Core i7 and 16 GB of RAM.
It is still hard to determine exactly what you're trying to accomplish, but this will at least get you counts for a specified date range:
df %>%
group_by(id) %>%
filter(cleaned_date >= "2008-08-11" & cleaned_date <= "2008-10-11") %>%
mutate(counts = n())
Will give us:
cleaned_date id counts
<date> <chr> <int>
1 2008-09-11 A 1
2 2008-09-10 B 2
3 2008-09-30 B 2
I have a starting time specified as a year-month character, e.g. "2020-12". From the start, for each of T consecutive months, I need to generate n different dates (year-month-day), where the day is random.
Any help will be useful!
The data I'm working on:
data <- data.frame(
data = sample(seq(as.Date('2000/01/01'), as.Date('2020/01/01'), by="day"), 500),
price = round(runif(500, min = 10, max = 20),2),
quantity = round(rnorm(500,30),0)
)
func <- function(start, months, n) {
startdate <- as.Date(paste0(start, "-01"))
enddate <- seq(startdate, by = "month", length.out = months)
months <- seq_len(months)
enddate_lt <- as.POSIXlt(enddate)
enddate_lt$mon <- enddate_lt$mon + 1
enddate_lt$mday <- enddate_lt$mday - 1
days_per_month <- as.integer(format(enddate_lt, format = "%d"))
days <- lapply(days_per_month, sample, size = n)
dates <- Map(`+`, enddate, days)
do.call(c, dates)
}
set.seed(2021)
func("2020-12", 4, 3)
# [1] "2020-12-08" "2020-12-07" "2020-12-15" "2021-01-27" "2021-01-08" "2021-01-13" "2021-02-21" "2021-02-07" "2021-02-28"
# [10] "2021-03-28" "2021-03-07" "2021-03-15"
func("2020-12", 5, 2)
# [1] "2020-12-06" "2020-12-16" "2021-01-08" "2021-01-10" "2021-02-24" "2021-02-13" "2021-03-20" "2021-03-29" "2021-04-19"
# [10] "2021-04-28"
func("2020-12", 2, 10)
# [1] "2020-12-29" "2020-12-30" "2020-12-04" "2020-12-15" "2020-12-09" "2020-12-27" "2020-12-05" "2020-12-06" "2020-12-23"
# [10] "2020-12-17" "2021-01-03" "2021-01-20" "2021-01-05" "2021-01-22" "2021-01-23" "2021-01-06" "2021-01-10" "2021-01-07"
# [19] "2021-01-19" "2021-01-12"
Most of the dancing with POSIXlt objects is because it gives us clean (base R) access to the number of days in a month, which makes sampleing the days in a month rather simple. It can also be done (code-golf shorter) using the lubridate package, but I don't know that that is any more correct than this code is.
This just dumps out a sequence of random dates, with n days per month. It does not sort within each month, though it does output the months in order. (That's not a difficult extension, there just wasn't a requirement for it.) It doesn't put out a frame, you can easily extend this to fit in a frame or call data.frame(date = do.call(c, dates)) on the last line, depending on what you need to do with the output.
You could convert the start time to a class for monthly data, zoo::yearmon. Then use as.Date.yearmon and its frac argument ("a number between 0 and 1 inclusive that indicates the fraction of the way through the period that the result represents") with random values from runif (uniform between 0 and 1) to convert to a random date within each year-month.
start = "2020-12"
T = 3
n = 2
library(zoo)
set.seed(1)
as.Date(as.yearmon(start) + rep((1:T)/12, each = n), frac = runif(T * n))
# [1] "2021-01-08" "2021-01-12" "2021-02-16" "2021-02-25" "2021-03-07" "2021-03-27"
I have the following R matrix:
Date MyVal
2016 1
2017 2
2018 3
....
2026 10
What I want to do is "blow it up" so that it goes like this (where monthly values are linearly interpolated):
Date MyVal
01/01/2016 1
02/01/2016 ..
....
01/01/2017 2
....
01/01/2026 10
I realize I can easily generate the sequence using:
DateVec <- seq(as.Date(paste(minYear,"/01/01", sep = "")), as.Date(paste(maxYear, "/01/01", sep = "")), by = "month")
And I can use that to make a large matrix and then fill things in using a for loop over the DateVector in but I wonder if there's a more elegant R way to do this?
You can use stats::approx:
library(stats)
ipc <- approx(df$Date, df$MyVal, xout = DateVec,
rule = 1, method = "linear", ties = mean)
You probably need to first convert the data in your original data-frame to have month and day and also be in asPOSIXct or as.Date format.
Based on what you provided, this works:
#Make the reference data-frame for interpolation:
DateVec <- seq(min(df$Date, na.rm=T),
max(df$Date, na.rm=T), by = "month")
#Interpolation:
intrpltd_df <- approx(df$Date, df$MyVal, xout = DateVec,
rule = 1, method = "linear", ties = mean)
# x y
# 1 2016-01-01 1.000000
# 2 2016-02-01 1.084699
# 3 2016-03-01 1.163934
# 4 2016-04-01 1.248634
# 5 2016-05-01 1.330601
# 6 2016-06-01 1.415301
Data:
#reproducing the data-frame:
Date <- seq(2016,2026)
MyVal <- seq(1:11)
Date <- data.frame(as.Date(paste0(Date,"/01/01"))) #yyyy-mm-dd format
df <- cbind(Date, MyVal)
df <- as.data.frame(df)
colnames(df) <- c ("Date", "MyVal") #Changing Column Names
This question already has answers here:
Split date into different columns for year, month and day
(4 answers)
Closed 6 years ago.
I have a dataset which looks like:
mother_id,dateOfBirth
1,1962-09-24
2,1991-02-19
3,1978-11-11
I need to extract the constituent elements (day,month,year) from date of birth and put them in corresponding columns to look like:
mother_id,dateOfBirth,dayOfBirth,monthOfBirth,yearOfBirth
1,1962-09-24,24,09,1962
2,1991-02-19,19,02,1991
3,1978-11-11,11,11,1978
Currently, I have it coded as a loop:
data <- read.csv("/home/tumaini/Desktop/IHI-Projects/Data-Linkage/matching file dss nacp.csv",stringsAsFactors = F)
dss_individuals <- read.csv("/home/tumaini/Desktop/IHI-Projects/Data-Linkage/Data/dssIndividuals.csv", stringsAsFactors = F)
lookup <- data[,c("patientid","extId")]
# remove duplicates
lookup <- lookup[!(duplicated(lookup$patientid)),]
dss_individuals$dateOfBirth <- as.character.Date(dss_individuals$dob)
dss_individuals$dayOfBirth <- 0
dss_individuals$monthOfBirth <- 0
dss_individuals$yearOfBirth <- 0
# Loop starts here
for(i in 1:nrow(dss_individuals)){ #nrow(dss_individuals)
split_list <- unlist(strsplit(dss_individuals[i,]$dateOfBirth,'[- ]'))
dss_individuals[i,]["dayOfBirth"] <- split_list[3]
dss_individuals[i,]["monthOfBirth"] <- split_list[2]
dss_individuals[i,]["yearOfBirth"] <- split_list[1]
}
This seems to work, but is horrendously slow as I have 400 000 rows. Is there a way I can get this done more efficiently?
I compared the speed of substr, format, and use of lubridate. It seems that lubridate and format are much faster than substr, if the the variable is stored as date. However, substr would be fastest if the variable is stored as character vector. The results of a single run is shown.
x <- sample(
seq(as.Date('1000/01/01'), as.Date('2000/01/01'), by="day"),
400000, replace = T)
system.time({
y <- substr(x, 1, 4)
m <- substr(x, 6, 7)
d <- substr(x, 9, 10)
})
# user system elapsed
# 3.775 0.004 3.779
system.time({
y <- format(x,"%y")
m <- format(x,"%m")
d <- format(x,"%d")
})
# user system elapsed
# 1.118 0.000 1.118
system.time({
y <- year(x)
m <- month(x)
d <- day(x)
})
# user system elapsed
# 0.951 0.000 0.951
x1 <- as.character(x)
system.time({
y <- substr(x1, 1, 4)
m <- substr(x1, 6, 7)
d <- substr(x1, 9, 10)
})
# user system elapsed
# 0.082 0.000 0.082
Not sure if this will solve your speed issues but here is a nicer way of doing it using dplyr and lubridate. In general when it comes to manipulating data.frames I personally recommend using either data.tables or dplyr. Data.tables is supposed to be faster but dplyr is more verbose which I personally prefer as I find it easier to pick up my code after not having read it for months.
library(dplyr)
library(lubridate)
dat <- data.frame( mother_id = c(1,2,3),
dateOfBirth = ymd(c( "1962-09-24" ,"1991-02-19" ,"1978-11-11"))
)
dat %>% mutate( year = year(dateOfBirth) ,
month = month(dateOfBirth),
day = day(dateOfBirth) )
Or you can use the mutate_each function to save having to write the variable name multiple times (though you get less control over the name of the output variables)
dat %>% mutate_each( funs(year , month , day) , dateOfBirth)
Here are some solutions. These solutions each (i) use 1 or 2 lines of code and (ii) return numeric year, month and day columns. In addition, the first two solutions use no packages -- the third uses chron's month.day.year function.
1) POSIXlt Convert to "POSIXlt" class and pick off the parts.
lt <- as.POSIXlt(DF$dateOfBirth, origin = "1970-01-01")
transform(DF, year = lt$year + 1900, month = lt$mon + 1, day = lt$mday)
giving:
mother_id dateOfBirth year month day
1 1 1962-09-24 1962 9 24
2 2 1991-02-19 1991 2 19
3 3 1978-11-11 1978 11 11
2) read.table
cbind(DF, read.table(text = format(DF$dateOfBirth), sep = "-",
col.names = c("year", "month", "day")))
giving:
mother_id dateOfBirth year month day
1 1 1962-09-24 1962 9 24
2 2 1991-02-19 1991 2 19
3 3 1978-11-11 1978 11 11
3) chron::month.day.year
library(chron)
cbind(DF, month.day.year(DF$dateOfBirth))
giving:
mother_id dateOfBirth month day year
1 1 1962-09-24 9 24 1962
2 2 1991-02-19 2 19 1991
3 3 1978-11-11 11 11 1978
Note 1: Often when year, month and day are added to data it is not really necessary and in fact they could be generated on the fly when needed using format, substr or as.POSIXlt so you might critically examine whether you actually need to do this.
Note 2: The input data frame, DF in reproducible form, was assumed to be:
Lines <- "mother_id,dateOfBirth
1,1962-09-24
2,1991-02-19
3,1978-11-11"
DF <- read.csv(text = Lines)
Use format once for each part:
dss_individuals$dayOfBirth <- format(dss_individuals$dateOfBirth,"%d")
dss_individuals$monthOfBirth <- format(dss_individuals$dateOfBirth,"%m")
dss_individuals$yearOfBirth <- format(dss_individuals$dateOfBirth,"%Y")
Check the substr function from the base package (or other functions from the nice stringr package) to extract different parts of a string. This function may assume that day, month and year are always in the same place and with the same length.
The strsplit function is vectorized so using rbind.data.frame to convert your list to a dataframe works:
do.call(rbind.data.frame, strsplit(df$dateOfBirth, split = '-'))
Results need to be transposed in order to be used: you can do it using do.call or the t function.
I would like a function that counts the number of specific days per month..
i.e.. Nov '13 -> 5 fridays.. while Dec'13 would return 4 Fridays..
Is there an elegant function that would return this?
library(lubridate)
num_days <- function(date){
x <- as.Date(date)
start = floor_date(x, "month")
count = days_in_month(x)
d = wday(start)
sol = ifelse(d > 4, 5, 4) #estimate that is the first day of the month is after Thu or Fri then the week will have 5 Fridays
sol
}
num_days("2013-08-01")
num_days(today())
What would be a better way to do this?
1) Here d is the input, a Date class object, e.g. d <- Sys.Date(). The result gives the number of Fridays in the year/month that contains d. Replace 5 with 1 to get the number of Mondays:
first <- as.Date(cut(d, "month"))
last <- as.Date(cut(first + 31, "month")) - 1
sum(format(seq(first, last, "day"), "%w") == 5)
2) Alternately replace the last line with the following line. Here, the first term is the number of Fridays from the Epoch to the next Friday on or after the first of the next month and the second term is the number of Fridays from the Epoch to the next Friday on or after the first of d's month. Again, we replace all 5's with 1's to get the count of Mondays.
ceiling(as.numeric(last + 1 - 5 + 4) / 7) - ceiling(as.numeric(first - 5 + 4) / 7)
The second solution is slightly longer (although it has the same number of lines) but it has the advantage of being vectorized, i.e. d could be a vector of dates.
UPDATE: Added second solution.
There are a number of ways to do it. Here is one:
countFridays <- function(y, m) {
fr <- as.Date(paste(y, m, "01", sep="-"))
to <- fr + 31
dt <- seq(fr, to, by="1 day")
df <- data.frame(date=dt, mon=as.POSIXlt(dt)$mon, wday=as.POSIXlt(dt)$wday)
df <- subset(df, df$wday==5 & df$mon==df[1,"mon"])
return(nrow(df))
}
It creates the first of the months, and a day in the next months.
It then creates a data frame of month index (on a 0 to 11 range, but we only use this for comparison) and weekday.
We then subset to a) be in the same month and b) on a Friday. That is your result set, and
we return the number of rows as your anwser.
Note that this only uses base R code.
Without using lubridate -
#arguments to pass to function:
whichweekday <- 5
whichmonth <- 11
whichyear <- 2013
#function code:
firstday <- as.Date(paste('01',whichmonth,whichyear,sep="-"),'%d-%m-%Y')
lastday <- if(whichmonth == 12) { '31-12-2013' } else {seq(as.Date(firstday,'%d-%m-%Y'), length=2, by="1 month")[2]-1}
sum(
strftime(
seq.Date(
from = firstday,
to = lastday,
by = "day"),
'%w'
) == whichweekday)