I have an excel dataset in which there are dates and time points as follows:
record_id date_E1 time_E1 date_E2 time_E2 ...
1 2019/8/24 09:00:00 2019/8/25 18:00:00
I would like to construct a variable which contains the number of hours past the first time and date, (09:00 a.m 2019/8/24). When I read the excel file with
read_excel("C:/visit.xlsx")
the time_E1 .. appears as 0.3750000 0.7736111 0.4131944 0.4131944,
and the date appears as 43640 43640 43641 43642, in R. I use visit_dates<-as.Date(as.numeric(visit_date_L$Day), origin = "1899-12-30")
to convert dates to 2019-8-24 and .. but do not know how to convert time of the day and convert to the hours past the first time point. What I expect is a vector like: 0, 42, ... hours past first time point.
I have used the following code:
as.POSIXct(visit_times, format = " %H-%M", origin = "09:00:00"),
but it returns a NULL vector. After that I could use the following code to transpose and combine date and time data:
visit_time <- subset(MY_visit, select = c(record_id, time_E1, ...)
visit_date <- subset(MY_visit, select = c(record_id, date_E1,...)
visit_time_L <- melt(visit_time, id.vars=c("record_id"))
visit_date_L <- melt(visit_date, id.vars=c("record_id"))
names(visit_time_L)[names(visit_time_L)=="value"] <- "time"
names(visit_date_L)[names(visit_date_L)=="value"] <- "Day"
visit_all <- cbind(visit_time_L, visit_date_L)
Any ideas how can I solve this problem?
Here is an approach that you can try. I have dates/times stored in an Excel file. Read it in and keep the columns as characters. Convert the dates to their proper format, as you did. Convert the fractions of the time of day to numeric and multiply by 24. Paste the dates/times together and convert to date format, then find the difference between the two in hours (the result will be in days, so multiply by 24).
library(dplyr);library(readxl); library(lubridate)
df <- read_excel('Book1.xlsx',col_types = c('text'))
# A tibble: 1 x 4
date1 time1 date2 time2
<chr> <chr> <chr> <chr>
1 43466 0.375 43467 0.41666666666666669
df %>% mutate_at(c('date1','date2'), ~ as.Date(as.numeric(.),origin='1899-12-30')) %>%
mutate_at(c('time1','time2'), ~ as.numeric(.)*24) %>%
mutate(t1=ymd_h(paste(date1,time1)),
t2=ymd_h(paste(date2,time2)),
diff=as.numeric(t2-t1)*24)
# A tibble: 1 x 7
date1 time1 date2 time2 t1 t2 diff
<date> <dbl> <date> <dbl> <dttm> <dttm> <dbl>
1 2019-01-01 9 2019-01-02 10 2019-01-01 09:00:00 2019-01-02 10:00:00 25
Related
I have a column of start and stop dates, and I need to extract the latest (most recent) stop date in order to calculate duration. (earliest start date - latest stop date)
Unfortunately, the dates in the last column are not necessarily the latest dates. So, I would have to go by row and compare the dates to figure out the latest date.
The other caveat is that not all columns will have dates.
Here's an example column of dates:
pacman::p_load(tibble, lubridate)
start_1 <- as_tibble(sample(seq(ymd("1999/01/01"), ymd("2000/01/01"), by="day"), 5))
stop_1 <- as_tibble(sample(seq(ymd("2000/01/01"), ymd("2001/01/01"), by="day"), 5))
stop_2 <- as_tibble(c(ymd("2000/03/05"), ymd("2000/11/15"), ymd("2000/07/22"), ymd("2000/05/05"), NA))
stop_3 <- as_tibble(c(ymd("2000/12/12"), ymd("2000/02/09"), NA, NA, NA))
dat <- cbind(start_1, stop_1, stop_2, stop_3)
I really have no idea how to go about this, and would appreciate any help.
Thank you!
One option is to use apply():
durs = as.Date(apply(dat[,c(2:ncol(dat))],1,max,na.rm=T))-dat[,1]
This assumes that the first column contains the start date and all columns thereafter contain possible stop dates.
First fix the column names and then use rowwise() with c_across().
colnames(dat) = c("start_1", "stop_1", "stop_2", "stop_3")
dat %>%
rowwise() %>%
mutate(LastDate=max(c_across(starts_with("stop")), na.rm=T),
Duration = LastDate-start_1)
start_1 stop_1 stop_2 stop_3 LastDate Duration
<date> <date> <date> <date> <date> <drtn>
1 1999-10-20 2000-11-12 2000-03-05 2000-12-12 2000-12-12 419 days
2 1999-04-30 2000-05-05 2000-11-15 2000-02-09 2000-11-15 565 days
3 1999-05-01 2000-04-01 2000-07-22 NA 2000-07-22 448 days
4 1999-04-17 2000-08-23 2000-05-05 NA 2000-08-23 494 days
5 1999-04-10 2000-04-02 NA NA 2000-04-02 358 days
I have a dataframe in the following format:
temp:
id time date
1 06:22:30 2018-01-01
2 08:58:00 2018-01-15
3 09:30:21 2018-01-30
The actual data set continues on for 9000 rows with obs for times throughout the month of January. I want to write a code that will assign each row a new value depending on which hour range the time variable belongs to.
A couple of example hour ranges would be:
Morning peak: 06:00:00 - 08:59:00
Morning: 09:00:00 - 11:59:00
The desired output would look like this:
id time date time_of_day
1 06:22:30 2018-01-01 MorningPeak
2 08:58:00 2018-01-15 MorningPeak
3 09:30:21 2018-01-30 Morning
I have tried playing around with time objects using the chron package using the following code to specify different time ranges:
MorningPeak <- temp[temp$Time >= "06:00:00" & temp$Time <= "08:59:59",]
MorningPeak$time_of_day <- "MorningPeak"
Morning <- temp[temp$Time >= "09:00:00" & temp$Time <= "11:59:59",]
Midday$time_of_day <- "Morning"
The results could then be merged and then manipulated to get everything in the same column. Is there a way to do this such that the desired result is generated and no extra data manipulation is required? I am interested in learning how to make my code more efficient.
You are comparing characters and not time/datetime objects, you need to convert it to date-time before comparison. It seems you can compare the hour of the day to get appropriate labels.
library(dplyr)
df %>%
mutate(hour = as.integer(format(as.POSIXct(time, format = "%T"), "%H")),
time_of_day = case_when(hour >= 6 & hour < 9 ~ "MorningPeak",
hour >= 9 & hour < 12 ~ "Morning",
TRUE ~ "Rest of the day"))
# id time date hour time_of_day
#1 1 06:22:30 2018-01-01 6 MorningPeak
#2 2 08:58:00 2018-01-15 8 MorningPeak
#3 3 09:30:21 2018-01-30 9 Morning
You can add more hourly criteria if needed.
We can also use cut
cut(as.integer(format(as.POSIXct(df$time, format = "%T"), "%H")),
breaks = c(-Inf, 6, 9, 12, Inf), right = FALSE,
labels = c("Rest of the day", "MorningPeak", "Morning", "Rest of the day"))
I have date data formatted in an odd way that I would like to clean up in R.
The dates are in format "d-Mon-y hh:mm:sec AM". For example "1-Feb-05 12:00:00 AM". The day and time are useless to me, however I would like to be able to use the month and year while also converting them to date-time format.
I cannot figure out how to do this.
Here is a way to do it with handy lubridate parsers and extractors. First convert the string into a datetime and then extract the month and the year:
library(tidyverse)
library(lubridate)
tibble(datetime = "1-Feb-05 12:00:00 AM") %>%
mutate(
datetime = dmy_hms(datetime),
year = year(datetime),
month = month(datetime)
)
#> # A tibble: 1 x 3
#> datetime year month
#> <dttm> <dbl> <dbl>
#> 1 2005-02-01 00:00:00 2005 2
Created on 2018-05-09 by the reprex package (v0.2.0).
I have a data frame that looks like this (of course it is way bigger):
> df1
# A tibble: 10 x 4
index1 index2 date1 date2
<int> <int> <date> <date>
1 5800032 6 2012-07-02 2013-09-18
2 5800032 7 2013-09-18 1970-01-01
3 5800254 6 2013-01-04 1970-01-01
4 5800261 5 2012-01-23 2013-02-11
5 5800261 6 2013-02-11 2014-02-05
6 5800261 7 2014-02-05 1970-01-01
7 3002704 7 2012-01-23 1970-01-01
8 3002728 7 2012-10-20 1970-01-01
9 3002810 7 2012-07-18 1970-01-01
10 8504593 3 2012-01-11 1970-01-01
The original variables are: index1, index2 and date1. There is one or more records with the same index1 value (their sequence is determined by index2). My objective is to filter out intervals between consequent values of date1 for the same value of index1. This means that there must be at least two records with the same index1 value to create an interval.
So I created date2 variable that provides the end date of the interval that starts on date1. This simply equals date1 of the consequent record (date2[n] = date1[n+1]). If date1[n] is the latest (or the only) date for the given index1 value, then date2[n] <- 0.
I couldn't come up with a better idea than ordering the df by index1 and index2 and running a for loop:
for (i in 1:(nrow(df1)-1)){
if (df1$index1[i] == df1$index1[i+1]){
df1$date2[i] <- df1$date1[i+1]
}
else{df1$date2[i] <- 0}
}
It sort of worked, but it was visibly slow and for some reason it did not "find" all values it should have. Also, I'm sure there must be a much more intelligent way of doing this task - possibly with sapply function. Any ideas are appreciated!
You can create date2 using lag from dplyr
df1 %>%
group_by(index1) %>%
arrange(index2) %>%
mutate(date2 = lag(date1, default=0))
I didn't clearly understand the filtering part of your question. Your problem may have to do with filtering on default date (1970-01-01) (value = zero)
Lets say we have, two time-series data.tables, one sampled by day, another by hour:
dtByDay
EURO TIME ... and some other columns
<num> <POSc>
1: 0.95 2017-01-20
2: 0.97 2017-01-21
3: 0.98 2017-01-22
...
dtByHour
TIME TEMP ... also some other columns
<POSc> <num>
1: 2017-01-20 00:00:00 22.45
2: 2017-01-20 01:00:00 23.50
3: 2017-01-20 02:00:00 23.50
...
and we need to merge them, so that to get all columns together. What's a nice what of doing it?
Evidently dtByDay[dtByHour] does not produce the desired outcome (as one could have wished) - you get `NA' in "EURO" column ...
Seems like roll = TRUE might give you funny behavior if a date is present in one data frame but no the other. So I wanted to post this alternative:
Starting with your original data frames:
dtbyday <- data.frame( EURO = c(0.95,0.97,0.98),
TIME = c(ymd("2017-01-20"),ymd("2017-01-21"),ymd("2017-01-22")))
dtbyhour <- data.frame( TEMP = c(22.45,23.50,23.40),
TIME = c(ymd_hms("2017-01-21 00:00:00"),ymd_hms("2017-01-21 01:00:00"),ymd_hms("2017-01-21 02:00:00")))
I converted the byhour$TIME to the same format as the byday$TIME using lubridate functions
dtbyhour <- dtbyhour %>%
rowwise() %>%
mutate( TIME = ymd( paste( year(TIME), month(TIME), day(TIME), sep="-" ) ) )
dtbyhour
# A tibble: 3 x 2
TEMP TIME
<dbl> <date>
1 22.45 2017-01-20
2 23.50 2017-01-20
3 23.40 2017-01-20
NOTE: The date changed because of time zone issues.
Then use dplyr::full_join to join by TIME, which will keep all records, and impute values whenever possible. You'll need to aggregate byHour values on a particular day...I calculated the mean TEMP below.
new.dt <- full_join( dtbyday, dtbyhour, by = c("TIME") ) %>%
group_by( TIME ) %>%
summarize( EURO = unique( EURO ),
TEMP = mean( TEMP, na.rm = TRUE ) )
# A tibble: 3 x 3
TIME EURO TEMP
<date> <dbl> <dbl>
1 2017-01-20 0.95 23.11667
2 2017-01-21 0.97 NaN
3 2017-01-22 0.98 NaN
Big thanks to comments above! - The solution is as easy as just adding roll=Inf argument when joining:
dtByHour[dtByDay, roll=Inf]
That's exactly what I needed. It takes dtByDay value and use it for all hours of this day. The output (from my application) is shown below.
For other applications, you may also consider roll="nearest". This will take the closest (from midnight) dtByDay value for all hours before and after midnight:
dtByHour[dtByDay, roll="nearest"]