I'm trying to manipulate the data with these dates.
Condition: if start time and end time overlaps with previous date from 21:00 to 9:00 next day, then the correct date would be the previous date.
For example here: Obs 3: the start time 12/5/2020 9:27 and end time 12/5/2020 12:13 overlaps with 12/4/2020 21:00 to 12/5/2020 10:00. So the correct date should be 12/04/2020.
With this logic, the next step is to combine hrs_slept for example Obs 6 and 7 because they have the same dates, which would become:
Condition: for each ID, if there are same dates, hrs_slept need to be combined:
original:
start_time <- c("2020-12-02 23:01:00","2020-12-04 04:17:00","2020-12-05 09:27:00","2020-12-06 06:38:00","2020-12-06 11:42:00","2020-12-08 01:22:00","2020-12-08 05:33:00", "2020-12-08 16:59:00","2020-12-09 03:15:00")
end_time <- c("2020-12-03 01:03:00","2020-12-04 09:42:00","2020-12-05 12:13:00","2020-12-06 09:16:00","2020-12-06 14:50:00","2020-12-08 04:24:00","2020-12-08 06:37:00","2020-12-08 18:05:00","2020-12-09 06:21:00")
hrs_sleep <- c(2.03,5.42,2.77,2.63,3.13,3.03,1.07,1.10,3.10)
a <- data.frame(start_time,end_time,hrs_sleep)%>%
mutate(ID=104)
a$start <- as.POSIXct(a$start)
a$end <- as.POSIXct(a$end)
Desired:
start_date <- c("2020-12-02","2020-12-03","2020-12-04","2020-12-05","2020-12-06","2020-12-07","2020-12-08")
hrs_sleep_new <- c(2.03,5.42,2.77,2.63,3.13,4.10,4.20)
b <- data.frame(start_date,hrs_sleep_new)%>%
mutate(ID=104)
b$start_date <- as.POSIXct(b$start_date)
I hope I made the logic clear. Essentially, I just need some sort of if statements for the two conditions written in bold. And I have a lot more observations with different IDs so these conditions need to be for each ID.
I appreciate all the help there is! Thanks!!!
Sorry, but I'm not sure I understand the conditions. However, try this code:
a <- a %>% mutate(start_date = as.Date(ifelse(hour(start) < 10, date(start)-1,date(start)), origin = "1970-01-01")) # Create the dates
a %>% group_by(ID, start_date) %>% summarise(hrs_sleep_new = sum(hrs_sleep)) # Sum the sleep hours
Related
I am working with daily returns from a Brazilian Index (IBOV) since 1993, I am trying to figure out the best way to subset for periods between 2 dates.
The data frame (IBOV_RET) is as follows :
head(IBOV_RET)
DATE 1D_RETURN
1 1993-04-28 -0.008163265
2 1993-04-29 -0.024691358
3 1993-04-30 0.016877637
4 1993-05-03 0.000000000
5 1993-05-04 0.033195021
6 1993-05-05 -0.012048193
...
I set 2 variables DATE1 and DATE2 as dates
DATE1 <- as.Date("2014-04-01")
DATE2 <- as.Date("2014-05-05")
I was able to create a new subset using this code:
TEST <- IBOV_RET[IBOV_RET$DATE >= DATE1 & IBOV_RET$DATE <= DATE2,]
It worked, but I was wondering if there is a better way to subset the data between 2 date, maybe using subset.
As already pointed out by #MrFlick, you dont get around the basic logic of subsetting. One way to make it easier for you to subset your specific data.frame would be to define a function that takes two inputs like DATE1 and DATE2 in your example and then returns the subset of IBOV_RET according to those subset parameters.
myfunc <- function(x,y){IBOV_RET[IBOV_RET$DATE >= x & IBOV_RET$DATE <= y,]}
DATE1 <- as.Date("1993-04-29")
DATE2 <- as.Date("1993-05-04")
Test <- myfunc(DATE1,DATE2)
#> Test
# DATE X1D_RETURN
#2 1993-04-29 -0.02469136
#3 1993-04-30 0.01687764
#4 1993-05-03 0.00000000
#5 1993-05-04 0.03319502
You can also enter the specific dates directly into myfunc:
myfunc(as.Date("1993-04-29"),as.Date("1993-05-04")) #will produce the same result
You can use the subset() function with the & operator:
subset(IBOV_RET, DATE1> XXXX-XX-XX & DATE2 < XXXX-XX-XX)
Updating for a more "tidyverse-oriented" approach:
IBOV_RET %>%
filter(DATE1 > XXXX-XX-XX, DATE2 < XXXX-XX-XX) #comma same as &
There is no real other way to extract date ranges. The logic is the same as extracting a range of numeric values as well, you just need to do the explicit Date conversion as you've done. You can make your subsetting shorter as you would with any other subsetting task with subset or with. You can break ranges into intervals with cut (there is a specific cut.Date overload). But base R does not have any way to specify Date literals so you cannot avoid the conversion. I can't imagine what other sort of syntax you may have had in mind.
What about:
DATE1 <- as.Date("1993-04-29")
DATE2 <- as.Date("1993-05-04")
# creating a data range with the start and end date:
dates <- seq(DATE1, DATE2, by="days")
IBOV_RET <- subset(IBOV_RET, DATE %in% dates)
I believe lubridate could help here;
daterange <- interval(DATE1, DATE2)
TEST <- IBOV_RET[which(Date %within% daterange),]
I sort of love dplyr package
So if you
>library("dplyr")
and then, as you did:
>Date1<-as.Date("2014-04-01")
>Date2<-as.Date("2014-05-05")
Finally
>test<-filter(IBOV_RET, filter(DATE>Date1 & DATE<Date2))
You can use R's between() function after simply converting the strings to dates:
df %>%
filter(between(date_column, as.Date("string-date-lower-bound"), as.Date("string-date-upper-bound")))
Test = IBOV_RET[IBOV_RET$Date => "2014-04-01" | IBOV_RET$Date <= "1993-05-04"]
Here I am using "or" function | where data should be greater than particular data or data should be less than or equal to this date.
I have data with Order Id, Start Date & End Date. I have to split both the Start and End dates into intervals of 30 days, and derive two new variables “split start date” and “split end date”.
Example: The below example illustrates how split dates are created when the Start Date is “01/05/2017” and the End Date is “06/07/2017”
Suppose, an order have start and end dates as below
see the image for example
What is the code for this problem in R ?
Here is a solution which should generalize to multiple order id's. I have created a sample data with two order id's. The basic idea is to calculate the number of intervals between start_date and end_date. Then we repeat the row for each order id by the number of intervals, and also create a sequence to determine which interval we are in. This is the purpose of creating functions f and g and the use of Map.
The remaining is just vector manipulations where we define split_start_date and split_end_date. The last statement is to ensure that split_end_date does not exceed end_date.
df <- data.frame(
order_id = c(1, 2),
start_date = c(as.Date("2017-05-01"), as.Date("2017-08-01")),
end_date = c(as.Date("2017-07-06"), as.Date("2017-09-15"))
)
df$diff_days <- as.integer(df$end_date - df$start_date)
df$num_int <- ceiling(df$diff_days / 30)
f <- function(rowindex) {
rep(rowindex, each = df[rowindex, "num_int"])
}
g <- function(rowindex) {
1:df[rowindex, "num_int"]
}
rowindex_rep <- unlist(Map(f, 1:nrow(df)))
df2 <- df[rowindex_rep, ]
df2$seq <- unlist(Map(g, 1:nrow(df)))
df3 <- df2
df3$split_start_date <- df3$start_date + (df3$seq - 1) * 30
df3$split_end_date <- df3$split_start_date + 29
df3[which(df3$seq == df3$num_int), ]$split_end_date <-
df3[which(df3$seq == df3$num_int), ]$end_date
I have two columns of dates. Two example dates are:
Date1= "2015-07-17"
Date2="2015-07-25"
I am trying to count the number of Saturdays and Sundays between the two dates each of which are in their own column (5 & 7 in this example code). I need to repeat this process for each row of my dataframe. The end results will be one column that represents the number of Saturdays and Sundays within the date range defined by two date columns.
I can get the code to work for one row:
sum(weekdays(seq(Date1[1,5],Date2[1,7],"days")) %in% c("Saturday",'Sunday')*1))
The answer to this will be 3. But, if I take out the "1" in the row position of date1 and date2 I get this error:
Error in seq.Date(Date1[, 5], Date2[, 7], "days") :
'from' must be of length 1
How do I go line by line and have one vector that lists the number of Saturdays and Sundays between the two dates in column 5 and 7 without using a loop? Another issue is that I have 2 million rows and am looking for something with a little more speed than a loop.
Thank you!!
map2* functions from the purrr package will be a good way to go. They take two vector inputs (eg two date columns) and apply a function in parallel. They're pretty fast too (eg previous post)!
Here's an example. Note that the _int requests an integer vector back.
library(purrr)
# Example data
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
# Wrapper function to compute number of weekend days between dates
n_weekend_days <- function(date_1, date_2) {
sum(weekdays(seq(date_1, date_2, "days")) %in% c("Saturday",'Sunday'))
}
# Iterate row wise
map2_int(d$Date1, d$Date2, n_weekend_days)
#> [1] 3 4 2
If you want to add the results back to your original data frame, mutate() from the dplyr package can help:
library(dplyr)
d <- mutate(d, end_days = map2_int(Date1, Date2, n_weekend_days))
d
#> Date1 Date2 end_days
#> 1 2015-07-17 2015-07-25 3
#> 2 2015-07-28 2015-08-14 4
#> 3 2015-08-15 2015-08-20 2
Here is a solution that uses dplyr to clean things up. It's not too difficult to use with to assign the columns in the dataframe directly.
Essentially, use a reference date, calculate the number of full weeks (by floor or ceiling). Then take the difference between the two. The code does not include cases in which the start date or end data fall on Saturday or Sunday.
# weekdays(as.Date(0,"1970-01-01")) -> "Friday"
require(dplyr)
startDate = as.Date(0,"1970-01-01") # this is a friday
df <- data.frame(start = "2015-07-17", end = "2015-07-25")
df$start <- as.Date(df$start,"", format = "%Y-%m-%d", origin="1970-01-01")
df$end <- as.Date(df$end, format = "%Y-%m-%d","1970-01-01")
# you can use with to define the columns directly instead of %>%
df <- df %>%
mutate(originDate = startDate) %>%
mutate(startDayDiff = as.numeric(start-originDate), endDayDiff = as.numeric(end-originDate)) %>%
mutate(startWeekDiff = floor(startDayDiff/7),endWeekDiff = floor(endDayDiff/7)) %>%
mutate(NumSatsStart = startWeekDiff + ifelse(startDayDiff %% 7>=1,1,0),
NumSunsStart = startWeekDiff + ifelse(startDayDiff %% 7>=2,1,0),
NumSatsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 1,1,0),
NumSunsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 2,1,0)
) %>%
mutate(NumSats = NumSatsEnd - NumSatsStart, NumSuns = NumSunsEnd - NumSunsStart)
Dates are number of days since 1970-01-01, a Thursday.
So the following is the number of Saturdays or Sundays since that date
f <- function(d) {d <- as.numeric(d); r <- d %% 7; 2*(d %/% 7) + (r>=2) + (r>=3)}
For the number of Saturdays or Sundays between two dates, just subtract, after decrementing the start date to have an inclusive count.
g <- function(d1, d2) f(d2) - f(d1-1)
These are all vectorized functions so you can just call directly on the columns.
# Example data, as in Simon Jackson's answer
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
As follows
within(d, end_days<-g(Date1,Date2))
# Date1 Date2 end_days
# 1 2015-07-17 2015-07-25 3
# 2 2015-07-28 2015-08-14 4
# 3 2015-08-15 2015-08-20 2
I have a R dataframe containing date info about generic events:
id;start_date;end_date.
Sometimes the same event may occur the same day (1) or at a distance of one day (2), for example:
(1)
1001;2016-05-07;2016-05-11
1001;2016-05-11;2016-05-14
(2)
1001;2016-05-07;2016-05-11
1001;2016-05-12;2016-05-14
In the first case the event "1001" ends and restarts the same day, while in the second case that event ends on 2017-05-11 and starts again the day after. I'd like to delete the second occurrence of the event in both cases.
If the second occurrence is at a distance of two or more days, it's ok to preserve the second occurrence. How can I do this in R?
Thank you in advance.
Partial solution with my guess of how data look like:
library(data.table)
dat <- data.table(id = c(1001,1001,1001,1001),
start_date = as.Date(c("2016-05-07", "2016-05-11", "2016-05-07", "2016-05-12")),
end_date = as.Date(c("2016-05-11", "2016-05-14", "2016-05-11", "2016-05-14")))
dat2 <- data.table(id = c(dat$id, NA),
start_date = c(dat$start_date, NA),
end_date = c(as.Date(NA), dat$end_date))
dat2[, dif := end_date - start_date]
Then you can just remove rows with dif <= 0 I guess.
I've used the data.table package, but you can just do dat2$dif <- dat2$end_date - dat2$start_date.
I'm stuck on a problem calculating travel dates. I have a data frame of departure dates and return dates.
Departure Return
1 7/6/13 8/3/13
2 7/6/13 8/3/13
3 6/28/13 8/7/13
I want to create and pass a function that will take these dates and form a list of all the days away. I can do this individually by turning each column into dates.
## Turn the departure and return dates into a readable format
Dept <- as.Date(travelDates$Dept, format = "%m/%d/%y")
Retn <- as.Date(travelDates$Retn, format = "%m/%d/%y")
travel_dates <- na.omit(data.frame(dept_dates,retn_dates))
seq(from = travel_dates[1,1], to = travel_dates[1,2], by = 1)
This gives me [1] "2013-07-06" "2013-07-07"... and so on. I want to scale to cover the whole data frame, but my attempts have failed.
Here's one that I thought might work.
days_abroad <- data.frame()
get_days <- function(x,y){
all_days <- seq(from = x, to = y, by =1)
c(days_abroad, all_days)
return(days_abroad)
}
get_days(travel_dates$dept_dates, travel_dates$retn_dates)
I get this error:
Error in seq.Date(from = x, to = y, by = 1) : 'from' must be of length 1
There's probably a lot wrong with this, but what I would really like help on is how to run multiple dates through seq().
Sorry, if this is simple (I'm still learning to think in r) and sorry too for any breaches in etiquette. Thank you.
EDIT: updated as per OP comment.
How about this:
travel_dates[] <- lapply(travel_dates, as.Date, format="%m/%d/%y")
dts <- with(travel_dates, mapply(seq, Departure, Return, by="1 day"))
This produces a list with as many items as you had rows in your initial table. You can then summarize (this will be data.frame with the number of times a date showed up):
data.frame(count=sort(table(Reduce(append, dts)), decreasing=T))
# count
# 2013-07-06 3
# 2013-07-07 3
# 2013-07-08 3
# 2013-07-09 3
# ...
OLD CODE:
The following gets the #days of each trip, rather than a list with the dates.
transform(travel_dates, days_away=Return - Departure + 1)
Which produces:
# Departure Return days_away
# 1 2013-07-06 2013-08-03 29 days
# 2 2013-07-06 2013-08-03 29 days
# 3 2013-06-28 2013-08-07 41 days
If you want to put days_away in a separate list, that is trivial, though it seems more useful to have it as an additional column to your data frame.