I wanted to add a new variable depending on the time range between two variables. I want times between 8:01-20:00 = day and times between 20:01-8:00 = night, and anything that overlaps both to be mixed.
I've tried to add the variable manually but trying to understand can this be done an easier way.
#Current database
id<-c("m1","m1","m1","m2","m2","m2","m3","m4","m4")
x<-c("2020-01-03 10:00:00","2020-01-03 16:00:00","2020-01-03 19:20:00","2020-01-05 10:00:00","2020-01-05 15:20:00","2020-01-05 20:50:00","2020-01-06 06:30:00","2020-01-08 06:30:00","2020-01-08 07:50:00")
start<-strptime(x,"%Y-%m-%d %H:%M:%S")
y<-c("2020-01-03 16:00:00","2020-01-03 19:20:00","2020-01-03 20:50:00","2020-01-05 15:20:00","2020-01-05 20:50:00","2020-01-05 22:00:00","2020-01-06 07:40:00","2020-01-08 07:50:00","2020-01-08 08:55:00")
end<-strptime(y,"%Y-%m-%d %H:%M:%S")
mydata<-data.frame(id,start,end)
#output
day.night<-c("day","day","mixed","day","mixed","night","night","night","mixed")
newdata<-cbind(mydata,day.night)
Edit: Apologies I forgot to add the date.
One way using dplyr is to convert start.time and end.time to POSIXct object and then compare the values at various intervals and apply labels using case_when.
library(dplyr)
data %>%
mutate(start.time1 = as.POSIXct(start.time, format = "%H:%M"),
end.time1 = as.POSIXct(end.time, format = "%H:%M"),
day.night = case_when(
start.time1 > as.POSIXct('08:01:00', format = "%T") &
end.time1 < as.POSIXct('20:00:00', format = "%T") ~"day",
start.time1 > as.POSIXct('20:01:00', format = "%T") |
start.time1 < as.POSIXct('08:00:00', format = "%T") &
end.time1 < as.POSIXct('08:00:00', format = "%T") ~ "night",
TRUE ~ "mixed")) %>%
select(names(data), day.night)
# id start.time end.time day.night
#1 m1 10:00 16:00 day
#2 m1 16:00 19:20 day
#3 m1 19:20 20:50 mixed
#4 m2 10:00 15:20 day
#5 m2 15:20 20:50 mixed
#6 m2 20:50 22:00 night
#7 m3 06:30 07:40 night
#8 m4 06:30 07:50 night
#9 m4 07:50 08:55 mixed
EDIT
If we also have date one way would be to replace the date component in start and end with todays date for comparison purposes.
library(dplyr)
library(lubridate)
mydata %>%
mutate_at(vars(start, end), ymd_hms) %>%
mutate(start_hour = hour(start),
end_hour = hour(end),
day.night = case_when(start_hour >= 8 & end_hour >= 8 & end_hour < 20 ~ "day",
start_hour >= 20 & (end_hour < 8 | end_hour <= 23) |
(start_hour < 8 & end_hour < 8)~ "night",
TRUE ~ "mixed"))
# id start end day.night
#1 m1 2020-01-03 10:00:00 2020-01-03 16:00:00 day
#2 m1 2020-01-03 16:00:00 2020-01-03 19:20:00 day
#3 m1 2020-01-03 19:20:00 2020-01-03 20:50:00 mixed
#4 m2 2020-01-05 10:00:00 2020-01-05 15:20:00 day
#5 m2 2020-01-05 15:20:00 2020-01-05 20:50:00 mixed
#6 m2 2020-01-05 20:50:00 2020-01-05 22:00:00 night
#7 m3 2020-01-06 06:30:00 2020-01-06 07:40:00 night
#8 m4 2020-01-08 06:30:00 2020-01-08 07:50:00 night
#9 m4 2020-01-08 07:50:00 2020-01-08 08:55:00 mixed
A data frame like below. 3 staffs have hourly readings in days, but incomplete (every staff shall have 24 readings a day).
Understand that staffs had different number of readings on the days. Now only interested in the staff with most readings in the day.
There are many days. It’s wanted to insert the missing (hourly) rows for the most ones on the days. That is, 2018-03-02 to insert only for Jack’s, 2018-03-03 only for David and 2018-03-04 only for Kate.
I tried these lines from this question (even though they fill all without differentiation) but not getting there.
How can it be done in R?
date_time <- c("2/3/2018 0:00","2/3/2018 1:00","2/3/2018 2:00","2/3/2018 3:00","2/3/2018 5:00","2/3/2018 6:00","2/3/2018 7:00","2/3/2018 8:00","2/3/2018 9:00","2/3/2018 10:00","2/3/2018 11:00","2/3/2018 12:00","2/3/2018 13:00","2/3/2018 14:00","2/3/2018 16:00","2/3/2018 17:00","2/3/2018 18:00","2/3/2018 19:00","2/3/2018 21:00","2/3/2018 22:00","2/3/2018 23:00","3/3/2018 0:00","3/3/2018 0:00","3/3/2018 1:00","3/3/2018 2:00","3/3/2018 4:00","3/3/2018 5:00","3/3/2018 7:00","3/3/2018 8:00","3/3/2018 9:00","3/3/2018 11:00","3/3/2018 12:00","3/3/2018 14:00","3/3/2018 15:00","3/3/2018 17:00","3/3/2018 18:00","3/3/2018 20:00","3/3/2018 22:00","3/3/2018 23:00","4/3/2018 0:00","4/3/2018 0:00","4/3/2018 1:00","4/3/2018 2:00","4/3/2018 3:00","4/3/2018 5:00","4/3/2018 6:00","4/3/2018 7:00","4/3/2018 8:00","4/3/2018 10:00","4/3/2018 11:00","4/3/2018 12:00","4/3/2018 14:00","4/3/2018 15:00","4/3/2018 16:00","4/3/2018 17:00","4/3/2018 19:00","4/3/2018 20:00","4/3/2018 22:00","4/3/2018 23:00")
staff <- c("Jack","Jack","Kate","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Kate","Jack","Jack","Jack","David","David","Jack","Kate","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","Jack","Kate","David","David","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Jack")
reading <- c(7.5,8.3,7,6.9,7.1,8.1,8.4,8.8,6,7.1,8.9,7.3,7.4,6.9,11.3,18.8,4.6,6.7,7.7,7.8,7,7,6.6,6.8,6.7,6.1,7.1,6.3,7.2,6,5.8,6.6,6.5,6.4,7.2,8.4,6.5,6.5,5.5,6.7,7,7.5,6.5,7.5,7.2,6.3,7.3,8,7,8.2,6.5,6.8,7.5,7,6.1,5.7,6.7,4.3,6.3)
df <- data.frame(date_time, staff, reading)
The option would be to do this separately. Create a data.table of the dates of interest and the corresponding 'staff', and get the full sequence of date time, then we rbind this with the original dataset and using a condition, we summarise the data
library(data.table)
stf <- c("Jack", "David", "Kate")
date <- as.Date(c("2018-03-02", "2018-03-03", "2018-03-04"))
df1 <- data.table(date, staff= stf)[, .(date_time = seq(as.POSIXct(paste(date, "00:00:00"),
tz = "GMT"),
length.out = 24, by = "1 hour")), staff]
setDT(df)[, date_time := as.POSIXct(date_time, "%d/%m/%Y %H:%M", tz = "GMT")]
res <- rbindlist(list(df, df1), fill = TRUE)[,
.(reading = if(any(is.na(reading))) sum(reading, na.rm = TRUE) else reading),
.(staff, date_time)]
table(res$staff, as.Date(res$date_time))
# 2018-03-02 2018-03-03 2018-03-04
# David 3 24 2
# Jack 24 1 1
# Kate 3 1 24
head(res)
# staff date_time reading
#1: Jack 2018-03-02 00:00:00 7.5
#2: Jack 2018-03-02 01:00:00 8.3
#3: Kate 2018-03-02 02:00:00 7.0
#4: Jack 2018-03-02 03:00:00 6.9
#5: Jack 2018-03-02 05:00:00 7.1
#6: Jack 2018-03-02 06:00:00 8.1
tail(res)
# staff date_time reading
#1: Kate 2018-03-04 04:00:00 0
#2: Kate 2018-03-04 09:00:00 0
#3: Kate 2018-03-04 13:00:00 0
#4: Kate 2018-03-04 18:00:00 0
#5: Kate 2018-03-04 21:00:00 0
#6: Kate 2018-03-04 23:00:00 0
Try this code:
Identify each daily hour and all staff members
date_h<-seq(as.POSIXlt(min(date_time),format="%d/%m/%Y %H:%M"),as.POSIXlt(max(date_time),format="%d/%m/%Y %H:%M"),by=60*60)
staff_u<-unique(staff)
comb<-expand.grid(staff_u,date_h)
colnames(comb)<-c("staff","date_time")
Uniform date format in df
df$date_time<-as.POSIXlt(df$date_time,format="%d/%m/%Y %H:%M")
Merge information
out<-merge(comb,df,all.x=T)
Your output:
head(out)
staff date_time reading
1 Jack 2018-03-02 00:00:00 7.5
2 Jack 2018-03-02 01:00:00 8.3
3 Jack 2018-03-02 02:00:00 NA
4 Jack 2018-03-02 03:00:00 6.9
5 Jack 2018-03-02 04:00:00 NA
6 Jack 2018-03-02 05:00:00 7.1
I have a data frame with missing values for "SNAP_ID". I'd like to fill in the missing values with floating point values based on a sequence from the previous non-missing value (lag()?). I would really like to achieve this using just dplyr if possible.
Assumptions:
There will never be missing data as the first or last row I'm generating the missing dates based on missing days between a min and max of a data set
There can be multiple gaps in the data set
Current data:
end SNAP_ID
1 2015-06-26 12:59:00 365
2 2015-06-26 13:59:00 366
3 2015-06-27 00:01:00 NA
4 2015-06-27 23:00:00 NA
5 2015-06-28 00:01:00 NA
6 2015-06-28 23:00:00 NA
7 2015-06-29 09:00:00 367
8 2015-06-29 09:59:00 368
What I want to achieve:
end SNAP_ID
1 2015-06-26 12:59:00 365.0
2 2015-06-26 13:59:00 366.0
3 2015-06-27 00:01:00 366.1
4 2015-06-27 23:00:00 366.2
5 2015-06-28 00:01:00 366.3
6 2015-06-28 23:00:00 366.4
7 2015-06-29 09:00:00 367.0
8 2015-06-29 09:59:00 368.0
As a data frame:
df <- structure(list(end = structure(c(1435323540, 1435327140, 1435363260,
1435446000, 1435449660, 1435532400, 1435568400, 1435571940), tzone = "UTC", class = c("POSIXct",
"POSIXt")), SNAP_ID = c(365, 366, NA, NA, NA, NA, 367, 368)), .Names = c("end",
"SNAP_ID"), row.names = c(NA, -8L), class = "data.frame")
This was my attempt at achieving this goal, but it only works for the first missing value:
df %>%
arrange(end) %>%
mutate(SNAP_ID=ifelse(is.na(SNAP_ID),lag(SNAP_ID)+0.1,SNAP_ID))
end SNAP_ID
1 2015-06-26 12:59:00 365.0
2 2015-06-26 13:59:00 366.0
3 2015-06-27 00:01:00 366.1
4 2015-06-27 23:00:00 NA
5 2015-06-28 00:01:00 NA
6 2015-06-28 23:00:00 NA
7 2015-06-29 09:00:00 367.0
8 2015-06-29 09:59:00 368.0
The outstanding answer from #mathematical.coffee below:
df %>%
arrange(end) %>%
group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1))) %>%
ungroup() %>%
select(-tmp)
EDIT: new version works for any number of NA runs.
This one doesn't need zoo, either.
First, notice that tmp=cumsum(!is.na(SNAP_ID)) groups the SNAP_IDs such groups of the same tmp consist of one non-NA value followed by a run of NA values.
Then group by this variable and just add .1 to the first SNAP_ID to fill out the NAs:
df %>%
arrange(end) %>%
group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1)))
end SNAP_ID tmp
1 2015-06-26 12:59:00 365.0 1
2 2015-06-26 13:59:00 366.0 2
3 2015-06-27 00:01:00 366.1 2
4 2015-06-27 23:00:00 366.2 2
5 2015-06-28 00:01:00 366.3 2
6 2015-06-28 23:00:00 366.4 2
7 2015-06-29 09:00:00 367.0 3
8 2015-06-29 09:59:00 368.0 4
Then you can drop the tmp column afterwards (add %>% select(-tmp) to the end).
EDIT: this is the old version which doesn't work for subsequent runs of NAs.
If your aim is to fill each NA with the previous value + 0.1, you can use zoo's na.locf (which fills each NA with the previous value), along with cumsum(is.na(SNAP_ID))*0.1 to add the extra 0.1.
library(zoo)
df %>%
arrange(end) %>%
mutate(SNAP_ID=ifelse(is.na(SNAP_ID),
na.locf(SNAP_ID) + cumsum(is.na(SNAP_ID))*0.1,
SNAP_ID))
BACKGROUD
dplyr has window functions. When you want to control the order of window functions,
you can use order_by.
DATA
mydf <- data.frame(id = c("ana", "bob", "caroline",
"bob", "ana", "caroline"),
order = as.POSIXct(c("2015-01-01 18:00:00", "2015-01-01 18:05:00",
"2015-01-01 19:20:00", "2015-01-01 09:07:00",
"2015-01-01 08:30:00", "2015-01-01 11:11:00"),
format = "%Y-%m-%d %H:%M:%S"),
value = runif(6, 10, 20),
stringsAsFactors = FALSE)
# id order value
#1 ana 2015-01-01 18:00:00 19.00659
#2 bob 2015-01-01 18:05:00 13.64010
#3 caroline 2015-01-01 19:20:00 12.08506
#4 bob 2015-01-01 09:07:00 14.40996
#5 ana 2015-01-01 08:30:00 17.45165
#6 caroline 2015-01-01 11:11:00 14.50865
Suppose you want to use lag(), you can do the following.
arrange(mydf, id, order) %>%
group_by(id) %>%
mutate(check = lag(value))
# id order value check
#1 ana 2015-01-01 08:30:00 17.45165 NA
#2 ana 2015-01-01 18:00:00 19.00659 17.45165
#3 bob 2015-01-01 09:07:00 14.40996 NA
#4 bob 2015-01-01 18:05:00 13.64010 14.40996
#5 caroline 2015-01-01 11:11:00 14.50865 NA
#6 caroline 2015-01-01 19:20:00 12.08506 14.50865
However, you can avoid using arrange() with order_by().
group_by(mydf, id) %>%
mutate(check = lag(value, order_by = order))
# id order value check
#1 ana 2015-01-01 18:00:00 19.00659 17.45165
#2 bob 2015-01-01 18:05:00 13.64010 14.40996
#3 caroline 2015-01-01 19:20:00 12.08506 14.50865
#4 bob 2015-01-01 09:07:00 14.40996 NA
#5 ana 2015-01-01 08:30:00 17.45165 NA
#6 caroline 2015-01-01 11:11:00 14.50865 NA
EXPERIMENT
I wanted to apply the same procedure to the case in which I wanted
to assign row number to a new column. Using the sample data, you can do the folowing.
group_by(mydf, id) %>%
arrange(order) %>%
mutate(num = row_number())
# id order value num
#1 ana 2015-01-01 08:30:00 17.45165 1
#2 ana 2015-01-01 18:00:00 19.00659 2
#3 bob 2015-01-01 09:07:00 14.40996 1
#4 bob 2015-01-01 18:05:00 13.64010 2
#5 caroline 2015-01-01 11:11:00 14.50865 1
#6 caroline 2015-01-01 19:20:00 12.08506 2
Can we omit the arrange line? Seeing the CRAN manual, I did the following.
Both attempts were not successful.
### Not working
group_by(mydf, id) %>%
mutate(num = row_number(order_by = order))
### Not working
group_by(mydf, id) %>%
mutate(num = order_by(order, row_number()))
How can we achieve this?
I did not mean to answer this question by myself. But, I decided to share
what I found given I have not seen many posts using order_by and particularly
with_order. My answer was to use with_order() instead of order_by().
group_by(mydf, id) %>%
mutate(num = with_order(order_by = order, fun = row_number, x = order))
# id order value num
#1 ana 2015-01-01 18:00:00 19.00659 2
#2 bob 2015-01-01 18:05:00 13.64010 2
#3 caroline 2015-01-01 19:20:00 12.08506 2
#4 bob 2015-01-01 09:07:00 14.40996 1
#5 ana 2015-01-01 08:30:00 17.45165 1
#6 caroline 2015-01-01 11:11:00 14.50865 1
I wanted to see if there would be any difference between the two
approaches in terms of speed. It seems that they are pretty similar in this case.
library(microbenchmark)
mydf2 <- data.frame(id = rep(c("ana", "bob", "caroline",
"bob", "ana", "caroline"), times = 200000),
order = seq(as.POSIXct("2015-03-01 18:00:00", format = "%Y-%m-%d %H:%M:%S"),
as.POSIXct("2015-01-01 18:00:00", format = "%Y-%m-%d %H:%M:%S"),
length.out = 1200000),
value = runif(1200000, 10, 20),
stringsAsFactors = FALSE)
jazz1 <- function() {group_by(mydf2, id) %>%
arrange(order) %>%
mutate(num = row_number())}
jazz2 <- function() {group_by(mydf2, id) %>%
mutate(num = with_order(order_by = order, fun = row_number, x = order))}
res <- microbenchmark(jazz1, jazz2, times = 1000000L)
res
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# jazz1 32 36 47.17647 38 47 12308 1e+06 a
# jazz2 32 36 47.02902 38 47 12402 1e+06 a