I have a dataset that looks like this:
start_date
end_date
2021-11-28 05:00:00
2022-06-29 04:00:00
2021-09-03 04:00:00
2022-12-04 05:00:00
2021-02-22 05:00:00
2021-03-16 04:00:00
2022-07-18 04:00:00
2022-12-19 04:00:00
2020-01-06 05:00:00
2020-07-05 04:00:00
2021-09-18 04:00:00
2022-03-18 04:00:00
2020-07-02 04:00:00
2020-08-30 04:00:00
2021-03-30 04:00:00
2021-04-27 04:00:00
2021-05-31 04:00:00
2021-11-30 05:00:00
2021-08-05 04:00:00
2022-02-03 05:00:00
I make another column showing the number of days in this “approved” date range. (Rounded, & as numeric so I can apply other calculations to it)
dat1$days_approved <- round(as.numeric(difftime(dat1$end_date,dat1$start_date,units=c("days"))),digits = 0)
Now, I want to see where I am based on today’s date regarding these time periods. That is, are we 1/2way through, have not started, or are we complete?
So, I use the tzone function for “today” and apply some basic division.
dat1$time_progress <- (round(as.numeric(now(tzone = "")-dat1$start_date,units=c("days"))))/dat1$days_approved
That leaves me with a dataset looking like this:
start_date
end_date
days_approved
time_progress
2021-11-28 05:00:00
2022-06-29 04:00:00
213
1.01
2021-09-03 04:00:00
2022-12-04 05:00:00
457
0.661
2021-02-22 05:00:00
2021-03-16 04:00:00
22
22.5
2022-07-18 04:00:00
2022-12-19 04:00:00
154
-0.104
2020-01-06 05:00:00
2020-07-05 04:00:00
181
5.02
2021-09-18 04:00:00
2022-03-18 04:00:00
181
1.59
2020-07-02 04:00:00
2020-08-30 04:00:00
59
12.4
2021-03-30 04:00:00
2021-04-27 04:00:00
28
16.4
2021-05-31 04:00:00
2021-11-30 05:00:00
183
2.17
2021-08-05 04:00:00
2022-02-03 05:00:00
182
1.82
This makes me think I need to set a threshold, if the value is greater than 1, I’d like it to return 1. If it is less than 1, I’d like to return the value.
I can make this work with an if else statement…
ifelse(dat1$time_progress > 1, 1, dat1$time_progress)
However, I’m struggling to apply it as logic to the column. Is there an existing function that can apply a threshold I have not found?
We could create our own treshold function and then apply it to the desired column:
library(dplyr)
library(lubridate)
my_treshold_function <- function(x){
ifelse(x >1, 1, x)
}
df %>%
mutate(across(ends_with("date"), ymd_hms),
days_approved = round(as.numeric(end_date-start_date), 0),
progress = round(as.numeric(now(tzone = "")-start_date))/days_approved,
across(progress, ~my_treshold_function(.), .names="treshold"))
start_date end_date days_approved progress treshold
<dttm> <dttm> <dbl> <dbl> <dbl>
1 2021-11-28 05:00:00 2022-05-29 04:00:00 182 1.18 1
2 2021-09-03 04:00:00 2022-03-04 05:00:00 182 1.65 1
3 2021-02-22 05:00:00 2021-03-16 04:00:00 22 22.5 1
4 2020-09-18 04:00:00 2021-03-19 04:00:00 182 3.58 1
5 2020-01-06 05:00:00 2020-07-05 04:00:00 181 5.01 1
6 2021-09-18 04:00:00 2022-03-18 04:00:00 181 1.58 1
7 2020-07-02 04:00:00 2020-08-30 04:00:00 59 12.4 1
8 2021-03-30 04:00:00 2021-04-27 04:00:00 28 16.4 1
9 2021-05-31 04:00:00 2021-11-30 05:00:00 183 2.16 1
10 2021-08-05 04:00:00 2022-02-03 05:00:00 182 1.81 1
data:
structure(list(start_date = c("2021-11-28 5:00:00", "2021-09-03 4:00:00",
"2021-02-22 5:00:00", "2020-09-18 4:00:00", "2020-01-06 5:00:00",
"2021-09-18 4:00:00", "2020-07-02 4:00:00", "2021-03-30 4:00:00",
"2021-05-31 4:00:00", "2021-08-05 4:00:00"), end_date = c("2022-05-29 4:00:00",
"2022-03-04 5:00:00", "2021-03-16 4:00:00", "2021-03-19 4:00:00",
"2020-07-05 4:00:00", "2022-03-18 4:00:00", "2020-08-30 4:00:00",
"2021-04-27 4:00:00", "2021-11-30 5:00:00", "2022-02-03 5:00:00"
)), class = "data.frame", row.names = c(NA, -10L))
dat2<- dat1
dat2$time_progress2 <- ifelse(dat1$time_progress > 1, 1, dat1$time_progress)
Have you tried this? It should create a new dataframe which has a new column with the same columns as dat1 plus a new column added which is the time_progress variable that is adjusted with the threshold. You can compare them side-by-side.
Then, if everything checks out, you can just delete the original time_progress variable from dat2.
Judging from the data you provided in your question, it seems like nearly every row will return a 1.
You could also create a new column using the same function as above to return 3 values "day completed", "half completed", and "not started" as that is what you are looking for.
Related
In my dataset I have a parameter called visit_datetime. This parameter determines during which period the participant visited the researcher. This can be at any time a day. I want to give a value "1" if the visit was between 08.00 and 20.00, and value "2" if the visit was between 20.00 and 08.00. Is there an easy way to do this? For all other date/time calculations I use the package lubridate. The visit_datetime is parsed the right way, because other calculations do work.
I tried it like this:
tijd_presentatie = ifelse(visit_datetime > hm("08:00") & visit_datetime < hm("20:00"), 1, 2)
But this gives me always the value of "2".
Using lubridate::hour():
library(lubridate)
visit_datetime <- seq(ymd_hms("2023-02-14 00:00:00"), by = "hour", length.out = 24)
tijd_presentatie <- ifelse(hour(visit_datetime) >= 8 & hour(visit_datetime) < 20, 1, 0)
data.frame(visit_datetime, tijd_presentatie)
visit_datetime tijd_presentatie
1 2023-02-14 00:00:00 0
2 2023-02-14 01:00:00 0
3 2023-02-14 02:00:00 0
4 2023-02-14 03:00:00 0
5 2023-02-14 04:00:00 0
6 2023-02-14 05:00:00 0
7 2023-02-14 06:00:00 0
8 2023-02-14 07:00:00 0
9 2023-02-14 08:00:00 1
10 2023-02-14 09:00:00 1
11 2023-02-14 10:00:00 1
12 2023-02-14 11:00:00 1
13 2023-02-14 12:00:00 1
14 2023-02-14 13:00:00 1
15 2023-02-14 14:00:00 1
16 2023-02-14 15:00:00 1
17 2023-02-14 16:00:00 1
18 2023-02-14 17:00:00 1
19 2023-02-14 18:00:00 1
20 2023-02-14 19:00:00 1
21 2023-02-14 20:00:00 0
22 2023-02-14 21:00:00 0
23 2023-02-14 22:00:00 0
24 2023-02-14 23:00:00 0
I have a large dataset of electric load data with a missing timestamp for the last Sunday of March of each year due to daylight saving time. I have copied below a few rows containing a missing timestamp.
structure(list(Date_Time = structure(c(1427569200, 1427572800,
1427576400, 1427580000, 1427583600, 1427587200, NA, 1427590800,
1427594400, 1427598000, 1427601600, 1427605200), tzone = "EET", class = c("POSIXct",
"POSIXt")), Day_ahead_Load = c("7139", "6598", "6137", "5177",
"4728", "4628", "N/A", "4426", "4326", "4374", "4546", "4885"
), Actual_Load = c(6541, 6020, 5602, 5084, 4640, 4593, NA, 4353,
NA, NA, 4333, 4556)), row.names = c(NA, -12L), class = "data.frame")
#> Date_Time Day_ahead_Load Actual_Load
#> 1 2015-03-28 21:00:00 7139 6541
#> 2 2015-03-28 22:00:00 6598 6020
#> 3 2015-03-28 23:00:00 6137 5602
#> 4 2015-03-29 00:00:00 5177 5084
#> 5 2015-03-29 01:00:00 4728 4640
#> 6 2015-03-29 02:00:00 4628 4593
#> 7 <NA> N/A NA
#> 8 2015-03-29 04:00:00 4426 4353
#> 9 2015-03-29 05:00:00 4326 NA
#> 10 2015-03-29 06:00:00 4374 NA
#> 11 2015-03-29 07:00:00 4546 4333
#> 12 2015-03-29 08:00:00 4885 4556
I have tried to fill these missing timestamps using na.approx, but the function returns "2015-03-29 02:30:00", instead of "2015-03-29 03:00:00". It does not use the correct scale.
mydata$Date_Time <- as.POSIXct(na.approx(mydata$Date_Time), origin = "1970-01-01 00:00:00", tz = "EET")
#> Date_Time Day_ahead_Load Actual_Load
#> 1 2015-03-28 21:00:00 7139 6541
#> 2 2015-03-28 22:00:00 6598 6020
#> 3 2015-03-28 23:00:00 6137 5602
#> 4 2015-03-29 00:00:00 5177 5084
#> 5 2015-03-29 01:00:00 4728 4640
#> 6 2015-03-29 02:00:00 4628 4593
#> 7 2015-03-29 02:30:00 N/A NA
#> 8 2015-03-29 04:00:00 4426 4353
#> 9 2015-03-29 05:00:00 4326 NA
#> 10 2015-03-29 06:00:00 4374 NA
#> 11 2015-03-29 07:00:00 4546 4333
#> 12 2015-03-29 08:00:00 4885 4556
I have also tried using some other functions, such as "fill", but none of them works properly.
As I am fairly new to R, I would really appreciate any suggestions for filling the missing timestamps. Thank you in advance.
Actually the answer is correct. There is only one hour difference between the 6th and 8th rows due to the change from standard time to daylight savings time.
Use GMT (or equivalently UTC) if you intended that there be 2 hours between those rows. Below we use the same date and time as a character string but change the timezone to GMT to avoid daylight savings time changes.
diff(mydata[c(6, 8), 1])
## Time difference of 1 hours
# use GMT
tt <- as.POSIXct(format(mydata[[1]]), tz = "GMT")
as.POSIXct(na.approx(tt), tz = "GMT", origin = "1970-01-01")
## [1] "2015-03-28 21:00:00 GMT" "2015-03-28 22:00:00 GMT"
## [3] "2015-03-28 23:00:00 GMT" "2015-03-29 00:00:00 GMT"
## [5] "2015-03-29 01:00:00 GMT" "2015-03-29 02:00:00 GMT"
## [7] "2015-03-29 03:00:00 GMT" "2015-03-29 04:00:00 GMT"
## [9] "2015-03-29 05:00:00 GMT" "2015-03-29 06:00:00 GMT"
## [11] "2015-03-29 07:00:00 GMT" "2015-03-29 08:00:00 GMT"
You could use the following loop which would ensure that you always get the correct answer, even if you have many NA's following each other in the data.
library(lubridate)
dat$Date_Time <- as_datetime(as.character(dat$Date_Time))
dat$id <- 1:nrow(dat)
dat$previoustime <- NA
dat$timediff <- NA
for( i in 2:nrow(dat)) {
previousdateinds <- which(!is.na(dat$Date_Time) & dat$id < i)
previousdateind <- tail(previousdateinds,1)
dat$timediff[i] <- i-previousdateind # number of rows between this row and the last non-NA time
dat$previoustime[i] <- as.character(dat$Date_Time)[previousdateind]
print(previousdateind)
}
dat$previoustime <- as_datetime(dat$previoustime)
dat$result <- ifelse(is.na(dat$Date_Time), as.character(dat$previoustime+dat$timediff*60*60),
as.character(dat$Date_Time))
dat[6:8,]
Date_Time Day_ahead_Load Actual_Load id previoustime timediff result
6 2015-03-29 02:00:00 4628 4593 6 2015-03-29 01:00:00 1 2015-03-29 02:00:00
7 <NA> N/A NA 7 2015-03-29 02:00:00 1 2015-03-29 03:00:00
8 2015-03-29 04:00:00 4426 4353 8 2015-03-29 02:00:00 2 2015-03-29 04:00:00
I have a data frame df1 with a datetime column in format UTC. I need to merge this dataframe with the data frame df2 by the column datetime. My problem is that df2 is in Europe/Paris format, and when I transform df2$datetime from Europe/Paris to UTC format, I lose or duplicate data at the moments in which is the time change between either summer/winter or winter/summer. As an example:
df1<- data.frame(datetime=c("2016-10-29 22:00:00","2016-10-29 23:00:00","2016-10-30 00:00:00","2016-10-30 01:00:00","2016-10-30 02:00:00","2016-10-30 03:00:00","2016-10-30 04:00:00","2016-10-30 05:00:00","2016-03-25 22:00:00","2016-03-25 23:00:00","2016-03-26 00:00:00","2016-03-26 01:00:00","2016-03-26 02:00:00","2016-03-26 03:00:00","2016-03-26 04:00:00"), Var1= c(4, 56, 76, 54, 34, 3, 4, 6, 78, 23, 12, 3, 5, 6, 7))
df1$datetime<- as.POSIXct(df1$datetime, format = "%Y-%m-%d %H", tz= "UTC")
df2<- data.frame(datetime=c("2016-10-29 22:00:00","2016-10-29 23:00:00","2016-10-30 00:00:00","2016-10-30 01:00:00","2016-10-30 02:00:00","2016-10-30 03:00:00","2016-10-30 04:00:00","2016-10-30 05:00:00","2016-03-25 22:00:00","2016-03-25 23:00:00","2016-03-26 00:00:00","2016-03-26 01:00:00","2016-03-26 02:00:00","2016-03-26 03:00:00","2016-03-26 04:00:00"), Var2=c(56, 43, 23, 14, 51, 27, 89, 76, 56, 4, 35, 23, 4, 62, 84))
df2$datetime<- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Europe/Paris")
df1
datetime Var1
1 2016-10-29 22:00:00 4
2 2016-10-29 23:00:00 56
3 2016-10-30 00:00:00 76
4 2016-10-30 01:00:00 54
5 2016-10-30 02:00:00 34
6 2016-10-30 03:00:00 3
7 2016-10-30 04:00:00 4
8 2016-10-30 05:00:00 6
9 2017-03-25 22:00:00 78
10 2017-03-25 23:00:00 23
11 2017-03-26 00:00:00 12
12 2017-03-26 01:00:00 3
13 2017-03-26 02:00:00 5
14 2017-03-26 03:00:00 6
15 2017-03-26 04:00:00 7
df2
datetime Var2
1 2016-10-29 22:00:00 56
2 2016-10-29 23:00:00 43
3 2016-10-30 00:00:00 23
4 2016-10-30 01:00:00 14
5 2016-10-30 02:00:00 51
6 2016-10-30 03:00:00 27
7 2016-10-30 04:00:00 89
8 2016-10-30 05:00:00 76
9 2017-03-25 22:00:00 56
10 2017-03-25 23:00:00 4
11 2017-03-26 00:00:00 35
12 2017-03-26 01:00:00 23
13 2017-03-26 02:00:00 4
14 2017-03-26 03:00:00 62
15 2017-03-26 04:00:00 84
When I change df2$datetime format from Europe/Paris to UTC, this happens:
library(lubridate)
df2$datetime<-with_tz(df2$datetime,"UTC")
df2
datetime Var2
1 2016-10-29 20:00:00 56
2 2016-10-29 21:00:00 43
3 2016-10-29 22:00:00 23
4 2016-10-29 23:00:00 14
5 2016-10-30 00:00:00 51
6 2016-10-30 02:00:00 27 # Data at 01:00:00 is missing
7 2016-10-30 03:00:00 89
8 2016-10-30 04:00:00 76
9 2017-03-25 21:00:00 56
10 2017-03-25 22:00:00 4
11 2017-03-25 23:00:00 35
12 2017-03-26 00:00:00 23
13 2017-03-26 00:00:00 4 # There is a duplicate at 00:00:00
14 2017-03-26 01:00:00 62
15 2017-03-26 02:00:00 84
16 2017-03-26 03:00:00 56
Is there another way to transform df2$datetime from Europe/Paris format to UTC format that allows me to merge two data frames without this problem of having either lost or duplicated data? I don't understand why I have to lose or duplicate info in df2.
Is the transformation I did right in df2$datetime in order to merge this data frame with df1? What I've done so far to solve this is to add a new row in df2 on 2016-10-30 at 01:00:00 that is the mean between 2016-10-30 00:00:00and 2016-10-30 02:00:00 and to remove one row on 2017-03-26 at 00:00:00.
Thanks for your help.
I found out that my original df2 should be like this:
df2
datetime Var1
1 2016-10-29 22:00:00 4 # This is time in format "GMT+2". It corresponds to 20:00 UTC
2 2016-10-29 23:00:00 56 # This is time in format "GMT+2". It corresponds to 21:00 UTC
3 2016-10-30 00:00:00 76 # This is time in format "GMT+2". It corresponds to 22:00 UTC
4 2016-10-30 01:00:00 54 # This is time in format "GMT+2". It corresponds to 23:00 UTC
5 2016-10-30 02:00:00 34 # This is time in format "GMT+2". It corresponds to 00:00 UTC
6 2016-10-30 02:00:00 3 # This is time in format "GMT+1". It corresponds to 01:00 UTC
7 2016-10-30 03:00:00 4 # This is time in format "GMT+1". It corresponds to 02:00 UTC
8 2016-10-30 04:00:00 6 # This is time in format "GMT+1". It corresponds to 03:00 UTC
9 2016-10-30 05:00:00 78 # This is time in format "GMT+1". It corresponds to 04:00 UTC
10 2017-03-25 22:00:00 23 # This is time in format "GMT+1". It corresponds to 21:00 UTC
11 2017-03-25 23:00:00 12 # This is time in format "GMT+1". It corresponds to 22:00 UTC
12 2017-03-26 00:00:00 3 # This is time in format "GMT+1". It corresponds to 23:00 UTC
13 2017-03-26 01:00:00 5 # This is time in format "GMT+1". It corresponds to 00:00 UTC
14 2017-03-26 03:00:00 6 # This is time in format "GMT+2". It corresponds to 01:00 UTC
15 2017-03-26 04:00:00 7 # This is time in format "GMT+2". It corresponds to 02:00 UTC
16 2017-03-26 05:00:00 76 # This is time in format "GMT+2". It corresponds to 03:00 UTC
However, my original df2 doesn't have duplicated or lost time data. It is like this:
df2
datetime Var1
1 2016-10-29 22:00:00 4
2 2016-10-29 23:00:00 56
3 2016-10-30 00:00:00 76
4 2016-10-30 01:00:00 54
5 2016-10-30 02:00:00 34
6 2016-10-30 03:00:00 3
7 2016-10-30 04:00:00 4
8 2016-10-30 05:00:00 6
9 2017-03-25 22:00:00 78
10 2017-03-25 23:00:00 23
11 2017-03-26 00:00:00 12
12 2017-03-26 01:00:00 3
13 2017-10-30 02:00:00 5
14 2017-03-26 03:00:00 6
15 2017-03-26 04:00:00 7
16 2017-03-26 05:00:00 76
When I applied the R code df2$datetime<-with_tz(df2$datetime,"UTC"), this happens:
df2
datetime Var1
1 2016-10-29 20:00:00 4
2 2016-10-29 21:00:00 56
3 2016-10-29 22:00:00 76
4 2016-10-29 23:00:00 54
5 2016-10-30 00:00:00 34
6 2016-10-30 02:00:00 3 # I have to add mannually a new row between the times "00:00" and "02:00"
7 2016-10-30 03:00:00 4
8 2016-10-30 04:00:00 6
9 2017-03-25 21:00:00 78
10 2017-03-25 22:00:00 23
11 2017-03-25 23:00:00 12
12 2017-03-26 00:00:00 3
13 2017-10-30 01:00:00 5 # I have to remove mannually one of the rows refered to the time "01:00".
14 2017-03-26 01:00:00 6
15 2017-03-26 02:00:00 7
16 2017-03-26 03:00:00 76
If my original df2 had one duplication at "02:00:00" on 30th Octover and a gap on 26th March between "01:00" and "03:00", I would get with the R code df2$datetime<-with_tz(df2$datetime,"UTC") this:
df2
datetime Var1
1 2016-10-29 20:00:00 4
2 2016-10-29 21:00:00 56
3 2016-10-29 22:00:00 76
4 2016-10-29 23:00:00 54
5 2016-10-30 00:00:00 34
6 2016-10-30 00:00:00 3 # I just have to change "00:00:00" for "01:00"
7 2016-10-30 02:00:00 4
8 2016-10-30 03:00:00 6
9 2016-10-30 04:00:00 78
10 2017-03-25 21:00:00 23
11 2017-03-25 22:00:00 12
12 2017-03-25 23:00:00 3
13 2017-03-26 00:00:00 5
14 2017-03-26 01:00:00 6
15 2017-03-26 02:00:00 7
16 2017-03-26 03:00:00 76
#As there are some Versions of df2 I use the one shown in the Question
df2 <- read.table(text = "
datetime Var2
1 '2016-10-29 22:00:00' 56
2 '2016-10-29 23:00:00' 43
3 '2016-10-30 00:00:00' 23
4 '2016-10-30 01:00:00' 14
5 '2016-10-30 02:00:00' 51
6 '2016-10-30 03:00:00' 27
7 '2016-10-30 04:00:00' 89
8 '2016-10-30 05:00:00' 76
9 '2017-03-25 22:00:00' 56
10 '2017-03-25 23:00:00' 4
11 '2017-03-26 00:00:00' 35
12 '2017-03-26 01:00:00' 23
13 '2017-03-26 02:00:00' 4
14 '2017-03-26 03:00:00' 62
15 '2017-03-26 04:00:00' 84
", header = TRUE)
library(lubridate)
#When you define now the timezone the content of df2 is already changed
df2$datetimeEP <- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Europe/Paris")
#df2[13,]
# datetime Var2 datetimeEP
#13 2017-03-26 02:00:00 4 2017-03-26 01:00:00
#For me it looks like that your recorded times don't consider "daylight savings time".
#So your have to uses e.g. "Etc/GMT-1" instead of "Europe/Paris"
df2$datetimeG1 <- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Etc/GMT-1")
data.frame(datetime=df2$datetime, utc=with_tz(df2$datetimeG1,"UTC"))
# datetime utc
#1 2016-10-29 22:00:00 2016-10-29 21:00:00
#2 2016-10-29 23:00:00 2016-10-29 22:00:00
#3 2016-10-30 00:00:00 2016-10-29 23:00:00
#4 2016-10-30 01:00:00 2016-10-30 00:00:00
#5 2016-10-30 02:00:00 2016-10-30 01:00:00
#6 2016-10-30 03:00:00 2016-10-30 02:00:00
#7 2016-10-30 04:00:00 2016-10-30 03:00:00
#8 2016-10-30 05:00:00 2016-10-30 04:00:00
#9 2017-03-25 22:00:00 2017-03-25 21:00:00
#10 2017-03-25 23:00:00 2017-03-25 22:00:00
#11 2017-03-26 00:00:00 2017-03-25 23:00:00
#12 2017-03-26 01:00:00 2017-03-26 00:00:00
#13 2017-03-26 02:00:00 2017-03-26 01:00:00
#14 2017-03-26 03:00:00 2017-03-26 02:00:00
#15 2017-03-26 04:00:00 2017-03-26 03:00:00
#You can use "dst" to see if datetime of a time zone has "daylight savings time"
dst(df2$datetimeEP)
dst(df2$datetimeG1)
dst(with_tz(df2$datetimeEP,"UTC"))
dst(with_tz(df2$datetimeG1,"UTC"))
#If your recorded times consider "daylight savings time" then you HAVE a gap and an overlap.
I am using dplyr's mutate function to create a POSIX date column of a data frame by taking the lead of another column. When I try to fill in the missing values in the lead function using a single date, I get an error:
> dates
# A tibble: 5 x 1
orig_date
<dttm>
1 2016-06-21 20:00:00
2 2016-07-09 22:00:00
3 2016-07-10 22:00:00
4 2016-07-20 21:00:00
5 2016-07-21 21:00:00
> fillin_date
[1] "2018-08-29 UTC"
> dates %>% mutate(next_date = lead(orig_date, 1, default = fillin_date))
Error in mutate_impl(.data, dots) :
Not compatible with requested type: [type=symbol; target=double].
This does not happen outside of mutate:
> lead(dates$orig_date, 1, default = fillin_date)
[1] "2016-07-09 22:00:00 UTC" "2016-07-10 22:00:00 UTC" "2016-07-20 21:00:00 UTC"
[4] "2016-07-21 21:00:00 UTC" "2018-08-29 00:00:00 UTC"
What is going wrong here?
I am not sure as to the underlying reason why you can supply the symbol outside of mutate but not inside, but you can get around it by quoting and unquoting the variable. You can also save your date to fill in as character and just convert to date inside the mutate call.
library(tidyverse)
df <- tibble(orig_date = c("2016-06-21 20:00:00", "2016-07-09 22:00:00", "2016-07-10 22:00:00", "2016-07-20 21:00:00", "2016-07-21 21:00:00")) %>%
mutate(orig_date = as.POSIXct(orig_date))
fillin_date <- as.POSIXct("2018-08-29")
fillin_date2 <- "2018-08-29"
df %>%
mutate(next_date = lead(orig_date, 1, default = !!quo(fillin_date)))
#> # A tibble: 5 x 2
#> orig_date next_date
#> <dttm> <dttm>
#> 1 2016-06-21 20:00:00 2016-07-09 22:00:00
#> 2 2016-07-09 22:00:00 2016-07-10 22:00:00
#> 3 2016-07-10 22:00:00 2016-07-20 21:00:00
#> 4 2016-07-20 21:00:00 2016-07-21 21:00:00
#> 5 2016-07-21 21:00:00 2018-08-29 00:00:00
df %>%
mutate(next_date = lead(orig_date, 1, default = as.POSIXct(fillin_date2)))
#> # A tibble: 5 x 2
#> orig_date next_date
#> <dttm> <dttm>
#> 1 2016-06-21 20:00:00 2016-07-09 22:00:00
#> 2 2016-07-09 22:00:00 2016-07-10 22:00:00
#> 3 2016-07-10 22:00:00 2016-07-20 21:00:00
#> 4 2016-07-20 21:00:00 2016-07-21 21:00:00
#> 5 2016-07-21 21:00:00 2018-08-29 00:00:00
Created on 2018-10-03 by the reprex package (v0.2.0).
I have a dataframe where I splitted the datetime column by date and time (two columns). However, when I group by time it gives me duplicates in time. So, to analyze it I used table() on time column, and it gave me duplicates also. This is a sample of it:
> table(df$time)
00:00:00 00:00:00 00:15:00 00:15:00 00:30:00 00:30:00
2211 1047 2211 1047 2211 1047
As you may see, when I splitted one of the "unique" values kept a " " inside. Is there a easy way to solve this?
PS: The datatype of the time column is character.
EDIT: Code added
df$datetime <- as.character.Date(df$datetime)
x <- colsplit(df$datetime, ' ', names = c('Date','Time'))
df <- cbind(df, x)
There are a number of approaches. One of them is to use appropriate functions to extract Dates and Times from Datetime column:
df <- data.frame(datetime = seq(
from=as.POSIXct("2018-5-15 0:00", tz="UTC"),
to=as.POSIXct("2018-5-16 24:00", tz="UTC"),
by="30 min") )
head(df$datetime)
#[1] "2018-05-15 00:00:00 UTC" "2018-05-15 00:30:00 UTC" "2018-05-15 01:00:00 UTC" "2018-05-15 01:30:00 UTC"
#[5] "2018-05-15 02:00:00 UTC" "2018-05-15 02:30:00 UTC"
df$Date <- as.Date(df$datetime)
df$Time <- format(df$datetime,"%H:%M:%S")
head(df)
# datetime Date Time
# 1 2018-05-15 00:00:00 2018-05-15 00:00:00
# 2 2018-05-15 00:30:00 2018-05-15 00:30:00
# 3 2018-05-15 01:00:00 2018-05-15 01:00:00
# 4 2018-05-15 01:30:00 2018-05-15 01:30:00
# 5 2018-05-15 02:00:00 2018-05-15 02:00:00
# 6 2018-05-15 02:30:00 2018-05-15 02:30:00
table(df$Time)
#00:00:00 00:30:00 01:00:00 01:30:00 02:00:00 02:30:00 03:00:00 03:30:00 04:00:00 04:30:00 05:00:00 05:30:00
#3 2 2 2 2 2 2 2 2 2 2 2
#06:00:00 06:30:00 07:00:00 07:30:00 08:00:00 08:30:00 09:00:00 09:30:00 10:00:00 10:30:00 11:00:00 11:30:00
#2 2 2 2 2 2 2 2 2 2 2 2
#12:00:00 12:30:00 13:00:00 13:30:00 14:00:00 14:30:00 15:00:00 15:30:00 16:00:00 16:30:00 17:00:00 17:30:00
#2 2 2 2 2 2 2 2 2 2 2 2
#18:00:00 18:30:00 19:00:00 19:30:00 20:00:00 20:30:00 21:00:00 21:30:00 22:00:00 22:30:00 23:00:00 23:30:00
#2 2 2 2 2 2 2 2 2 2 2 2
#If the data were given as character strings and contain extra spaces the above approach will still work
df <- data.frame(datetime=c("2018-05-15 00:00:00","2018-05-15 00:30:00",
"2018-05-15 01:00:00", "2018-05-15 02:00:00",
"2018-05-15 00:00:00","2018-05-15 00:30:00"),
stringsAsFactors=FALSE)
df$Date <- as.Date(df$datetime)
df$Time <- format(as.POSIXct(df$datetime, tz="UTC"),"%H:%M:%S")
head(df)
# datetime Date Time
# 1 2018-05-15 00:00:00 2018-05-15 00:00:00
# 2 2018-05-15 00:30:00 2018-05-15 00:30:00
# 3 2018-05-15 01:00:00 2018-05-15 01:00:00
# 4 2018-05-15 02:00:00 2018-05-15 02:00:00
# 5 2018-05-15 00:00:00 2018-05-15 00:00:00
# 6 2018-05-15 00:30:00 2018-05-15 00:30:00
table(df$Time)
#00:00:00 00:30:00 01:00:00 02:00:00
# 2 2 1 1
reshape2::colsplit accepts regular expressions, so you could split on '\s+' which matches 1 or more whitespace characters.
You can find out more about regular expressions in R using ?base::regex. The syntax is generally constant between languages, so you can use pretty much any regex tutorial. Take a look at https://regex101.com/. This site evaluates your regular expressions in real time and shows you exactly what each part is matching. It is extremely helpful!
Keep in mind that in R, as compared to most other languages, you must double the number of backslashes \. So \s (to match 1 whitespace character) must be written as \\s in R.