Erase space in splitting - R - r

I have a dataframe where I splitted the datetime column by date and time (two columns). However, when I group by time it gives me duplicates in time. So, to analyze it I used table() on time column, and it gave me duplicates also. This is a sample of it:
> table(df$time)
00:00:00 00:00:00 00:15:00 00:15:00 00:30:00 00:30:00
2211 1047 2211 1047 2211 1047
As you may see, when I splitted one of the "unique" values kept a " " inside. Is there a easy way to solve this?
PS: The datatype of the time column is character.
EDIT: Code added
df$datetime <- as.character.Date(df$datetime)
x <- colsplit(df$datetime, ' ', names = c('Date','Time'))
df <- cbind(df, x)

There are a number of approaches. One of them is to use appropriate functions to extract Dates and Times from Datetime column:
df <- data.frame(datetime = seq(
from=as.POSIXct("2018-5-15 0:00", tz="UTC"),
to=as.POSIXct("2018-5-16 24:00", tz="UTC"),
by="30 min") )
head(df$datetime)
#[1] "2018-05-15 00:00:00 UTC" "2018-05-15 00:30:00 UTC" "2018-05-15 01:00:00 UTC" "2018-05-15 01:30:00 UTC"
#[5] "2018-05-15 02:00:00 UTC" "2018-05-15 02:30:00 UTC"
df$Date <- as.Date(df$datetime)
df$Time <- format(df$datetime,"%H:%M:%S")
head(df)
# datetime Date Time
# 1 2018-05-15 00:00:00 2018-05-15 00:00:00
# 2 2018-05-15 00:30:00 2018-05-15 00:30:00
# 3 2018-05-15 01:00:00 2018-05-15 01:00:00
# 4 2018-05-15 01:30:00 2018-05-15 01:30:00
# 5 2018-05-15 02:00:00 2018-05-15 02:00:00
# 6 2018-05-15 02:30:00 2018-05-15 02:30:00
table(df$Time)
#00:00:00 00:30:00 01:00:00 01:30:00 02:00:00 02:30:00 03:00:00 03:30:00 04:00:00 04:30:00 05:00:00 05:30:00
#3 2 2 2 2 2 2 2 2 2 2 2
#06:00:00 06:30:00 07:00:00 07:30:00 08:00:00 08:30:00 09:00:00 09:30:00 10:00:00 10:30:00 11:00:00 11:30:00
#2 2 2 2 2 2 2 2 2 2 2 2
#12:00:00 12:30:00 13:00:00 13:30:00 14:00:00 14:30:00 15:00:00 15:30:00 16:00:00 16:30:00 17:00:00 17:30:00
#2 2 2 2 2 2 2 2 2 2 2 2
#18:00:00 18:30:00 19:00:00 19:30:00 20:00:00 20:30:00 21:00:00 21:30:00 22:00:00 22:30:00 23:00:00 23:30:00
#2 2 2 2 2 2 2 2 2 2 2 2
#If the data were given as character strings and contain extra spaces the above approach will still work
df <- data.frame(datetime=c("2018-05-15 00:00:00","2018-05-15 00:30:00",
"2018-05-15 01:00:00", "2018-05-15 02:00:00",
"2018-05-15 00:00:00","2018-05-15 00:30:00"),
stringsAsFactors=FALSE)
df$Date <- as.Date(df$datetime)
df$Time <- format(as.POSIXct(df$datetime, tz="UTC"),"%H:%M:%S")
head(df)
# datetime Date Time
# 1 2018-05-15 00:00:00 2018-05-15 00:00:00
# 2 2018-05-15 00:30:00 2018-05-15 00:30:00
# 3 2018-05-15 01:00:00 2018-05-15 01:00:00
# 4 2018-05-15 02:00:00 2018-05-15 02:00:00
# 5 2018-05-15 00:00:00 2018-05-15 00:00:00
# 6 2018-05-15 00:30:00 2018-05-15 00:30:00
table(df$Time)
#00:00:00 00:30:00 01:00:00 02:00:00
# 2 2 1 1

reshape2::colsplit accepts regular expressions, so you could split on '\s+' which matches 1 or more whitespace characters.
You can find out more about regular expressions in R using ?base::regex. The syntax is generally constant between languages, so you can use pretty much any regex tutorial. Take a look at https://regex101.com/. This site evaluates your regular expressions in real time and shows you exactly what each part is matching. It is extremely helpful!
Keep in mind that in R, as compared to most other languages, you must double the number of backslashes \. So \s (to match 1 whitespace character) must be written as \\s in R.

Related

na.approx function does not produce correct timestamps

I have a large dataset of electric load data with a missing timestamp for the last Sunday of March of each year due to daylight saving time. I have copied below a few rows containing a missing timestamp.
structure(list(Date_Time = structure(c(1427569200, 1427572800,
1427576400, 1427580000, 1427583600, 1427587200, NA, 1427590800,
1427594400, 1427598000, 1427601600, 1427605200), tzone = "EET", class = c("POSIXct",
"POSIXt")), Day_ahead_Load = c("7139", "6598", "6137", "5177",
"4728", "4628", "N/A", "4426", "4326", "4374", "4546", "4885"
), Actual_Load = c(6541, 6020, 5602, 5084, 4640, 4593, NA, 4353,
NA, NA, 4333, 4556)), row.names = c(NA, -12L), class = "data.frame")
#> Date_Time Day_ahead_Load Actual_Load
#> 1 2015-03-28 21:00:00 7139 6541
#> 2 2015-03-28 22:00:00 6598 6020
#> 3 2015-03-28 23:00:00 6137 5602
#> 4 2015-03-29 00:00:00 5177 5084
#> 5 2015-03-29 01:00:00 4728 4640
#> 6 2015-03-29 02:00:00 4628 4593
#> 7 <NA> N/A NA
#> 8 2015-03-29 04:00:00 4426 4353
#> 9 2015-03-29 05:00:00 4326 NA
#> 10 2015-03-29 06:00:00 4374 NA
#> 11 2015-03-29 07:00:00 4546 4333
#> 12 2015-03-29 08:00:00 4885 4556
I have tried to fill these missing timestamps using na.approx, but the function returns "2015-03-29 02:30:00", instead of "2015-03-29 03:00:00". It does not use the correct scale.
mydata$Date_Time <- as.POSIXct(na.approx(mydata$Date_Time), origin = "1970-01-01 00:00:00", tz = "EET")
#> Date_Time Day_ahead_Load Actual_Load
#> 1 2015-03-28 21:00:00 7139 6541
#> 2 2015-03-28 22:00:00 6598 6020
#> 3 2015-03-28 23:00:00 6137 5602
#> 4 2015-03-29 00:00:00 5177 5084
#> 5 2015-03-29 01:00:00 4728 4640
#> 6 2015-03-29 02:00:00 4628 4593
#> 7 2015-03-29 02:30:00 N/A NA
#> 8 2015-03-29 04:00:00 4426 4353
#> 9 2015-03-29 05:00:00 4326 NA
#> 10 2015-03-29 06:00:00 4374 NA
#> 11 2015-03-29 07:00:00 4546 4333
#> 12 2015-03-29 08:00:00 4885 4556
I have also tried using some other functions, such as "fill", but none of them works properly.
As I am fairly new to R, I would really appreciate any suggestions for filling the missing timestamps. Thank you in advance.
Actually the answer is correct. There is only one hour difference between the 6th and 8th rows due to the change from standard time to daylight savings time.
Use GMT (or equivalently UTC) if you intended that there be 2 hours between those rows. Below we use the same date and time as a character string but change the timezone to GMT to avoid daylight savings time changes.
diff(mydata[c(6, 8), 1])
## Time difference of 1 hours
# use GMT
tt <- as.POSIXct(format(mydata[[1]]), tz = "GMT")
as.POSIXct(na.approx(tt), tz = "GMT", origin = "1970-01-01")
## [1] "2015-03-28 21:00:00 GMT" "2015-03-28 22:00:00 GMT"
## [3] "2015-03-28 23:00:00 GMT" "2015-03-29 00:00:00 GMT"
## [5] "2015-03-29 01:00:00 GMT" "2015-03-29 02:00:00 GMT"
## [7] "2015-03-29 03:00:00 GMT" "2015-03-29 04:00:00 GMT"
## [9] "2015-03-29 05:00:00 GMT" "2015-03-29 06:00:00 GMT"
## [11] "2015-03-29 07:00:00 GMT" "2015-03-29 08:00:00 GMT"
You could use the following loop which would ensure that you always get the correct answer, even if you have many NA's following each other in the data.
library(lubridate)
dat$Date_Time <- as_datetime(as.character(dat$Date_Time))
dat$id <- 1:nrow(dat)
dat$previoustime <- NA
dat$timediff <- NA
for( i in 2:nrow(dat)) {
previousdateinds <- which(!is.na(dat$Date_Time) & dat$id < i)
previousdateind <- tail(previousdateinds,1)
dat$timediff[i] <- i-previousdateind # number of rows between this row and the last non-NA time
dat$previoustime[i] <- as.character(dat$Date_Time)[previousdateind]
print(previousdateind)
}
dat$previoustime <- as_datetime(dat$previoustime)
dat$result <- ifelse(is.na(dat$Date_Time), as.character(dat$previoustime+dat$timediff*60*60),
as.character(dat$Date_Time))
dat[6:8,]
Date_Time Day_ahead_Load Actual_Load id previoustime timediff result
6 2015-03-29 02:00:00 4628 4593 6 2015-03-29 01:00:00 1 2015-03-29 02:00:00
7 <NA> N/A NA 7 2015-03-29 02:00:00 1 2015-03-29 03:00:00
8 2015-03-29 04:00:00 4426 4353 8 2015-03-29 02:00:00 2 2015-03-29 04:00:00

R: calculate number of occurrences which have started but not ended - count if within a datetime range

I've got a dataset with the following shape
ID Start Time End Time
1 01/01/2017 00:15:00 01/01/2017 07:15:00
2 01/01/2017 04:45:00 01/01/2017 06:15:00
3 01/01/2017 10:20:00 01/01/2017 20:15:00
4 01/01/2017 02:15:00 01/01/2017 00:15:00
5 02/01/2017 15:15:00 03/01/2017 00:30:00
6 03/01/2017 07:00:00 04/01/2017 09:15:00
I would like to count every 15 min for an entire year how many items have started but not finished, so count the number of times with a start time greater or equal than the time I'm looking at and an end time less or equal than the time I'm looking at.
I'm looking for an approach using tidyverse/dplyr if possible.
Any help or guidance would be very much appreciated.
If I understand correctly, the OP wants to count the number of simultaneously active events.
One possibility to tackle this question is the coverage() function from Bioconductor's IRange package. Another one is to aggregate in a non-equi join which is available with the data.table package.
Non-equi join
# create sequence of datetimes (limited to 4 days for demonstration)
seq15 <- seq(lubridate::as_datetime("2017-01-01"),
lubridate::as_datetime("2017-01-05"), by = "15 mins")
# aggregate within a non-equi join
library(data.table)
result <- periods[.(time = seq15), on = .(Start.Time <= time, End.Time > time),
.(time, count = sum(!is.na(ID))), by = .EACHI][, .(time, count)]
result
time count
1: 2017-01-01 00:00:00 0
2: 2017-01-01 00:15:00 1
3: 2017-01-01 00:30:00 1
4: 2017-01-01 00:45:00 1
5: 2017-01-01 01:00:00 1
---
381: 2017-01-04 23:00:00 0
382: 2017-01-04 23:15:00 0
383: 2017-01-04 23:30:00 0
384: 2017-01-04 23:45:00 0
385: 2017-01-05 00:00:00 0
The result can be visualized graphically:
library(ggplot2)
ggplot(result) + aes(time, count) + geom_step()
Data
periods <- readr::read_table(
"ID Start.Time End.Time
1 01/01/2017 00:15:00 01/01/2017 07:15:00
2 01/01/2017 04:45:00 01/01/2017 06:15:00
3 01/01/2017 10:20:00 01/01/2017 20:15:00
4 01/01/2017 02:15:00 01/01/2017 00:15:00
5 02/01/2017 15:15:00 03/01/2017 00:30:00
6 03/01/2017 07:00:00 04/01/2017 09:15:00"
)
# convert date strings to class Date
library(data.table)
cols <- names(periods)[names(periods) %like% "Time$"]
setDT(periods)[, (cols) := lapply(.SD, lubridate::dmy_hms), .SDcols = cols]
periods
ID Start.Time End.Time
1: 1 2017-01-01 00:15:00 2017-01-01 07:15:00
2: 2 2017-01-01 04:45:00 2017-01-01 06:15:00
3: 3 2017-01-01 10:20:00 2017-01-01 20:15:00
4: 4 2017-01-01 02:15:00 2017-01-01 00:15:00
5: 5 2017-01-02 15:15:00 2017-01-03 00:30:00
6: 6 2017-01-03 07:00:00 2017-01-04 09:15:00

as.POSIXct gives inexplicable NA value [duplicate]

This question already has answers here:
How do I clear an NA flag for a posix value?
(3 answers)
Closed 5 years ago.
I have a large dataset (21683 records) and I've managed to combine date and time to datetime in a correct way using asPOSIXct. Nevertheless, this did not work for 6 records (17463:17468). This is the dataset I'm using:
> head(solar.angle)
Date Time sol.elev.angle ID Datetime
1 2016-11-24 15:00:00 41.32397 1 2016-11-24 15:00:00
2 2016-11-24 15:10:00 39.11225 2 2016-11-24 15:10:00
3 2016-11-24 15:20:00 36.88180 3 2016-11-24 15:20:00
4 2016-11-24 15:30:00 34.63507 4 2016-11-24 15:30:00
5 2016-11-24 15:40:00 32.37418 5 2016-11-24 15:40:00
6 2016-11-24 15:50:00 30.10096 6 2016-11-24 15:50:00
> solar.angle[17460:17470,]
Date Time sol.elev.angle ID Datetime
17488 2017-03-26 01:30:00 -72.01821 17460 2017-03-26 01:30:00
17489 2017-03-26 01:40:00 -69.53832 17461 2017-03-26 01:40:00
17490 2017-03-26 01:50:00 -67.05409 17462 2017-03-26 01:50:00
17491 2017-03-26 02:00:00 -64.56682 17463 <NA>
17492 2017-03-26 02:10:00 -62.07730 17464 <NA>
17493 2017-03-26 02:20:00 -59.58609 17465 <NA>
17494 2017-03-26 02:30:00 -57.09359 17466 <NA>
17495 2017-03-26 02:40:00 -54.60006 17467 <NA>
17496 2017-03-26 02:50:00 -52.10572 17468 <NA>
17497 2017-03-26 03:00:00 -49.61071 17469 2017-03-26 03:00:00
17498 2017-03-26 03:10:00 -47.11515 17470 2017-03-26 03:10:00
This is the code I'm using:
solar.angle$Datetime <- as.POSIXct(paste(solar.angle$Date,solar.angle$Time), format="%Y-%m-%d %H:%M:%S")
I've already tried to fill them in manually but this did not make any difference:
> solar.angle$Datetime[17463] <- as.POSIXct('2017-03-26 02:00:00', format = "%Y-%m-%d %H:%M:%S")
> solar.angle$Datetime[17463]
[1] NA
Any help will be appreciated!
The problem here is that this is the time you switch to summer time, so you need to specify the time zone, otherwise there is ambiguity.
If you specify a time zone, it will work:
as.POSIXct('2017-03-26 02:00:00', format = "%Y-%m-%d %H:%M:%S", tz = "GMT")
Which returns:
"2017-03-26 02:00:00 GMT"
You can check ?timezones for more information.

using match function to find and replace missing values in a data frame in R (closed)

I have the following data frame
t <- strptime(c("2012-01-01 00:00:00","2012-01-01 01:00:00", "2012-01-01 02:00:00", "2012-01-01 05:00:00", "2012-01-01 06:00:00"), format ="%Y-%m-%d %H:%M:%S");t
d1 <- 2:6
d2 <- 15:11
dfr <- data.frame(t, d1, d2);dfr
t d1 d2
2012-01-01 00:00:00 2 15
2012-01-01 01:00:00 3 14
2012-01-01 02:00:00 4 13
2012-01-01 05:00:00 5 12
2012-01-01 06:00:00 6 11
You can notice that the data from times "2012-01-01 03:00:00" and "2012-01-01 04:00:00" are missing.
To find out the missing data, i first generated a correct time step, then compared it with the "t" column as below.
t1Gen <- strptime("2012-01-01 00:00:00",format="%Y-%m-%d %H:%M:%S");
t2Gen <- strptime("2012-01-01 06:00:00",format="%Y-%m-%d %H:%M:%S");
tGen <- seq(t1Gen,t2Gen, 3600);tGen
"2012-01-01 00:00:00 CET"
"2012-01-01 01:00:00 CET"
"2012-01-01 02:00:00 CET"
"2012-01-01 03:00:00 CET"
"2012-01-01 04:00:00 CET"
"2012-01-01 05:00:00 CET"
"2012-01-01 06:00:00 CET"
mdfr <- match(tGen,dfr$t);mdfr
[1] 1 2 3 NA NA 4 5
subfr <- subset(mdfr, is.na(mdfr));subfr
[1] NA NA
Using the match function, 2 elements are singled out as missing with "NA". Now my aim is to fill out the two missing rows with "-99" to show that the data is missing, with the resulting dataframe looking like this;
t d1 d2
2012-01-01 00:00:00 2 15
2012-01-01 01:00:00 3 14
2012-01-01 02:00:00 3 14
2012-01-01 03:00:00 -99-99
2012-01-01 04:00:00 -99-99
2012-01-01 05:00:00 5 12
2012-01-01 06:00:00 6 11
I'm stuck upto this point, any help with this be appreciated.
P.S: Any other code would be welcome as well. Thanks
You can merge dfr and the tGen vector (after turning the latter into a data.frame). Specifying all = TRUE allows you to fill missing rows with NA.
dfrM <- merge(dfr, data.frame(t = tGen), all = TRUE)
Then determine which values are missing and replace with -99:
dfrM[is.na(dfrM)] <- -99
> dfrM
t d1 d2
1 2012-01-01 00:00:00 2 15
2 2012-01-01 01:00:00 3 14
3 2012-01-01 02:00:00 4 13
4 2012-01-01 03:00:00 -99 -99
5 2012-01-01 04:00:00 -99 -99
6 2012-01-01 05:00:00 5 12
7 2012-01-01 06:00:00 6 11
You're almost there!
dfr[subfr, -1] <- -99
# assumes that time is your first column, and the rest of the row gets -99
You can also combine a few lines, if you'd like:
dfr[is.na(match(tGen,dfr$t)), -1] <- -99

convert hourly rainfall data into daily in specific time interval

I have hourly rainfall and temperature data for long period. I would like to get daily values from hourly data. I am considering day means from 07:00:00 to next day 07:00:00.
Could you tell me how to convert hourly data to daily between specific time interval?
example : 07:00:00 to 07:00:00 or 12:00:00 to 12:00:00)
Rainfall data looks like:
1970-01-05 00:00:00 1.0
1970-01-05 01:00:00 1.0
1970-01-05 02:00:00 1.0
1970-01-05 03:00:00 1.0
1970-01-05 04:00:00 1.0
1970-01-05 05:00:00 3.6
1970-01-05 06:00:00 3.6
1970-01-05 07:00:00 2.2
1970-01-05 08:00:00 2.2
1970-01-05 09:00:00 2.2
1970-01-05 10:00:00 2.2
1970-01-05 11:00:00 2.2
1970-01-05 12:00:00 2.2
1970-01-05 13:00:00 2.2
1970-01-05 14:00:00 2.2
1970-01-05 15:00:00 2.2
1970-01-05 16:00:00 0.0
1970-01-05 17:00:00 0.0
1970-01-05 18:00:00 0.0
1970-01-05 19:00:00 0.0
1970-01-05 20:00:00 0.0
1970-01-05 21:00:00 0.0
1970-01-05 22:00:00 0.0
1970-01-05 23:00:00 0.0
1970-01-06 00:00:00 0.0
First, create some reproducible data so we can help you better:
require(xts)
set.seed(1)
X = data.frame(When = as.Date(seq(from = ISOdatetime(2012, 01, 01, 00, 00, 00),
length.out = 100, by="1 hour")),
Measurements = sample(1:20, 100, replace=TRUE))
We now have a data frame with 100 hourly observations where the dates start at 2012-01-01 00:00:00 and end at 2012-01-05 03:00:00 (time is in 24-hour format).
Second, convert it to an XTS object.
X2 = xts(X$Measurements, order.by=X$When)
Third, learn how to subset a specific time window.
X2['T04:00/T08:00']
# [,1]
# 2012-01-01 04:00:00 5
# 2012-01-01 05:00:00 18
# 2012-01-01 06:00:00 19
# 2012-01-01 07:00:00 14
# 2012-01-01 08:00:00 13
# 2012-01-02 04:00:00 18
# 2012-01-02 05:00:00 7
# 2012-01-02 06:00:00 10
# 2012-01-02 07:00:00 12
# 2012-01-02 08:00:00 10
# 2012-01-03 04:00:00 9
# 2012-01-03 05:00:00 5
# 2012-01-03 06:00:00 2
# 2012-01-03 07:00:00 2
# 2012-01-03 08:00:00 7
# 2012-01-04 04:00:00 18
# 2012-01-04 05:00:00 8
# 2012-01-04 06:00:00 16
# 2012-01-04 07:00:00 20
# 2012-01-04 08:00:00 9
Fourth, use that information with apply.daily and whatever function you want, as follows:
apply.daily(X2['T04:00/T08:00'], mean)
# [,1]
# 2012-01-01 08:00:00 13.8
# 2012-01-02 08:00:00 11.4
# 2012-01-03 08:00:00 5.0
# 2012-01-04 08:00:00 14.2
Update: Custom endpoints
After re-reading your question, I see that I misinterpreted what you wanted.
It seems that you want to take the mean of a 24 hour period, not necessarily from midnight to midnight.
For this, you should ditch apply.daily and instead, use period.apply with custom endpoints, like this:
# You want to start at 7AM. Find out which record is the first one at 7AM.
A = which(as.character(index(X2)) == "2012-01-01 07:00:00")
# Use that to create your endpoints.
# The ends of the endpoints should start at 0
# and end at the max number of records.
ep = c(0, seq(A, 100, by=24), 100)
period.apply(X2, INDEX=ep, FUN=function(x) mean(x))
# [,1]
# 2012-01-01 07:00:00 12.62500
# 2012-01-02 07:00:00 10.08333
# 2012-01-03 07:00:00 10.79167
# 2012-01-04 07:00:00 11.54167
# 2012-01-05 03:00:00 10.25000
You can you this code :
fun <- function(s,i,j) { sum(s[i:(i+j-1)]) }
sapply(X=seq(1,24*nb_of_days,24),FUN=fun,s=your_time_serie,j=24)
You just have to change 1 to another value to have different interval of time : 8 of 07:00:00 to 07:00:00 or 13 for 12:00:00 to 12:00:00
Step 1: transform date to POSIXct
ttt <- as.POSIXct("1970-01-05 08:00:00",tz="GMT")
ttt
#"1970-01-05 08:00:00 GMT"
Step 2: substract difftime of 7 hours
ttt <- ttt-as.difftime(7,units="hours")
ttt
#"1970-01-05 01:00:00 GMT"
Step 3: trunc to days
ttt<-trunc(ttt,"days")
ttt
#"1970-01-05 GMT"
Step 4: use plyr, data.table or whatever method you prefer, to calculate daily means
Using regular expressions should get you what you need. Select lines that match your needs and sum the values. Do this for each day within your hour range and you're set.

Resources