Missing time series for UK Summer time change - r - r

I am seeing missing timseries data corresponding to GMT time change for Summer. I guess this might also be for Winter. I have two parts of query.
How to generate the missing timeseries from the code below. The table is in a xts format.
How to filter the records by time that will include the missing time series once generated. This is only a sample dataset. Thanks.
start <- as.POSIXct("2022-03-27 00:58:00")
interval <- 2
end <- as.POSIXct("2022-03-27 03:00:00")
missing_timestamp <- data.frame(TIMESTAMP = seq(from=start, by=interval*60, to=end))
head(missing_timestamp)
TIMESTAMP
1 2022-03-27 00:58:00
2 2022-03-27 02:00:00
3 2022-03-27 02:02:00
4 2022-03-27 02:04:00
5 2022-03-27 02:06:00
6 2022-03-27 02:08:00
Update:
Linked to the second query, when below code is executed for time between 00hrs and 02:06hrs all data is returned rather than the first four records.
a <- missing_timestamp %>% filter(TIMESTAMP > ymd_hms("2022-03-27 00:00:00") & TIMESTAMP < ymd_hms("2022-03-27 02:06:00"))

Related

Filling not observed observations

I want to make a time series with the frequency a date and time is observed. The raw data looked something like this:
dd-mm-yyyy hh:mm
28-2-2018 0:12
28-2-2018 11:16
28-2-2018 12:12
28-2-2018 13:22
28-2-2018 14:23
28-2-2018 14:14
28-2-2018 16:24
The date and time format is in the wrong way for R, so I had to adjust it:
extracted_times <- as.POSIXct(bedrijf.CSV$viewed_at, format = "%d-%m-%Y %H:%M")
I ordered the data with frequency in a table using the following code:
timeserieswithoutzeros <- table(extracted_times)
The data looks something like this now:
2018-02-28 00:11:00 2018-02-28 01:52:00 2018-02-28 03:38:00
1 2 5
2018-02-28 04:10:00 2018-02-28 04:40:00 2018-02-28 04:45:00
2 1 1
As you may see there are a lot of unobserved dates and times.
I want to add these unobserved dates and times with the frequency of 0.
I tried the complete function, but the error states that it can't best used, because I use as.POSIXct().
Any ideas?
As already mentinoned in the comments by #eric-lecoutre, you can combine your observations with a sequence begining at the earliest ending at the last date using seq and subtract 1 of the frequency table.
timeseriesWithzeros <- table(c(extracted_times, seq(min(extracted_times), max(extracted_times), "1 min")))-1
Maybe the following is what you want.
First, coerce the data to class "POSIXt" and create the sequence of all date/time between min and max by steps of 1 minute.
bedrijf.CSV$viewed_at <- as.POSIXct(bedrijf.CSV$viewed_at, format = "%d-%m-%Y %H:%M")
new <- seq(min(bedrijf.CSV$viewed_at),
max(bedrijf.CSV$viewed_at),
by = "1 mins")
tmp <- data.frame(viewed_at = new)
Now see if these values are in the original data.
tmp$viewed <- tmp$viewed_at %in% bedrijf.CSV$viewed_at
tbl <- xtabs(viewed ~ viewed_at, tmp)
sum(tbl != 0)
#[1] 7
Final clean up.
rm(new, tmp)

Creating a for loop to subset data on R

I have a huge set of data that in the .csv format has 2 columns (one Date_time and other is Q.vanda).
This is what the head and tail of the data looks like,
> head(mdf.vanda)
Date_Time Q.vanda
1 1969-12-05 21:00:00 0
2 1969-12-05 21:01:00 4
3 1969-12-05 21:05:00 11
4 1969-12-05 21:20:00 17
5 1969-12-05 22:45:00 27
6 1969-12-05 22:55:00 23
> tail(mdf.vanda)
Date_Time Q.vanda
165738 2016-01-19 10:15:00 2995.25
165739 2016-01-19 10:30:00 2858.04
165740 2016-01-19 10:45:00 2956.94
165741 2016-01-19 11:00:00 2972.52
165742 2016-01-19 11:15:00 2776.99
165743 2016-01-19 11:30:00 3082.53
There are 48 years of data in between and I want to create a for loop to subset them by year (ex. from 1969/10/01 to 1970/10/01, 1970/10/01 to 1971/10/01 etc.)
I wrote a code but, it's giving me an error that I am not able to resolve. I am pretty new at R so, feel free to suggest some other code that you might think is more efficient for my purpose.
code:
cut <- as.POSIXct(strptime(as.character(c('1969/10/01','1970/10/01','1971/10/01','1972/10/01','1973/10/01','1974/10/01','1975/10/01','1976/10/01','1977/10/01','1978/10/01','1979/10/01','1980/10/01','1981/10/01','1982/10/01','1983/10/01','1984/10/01','1985/10/01','1986/10/01','1987/10/01','1988/10/01','1989/10/01','1990/10/01','1991/10/01','1992/10/01','1993/10/01','1994/10/01','1995/10/01','1996/10/01','1997/10/01','1998/10/01',
'1999/10/01','2000/10/01','2001/10/01','2002/10/01','2003/10/01','2004/10/01',
'2005/10/01','2006/10/01','2007/10/01','2008/10/01','2009/10/01','2010/10/01',
'2011/10/01','2012/10/01','2013/10/01','2014/10/01','2015/10/01','2016/10/01')),format = "%Y/%m/%d"))
df.sub <- as.data.frame(matrix(data=NA,nrow=14496, ncol=96)) #nrow = (31+30+31+31+28)*(4*24)[days * readings/day] , ncol = (48*2)[Seasons*cols]
i.odd <- seq(1,49, by=2)
for (i in 1:48) {df.sub[1:length(mdf.vanda$Date_Time[mdf.vanda$Date_Time >= cut[i] & mdf.vanda$Date_Time < cut[i+1]])
,i.odd[i]:(i.odd[i]+1)] <- subset(mdf.vanda,mdf.vanda$Date_Time > cut[i] & mdf.vanda$Date_Time < cut[i+1])}
Error:
Error in [<-.data.frame(*tmp*, 1:length(mdf.vanda$Date_Time[mdf.vanda$Date_Time >= :
replacement element 1 has 1595 rows, need 1596
you can split your data as shown
split(mdf.vanda,findInterval(as.Date(mdf.vanda$Date_Time),seq(as.Date("1969-10-01"),as.Date("2016-10-01"),"1 year"))
There is no need for a loop here. Base R has the cut function to perform this very operation and significantly faster than the loop. Since you have the break points defined with your "cut" variable.
#cut <- as.POSIXct(c('1969/10/01', ... ,'2016/10/01'),format = "%Y/%m/%d")
mytime<-cut(mdf.vanda$Date_Time, breaks = cut, include.lowest = TRUE)
The variable "mytime" is a vector the length of your data frame with a label to bin the data.
You could then use the split function to break your dataframe in a list of data frames or use the group_by function from the dplyr library for additional data processing.
I suggest you have a look at the convenient quantmod package. Once you have Time Series data, you can use the apply.yearly function and apply any function to every year of data.

subset data by time interval if I have all data between time interval

I have a data frame that looks like this:
X id mat.1 mat.2 mat.3 times
1 1 1 Anne 1495206060 18.5639404 2017-05-19 11:01:00
2 2 1 Anne 1495209660 9.0160321 2017-05-19 12:01:00
3 3 1 Anne 1495211460 37.6559161 2017-05-19 12:31:00
4 4 1 Anne 1495213260 31.1218856 2017-05-19 13:01:00
....
164 164 1 Anne 1497825060 4.8098351 2017-06-18 18:31:00
165 165 1 Anne 1497826860 15.0678781 2017-06-18 19:01:00
166 166 1 Anne 1497828660 4.7636241 2017-06-18 19:31:00
What I would like is to subset the data set by time interval (all data between 11 AM and 4 PM) if there are data points for each hour at least (11 AM, 12, 1, 2, 3, 4 PM) within each day. I want to ultimately sum the values from mat.3 per time interval (11 AM to 4 PM) per day.
I did tried:
sub.1 <- subset(t,format(times,'%H')>='11' & format(times,'%H')<='16')
but this returns all the data from any of the times between 11 AM and 4 PM, but often I would only have data for e.g. 12 and 1 PM for a given day.
I only want the subset from days where I have data for each hour from 11 AM to 4 PM. Any ideas what I can try?
A complement to #Henry Navarro answer for solving an additional problem mentioned in the question.
If I understand in proper way, another concern of the question is to find the dates such that there are data points at least for each hour of the given interval within the day. A possible way following the style of #Henry Navarro solution is as follows:
library(lubridate)
your_data$hour_only <- as.numeric(format(your_data$times, format = "%H"))
your_data$days <- ymd(format(your_data$times, "%Y-%m-%d"))
your_data_by_days_list <- split(x = your_data, f = your_data$days)
# the interval is narrowed for demonstration purposes
hours_intervals <- 11:13
all_hours_flags <- data.frame(days = unique(your_data$days),
all_hours_present = sapply(function(Z) (sum(unique(Z$hour_only) %in% hours_intervals) >=
length(hours_intervals)), X = your_data_by_days_list), row.names = NULL)
your_data <- merge(your_data, all_hours_flags, by = "days")
There is now the column "all_hours_present" indicating that the data for a corresponding day contains at least one value for each hour in the given hours_intervals. And you may use this column to subset your data
subset(your_data, all_hours_present)
Try to create a new variable in your data frame with only the hour.
your_data$hour<-format(your_data$times, format="%H:%M:%S")
Then, using this new variable try to do the next:
#auxiliar variable with your interval of time
your_data$aux_var<-ifelse(your_data$hour >"11:00:00" || your_data$hour<"16:00:00" ,1,0)
So, the next step is filter your data when aux_var==1
your_data[which(your_data$aux_var ==1),]

Map a list of events (instants) to a list of periods (intervals) in R (with or without lubridate)

I have two data frames. One containing time periods marked with character unique IDs and another containing events with another set of unique IDs associated with them
Period DF (code):
periodID <- c("P_UID_00", "P_UID_01", "P_UDI_02", "P_UID_03")
periodStart <- as.POSIXct(c("2016/02/10 19:00", "2016/02/11 19:00",
"2016/02/12 19:00", "2016/02/13 19:00"))
periodEnd <- as.POSIXct(c("2016/02/10 21:00", "2016/02/11 21:00",
"2016/02/12 21:00", "2016/02/13 21:00"))
periodDF <- data.frame(periodID, periodStart, periodEnd)
Period DF:
periodID periodStart periodEnd
1 P_UID_00 2016-02-10 19:00:00 2016-02-10 21:00:00
2 P_UID_01 2016-02-11 19:00:00 2016-02-11 21:00:00
3 P_UDI_02 2016-02-12 19:00:00 2016-02-12 21:00:00
4 P_UID_03 2016-02-13 19:00:00 2016-02-13 21:00:00
Event DF (code):
eventID <- c("E_UID_00", "E_UID_01", "E_UDI_02", "E_UID_03")
eventTime <- as.POSIXct(c("2016/02/09 19:55:01", "2016/02/11 19:12:01",
"2016/02/11 20:22:01", "2016/02/15 19:00:01"))
eventDF <- data.frame(eventID, eventTime)
Event DF:
eventID eventTime
1 E_UID_00 2016-02-09 19:55:01
2 E_UID_01 2016-02-11 19:12:01
3 E_UDI_02 2016-02-11 20:22:01
4 E_UID_03 2016-02-15 19:00:01
I want to to map the event times in second DF to the time periods in the first DF in order to match the ID of the event to the ID of the period. Essentially the result table I want to see should look like:
eventID periodID
1 E_UID_00 NA
2 NA P_UID_00
3 E_UID_01 P_UID_01
4 E_UDI_02 P_UID_01
5 NA P_UID_02
6 NA P_UID_03
7 E_UID_03 NA
I suppose this can be achieved by using lubricate to transform the start and end cloumns in the first DF to intervals and the use some form of apply and instant %within% interval combination, but I am not really familiar with lubridate and did not manage to produce a working code
Additional considerations:
- periods are completely arbitrary and can last from seconds to years
- periods never overlap, so this is not an issue
- more than one event could be associated with a time period
- it is possible for DFs to contain unassociatable events and time periods
- the solution must not include loops
- does not have to be solved with lubridate, in fact a solution with the base R will be even more welcome.
I actually managed to come up with the code that produces exactly what I wanted using lubridate. So if anyone knows how to do this in base OR simply a better way than the one suggested below, sharing this will be greatly appreciated!
First off, the start and end times in the period DF should be converted to lubridate intervals:
intervalsP <- as.interval(periodStart, periodEnd)
Step 2: A function should be created for checking if an instant is located within a list of intervals. The only reason I have created a separate function is to be able using it with apply:
PeriodAssign <- function(x, y){
# x - instants
# y - intervals
variable1 <- mapply(`%within%`, x, y)
if (length(y[variable1]) != 0) {
as.character(y[variable1])
} else {
NA
}
}
NOTE: I had to use the interval to character coercion, because otherwise intervals were coerced to their length in seconds by the apply function and as such being not really useful for matching purposes - i.e. all four intervals in this example are the same length
Step 3: The function can the be used on the event DF and both DFs can then be merged to produce the DF I was looking for:
eventDF$intervals <- lapply(eventTime, PeriodAssign, intervalsP)
periodDF$intervals <- as.character(intervalsP)
mergedDF <- merge(periodDF, eventDF, by = "intervals")
presentableDF <- mergedDF[, c(2, 5)]
# adding in the unmatched Periods and Evenets
tDF1 <- data.frame(periodDF[!(periodDF$periodID %in% presentableDF$periodID), 1], NA)
colnames(tDF1) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF1)
tDF2 <- data.frame(NA, eventDF[!(eventDF$eventID %in% presentableDF$eventID), 1])
colnames(tDF2) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF2)
presentableDF <- presentableDF[order(presentableDF[,1]),]
The eventual DF looks like:
> presentableDF
periodID eventID
3 P_UID_00 <NA>
1 P_UID_01 E_UID_01
2 P_UID_01 E_UDI_02
4 P_UID_02 <NA>
5 P_UID_03 <NA>
6 <NA> E_UID_00
7 <NA> E_UID_03

R time series with datapoints every 15 min

I have a data file containing the raw values of my measurments. The top entries look like that:
495.08
1117.728
872.712
665.632
713.296
1172.44
1302.544
1428.832
1413.536
1361.896
1126.656
644.776
1251.616
1252.824
[...]
The measurements are done in 15 min intervals. In order to put this in a R time series object I use the following code.
data <- scan("./data__2.dat",skip=1)
datats <- ts(data, frequency=24*60/15, start=c(2014,1))
But this leaves me with:
Although the data is only from one year. Hence, it seems like the frequency is wrong. Any thoughts on how to fix this?
By doing:
library(xts)
library(lubridate)
df <- data.frame(interval = seq(ymd_hms('2014-01-21 00:00:00'),
by = '15 min',length.out=(60*24*365/15)),
data = rnorm(60*24*365/15))
ts <- xts(df, order.by=df$interval)
You get:
interval data
1 2014-01-21 00:00:00 -1.3975823
2 2014-01-21 00:15:00 -0.4710713
3 2014-01-21 00:30:00 0.9149273
4 2014-01-21 00:45:00 -0.3053136
5 2014-01-21 01:00:00 -1.2459707
6 2014-01-21 01:15:00 0.4749215

Resources