Merge time series data with different length (gaps) - r

I have two water flow measurement devices which give a value every minute. Now i need to merge both time series. My problem: The devices produce every couple of hour some failures. Thus, the two time series have a different length. I need to fill the gaps first. This could be done with a NA, zero value or with the leading value before the gap.
I can easily define the required time vector tseq by min and max values of the time series:
from <- as.POSIXct(min(Measurement1[[1]], Measurement1[[1]]))
to <- as.POSIXct(max(Measurement1[[1]], Measurement1[[1]]))
tseq <- as.data.frame(seq.POSIXt(from = from, to = to, by = deltaT, tz=UTC))
Then i tried to complete the two lists Measurement1 and Measurement2 with the zoo function as follows:
Measurement1Zoo <- as.data.frame(zoo(x=Measurement1, tseq[[1]]))
This leads to a df with the same length than tseq, but zoo just adds some values at the end of the vector.
I'm a bit confused how zoo works. I just want to add the missing time stamps in the two time series and complete it with NA (or another value). How could this be done? You can find two example files here:
Example time series
Thank you!

You can use dplyr to do an outerjoin (i.e. full_join):
library(data.table)
m1 <- fread(file = "/Measurement1.CSV", sep = ";", header = TRUE)
m1$Date <- as.POSIXct(m1$Date,format="%d.%m.%Y %H:%M",tz=Sys.timezone())
m2 <- fread(file = "/Measurement2.CSV", sep = ";", header = TRUE)
m2$Date <- as.POSIXct(m2$Date,format="%d.%m.%Y %H:%M",tz=Sys.timezone())
names(m2)[2] <- "Value 5"
min(m1$Date) == min(m2$Date) #TRUE
max(m1$Date) == max(m2$Date) #TRUE
library(dplyr)
m_all <- full_join(x = m1, y = m2, by = "Date")
nrow(m1) #11517
nrow(m2) #11520
nrow(m_all) #11520
head(m_all)
# Date Value 1 Value 2 Value 3 Value 4 Value 5
#1 2015-07-24 00:00:00 28 2 0 26 92
#2 2015-07-24 00:01:00 28 2 0 26 95
#3 2015-07-24 00:02:00 28 2 0 26 90
#4 2015-07-24 00:03:00 28 2 0 26 89
#5 2015-07-24 00:04:00 28 2 0 26 94
#6 2015-07-24 00:05:00 27 1 0 26 95
#checking NA's
sum(is.na(m1$`Value 1`)) #0
sum(is.na(m1$`Value 2`)) #0
sum(is.na(m1$`Value 3`)) #3
sum(is.na(m1$`Value 4`))#0
sum(is.na(m2$`Value 5`)) #42
sum(is.na(m_all$`Value 1`)) #3
sum(is.na(m_all$`Value 2`)) #3
sum(is.na(m_all$`Value 3`)) #6
sum(is.na(m_all$`Value 4`)) #3
sum(is.na(m_all$`Value 5`)) #42

Related

Time series - Convert every column of dataframe to time series

I have a dataframe df in R:
month abc1 def2 xyz3
201201 1 2 4
201202 2 5 7
201203 4 11 4
201204 6 23 40
I would like to convert each of the columns (of which there are ~50, each with ~100 monthly observations) to a time series format in order to check for seasonality in the data, using the decompose function.
I assumed a for loop using the ts function would be the best way of doing this. I would like to use something along the lines of the loop below, although I realise using a function on the left side of the <- produces an error. Is there a way to dynamically name variables generated by a loop?
for(i in 2:ncol(df)) {
paste(names(df[, i]), "_ts") <- ts(df[ ,i], start = c(2012, 1), end = c(2021,11), frequency = 12)
}
You could try zoo:
test = data.frame(month=c("201201", "201202", "201203", "201204"), abc1=c(1,2,3,4), def2=c(4,6,7,10), xyz3=c(12,15,16,19))
library(zoo)
ZOO =zoo(test[, c("abc1", "def2", "xyz3")], order.by=as.Date(paste0(test$month, "01"), format="%Y%m%d"))
ts(ZOO, frequency=12)
Output:
abc1 def2 xyz3
Jan 1 1 4 12
Feb 1 2 6 15
Mar 1 3 7 16
Apr 1 4 10 19
attr(,"index")
[1] 2012-01-01 2012-02-01 2012-03-01 2012-04-01
Update:
Now with correct frequency.

Create Time series observations,timestamps and filling up the values

I have a cross section data as following:
transaction_code <- c('A_111','A_222','A_333')
loan_start_date <- c('2016-01-03','2011-01-08','2013-02-13')
loan_maturity_date <- c('2017-01-03','2013-01-08','2015-02-13')
loan_data <- data.frame(cbind(transaction_code,loan_start_date,loan_maturity_date))
Now the dataframe looks like this
>loan_data
transaction_code loan_start_date loan_maturity_date
1 A_111 2016-01-03 2017-01-03
2 A_222 2011-01-08 2013-01-08
3 A_333 2013-02-13 2015-02-13
Now I want to create a monthly time series observing the time to maturity(in months) for each of the three loans for a period of 48 months. How can I achieve that? The final output should look like following:
>loan data
transaction_code loan_start_date loan_maturity_date feb13 march13 april13........
1 A_111 2016-01-03 2017-01-03 46 45 44
2 A_222 2011-01-08 2013-01-08 NA NA NA
3 A_333 2013-02-13 2015-02-13 23 22 21
Here new columns (for 48 months) represents the time to maturity for each loan from that respective months.
Would really appreciate your help. Thanks
Here's an approach using tidyverse packages.
# Define the months to use in the right-hand columns.
months <- seq.Date(from = as.Date("2013-02-01"), by = "month", length.out = 48)
library(tidyverse); library(lubridate)
loan_data2 <- loan_data %>%
# Make a row for each combination of original data and the `months` list
crossing(months) %>%
# Format dates as MonYr and make into an ordered factor
mutate(month_name = format(months, "%b%y") %>% fct_reorder(months)) %>%
# Calculate months remaining -- this task is harder than it sounds! This
# approach isn't perfect, but it's hard to accomplish more simply, since
# months are different lengths.
mutate(months_remaining =
round(interval(months, loan_maturity_date) / ddays(1) / 30.5 - 1),
months_remaining = if_else(months_remaining < 0,
NA_real_, months_remaining)) %>%
# Drop the Date format of months now that calcs done
select(-months) %>%
# Spread into wide format
spread(month_name, months_remaining)
Output
loan_data2[,1:6]
# transaction_code loan_start_date loan_maturity_date Feb13 Mar13 Apr13
# 1 A_111 2016-01-03 2017-01-03 46 45 44
# 2 A_222 2011-01-08 2013-01-08 NA NA NA
# 3 A_333 2013-02-13 2015-02-13 23 22 21

Check differences of various DATE inside one variables R

I want to split the line when the variable contain different YEAR,
also split the col : "Price" with evenly divided by the numbers of date appear
--> count (" ; ") +1
There is a table with the variable that is not yet be splitted.
# Dataset call df
Price Date
500 2016-01-01
400 2016-01-03;2016-01-09
1000 2016-01-04;2017-09-01;2017-08-10;2018-01-01
25 2016-01-04;2017-09-01
304 2015-01-02
238 2018-01-02;2018-02-02
Desire Outlook
# Targeted df
Price Date
500 2016-01-01
400 2016-01-03;2016-01-09
250 2016-01-04
250 2017-09-01
250 2017-08-10
250 2018-01-01
12.5 2016-01-04
12.5 2017-09-01
304 2015-01-02
238 2018-01-02;2018-02-02
Once the variable contains different year is defined , below is the operation
have to do .(It is just a example .)
mutate(Price = ifelse(DIFFERENT_DATE_ROW,
as.numeric(Price) / (str_count(Date,";")+1),
as.numeric(Price)),
Date = ifelse(DIFFERENT_DATE_ROW,
strsplit(as.character(Date),";"),
Date)) %>%
unnest()
I meet some constraints that cannot use dplyr's function "if_else" because
else NO operation cannot be recognized .Only ifelse work properly.
How to find out there is differences of the year in one variables to
PROVOKE the split line & split price calculations ?
so far the operation to split the element like
unlist(lapply(unlist(strsplit(df1$noFDate[8],";")),FUN = year))
cannot solve the problem.
I am beginner of coding , please feel free to change all operation above with considering the real data have over 2 million rows and 50 cols.
This might not be the most efficient one but can be used to get the required answer.
#Get the row indices which we need to separate
inds <- sapply(strsplit(df$Date, ";"), function(x)
#Format the date into year and count number of unique values
#Return TRUE if number of unique values is greater than 1
length(unique(format(as.Date(x), "%Y"))) > 1
)
library(tidyverse)
library(stringr)
#Select those indices
df[inds, ] %>%
# divide the price by number of dates in that row
mutate(Price = Price / (str_count(Date,";") + 1)) %>%
# separate `;` delimited values in separate rows
separate_rows(Date, sep = ";") %>%
# bind the remaining rows as it is
bind_rows(df[!inds,])
# Price Date
#1 250.0 2016-01-04
#2 250.0 2017-09-01
#3 250.0 2017-08-10
#4 250.0 2018-01-01
#5 12.5 2016-01-04
#6 12.5 2017-09-01
#7 500.0 2016-01-01
#8 400.0 2016-01-03;2016-01-09
#9 304.0 2015-01-02
#10 238.0 2018-01-02;2018-02-02
A bit cumbersome but you could do:
d_new = lapply(1:nrow(dat),function(x) {
a = dat[x,]
b = unlist(strsplit(as.character(a$Date),";"))
l = length(b)
if (l==1) check = 0 else check = ifelse(var(as.numeric(strftime(b,"%Y")))==0,0,1)
if (check==0) {
a
} else {
data.frame(Date = b, Price = rep(a$Price / l,l))
}
})
do.call(rbind,d_new)

Finding each time of daily max variable in climate data

I have a large dataset over many years which has several variables, but the one I am interested in is wind speed and dateTime. I want to find the time of the max wind speed for every day in the data set. I have hourly data in Posixct format, with WS as a numeric with occasional NAs. Below is a short data set that should hopefully illustrate my point, however my dateTime wasn't working out to be hourly data, but it provides enough for a sample.
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1798,rep=TRUE)
WD <- sample(0:390,1798,rep=TRUE)
Temp <- sample(0:40,1798,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I have previously tried creating a new column with just a posix date (minus time) to allow for day isolation, however all the things I have tried have only returned a shortened data frame with date and WS (aggregate, splitting, xts). Aggregate was only one that didn't do this, however, it gave me 23:00:00 as a constant time which isn't correct.
I have looked at How to calculate daily means, medians, from weather variables data collected hourly in R?, https://stats.stackexchange.com/questions/7268/how-to-aggregate-by-minute-data-for-a-week-into-hourly-means and others but none have answered this question, or the solutions have not returned an ideal result.
I need to compare the results of this analysis with another data frame, so hence the reason I need the actual time when the max wind speed occurred for each day in the dataset. I have a feeling there is a simple solution, however, this has me frustrated.
A dplyr solution may be:
library(dplyr)
df %>%
mutate(date = as.Date(dateTime)) %>%
left_join(
df %>%
mutate(date = as.Date(dateTime)) %>%
group_by(date) %>%
summarise(max_ws = max(WS, na.rm = TRUE)) %>%
ungroup(),
by = "date"
) %>%
select(-date)
# dateTime WS WD Temp max_ws
# 1 2011-01-01 00:00:00 NA 313 2 15
# 2 2011-01-01 00:24:00 7 376 1 15
# 3 2011-01-01 00:48:00 3 28 28 15
# 4 2011-01-01 01:12:00 15 262 24 15
# 5 2011-01-01 01:36:00 1 149 34 15
# 6 2011-01-01 02:00:00 4 319 33 15
# 7 2011-01-01 02:24:00 15 280 22 15
# 8 2011-01-01 02:48:00 NA 110 23 15
# 9 2011-01-01 03:12:00 12 93 15 15
# 10 2011-01-01 03:36:00 3 5 0 15
Dee asked for: "I want to find the time of the max wind speed for every day in the data set." Other answers have calculated the max(WS) for every day, but not at which hour that occured.
So I propose the following solution with dyplr:
library(dplyr)
set.seed(12345)
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1738,rep=TRUE)
WD <- sample(0:390,1738,rep=TRUE)
Temp <- sample(0:40,1738,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
df %>%
group_by(Date = as.Date(dateTime)) %>%
mutate(Hour = hour(dateTime),
Hour_with_max_ws = Hour[which.max(WS)])
I want to highlight out, that if there are several hours with the same maximal windspeed (in the example below: 15), only the first hour with max(WS) will be shown as result, though the windspeed 15 was reached on that date at the hours 0, 3, 4, 21 and 22! So you might need a more specific logic.
For the sake of completeness (and because I like the concise code) here is a "one-liner" using data.table:
library(data.table)
setDT(df)[, max.ws := max(WS, na.rm = TRUE), by = as.IDate(dateTime)][]
dateTime WS WD Temp max.ws
1: 2011-01-01 00:00:00 NA 293 22 15
2: 2011-01-01 00:24:00 15 55 14 15
3: 2011-01-01 00:48:00 NA 186 24 15
4: 2011-01-01 01:12:00 4 300 22 15
5: 2011-01-01 01:36:00 0 120 36 15
---
1734: 2011-01-29 21:12:00 12 249 5 15
1735: 2011-01-29 21:36:00 9 282 21 15
1736: 2011-01-29 22:00:00 12 238 6 15
1737: 2011-01-29 22:24:00 10 127 21 15
1738: 2011-01-29 22:48:00 13 297 0 15

Counting the Occurrences of Regular Time Spans in a Set of Time Intervals R

Given a data set like below. I would like to count how many times a particular hour of the day (00:00, 01:00, ...., 22:00, 23:00) falls completely within any of the given intervals.
The date of occurrence doesn't matter. Just the overall count.
### This code is to create a data set similar to the one I am using.
### This is a function I found on here to generate random times
latemail <- function(N, st="2012/01/01", et="2012/12/31") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
set.seed(123)
startTimes <- latemail(5)
endTimes <- startTimes +18000
my_data <- data.frame(startTimes, endTimes)
> my_data
start end
1 2012-04-14 16:10:44 2012-04-14 21:10:44
2 2012-05-28 23:38:16 2012-05-29 04:38:16
3 2012-10-14 10:33:10 2012-10-14 15:33:10
4 2012-11-17 23:13:56 2012-11-18 04:13:56
5 2012-12-08 22:29:36 2012-12-09 03:29:36
So that hopefully helps give you an idea of what I am working with.
Ideally the output would be a dataset with one variable for the hour, and another for the count of occurrences. Like this
hour count
1 00:00 3
2 01:00 3
3 etc ?
How to doing this in different increments (say 15 minutes) would also be great to know.
Thank you!
Here is my attempt. I am sure there are better ways of doing this. Given the comments above, I did the following. First, I took hour using ifelse. As you described in your commented, I rounded up/down hour here. Using transmute, I want to get a string including hours. In some cases, start hour can be larger than ending hour (in this case the record crosses dates). In order to deal with that, I used setdiff(), c(), and toString(). Using separate I separated hours into columns. I wanted to use cSplit() from the splitstackshape package, but I had an error message coming back. Hence, I chose separate() here. Once I had all hours separated, I reshaped the data using gather() and finally counted hour with count(). filter() was employed to remove NA cases. I hope this will help you to some extent.
** Data **
structure(list(startTimes = structure(c(1328621832.79254, 1339672345.94964,
1343434566.9641, 1346743867.55964, 1355550696.37895), class = c("POSIXct",
"POSIXt")), endTimes = structure(c(1328639832.79254, 1339690345.94964,
1343452566.9641, 1346761867.55964, 1355568696.37895), class = c("POSIXct",
"POSIXt"))), .Names = c("startTimes", "endTimes"), row.names = c(NA,
-5L), class = "data.frame")
# startTimes endTimes
#1 2012-02-07 22:37:12 2012-02-08 03:37:12
#2 2012-06-14 20:12:25 2012-06-15 01:12:25
#3 2012-07-28 09:16:06 2012-07-28 14:16:06
#4 2012-09-04 16:31:07 2012-09-04 21:31:07
#5 2012-12-15 14:51:36 2012-12-15 19:51:36
library(dplyr)
library(tidyr)
mutate(my_data, start = ifelse(as.numeric(format(startTimes, "%M")) >= 0 & as.numeric(format(startTimes, "%S")) > 0,
as.numeric(format(startTimes, "%H")) + 1,
as.numeric(format(startTimes, "%H"))),
end = ifelse(as.numeric(format(endTimes, "%M")) >= 0 & as.numeric(format(endTimes, "%S")) > 0,
as.numeric(format(endTimes, "%H")) - 1,
as.numeric(format(endTimes, "%H"))),
start = replace(start, which(start == "24"), 0),
end = replace(end, which(end == "-1"), 23)) %>%
rowwise() %>%
transmute(hour = ifelse(start < end, toString(seq.int(start, end, by = 1)),
toString(c(setdiff(seq(0, 23, by = 1), seq.int(end, start, by = 1)),
start, end)))) %>%
separate(hour, paste("hour", 1:24, sep = "."), ", ", extra = "merge") %>%
gather(foo, hour) %>%
count(hour) %>%
filter(complete.cases(hour))
# hour n
#1 0 2
#2 1 1
#3 10 1
#4 11 1
#5 12 1
#6 13 1
#7 15 1
#8 16 1
#9 17 2
#10 18 2
#11 19 1
#12 2 1
#13 20 1
#14 21 1
#15 22 1
#16 23 2

Resources