R - Gap fill a time series - r

I am trying to fill in the gaps in one of my time series by merging a full day time series into my original time series. But for some reason I get duplicate entries and all the rest of my data is NA.
My data looks like this:
> head(data)
TIME Water_Temperature
1 2016-08-22 00:00:00 81.000
2 2016-08-22 00:01:00 80.625
3 2016-08-22 00:02:00 85.000
4 2016-08-22 00:03:00 80.437
5 2016-08-22 00:04:00 85.000
6 2016-08-22 00:05:00 80.375
> tail(data)
TIME Water_Temperature
1398 2016-08-22 23:54:00 19.5
1399 2016-08-22 23:55:00 19.5
1400 2016-08-22 23:56:00 19.5
1401 2016-08-22 23:57:00 19.5
1402 2016-08-22 23:58:00 19.5
1403 2016-08-22 23:59:00 19.5
In between are some minutes missing (1403 rows instead of 1440). I tried to fill them in using:
data.length <- length(data$TIME)
time.min <- data$TIME[1]
time.max <- data$TIME[data.length]
all.dates <- seq(time.min, time.max, by="min")
all.dates.frame <- data.frame(list(TIME=all.dates))
merged.data <- merge(all.dates.frame, data, all=T)
But that gives me a result of 1449 rows instead of 1440. The first eight minutes are duplicates in the time stamp column and all other values in Water_Temperature are NA. Looks like this:
> merged.data[1:25,]
TIME Water_Temperature
1 2016-08-22 00:00:00 NA
2 2016-08-22 00:00:00 81.000
3 2016-08-22 00:01:00 NA
4 2016-08-22 00:01:00 80.625
5 2016-08-22 00:02:00 NA
6 2016-08-22 00:02:00 85.000
7 2016-08-22 00:03:00 NA
8 2016-08-22 00:03:00 80.437
9 2016-08-22 00:04:00 NA
10 2016-08-22 00:04:00 85.000
11 2016-08-22 00:05:00 NA
12 2016-08-22 00:05:00 80.375
13 2016-08-22 00:06:00 NA
14 2016-08-22 00:06:00 80.812
15 2016-08-22 00:07:00 NA
16 2016-08-22 00:07:00 80.812
17 2016-08-22 00:08:00 NA
18 2016-08-22 00:08:00 80.937
19 2016-08-22 00:09:00 NA
20 2016-08-22 00:10:00 NA
21 2016-08-22 00:11:00 NA
22 2016-08-22 00:12:00 NA
23 2016-08-22 00:13:00 NA
24 2016-08-22 00:14:00 NA
25 2016-08-22 00:15:00 NA
> tail(merged.data)
TIME Water_Temperature
1444 2016-08-22 23:54:00 NA
1445 2016-08-22 23:55:00 NA
1446 2016-08-22 23:56:00 NA
1447 2016-08-22 23:57:00 NA
1448 2016-08-22 23:58:00 NA
1449 2016-08-22 23:59:00 NA
Does anyone has an idea whats going wrong?
EDIT:
Using the xts and zoo package now to do the job by doing:
library(xts)
library(zoo)
df1.zoo<-zoo(data[,-1],data[,1])
df2 <- as.data.frame(as.zoo(merge(as.xts(df1.zoo), as.xts(zoo(,seq(start(df1.zoo),end(df1.zoo),by="min"))))))
Very easy and effective!

Instead of merge use rbind which gives you an irregular time series without NAs to start with. If you really want a regular time series with a frequency of say 1 minute you can build a time based sequence as an index and merge it with your data after ( after using rbind) and fill the resulting NAs with na.locf. Hope this helps.

you can try merging with full_join from tidyverse
This works for me with two dataframes (daily values) sharing a column named date.
big_data<-my_data %>%
reduce(full_join, by="Date")

Related

Joining datasets in nearest time R

I am trying to join 2 data sets with the closest TimeDate(POSTIXct format).
Indeed, some of the DateTime are properly matching and others are different by 5 min.
(df1) //// every 5 min with specific time points
# A tibble: 6 × 3
TimeDate TimeDateAnimal Event
<dttm> <chr> <dbl>
1 2015-03-01 00:55:00 2015-03-01 00:55:00 G 1
**2 2015-03-01 03:40:00 2015-03-01 03:40:00 G 1
3 2015-03-01 03:45:00 2015-03-01 03:45:00 G 1**
4 2015-03-01 13:35:00 2015-03-01 13:35:00 G 1
5 2015-03-01 18:45:00 2015-03-01 18:45:00 G 1
6 2015-03-01 19:10:00 2015-03-01 19:10:00 G 1
> (df2) /// every 10 min
A tibble: 52 × 3
TimeDate TimeDateAnimal Temperature
<dttm> <chr> <dbl>
1 2015-03-01 00:05:00 2015-03-01 00:05:00 G 38.52000
2 2015-03-01 00:15:00 2015-03-01 00:15:00 G 38.65333
3 2015-03-01 00:25:00 2015-03-01 00:25:00 G 38.78667
4 2015-03-01 00:35:00 2015-03-01 00:35:00 G 38.86000
5 2015-03-01 00:45:00 2015-03-01 00:45:00 G 38.92667
6 2015-03-01 00:55:00 2015-03-01 00:55:00 G 38.99333
..
**34 2015-03-01 03:35:00 2015-03-01 03:35:00 G 38.80000
35 2015-03-01 03:45:00 2015-03-01 03:45:00 G 38.80000**
I would like this output:
Merge df:
TimeDate TimeDateAnimal Temperature Event
<dttm> <chr> <dbl> <dbl>
1 2015-03-01 00:05:00 2015-03-01 00:05:00 G 38.52000 NA
2 2015-03-01 00:15:00 2015-03-01 00:15:00 G 38.65333 NA
3 2015-03-01 00:25:00 2015-03-01 00:25:00 G 38.78667 NA
4 2015-03-01 00:35:00 2015-03-01 00:35:00 G 38.86000 NA
5 2015-03-01 00:45:00 2015-03-01 00:45:00 G 38.92667 NA
6 2015-03-01 00:55:00 2015-03-01 00:55:00 G 38.99333 NA
..
**34 2015-03-01 03:35:00 2015-03-01 03:35:00 G 38.80000 1
35 2015-03-01 03:45:00 2015-03-01 03:45:00 G 38.80000 1**
I tried fuzzyjoin, data.table but I always get extra raw instead of merging by the nearest TimeDate
Test<- merge(df1, df2, by = "TimeDate", roll = "nearest",all = T)
#head(Test, n=15)
TimeDate TimeDateAnimal.x Event TimeDateAnimal.y Temperature
10 2015-03-01 00:45:00 <NA> NA 2015-03-01 00:45:00 G 38.92667
11 2015-03-01 00:55:00 2015-03-01 00:55:00 G 1 <NA> NA
12 2015-03-01 00:55:00 <NA> NA 2015-03-01 00:55:00 G 38.99333
Thanks in advance.

Time difference calculations considering the midnight time using r

I am working on a problem where I need to calculate the time difference in minutes. I have the time values in hh:mm:ss format in a column (more than 28,000 values).
I have been using the following function to calculate the time difference.
tdiff <- dt[dt, Time_Diff := c(abs(diff(as.numeric(Time))),0.30), Student_ID]
where dt --> is the ordered data table and
0.30 --> 30 minutes assigned to the last activity of the student in a course.
This works, but it is not considering the midnight time.
Thanks to #niko for his help and this problem is solved, however the '30 minutes' that should be assigned to each student's last activity is still not done. Any help in this direction will be greatly appreciated. Thank you.
The expected output is described below
S_Id Date Time Time_Diff Time_Diff(minutes)
A 10/08/2018 23:49:00 00:01:00 1 minutes
A 10/08/2018 23:50:00 00:09:00 9
A 10/08/2018 23:59:00 00:02:00 2
A 10/09/2018 00:01:00 00:09:00 9
A 10/09/2018 00:10:00 08:02:00 482
A 10/09/2018 08:12:00 04:08:00 248
A 10/09/2018 12:20:00 10:01:00 601
A 10/09/2018 22:21:00 01:35:00 95
A 10/09/2018 23:56:00 00:09:00 9
A 10/10/2018 00:05:00 00:05:00 5
A 10/10/2018 00:10:00 00:02:00 2
A 10/10/2018 00:12:00 00:30:00 30
B 10/08/2018 23:49:00 00:01:00 1
B 10/08/2018 23:50:00 00:09:00 9
B 10/08/2018 23:59:00 00:02:00 2
B 10/09/2018 00:01:00 00:09:00 9
B 10/09/2018 00:10:00 08:02:00 482
B 10/09/2018 08:12:00 04:08:00 248
B 10/09/2018 12:20:00 10:01:00 601
B 10/09/2018 22:21:00 01:35:00 95
B 10/09/2018 23:56:00 00:09:00 9
B 10/10/2018 00:05:00 00:05:00 5
B 10/10/2018 00:10:00 00:02:00 2
B 10/10/2018 00:12:00 00:30:00 30
C 10/08/2018 23:49:00 00:01:00 1
C 10/08/2018 23:50:00 00:09:00 9
C 10/08/2018 23:59:00 00:02:00 2
C 10/09/2018 00:01:00 00:09:00 9
C 10/09/2018 00:10:00 08:02:00 482
C 10/09/2018 08:12:00 04:08:00 248
C 10/09/2018 12:20:00 10:01:00 601
C 10/09/2018 22:21:00 01:35:00 95
C 10/09/2018 23:56:00 00:09:00 9
C 10/10/2018 00:05:00 00:05:00 5
C 10/10/2018 00:10:00 00:02:00 2
C 10/10/2018 00:12:00 00:30:00 30
Try converting date and time to POSIXct
# dt is your data frame
diff(as.POSIXct(paste(dt$Date, dt$Time), format='%m/%d/%Y %H:%M:%S')) # or '%d/%m/%Y %H:%M:%S'
That should do the trick.
Data
dt <- structure(list(Date = c("10/08/2018", "10/08/2018", "10/08/2018", "10/09/2018", "10/09/2018",
"10/09/2018", "10/09/2018", "10/09/2018", "10/09/2018", "10/10/2018",
"10/10/2018", "10/10/2018"),
Time = c("23:49:00", "23:50:00", "23:59:00", "00:01:00", "00:10:00", "08:12:00",
"12:20:00", "22:21:00", "23:56:00", "00:05:00", "00:10:00", "00:12:00")),
class = "data.frame", row.names = c(NA, -12L))

loop for multiple which statements

I have a df in long format with travel data.
The df looks like this:
id from to traveltime Key departuretime arrivaltime (next stop)
1 2 3 00:01:00 301 08:15:00 08:16:00
1 2 3 00:01:00 301 08:30:00 08:31:00
1 2 3 00:01:00 301 08:45:00 08:46:00
2 3 4 00:02:00 301
2 3 4 00:02:00 301
2 3 4 00:02:00 301
1 5 6 00:01:00 302 09:00:00 09:01:00
1 6 7 00:01:00 302 09:01:00 09:02:00
2 7 8 00:01:00 302
Now I want to fill the empty cells. The departure time is always the sum of the arrival time and the travel time of the previous stop. So the expected output is:
id from to traveltime Key departuretime arrivaltime (next stop)
1 2 3 00:01:00 301 08:15:00 08:16:00
1 2 3 00:01:00 301 08:30:00 08:31:00
1 2 3 00:01:00 301 08:45:00 08:46:00
2 3 4 00:02:00 301 08:16:00 08:18:00
2 3 4 00:02:00 301 08:31:00 08:33:00
2 3 4 00:02:00 301 08:33:00 08:35:00
1 5 6 00:01:00 302 09:00:00 09:01:00
1 6 7 00:01:00 302 09:01:00 09:02:00
2 7 8 00:01:00 302 09:02:00 09:03:00
I wrote some code that works fine. But I have to adapt the code for every edge in my df.
data$arrivaltime <- data$departuretime + data$traveltime
data$departuretime[which(data$id =="2" & data$Key =="301")]<-data$arrivaltime[which(data$id == "1" & data$Key =="301")]
This would work yet it is terrible time consuming. Cause I would need to adapt this code for every edge.
What I want to do now is to ?automate my code. So that I don't have to change the id and the key parameters manually.
I guess that I need to store the Keys and the ids in a list and then build a loop that iterates trough the df.
I'm new in R and I don't know how to build such a loop. So I hope that someone has an idea on that. Thank you in advance!

R: Compare data.table and pass variable while respecting key

I have two data.tables:
original <- data.frame(id = c(rep("RE01",5),rep("RE02",5)),date.time = head(seq.POSIXt(as.POSIXct("2015-11-01 01:00:00"),as.POSIXct("2015-11-05 01:00:00"),60*60*10),10))
compare <- data.frame(id = c("RE01","RE02"),seq = c(1,2),start = as.POSIXct(c("2015-11-01 20:00:00","2015-11-04 08:00:00")),end = as.POSIXct(c("2015-11-02 08:00:00","2015-11-04 20:00:00")))
setDT(original)
setDT(compare)
I would like to check the date in each row of original and see if it lies between the start and finish dates of compare whilst respecting the id. If it does lie between the two elements, a variable should be passed to original (compare$diff.seq). The output should look like this:
original
id date.time diff.seq
1 RE01 2015-11-01 01:00:00 NA
2 RE01 2015-11-01 11:00:00 NA
3 RE01 2015-11-01 21:00:00 1
4 RE01 2015-11-02 07:00:00 1
5 RE01 2015-11-02 17:00:00 NA
6 RE02 2015-11-03 03:00:00 NA
7 RE02 2015-11-03 13:00:00 NA
8 RE02 2015-11-03 23:00:00 NA
9 RE02 2015-11-04 09:00:00 2
10 RE02 2015-11-04 19:00:00 2
I've been reading the manual and SO for hours and trying "on", "by" and so on.. without any success. Can anybody point me in the right direction?
As said in the comments, this is very straight forward using data.table::foverlaps
You basically have to create an additional column in the original data set in order to set join boundaries, then key the two data sets by the columns you want to join on and then simply run forverlas and select the desired columns
original[, end := date.time]
setkey(original, id, date.time, end)
setkey(compare, id, start, end)
foverlaps(original, compare)[, .(id, date.time, seq)]
# id date.time seq
# 1: RE01 2015-11-01 01:00:00 NA
# 2: RE01 2015-11-01 11:00:00 NA
# 3: RE01 2015-11-01 21:00:00 1
# 4: RE01 2015-11-02 07:00:00 1
# 5: RE01 2015-11-02 17:00:00 NA
# 6: RE02 2015-11-03 03:00:00 NA
# 7: RE02 2015-11-03 13:00:00 NA
# 8: RE02 2015-11-03 23:00:00 NA
# 9: RE02 2015-11-04 09:00:00 2
# 10: RE02 2015-11-04 19:00:00 2
Alternatively, you can run foverlaps the other way around and then just update the original data set by reference while selecting the correct rows to update
indx <- foverlaps(compare, original, which = TRUE)
original[indx$yid, diff.seq := indx$xid]
original
# id date.time end diff.seq
# 1: RE01 2015-11-01 01:00:00 2015-11-01 01:00:00 NA
# 2: RE01 2015-11-01 11:00:00 2015-11-01 11:00:00 NA
# 3: RE01 2015-11-01 21:00:00 2015-11-01 21:00:00 1
# 4: RE01 2015-11-02 07:00:00 2015-11-02 07:00:00 1
# 5: RE01 2015-11-02 17:00:00 2015-11-02 17:00:00 NA
# 6: RE02 2015-11-03 03:00:00 2015-11-03 03:00:00 NA
# 7: RE02 2015-11-03 13:00:00 2015-11-03 13:00:00 NA
# 8: RE02 2015-11-03 23:00:00 2015-11-03 23:00:00 NA
# 9: RE02 2015-11-04 09:00:00 2015-11-04 09:00:00 2
# 10: RE02 2015-11-04 19:00:00 2015-11-04 19:00:00 2

Using Time Diary Data with TraMineR

I am trying to do sequence analysis using time-diary data (American Time Use Survey) using TraMineR in R. I have the data as SPELL data (id, start time, stop time, event) but I receive the following error when trying to convert it to STS or SPS data:
Error in as.matrix.data.frame(subset(data, , 2)) : dims [product 0] do not match the length of object [9]
I believe it has something to do with how I convert my time (as character) to date/time types. I believe TraMineR requires an POSIXlt format?
Here is a snippet of my raw data (trcode is the event)
head(atus.act.short)
tucaseid tustarttim tustoptime trcode
1 2.00701e+13 04:00:00 08:00:00 10101
2 2.00701e+13 08:00:00 08:20:00 110101
3 2.00701e+13 08:20:00 08:50:00 10201
4 2.00701e+13 08:50:00 09:30:00 20102
5 2.00701e+13 09:30:00 09:40:00 180201
6 2.00701e+13 09:40:00 11:40:00 20102
I use strptime to convert the character strings to POSIXlt:
atus.act.short$starttime.new <- strptime(atus.act.short$tustarttim, format="%X")
atus.act.short$stoptime.new <- strptime(atus.act.short$tustoptime, format="%X")
I also cut the ID down to only two digits
atus.act.short$id <- atus.act.short$tucaseid-20070101070000
I end up with a new data frame as follows:
id starttime.new stoptime.new trcode
1 44 2012-08-03 04:00:00 2012-08-03 08:00:00 10101
2 44 2012-08-03 08:00:00 2012-08-03 08:20:00 110101
3 44 2012-08-03 08:20:00 2012-08-03 08:50:00 10201
4 44 2012-08-03 08:50:00 2012-08-03 09:30:00 20102
5 44 2012-08-03 09:30:00 2012-08-03 09:40:00 180201
6 44 2012-08-03 09:40:00 2012-08-03 11:40:00 20102
7 44 2012-08-03 11:40:00 2012-08-03 11:50:00 180201
8 44 2012-08-03 11:50:00 2012-08-03 12:05:00 20102
9 44 2012-08-03 12:05:00 2012-08-03 13:05:00 120303
10 44 2012-08-03 13:05:00 2012-08-03 13:20:00 180704
11 44 2012-08-03 13:20:00 2012-08-03 15:20:00 70104
12 44 2012-08-03 15:20:00 2012-08-03 15:35:00 180704
13 44 2012-08-03 15:35:00 2012-08-03 17:00:00 120303
14 44 2012-08-03 17:00:00 2012-08-03 17:20:00 180701
15 44 2012-08-03 17:20:00 2012-08-03 17:25:00 180701
16 44 2012-08-03 17:25:00 2012-08-03 17:55:00 70101
17 44 2012-08-03 17:55:00 2012-08-03 18:00:00 181203
18 44 2012-08-03 18:00:00 2012-08-03 19:00:00 120303
19 44 2012-08-03 19:00:00 2012-08-03 19:30:00 110101
20 44 2012-08-03 19:30:00 2012-08-03 21:30:00 120303
21 44 2012-08-03 21:30:00 2012-08-03 23:00:00 10101
22 44 2012-08-03 23:00:00 2012-08-03 23:03:00 10201
26 48 2012-08-03 06:45:00 2012-08-03 08:15:00 10201
27 48 2012-08-03 08:15:00 2012-08-03 08:45:00 180209
28 48 2012-08-03 08:45:00 2012-08-03 09:00:00 20902
29 48 2012-08-03 09:00:00 2012-08-03 11:00:00 50101
30 48 2012-08-03 11:00:00 2012-08-03 11:45:00 120312
Then I try to create a sequence object [using library(TraMineR)]
atus.seq <- seqdef(atus.act.short, informat = "SPELL", id="id")
And I get the following error:
Error in as.matrix.data.frame(subset(data, , 2)) : dims [product 0] do not match the length of object [9]
Thoughts?
I've managed to work around this by converting the time to minutes (following another questions on stackoverflow), making the status code a character (as.character), using seqformat, and assigning it to a time axis. The new code reads:
atus.seq2 <- seqformat(atus.act.short2, id="id", from="SPELL", to="STS", begin = "startmin", end = "stopmin", status="trcode", process = "FALSE")

Resources