missing value imputation for unevenly spaced univariate time series using R - r

I have the following dataset:
timestamp value
1 90
3 78
6 87
8 NA
12 98
15 100
18 NA
24 88
27 101
As you can see, the gaps between the consecutive timestamps are not equi-spaced. Is there a way to imputate values to replace the NA using a timestamp dependend method?
All packages I found are only suitable for equi-spaced time series...
Thanks!

The zoo R package can be used to handle irregular spaced / unevenly spaced time series.
First you have to create a zoo ts object. You can either specify indices or use POSIXct timestamps.
Afterwards you can use a imputation method on this object. Zoo's imputation methods are limited, but they also work on irregular speced time series. You can use linear interpolation (na.approx) or spline interpolation (na.spline), which also account for the uneven time stamps.
# First create a unevenly spaced zoo time series object
# First vector with values, second with your indices
zoo_ts <- zoo(c(90,78,87,NA,98,100,NA,88,101), c(1, 3, 6,8,12,15,18,24,27))
# Perform the imputation
na.approx(zoo_ts)
Your zoo object looks like this:
> 1 3 6 8 12 15 18 24 27
> 90 78 87 NA 98 100 NA 88 101
Your imputed series like this afterwards:
> 1 3 6 8 12 15 18 24 27
> 90.00000 78.00000 87.00000 90.66667 98.00000 100.00000 96.00000 88.00000 101.00000
When you have time stamps and the series is only slightly / few seconds off for each time stamp, you could also try to transform the series into a regular time series by mapping your values to the correct regular intervals. (only reasonably if the differences are small). By doing this you could also use additional imputation methods e.g. by the imputeTS package (which only works for regular spaced data).

Related

LOCF and NOCF methods for missing data: how to plot data?

I'm working on the following dataset and its missing data:
# A tibble: 27 x 6
id sex d8 d10 d12 d14
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 F 21 20 21.5 23
2 2 F 21 21.5 24 25.5
3 3 NA NA 24 NA 26
4 4 F 23.5 24.5 25 26.5
5 5 F 21.5 23 22.5 23.5
6 6 F 20 21 21 22.5
7 7 F 21.5 22.5 23 25
8 8 F 23 23 23.5 24
9 9 F NA 21 NA 21.5
10 10 F 16.5 19 19 19.5
# ... with 17 more rows
I would like to fill the missiningness data via the Last Observation Carried Forward method (LOCF) and the Next Observation Carried Backward one (NOCB) and report also a graphic representation, plotting the individual profiles during age by sex, highlighting the imputed values, and compute the means and the standard errors at each age by sex. May you suggest a way to set properly the argument in plot() function?
Someone may have any clue about this?
I let you below some code, just in case they could turn out as useful, drawn from other dataset as example.
par(mfrow=c(1,1))
Oz <- airquality$Ozone
locf <- function(x) {
a <- x[1]
for (i in 2:length(x)) {
if (is.na(x[i])) x[i] <- a
else a <- x[i]
}
return(x)
}
Ozi <- locf(Oz)
colvec <- ifelse(is.na(Oz),mdc(2),mdc(1))
### Figure
plot(Ozi[1:80],col=colvec,type="l",xlab="Day number",ylab="Ozone (ppb)")
points(Ozi[1:80],col=colvec,pch=20,cex=1)
Next Observation Carried Backward / Last Observation Carried Forward is probably a very bad choice for your data.
These algorithms are usually used for time series data. Where carrying the last observation forward might be a good idea. E.g. if you think of 10 minute temperature measurements, the actual outdoor temperature will be quite likely quite similar to the temperature 10 minutes ago.
For cross sectional data (it seems you are looking at persons) the previous person is usually no more similar to actual person than any other random person.
Take a look at the mice R package for your cross-sectional dataset.
It offers way better algorithms for your case than locf/nocb.
Here is a overview about the function it offers: https://amices.org/mice/reference/index.html
It also includes different plots to assess the imputations e.g.:
Usually when using mice you create multiple possible imputations ( is worth reading about the technique of multiple imputation ). But you can also just produce one imputed dataset with the package.
There are the following functions for visualization of your imputations:
bwplot() (Box-and-whisker plot of observed and imputed data)
densityplot() (Density plot of observed and imputed data)
stripplot() (Stripplot of observed and imputed data)
xyplot()(Scatterplot of observed and imputed data)
Hope this helps a little bit. So my advice would be to take a look at this package and then start a new approach with your new knowledge.

Forecast using time and cluster as groups

I'm a relative newbie with R and I'm trying to figure out the R code to generate a table of forecast data that I can export to a CSV for multiple variables grouped by different slices.
My data looks like this:
Time Cluster X1 X2 X3 ...
2018-04-21 A 10 53 23 ...
2018-04-21 B 65 34 79 ...
2018-04-22 A 35 80 76 ...
2018-04-22 B 12 68 34 ...
I'd like to get a forecast by date per cluster for each X value in the table. The end goal is to combine all the forecasted values into a CSV for import into a DB. My initial dataset has 7 different cluster values and about 3 months of daily data. There are about 6 different values that need forecasts. I can (and have) done this fairly easily in Excel, but the requirement going forward is R to a CSV to a DB.
Thanks in advance!
Brandon~

'Forward' cumulative sum in dplyr

When examining datasets from longitudinal studies, I commonly get results like this from a dplyr analysis chain from the raw data:
df = data.frame(n_sessions=c(1,2,3,4,5), n_people=c(59,89,30,23,4))
i.e. a count of how many participants have completed a certain number of assessments at this point in time.
Although it is useful to know how many people have completed exactly n sessions, we more often need to know how many have completed at least n sessions. As per the table below, a standard cumulative sum isn't appropriate, What we want are the values in the n_total column, which is a sort of "forwards cumulative sum" of the values in the n_people column. i.e. the value in each row should be the sum of the values of itself and all values beyond it, rather than the standard cumulative sum, which is the sum of all values up to and including itself:
n_sessions n_people n_total cumsum
1 59 205 59
2 89 146 148
3 30 57 178
4 23 27 201
5 4 4 205
Generating the cumulative sum is simple:
mutate(df, cumsum = cumsum(n_people))
What would be an expression for generating a "forwards cumulative sum" that could be incorporated in a dplyr analysis chain? I'm guessing that cumsum would need to be applied to n_people after sorting by n_sessions descending, but can't quite get my head around how to get the answer while preserving the original order of the data frame.
You can take a cumulative sum of the reversed vector, then reverse that result. The built-in rev function is helpful here:
mutate(df, rev_cumsum = rev(cumsum(rev(n_people))))
For example, on your data this returns:
n_sessions n_people rev_cumsum
1 1 59 205
2 2 89 146
3 3 30 57
4 4 23 27
5 5 4 4

Detection of time-series outliers

I'm working on a university project forecasting. I have a huge database with demand between two cities. However, I know that this dataset is contaminated. However, I do not know which data points are obscured. The dataset is a panel data set that follows demand between city pairs on a monthly basis. Below is a part of the data that I am working with.
CAI.JED CAI.RUH ADD.DXB CAI.IST ALG.IST
2013-01-01 19196 14777 16 1413 12
2013-02-01 19913 8 18203 1026 5
2013-03-01 34242 11751 17836 985 1
2013-04-01 23481 12000 13479 948 27
2013-05-01 24428 16046 16391 954 9
2013-06-01 31791 23479 16571 1 4
2013-07-01 33716 20090 11323 0 5724
2013-08-01 35553 2 11121 0 0
2013-09-01 18746 13423 12119 0 26
2013-10-01 10 12223 10239 0 0
2013-11-01 19 20234 14231 5 2
2013-12-01 15198 1 12132 10 5
The dataset is a combination from two datasets. The persons that provided me the data told me that in some months, only one of the two dataset is working. However, it is not known for which months, which specific dataset is available.
Now comes my question: for the next part of the project, I need to get annual demand numbers. However, as I know that the figures are obscured, I would like to remove outliers. What techniques are available in R to do this?
As the data is in time-series format, I tried to use the tsoutliers package (see http://cran.r-project.org/web/packages/tsoutliers/tsoutliers.pdf). However, I could not get this working. Also, I tried the suggestions from https://stats.stackexchange.com/questions/104882/detecting-outliers-in-time-series-ls-ao-tc-using-tsoutliers-package-in-r-how/104946#104946 , but it didn't work.
After knowing what the outliers are, I would like to either replace them (e.g. with the mean for that route), or if too many points are missing, I would like to reject the entire route from the dataset.
I prefer density based clustering algorithm such as DBSCAN.
If you modify the epsilon and num-samples, you can filter outliers very specifically
using a plot to visualize the result (label -1 are the outliers)

Using tapply on two columns instead of one

I would like to calculate the gini coefficient of several plots with R unsing the gini() function from the package reldist.
I have a data frame from which I need to use two columns as input to the gini function.
> head(merged[,c(1,17,29)])
idp c13 w
1 19 126 14.14
2 19 146 14.14
3 19 76 39.29
4 19 74 39.29
5 19 86 39.29
6 19 93 39.29
The gini function uses the first elements for calculation (c13 here) and the second elements are the weights (w here) corresponding to each element from c13.
So I need to use the column c13 and w like this:
gini(merged$c13,merged$w)
[1] 0.2959369
The thing is I want to do this for each plot (idp). I have 4 thousands different values of idp with dozens of values of the two other columns for each.
I thought I could do this using the function tapply(). But I can't put two colums in the function using tapply.
tapply(list(merged$c13,merged$w), merged$idp, gini)
As you know this does not work.
So what I would love to get as a result is a data frame like this:
idp Gini
1 19 0.12
2 21 0.45
3 35 0.65
4 65 0.23
Do you have any idea of how to do this?? Maybe the plyr package?
Thank you for your help!
You can use function ddply() from library plyr() to calculate coefficient for each level (changed in example data frame some idp values to 21).
library(plyr)
library(reldist)
ddply(merged,.(idp),summarize, Gini=gini(c13,w))
idp Gini
1 19 0.15307402
2 21 0.05006588

Resources