Forecast using time and cluster as groups - r

I'm a relative newbie with R and I'm trying to figure out the R code to generate a table of forecast data that I can export to a CSV for multiple variables grouped by different slices.
My data looks like this:
Time Cluster X1 X2 X3 ...
2018-04-21 A 10 53 23 ...
2018-04-21 B 65 34 79 ...
2018-04-22 A 35 80 76 ...
2018-04-22 B 12 68 34 ...
I'd like to get a forecast by date per cluster for each X value in the table. The end goal is to combine all the forecasted values into a CSV for import into a DB. My initial dataset has 7 different cluster values and about 3 months of daily data. There are about 6 different values that need forecasts. I can (and have) done this fairly easily in Excel, but the requirement going forward is R to a CSV to a DB.
Thanks in advance!
Brandon~

Related

missing value imputation for unevenly spaced univariate time series using R

I have the following dataset:
timestamp value
1 90
3 78
6 87
8 NA
12 98
15 100
18 NA
24 88
27 101
As you can see, the gaps between the consecutive timestamps are not equi-spaced. Is there a way to imputate values to replace the NA using a timestamp dependend method?
All packages I found are only suitable for equi-spaced time series...
Thanks!
The zoo R package can be used to handle irregular spaced / unevenly spaced time series.
First you have to create a zoo ts object. You can either specify indices or use POSIXct timestamps.
Afterwards you can use a imputation method on this object. Zoo's imputation methods are limited, but they also work on irregular speced time series. You can use linear interpolation (na.approx) or spline interpolation (na.spline), which also account for the uneven time stamps.
# First create a unevenly spaced zoo time series object
# First vector with values, second with your indices
zoo_ts <- zoo(c(90,78,87,NA,98,100,NA,88,101), c(1, 3, 6,8,12,15,18,24,27))
# Perform the imputation
na.approx(zoo_ts)
Your zoo object looks like this:
> 1 3 6 8 12 15 18 24 27
> 90 78 87 NA 98 100 NA 88 101
Your imputed series like this afterwards:
> 1 3 6 8 12 15 18 24 27
> 90.00000 78.00000 87.00000 90.66667 98.00000 100.00000 96.00000 88.00000 101.00000
When you have time stamps and the series is only slightly / few seconds off for each time stamp, you could also try to transform the series into a regular time series by mapping your values to the correct regular intervals. (only reasonably if the differences are small). By doing this you could also use additional imputation methods e.g. by the imputeTS package (which only works for regular spaced data).

LOCF and NOCF methods for missing data: how to plot data?

I'm working on the following dataset and its missing data:
# A tibble: 27 x 6
id sex d8 d10 d12 d14
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 F 21 20 21.5 23
2 2 F 21 21.5 24 25.5
3 3 NA NA 24 NA 26
4 4 F 23.5 24.5 25 26.5
5 5 F 21.5 23 22.5 23.5
6 6 F 20 21 21 22.5
7 7 F 21.5 22.5 23 25
8 8 F 23 23 23.5 24
9 9 F NA 21 NA 21.5
10 10 F 16.5 19 19 19.5
# ... with 17 more rows
I would like to fill the missiningness data via the Last Observation Carried Forward method (LOCF) and the Next Observation Carried Backward one (NOCB) and report also a graphic representation, plotting the individual profiles during age by sex, highlighting the imputed values, and compute the means and the standard errors at each age by sex. May you suggest a way to set properly the argument in plot() function?
Someone may have any clue about this?
I let you below some code, just in case they could turn out as useful, drawn from other dataset as example.
par(mfrow=c(1,1))
Oz <- airquality$Ozone
locf <- function(x) {
a <- x[1]
for (i in 2:length(x)) {
if (is.na(x[i])) x[i] <- a
else a <- x[i]
}
return(x)
}
Ozi <- locf(Oz)
colvec <- ifelse(is.na(Oz),mdc(2),mdc(1))
### Figure
plot(Ozi[1:80],col=colvec,type="l",xlab="Day number",ylab="Ozone (ppb)")
points(Ozi[1:80],col=colvec,pch=20,cex=1)
Next Observation Carried Backward / Last Observation Carried Forward is probably a very bad choice for your data.
These algorithms are usually used for time series data. Where carrying the last observation forward might be a good idea. E.g. if you think of 10 minute temperature measurements, the actual outdoor temperature will be quite likely quite similar to the temperature 10 minutes ago.
For cross sectional data (it seems you are looking at persons) the previous person is usually no more similar to actual person than any other random person.
Take a look at the mice R package for your cross-sectional dataset.
It offers way better algorithms for your case than locf/nocb.
Here is a overview about the function it offers: https://amices.org/mice/reference/index.html
It also includes different plots to assess the imputations e.g.:
Usually when using mice you create multiple possible imputations ( is worth reading about the technique of multiple imputation ). But you can also just produce one imputed dataset with the package.
There are the following functions for visualization of your imputations:
bwplot() (Box-and-whisker plot of observed and imputed data)
densityplot() (Density plot of observed and imputed data)
stripplot() (Stripplot of observed and imputed data)
xyplot()(Scatterplot of observed and imputed data)
Hope this helps a little bit. So my advice would be to take a look at this package and then start a new approach with your new knowledge.

Sum-product in R for specific conditions

I'm looking to do sumproduct in r as we do in excel.
It's a little challenging as i have to apply some logical conditions meanwhile.
Excel code looks like this
SUMPRODUCT(--(ID=A2),--(INDIRECT(A1)<>"-"),INDIRECT(B1),C1)
here ID, A1 ,B1 are name ranges on other sheet of same workbook.
ID $ Quantity
1 23 34
2 4 55
3 NA 6
4 6 45
5 7 NA
6 8 NA
I want logical operators because some values are NA and i don't want to take them in consideration. I want this process to be automated without much manual work.
I've done this upto some extent using deplyr but it's not giving satisfactory results.

Transposing data frame based on factor column

Let's assume I have a dataframe in the following format, obtained from a .csv file:
Measurement Config Value
--------------------------- _
Time A 10 |
Object A 20 | Run 1
Nodes A 30 _|
Time A 8 |
Object A 18 | Run 2
Nodes A 29 _|
Time B 9 |
Object B 20 | Run 3
Nodes B 35 _|
...
There are a fixed number of Measurements that are taken during each run, and each run is run with a given Config.
The Measurements per run are fixed (e.g., every run consists of a Time, an Objects and a Nodes measurement in the example above), but there can be multiple runs for a single config (e.g., Config A was run two times in the example above, B only once)
My primary goal is to plot correlations (scatter plots) between two of those measurement types, e.g., plot Objects (x-axis) against Nodes (y-axis) and highlight different Configs (color)
I thought that this could be best achieved if the dataframe is in the following format:
Config Time Objects Nodes
--------------------------
A 10 20 30 <- Run 1
A 8 18 29 <- Run 2
B 9 20 35 <- Run 3
I.e., creating the columns based on the factor-values of the Measurement-column, and assigning the respective Value-value to the cells.
Is there an "easy" way in R to achieve that?
First create a run variable:
# option 1:
d$run <- ceiling(seq_along(d$Measurement)/3)
# option 2:
d$run <- 1 + (seq_along(d$Config)-1) %/% 3
Then you reshape to wide wide format with the dcast function from reshape2 or data.table:
reshape2::dcast(d, Config + run ~ Measurement, value.var = 'Value')
you will then get:
Config run Nodes Object Time
1 A 1 30 20 10
2 A 2 29 18 8
3 B 3 35 20 9

Using tapply on two columns instead of one

I would like to calculate the gini coefficient of several plots with R unsing the gini() function from the package reldist.
I have a data frame from which I need to use two columns as input to the gini function.
> head(merged[,c(1,17,29)])
idp c13 w
1 19 126 14.14
2 19 146 14.14
3 19 76 39.29
4 19 74 39.29
5 19 86 39.29
6 19 93 39.29
The gini function uses the first elements for calculation (c13 here) and the second elements are the weights (w here) corresponding to each element from c13.
So I need to use the column c13 and w like this:
gini(merged$c13,merged$w)
[1] 0.2959369
The thing is I want to do this for each plot (idp). I have 4 thousands different values of idp with dozens of values of the two other columns for each.
I thought I could do this using the function tapply(). But I can't put two colums in the function using tapply.
tapply(list(merged$c13,merged$w), merged$idp, gini)
As you know this does not work.
So what I would love to get as a result is a data frame like this:
idp Gini
1 19 0.12
2 21 0.45
3 35 0.65
4 65 0.23
Do you have any idea of how to do this?? Maybe the plyr package?
Thank you for your help!
You can use function ddply() from library plyr() to calculate coefficient for each level (changed in example data frame some idp values to 21).
library(plyr)
library(reldist)
ddply(merged,.(idp),summarize, Gini=gini(c13,w))
idp Gini
1 19 0.15307402
2 21 0.05006588

Resources