I'm working on a university project forecasting. I have a huge database with demand between two cities. However, I know that this dataset is contaminated. However, I do not know which data points are obscured. The dataset is a panel data set that follows demand between city pairs on a monthly basis. Below is a part of the data that I am working with.
CAI.JED CAI.RUH ADD.DXB CAI.IST ALG.IST
2013-01-01 19196 14777 16 1413 12
2013-02-01 19913 8 18203 1026 5
2013-03-01 34242 11751 17836 985 1
2013-04-01 23481 12000 13479 948 27
2013-05-01 24428 16046 16391 954 9
2013-06-01 31791 23479 16571 1 4
2013-07-01 33716 20090 11323 0 5724
2013-08-01 35553 2 11121 0 0
2013-09-01 18746 13423 12119 0 26
2013-10-01 10 12223 10239 0 0
2013-11-01 19 20234 14231 5 2
2013-12-01 15198 1 12132 10 5
The dataset is a combination from two datasets. The persons that provided me the data told me that in some months, only one of the two dataset is working. However, it is not known for which months, which specific dataset is available.
Now comes my question: for the next part of the project, I need to get annual demand numbers. However, as I know that the figures are obscured, I would like to remove outliers. What techniques are available in R to do this?
As the data is in time-series format, I tried to use the tsoutliers package (see http://cran.r-project.org/web/packages/tsoutliers/tsoutliers.pdf). However, I could not get this working. Also, I tried the suggestions from https://stats.stackexchange.com/questions/104882/detecting-outliers-in-time-series-ls-ao-tc-using-tsoutliers-package-in-r-how/104946#104946 , but it didn't work.
After knowing what the outliers are, I would like to either replace them (e.g. with the mean for that route), or if too many points are missing, I would like to reject the entire route from the dataset.
I prefer density based clustering algorithm such as DBSCAN.
If you modify the epsilon and num-samples, you can filter outliers very specifically
using a plot to visualize the result (label -1 are the outliers)
Related
I am fairly new to survival analysis and I apologize if this is a trivial question, but I wasn't able to find any solution to my problem.
I'm trying to find a good model for predicting whether and when a contract for a specific product (identified by ID column) will be bought, therefore a time to event prediction. I am interested mostly in a probability, that the event will occur in 3 months. However, my data is pretty much a monthly time series. Sample of the dataset would look somewhat like this:
ID
Time
Number of assistance calls
Number of product malfunctions
Time to fix
Contract bought
1
2012-01
0
0
NA
0
1
2012-02
3
1
37.124
0
1
2012-03
2
0
NA
0
1
2012-04
0
0
NA
1
2
2012-03
1
0
NA
0
2
2012-04
0
0
NA
0
Here's what I struggle with. I could use a survival analysis model, e.g. Cox proportional hazards model, which is able to deal with time dependent variables, but in that case it wouldn't be able to predict (1). I could also summarize the data for each ID, but that would mean losing some information contained in the data, e.g. malfunction could occur 1, 2 or 3 months before the event.
Is there a better way to approach this?
Thank you very much for any tips!
Sources:
[1] https://www.annualreviews.org/doi/10.1146/annurev.publhealth.20.1.145
I have the following dataset:
timestamp value
1 90
3 78
6 87
8 NA
12 98
15 100
18 NA
24 88
27 101
As you can see, the gaps between the consecutive timestamps are not equi-spaced. Is there a way to imputate values to replace the NA using a timestamp dependend method?
All packages I found are only suitable for equi-spaced time series...
Thanks!
The zoo R package can be used to handle irregular spaced / unevenly spaced time series.
First you have to create a zoo ts object. You can either specify indices or use POSIXct timestamps.
Afterwards you can use a imputation method on this object. Zoo's imputation methods are limited, but they also work on irregular speced time series. You can use linear interpolation (na.approx) or spline interpolation (na.spline), which also account for the uneven time stamps.
# First create a unevenly spaced zoo time series object
# First vector with values, second with your indices
zoo_ts <- zoo(c(90,78,87,NA,98,100,NA,88,101), c(1, 3, 6,8,12,15,18,24,27))
# Perform the imputation
na.approx(zoo_ts)
Your zoo object looks like this:
> 1 3 6 8 12 15 18 24 27
> 90 78 87 NA 98 100 NA 88 101
Your imputed series like this afterwards:
> 1 3 6 8 12 15 18 24 27
> 90.00000 78.00000 87.00000 90.66667 98.00000 100.00000 96.00000 88.00000 101.00000
When you have time stamps and the series is only slightly / few seconds off for each time stamp, you could also try to transform the series into a regular time series by mapping your values to the correct regular intervals. (only reasonably if the differences are small). By doing this you could also use additional imputation methods e.g. by the imputeTS package (which only works for regular spaced data).
I am trying to implement the data.table method described here: Calculating grouped variance from a frequency table in R.
I can successfully replicate their example. But when I apply it to my own data, nothing seems to happen. In particular, output is this:
table <- data.frame(districts,proportions,populations)
table<-setDT(table)
districts proportions populations
1: 24 0.8270270 1269
2: 26 0.8867925 1679
3: 12 0.9136691 510
4: 27 0.4220532 3274
5: 20 0.5457650 3644
---
8937: 1 0.7798072 3444
8938: 1 0.6080247 6128
8939: 1 0.4655172 4335
8940: 1 0.4813200 4297
8941: 1 0.7690167 3906
setDT(table)[, list(GroupMedian=as.double(median(rep(proportions, populations))),
TotalCount=sum(populations)) , by = districts]
print(table)
##Same output as above###
I have no idea whats going on, after much time.
I have loaded two datasets as data.frames, named DF1 and DF2. Both have the columns time and area. DF1 though has more rows than DF2, i.e. more time points (or data points). The merge function would allow me to combine the area columns of the two datasets by="time", but the time points are dissimilar. round isn't useful here (too coarse and duplicates).
What I actually want to do is to run a two-sample wilcox.test (i.e. they don't follow a normal distribution), which doesn't allow for vectors of different length (afaik).
> head(DF1)
timesteps area time
1 0 1030 40.00
2 100 1031 40.11
3 200 1039 40.22
4 300 1046 40.32
5 400 1053 40.43
6 500 1061 40.54
> head(DF2)
time area
1 33.83506 952.7843
2 43.31922 935.7430
3 47.95656 1528.4501
4 52.78808 2400.7030
5 67.29044 5699.4736
6 72.12320 8277.1240
Why not just use
wilcox.test(DF1$time, DF2$time)
or area if that is the desired test.
The following works:
wilcox.test(rnorm(50), (rnorm(100)+2))
thanks in advance for your time on reading and answering this.
I have a data frame (15264*3) the head of which is:
head(actData)
steps date interval
289 0 2012-10-02 0
290 0 2012-10-02 5
291 0 2012-10-02 10
292 0 2012-10-02 15
293 0 2012-10-02 20
294 0 2012-10-02 25
There are 53 of the "date" variable (factor); I want to split the data based on date, calculate the mean of the steps/date and then create a plot for interval vs. steps' mean;
What I have done:
mn<- ddply(actData, c("date"), function (x) apply(x[1], 2, mean)) # to calculate mean of steps per day (with the length of 53)
splt<- split(actData, actData$date) # split the data based on date (it should divide the data into 53 parts)
Now I have two variables with the same length (53); but when I try plotting them, I get an error for the difference in their length:
plot(splt$interval, mn[,2], type="l")
Error in xy.coords(x, y, xlabel, ylabel, log) : 'x' and 'y' lengths differ
when I check the length of splt$interval, it gives me "0"!
I've also visited here "How to split a data frame by rows, and then process the blocks?", "Split data based on column values and create scatter plot." and so on... with a lot of good suggestions but none of them addresses my questions!
Sorry if my question is a little stupid, I am not an expert in R :)
I am using windows 7, Rstudio 3.0.1.
Thanks.
EDIT:
head(splt, 2)
$`2012-10-01`
[1] steps date interval
<0 rows> (or 0-length row.names)
$`2012-10-02`
steps date interval
289 0 2012-10-02 0
290 0 2012-10-02 5
291 0 2012-10-02 10
292 0 2012-10-02 15
head(mn)
date steps
1 2012-10-02 0.43750
2 2012-10-03 39.41667
3 2012-10-04 42.06944
4 2012-10-05 46.15972
5 2012-10-06 53.54167
6 2012-10-07 38.24653
I want to split the data based on date, calculate the mean of the steps/date and then create a plot for interval vs. steps' mean;
After step 2, you will have a matrix like this:
mean(steps) date
289 0.23 2012-10-02
290 0.42 2012-10-03
291 0.31 2012-10-04
You want to plot this against "the intervals", but there are also multiple intervals per 'date'. What are you exactly trying to plot in x vs y?
The mean steps per date?
The mean steps vs mean intervals (i.e. an x-y point per date)?