How to choose best imputed data for further analysis in r - r

I have a multivariate time series dataset (almost 30 years) with random missing values.
T
S
po4
si
din
9.00000
NA
0.290
5.310
18.51
8.45000
NA
0.130
6.180
14.74
13.60000
36.46000
0.010
0.500
1.86
23.20000
32.12000
0.010
6.580
0.81
26.00000
32.13000
0.070
0.500
0.23
NA
35.41400
0.010
1.670
0.72
24.80000
36.42000
0.000
3.540
24.20000
33.16000
0.110
2.020
22.50000
37.60000
0.040
0.400
16.32000
36.01000
0.020
2.900
17.60000
38.04000
0.010
0.970
9.70000
36.36000
0.120
7.950
13.80000
38.33000
0.010
5.760
7.90000
35.51000
0.060
2.350
11.90000
38.33000
0.030
3.410
24.10000
36.30000
0.020
0.730
25.20000
35.77000
0.020
1.370
24.70000
37.54000
0.330
0.700
5.75000
33.26000
0.120
0.860
13.30000
33.14000
0.000
0.000
13.60000
38.21265
0.000
0.190
15.70000
28.33000
0.040
11.500
41.64
I would like to fill the gaps in order to have a constant frequency (I have a monthly frequency with missing values) to try different techniques in the content of a time series analysis.
I have tried to use the mice package in r
and to decide which imputed dataset to use with with() and pool(),but I don't want to use all of them in a model, but obtain the most correct one and use that one for further analysis.
How can I do that? How can I find the best one?
Otherwise, can you suggest me another method as an alternative to mice?
Thank you very much in advance

If you have a strong time correlation you can use the imputets package for time series imputation.
library(imputeTS)
na_kalman(your_dataframe)
There are also several other methods included in the package. As for mice, the whole point of multiple imputation is to have several imputed datasets. You would perform your analysis separately on each of them. Then you can compare the results. Since with imputation there always comes some uncertainty along (since your missing data replacements are only estimations). This technique enables you to model / get a feeling for the uncertainty.
If you don't want to do multiple analysis and do single imputation you can use any of these datasets, they are equally valid/there is no best one.
Or you could also use a single imputation package like misssForest.

Related

Changepoints detection in time series in R

I need some guidance regarding how changepoints work in time series. I am trying to detect some changepoints using R, and the package called "changepoint" (https://cran.r-project.org/web/packages/changepoint/changepoint.pdf).
There are options for how to detect when the variance (cpt.var) and the mean (cpt.mean) change, but what I'm trying to look for is when the time series changes trend.
Maybe I'm confused with what changepoints really are, but is there any way to get this information?
I am showing the result of using cpt.var() function, and I have added some arrows, showing what I would like to achieve.
Is there any way to achieve this? I guess should be somehow like inflection points...
I would appreciate any light on this.
Thanks beforehand,
Jon
EDIT
I have tried with the approach of using diff(), but is not detecting the change correctly:
The data I am using is the following:
[1] 10.695 10.715 10.700 10.665 10.830 10.830 10.800 11.070 11.145 11.270 11.015 11.060 10.945 10.965 10.780 10.735 10.705 10.680 10.600 10.335 10.220 10.125
[23] 10.370 10.595 10.680 11.000 10.980 11.065 11.060 11.355 11.445 11.415 11.350 11.310 11.330 11.360 11.445 11.335 11.275 11.300 11.295 11.470 11.445 11.325
[45] 11.300 11.260 11.200 11.210 11.230 11.240 11.300 11.250 11.285 11.215 11.260 11.395 11.410 11.235 11.320 11.475 11.470 11.685 11.740 11.740 11.700 11.905
[67] 11.720 12.230 12.285 12.505 12.410 11.995 12.110 12.005 11.915 11.890 11.820 11.730 11.700 11.660 11.685 11.615 11.360 11.425 11.185 11.275 11.265 11.375
[89] 11.310 11.250 11.050 10.880 10.775 10.775 10.805 10.755 10.595 10.700 10.585 10.510 10.290 10.255 10.395 10.290 10.425 10.405 10.365 10.010 10.305 10.185
[111] 10.400 10.700 10.725 10.875 10.750 10.760 10.905 10.680 10.670 10.895 10.790 10.990 10.925 10.980 10.975 11.035 10.895 10.985 11.035 11.295 11.245 11.535
[133] 11.510 11.430 11.450 11.390 11.520 11.585
And when I do diff() I get this data:
[1] 0.020 -0.015 -0.035 0.165 0.000 -0.030 0.270 0.075 0.125 -0.255 0.045 -0.115 0.020 -0.185 -0.045 -0.030 -0.025 -0.080 -0.265 -0.115 -0.095 0.245
[23] 0.225 0.085 0.320 -0.020 0.085 -0.005 0.295 0.090 -0.030 -0.065 -0.040 0.020 0.030 0.085 -0.110 -0.060 0.025 -0.005 0.175 -0.025 -0.120 -0.025
[45] -0.040 -0.060 0.010 0.020 0.010 0.060 -0.050 0.035 -0.070 0.045 0.135 0.015 -0.175 0.085 0.155 -0.005 0.215 0.055 0.000 -0.040 0.205 -0.185
[67] 0.510 0.055 0.220 -0.095 -0.415 0.115 -0.105 -0.090 -0.025 -0.070 -0.090 -0.030 -0.040 0.025 -0.070 -0.255 0.065 -0.240 0.090 -0.010 0.110 -0.065
[89] -0.060 -0.200 -0.170 -0.105 0.000 0.030 -0.050 -0.160 0.105 -0.115 -0.075 -0.220 -0.035 0.140 -0.105 0.135 -0.020 -0.040 -0.355 0.295 -0.120 0.215
[111] 0.300 0.025 0.150 -0.125 0.010 0.145 -0.225 -0.010 0.225 -0.105 0.200 -0.065 0.055 -0.005 0.060 -0.140 0.090 0.050 0.260 -0.050 0.290 -0.025
[133] -0.080 0.020 -0.060 0.130 0.065
What I get is the next results:
> cpt =cpt.mean(diff(vector), method="PELT")
> (cpt.pts <- attributes(cpt)$cpts)
[1] 137
Appearly does not make sense... Any clue?
In R, there are many packages available for time series changepoint detection. changepoint is definitely a very useful one. A partial list of the packages is summarized in CRAN Task View:
Change point detection is provided in strucchange (using linear regression models), and in trend (using nonparametric tests). The changepoint package provides many popular changepoint methods, and ecp does nonparametric changepoint detection for univariate and multivariate series. changepoint.np implements the nonparametric PELT algorithm, while changepoint.mv detects changepoints in multivariate time series. InspectChangepoint uses sparse projection to estimate changepoints in high-dimensional time series. robcp provides robust change-point detection using Huberized cusum tests, and Rbeast provides Bayesian change-point detection and time series decomposition.
Here is also a great blog comparing several alternative packages: https://www.marinedatascience.co/blog/2019/09/28/comparison-of-change-point-detection-methods/. Another impressive comparison is from Dr. Jonas Kristoffer LindelΓΈv who developed the mcp package: https://lindeloev.github.io/mcp/articles/packages.html.
Below I used your sample time series to generate some quick results using the Rbeast package developed by myself (chosen here apparently for ego of self-promoting as well as perceived relvance). Rbeast is a Baysian changepoint detection algorithm and it can estimate the probability of changepoint occurrence. It can also be used for decomposing time series into seasonality and trend, but apparently, your time series is trend-only, so in the beast function below, season='none' is specified.
y = c(10.695,10.715,10.700,10.665,10.830,10.830,10.800,11.070,11.145,11.270,11.015,11.060,10.945,10.965,10.780,10.735,10.705,
10.680,10.600,10.335,10.220,10.125,10.370,10.595,10.680,11.000,10.980,11.065,11.060,11.355,11.445,11.415,11.350,11.310,11.330,
11.360,11.445,11.335,11.275,11.300,11.295,11.470,11.445,11.325,11.300,11.260,11.200,11.210,11.230,11.240,11.300,11.250,11.285,
11.215,11.260,11.395,11.410,11.235,11.320,11.475,11.470,11.685,11.740,11.740,11.700,11.905,11.720,12.230,12.285,12.505,12.410,
11.995,12.110,12.005,11.915,11.890,11.820,11.730,11.700,11.660,11.685,11.615,11.360,11.425,11.185,11.275,11.265,11.375,11.310,
11.250,11.050,10.880,10.775,10.775,10.805,10.755,10.595,10.700,10.585,10.510,10.290,10.255,10.395,10.290,10.425,10.405,10.365,
10.010,10.305,10.185,10.400,10.700,10.725,10.875,10.750,10.760,10.905,10.680,10.670,10.895,10.790,10.990,10.925,10.980,10.975,
11.035,10.895,10.985,11.035,11.295,11.245,11.535 ,11.510,11.430,11.450,11.390,11.520,11.585)
library(Rbeast)
out=beast(y, season='none')
plot(out)
print(out)
In the figure above, dashed vertical lines mark the most likely locations of changepoints; the green curve of Pr(tcp) shows the point-wise probability of changepoint occurrence over time. The order_t curve gives the estimated mean order of the piecewise polynomials needed to adequately fit the trend (the 0-th order is constant and the 1st order is linear): An average order toward 0 means that the trend is more likely to be flat and an order close to 1 means that the trend is linear. The output can be also printed as some ascii outputs, as shown below. Again, it says that the time series is most likely to have 8 changepoints; their most probable locations are given in out$trend$cp.
Result for time series #1 (total number of time series in 'out': 1)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ SEASONAL CHANGEPOINTS +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
No seasonal/periodic component present (i.e., season='none')
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ TREND CHANGEPOINTS +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
An ascii plot of the probability dist for number of chgpts(ncp)
---------------------------------------------------------------
Pr(ncp=0 )=0.000|* |
Pr(ncp=1 )=0.000|* |
Pr(ncp=2 )=0.000|* |
Pr(ncp=3 )=0.000|* |
Pr(ncp=4 )=0.000|* |
Pr(ncp=5 )=0.000|* |
Pr(ncp=6 )=0.055|***** |
Pr(ncp=7 )=0.074|****** |
Pr(ncp=8 )=0.575|******************************************** |
Pr(ncp=9 )=0.240|******************* |
Pr(ncp=10)=0.056|***** |
---------------------------------------------------------------
Max ncp : 10 | A parameter you set (e.g., maxTrendKnotNum) |
Mode ncp: 8 | Pr(ncp= 8)=0.57; there is a 57.5% probability|
| that the trend componet has 8 chngept(s). |
Avg ncp : 8.17 | Sum[ncp*Pr(ncp)] |
---------------------------------------------------------------
List of most probable trend changepoints (avg number of changpts: 8.17)
--------------------------------.
tcp# |time (cp) |prob(cpPr)|
-----|---------------|----------|
1 |8.0000 | 0.92767|
2 |112.0000 | 0.91433|
3 |68.0000 | 0.84213|
4 |21.0000 | 0.80188|
5 |32.0000 | 0.78171|
6 |130.0000 | 0.76938|
7 |101.0000 | 0.66404|
8 |62.0000 | 0.61171|
--------------------------------'
If the signal isn't too noisy, you could use diff to detect changepoints in slope instead of mean:
library(changepoint)
set.seed(1)
slope <- rep(sample(10,10)-5,sample(100,10))
sig <- cumsum(slope)+runif(n=length(slope),min = -1, max = 1)
cpt =cpt.mean(diff(sig),method="PELT")
# Show change point
(cpt.pts <- attributes(cpt)$cpts)
#> [1] 58 109 206 312 367 440 447 520 599
plot(sig,type="l")
lines(x=cpt.pts,y=sig[cpt.pts],type="p",col="red",cex=2)
Another option which seems to work better with the data you provided is to use piecewise linear segmentation:
library(ifultools)
changepoints <- linearSegmentation(x=1:length(data),y=data,angle.tolerance = 90,n.fit=10,plot=T)
changepoints
#[1] 13 24 36 58 72 106

Reshaping database using reshape package

I would like to reshaping some rows of my database. In particular I have some row that it replicate for the Id column. I would like to convert this row in column. I provide a code that it represent a example of my database.
I'm trying with t() and reshape but it doesn't do that I would. Can anyone give me any suggestions?
test<-data.frame(Id=c(1,1,2,3),
St=c(20,80,80,20),
gap=seq(0.02,0.08,by=0.02),
gip=c(0.23,0.60,0.86,2.09),
gat=c(0.0107,0.989,0.337,0.663))
setNames(data.frame(t(test))[2:nrow(data.frame(t(test))),], test$Id)
1 1 2 3
St 20.0000 80.000 80.000 20.000
gap 0.0200 0.040 0.060 0.080
gip 0.2300 0.600 0.860 2.090
gat 0.0107 0.989 0.337 0.663
It helps to provide an expected output. It this what you expected?

Marginal densities (or bar plots) on facets in ggplot2

my problem is the following: I have this table below
0 1-5 6-10 11-15 16-20 21-26 27-29
a 0.019 0.300 0.296 0.211 0.117 0.042 0.014
b 0.058 0.448 0.308 0.120 0.042 0.019 0.005
c 0.026 0.277 0.316 0.187 0.105 0.068 0.020
d 0.054 0.297 0.378 0.108 0.108 0.041 0.014
e 0.004 0.252 0.358 0.216 0.102 0.053 0.015
f 0.032 0.097 0.312 0.280 0.161 0.065 0.054
g 0.113 0.500 0.233 0.094 0.043 0.014 0.003
h 0.328 0.460 0.129 0.050 0.020 0.010 0.003
representing some marginal frequencies (by row) for each subgroups of my data (a to h).
My dataset is actually in the long format (very long, counting more than 100 thousand entries), with the first 6 rows as you see below:
RX_SUMM_SURG_PRIM_SITE Nodes.Examined.Class
1 Wedge Resection 1-5
2 Segmental Resection 1-5
3 Lobectomy w/mediastinal LNdissection 6-10
4 Lobectomy w/mediastinal LNdissection 6-10
5 Lobectomy w/mediastinal LNdissection 1-5
6 Lobectomy w/mediastinal LNdissection 11-15
When I plot a barplot by group (the table above is simply the cross tabulation of of these two covariates with the row marginal probabilities taken) here's what happens:
The code I have for this is
ggplot(data.ln.red, aes(x=Nodes.Examined.Class))+geom_bar(aes(x=Nodes.Examined.Class, group=RX_SUMM_SURG_PRIM_SITE))+
facet_grid(RX_SUMM_SURG_PRIM_SITE~.)
Actually I would be very happy only with the marginal frequencies (i.e. the ones in the table) on each y-axis of the facets of the plot (instead of the counts).
Anybody can help me with this?
Thanks for all your help!
EM
geom_bar calculates both counts and proportions of observations. You can access these calculated proportions with either ..prop.. (the old way) or calc(prop) (introduced in newer versions of ggplot2). Use this as your y aesthetic.
You can also get rid of the aes you have in geom_bar, as this is just a repeat of what you've already covered by ggplot and facet_grid.
It looks like your counts/proportions are going to vary widely between groups, so I'm adding free y-scaling to the faceting.
Here's an example of a similar plot with the iris data, which you can model your code off of:
library(tidyverse)
ggplot(iris, aes(x = Sepal.Length, y = calc(prop))) +
geom_bar() +
facet_grid(Species ~ ., scales = "free_y")
Created on 2018-04-06 by the reprex package (v0.2.0).
Edit: the calculated prop variable is proportions within each group, not proportions across all groups, so it works differently when x is a factor. For categorical x, prop treats x as the group; to override this, include group = 0 or some other dummy value in your aes. Sorry I missed that the first time!

Building a time series of r-squared values in R

I have multiple xts objects that look something like this:
t x0 x1 x2 x3 x4
2000-01-01 0.397 0.262 -0.039 0.440 0.186
2000-01-02 0.020 -0.219 0.197 0.301 -0.300
2000-01-03 0.010 0.064 -0.034 0.214 -0.451
2000-01-04 -0.002 -0.483 -0.251 0.023 0.153
2000-01-05 0.451 0.375 0.566 0.258 -0.092
2000-01-06 0.411 0.219 0.108 0.137 -0.087
2000-01-07 0.111 -0.039 0.187 -0.271 0.346
2000-01-08 0.131 -0.078 0.163 -0.057 -0.313
2000-01-09 -0.409 -0.022 0.211 -0.297 0.217
.
.
.
Now, I am interested in finding out how well the average of the other variables explains each single x variable during each period 𝜏 (e.g. monthly).
Or in other words, I'm interested in building a time series of the r-squared in the following regression:
xi,t = 𝛽0 + 𝛽1 avgi,t + πœ€i,t
where avgi,t is be the average of all other variables at time t. And running that for each i πœ– n variables and observations t πœ– 𝜏 (e.g. 𝜏 is a month). I would like to be able to run this with any period 𝜏, where I believe xts might be able to help, since there are functions such as endpoints or apply.monthly?
I have years and years of data, and multiple of those xts objects, so the question is, what would be a sensible way to run those regressions, collecting the time series of r-squared values? A for-loop should be able to take care of doing that for each separate xts object, but I really don't think a for-loop would be very wise for each single object?

Print loadings and yloadings details in R

I am new to R and have been asked to prepare a script that will be used to capture some R output in a text file.
I have been given a set of commands that creates DB connection, loads data and then performs some mathematical calculation and churns out Summary, Loadings and YLoadings. I am to capture this output and save it in database. I have got everything working already except one bit that keeps on giving issues time and again.
The loadings and yloadings functions sometime gives out Matrix that has white-spaces in it. For example,
Comp 1, Comp 2, Comp 3
Row1 0.495 0.748 -0.272
Row2 0.605 -0.562
Row3 0.666 -0.397 0.781
Row4
LongNameRow1 0.536 -1.483
LongNameRow2 -0.681 -0.408 -1.145
Because of such outputs I have to manually check the files and edit them so that they become,
Comp 1, Comp 2, Comp 3
Row1 0.495 0.748 -0.272
Row2 0.605 0.000 -0.562
Row3 0.666 -0.397 0.781
Row4 0.000 0.000 0.000
LongNameRow1 0.536 0.000 -1.483
LongNameRow2 -0.681 -0.408 -1.145
i.e. I have to manually replace all the spaces with 0.000 (I am not sure of 0.000 is the correct value, but this was the only thing I could think of) in the output. This is very time consuming and painful to do.
I did some search around the loadings function and found,
Small loadings are conventionally not printed (replaced by spaces), to draw the eye to the pattern of the larger loadings.
So my question is, Are there any other methods or any configuration that I am missing that can give me the Output the way I need? i.e. 0.000 instead of white spaces or any other reasonable value? At the very least I am wondering if I can delimit the output with commas or the pipe character (i.e. "|") or something similar to make parsing the text possible?
Thanks in advance for help!!!
The answer is to use unclass to convert the loadings to a matrix. The following example illustrates this.
The loadings function extracts the loadings matrix and changes the class of this matrix to loadings. When you print an object of class loadings, small values are not printed, as you observe.
Here is the example from ?princomp:
fit <- princomp(USArrests, cor = TRUE)
l <- loadings(fit)
l
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.536 0.418 -0.341 0.649
Assault -0.583 0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378 0.134
Rape -0.543 -0.167 0.818
Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00
It is quite straightforward to change the class of this object back to its default. If you then print it, the values are displayed as the true underlying values
l <- unclass(l)
l
Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.5358995 0.4181809 -0.3412327 0.64922780
Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
Rape -0.5434321 -0.1673186 0.8177779 0.08902432

Resources