Subset using R Studio - r

How do I compute the ADF test with R if I do not want all my observations in it?
My time series contains 3000 observations. Now I want to compute the ADF test for example for the 200 first observations. I tried the following: ur.df(x, lags=5, selectlags="AIC", type="drift", subset=1:200) from the package urca, library(urca), but I get the following error message:
Error in summary(ur.df(Vstoxx, lags = 5, selectlags = "AIC", type = "drift", :
Fehler bei der Auswertung des Argumentes 'object' bei der Methodenauswahl
for function 'summary': Error in ur.df(Vstoxx, lags = 5, selectlags = "AIC", type = "drift", subset = 1:200) :
unused argument (subset = 1:200)
where the german part translates to: Error during evaluation of the argument 'object' in the method selection.
Here is a small data sample:
x
1 14.4700
2 14.5100
3 14.4200
4 13.8000
5 13.5700
6 12.9200
7 13.6800
8 14.0500
9 13.6400
10 13.5700
11 13.2000
12 13.1700
13 13.6300
14 14.1700
15 13.9600
16 14.1100
17 13.6300
18 13.3200
19 12.4600
20 12.8100
21 12.7200
22 12.3600
23 12.2500
24 12.3800
25 11.6000
26 11.9900
27 11.9200
28 12.1900
29 12.0400
30 11.9900
31 12.5200
32 12.3500
33 13.6600
34 13.5700
35 13.0100
36 13.2400
37 13.4900
38 13.9900
39 13.1900
40 12.2100
41 12.8900
42 12.3500
43 12.8600
44 12.5700
45 11.9300
46 11.7200
47 12.0000
48 12.5300
49 13.4700
50 12.9600
51 13.3500
52 12.4900
53 14.5700
Many thanks

Instead of adding a subset= parameter, you can simply use indexing to subset x (see my example below)
x <- c(14.4700, 14.5100, 14.4200, 13.8000, 13.5700, 12.9200, 13.6800,
14.0500, 13.6400, 13.5700, 13.2000, 13.1700, 13.6300, 14.1700, 13.9600,
14.1100, 13.6300, 13.3200, 12.4600, 12.8100, 12.7200, 12.3600, 12.2500, 12.3800,
11.6000, 11.9900, 11.9200, 12.1900, 12.0400, 11.9900, 12.5200, 12.3500, 13.6600,
13.5700, 13.0100, 13.2400, 13.4900, 13.9900, 13.1900, 12.2100, 12.8900, 12.3500,
12.8600, 12.5700, 11.9300, 11.7200, 12.0000, 12.5300, 13.4700, 12.9600, 13.3500,
12.4900, 14.5700)
library(urca)
# We'll use only the 50 first elements in x
ur.df(x[1:50], lags=5, selectlags="AIC", type="drift")
Output:
###############################################################
# Augmented Dickey-Fuller Test Unit Root / Cointegration Test #
###############################################################
The value of the test statistic is: -2.1741 2.3635

Related

Using a loop to create a polynomial model gives R trouble understanding it?

I create a lot of polynomial models to compare them, so I used a loop like this:
library(ISLR)
library(boot)
data(Wage)
list = list()
for (i in 1:10){
list[[i]] = lm(wage ~ poly(age, i), data = Wage)
assign(paste("fit.aov", i, sep = ""), list[[i]])
}
agelims <- range(Wage$age)
age.grid <- seq(agelims[1], agelims[2])
If I run the following code
preds <- predict(fit.aov1, data.frame(age = age.grid), se=TRUE)
I receive the following error:
Error: variable 'poly(age, i)' was fitted with type "nmatrix.1" but type "nmatrix.10" was supplied
In addition: Warning message:
In Z/rep(sqrt(norm2[-1L]), each = length(x)) :
longer object length is not a multiple of shorter object length
However, if I create each model manually like this
fit1 = lm(wage, poly(age,1), data = Wage)
Then the predict() function runs just fine.
Here we need to create the formula with paste
lst1 <- vector('list', 10)
for (i in 1:10){
fmla <- sprintf("wage~ poly(age,%d)", i)
print(fmla)
lst1[[i]] = lm(as.formula(fmla), data = Wage)
lst1[[i]]$call <- parse(text =fmla )[[1]]
assign(paste("fit.aov", i, sep = ""), lst1[[i]])
}
-testing with predict
predict(fit.aov1, data.frame(age = age.grid), se=TRUE)
#$fit
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14
# 94.43570 95.14298 95.85025 96.55753 97.26481 97.97208 98.67936 99.38663 100.09391 100.80119 101.50846 102.21574 102.92301 103.63029
# 15 16 17 18 19 20 21 22 23 24 25 26 27 28
#104.33757 105.04484 105.75212 106.45939 107.16667 107.87394 108.58122 109.28850 109.99577 110.70305 111.41032 112.11760 112.82488 113.53215
# 29 30 31 32 33 34 35 36 37 38 39 40 41 42
#114.23943 114.94670 115.65398 116.36126 117.06853 117.77581 118.48308 119.19036 119.89764 120.60491 121.31219 122.01946 122.72674 123.43402
# 43 44 45 46 47 48 49 50 51 52 53 54 55 56
#124.14129 124.84857 125.55584 126.26312 126.97039 127.67767 128.38495 129.09222 129.79950 130.50677 131.21405 131.92133 132.62860 133.33588
# 57 58 59 60 61 62 63
#134.04315 134.75043 135.45771 136.16498 136.87226 137.57953 138.28681
# ...
The issue was that we are passing poly(age, i) which is not getting recognized as 1, 2, ... instead as only i

evaluating neural network performance

I trained my neural network with a sigmoid activation function so that the predicted values lie in the range [0,1). However, the range of real data in which the z-score transformation has been performed goes beyond [0,1). In this case what would be the appropriate way to evaluate my model. Should I rescale as well the original test data to the same range and then evaluate with criteria like mean square forecast error?
> real_predicted_neural
predicted real
1 1.909219e-07 -3.57877473
2 4.161819e-08 -2.28704595
3 1.754706e-11 -1.08509429
4 1.149891e-13 -0.46573114
5 7.777560e-02 0.42381300
6 4.173448e-07 -0.44060297
7 1.119703e-01 0.21075550
8 8.682557e-01 -0.01292402
9 4.736056e-08 -0.29830701
10 7.506821e-08 -1.20302227
11 7.341235e-01 -0.03986571
12 7.501776e-05 -0.94315815
13 1.145697e-04 0.49730175
14 2.214929e-13 0.04252241
15 4.597199e-01 -0.38539901
16 2.324931e-03 -0.74468628
17 4.366025e-06 -0.77037244
18 1.394450e-06 0.16679048
19 5.869884e-11 -0.75876486
20 1.817941e-04 0.04303387
21 7.060773e-04 0.06099372
22 8.267170e-06 -1.21687318
23 9.388680e-02 0.61135319
24 1.099290e-01 0.55715201
25 9.757236e-01 -0.33480226
26 9.544055e-01 0.09061006
27 7.322074e-07 0.09290822
28 1.014327e-06 -0.61658893
29 7.848382e-08 -0.78739456
30 1.791908e-04 -0.44073540
31 1.357918e-03 -0.22099008
32 5.192233e-06 -0.32744703
33 2.624779e-06 -0.37644068
34 6.414216e-02 -0.36947939
35 1.388143e-06 -0.00994845
36 3.010872e-05 -0.05984833
37 9.873201e-03 -0.21815268
38 3.896163e-04 -0.24009094
39 2.718760e-02 0.33383333
40 1.025650e-02 0.09779867

How can I create a matrix , with random number on row and not replace,but in col can replace, R language

How can I create a matrix , with random number on row and not replace.
like this
5 29 24 20 31 33
2 18 35 4 11 21
30 40 22 14 2 28
33 14 4 18 5 10
10 33 15 2 28 18
7 22 9 25 31 20
12 29 31 22 37 26
7 31 34 28 19 23
7 34 11 6 31 28
my code :
matrix(sample(1:42, 60, replace = FALSE), ncol = 6)
But I receive this error message:
Error in sample.int(length(x), size, replace, prob) : cannot take a
sample larger than the population when 'replace = FALSE'
but it's wrong because only 1~42, it can't create a 60 matrix.
You can not generate all 60 of the numbers with one sample function as you want to allow replacement of numbers in a different row. Therefore you have to do one sample per row. #Jav provided very neat code to accomplish this in the comment to the question:
t(sapply(1:10, function(x) sample(1:42, 6, replace = FALSE)))
if you want to have a different sample in each row, then replicate can help you -- but replicate (as pretty much everything else in R) works naturally columnwise, so you have to transpose the result:
t(replicate(10, sample(1:42, 6)))
replace = FALSE is the default, so I didn't include it
after transposing, 10 becomes the number of rows and 6 becomes the number of columns

How can I look at a specific generated train and test sets made from for loop?

My program divides my dataset into train and test set, builds a decision tree based on the train and test set and calculates the accuracy, sensitivity and the specifity of the confusion matrix.
I added a for loop to rerun my program 100 times. This means I get 100 train and test sets. The output of the for loop is a result_df with columns of accuracy, specifity and sensitivity.
This is the for loop:
result_df<-matrix(ncol=3,nrow=100)
colnames(result_df)<-c("Acc","Sens","Spec")
for (g in 1:100 )
{
# Divide into Train and test set
smp_size <- floor(0.8 * nrow(mydata1))
train_ind <- sample(seq_len(nrow(mydata1)), size = smp_size)
train <- mydata1[train_ind, ]
test <- mydata1[-train_ind, ]
REST OF MY CODE
}
My result_df (first 20 rows) looks like this:
> result_df[1:20,]
Acc Sens Spec id
1 26 22 29 1
2 10 49 11 2
3 37 43 36 3
4 4 79 4 4
5 21 21 20 5
6 31 17 34 6
7 57 4 63 7
8 33 3 39 8
9 56 42 59 9
10 65 88 63 10
11 6 31 7 11
12 57 44 62 12
13 25 10 27 13
14 32 24 32 14
15 19 8 19 15
16 27 27 29 16
17 38 89 33 17
18 54 32 56 18
19 35 62 33 19
20 37 6 40 20
I use ggplot() to plot the specifity and the sensitivity as a scatterplot:
What I want to do :
I want to see e.g. the train and test set of datapoint 17.
I think I can do this by using the set.seed function, but I am very unfamiliar with this function.
First, clearly, if in your code you store your estimate models, e.g., in a list, then you could recover your data from those models. However, it doesn't look like that's the case.
With your current code all you can do is to see that last train and test sets (number 100). That is because you keep redefining test, train, train_ind variables. The cheapest (in terms of memory) way to achieve what you want would be to somehow store train_ind from each iteration. For instance, you could use
train_inds <- list()[rep(1, 100)]
for (g in 1:100 )
{
smp_size <- floor(0.8 * nrow(mydata1))
train_inds[[g]] <- sample(seq_len(nrow(mydata1)), size = smp_size)
train <- mydata1[train_inds[[g]], ]
test <- mydata1[-train_ind[[g]], ]
# The rest
}
and in this way you would always know which observations were in which set. If you somehow are interested only in one specific iteration, you could save only that one.
Lastly, set.seed isn't really going to help here. If all you were doing was running rnorm(1) hundred times, then yes, by using set.seed you could quickly recover the n-th generated value later. In your case, however, you are not only using sample for train_ind; the model estimation functions are also very likely generating random values.

R error type "Subscript out of bounds"

I am simulating a correlation matrix, where the 60 variables correlate in the following way:
more highly (0.6) for every two variables (1-2, 3-4... 59-60)
moderate (0.3) for every group of 12 variables (1-12,13-24...)
mc <- matrix(0,60,60)
diag(mc) <- 1
for (c in seq(1,59,2)){ # every pair of variables in order are given 0.6 correlation
mc[c,c+1] <- 0.6
mc[c+1,c] <- 0.6
}
for (n in seq(1,51,10)){ # every group of 12 are given correlation of 0.3
for (w in seq(12,60,12)){ # these are variables 11-12, 21-22 and such.
mc[n:n+1,c(n+2,w)] <- 0.2
mc[c(n+2,w),n:n+1] <- 0.2
}
}
for (m in seq(3,9,2)){ # every group of 12 are given correlation of 0.3
for (w in seq(12,60,12)){ # these variables are the rest.
mc[m:m+1,c(1:m-1,m+2:w)] <- 0.2
mc[c(1:m-1,m+2:w),m:m+1] <- 0.2
}
}
The first loop works well, but not the second and third ones. I get this error message:
Error in `[<-`(`*tmp*`, m:m + 1, c(1:m - 1, m + 2:w), value = 0.2) :
subscript out of bounds
Error in `[<-`(`*tmp*`, m:m + 1, c(1:m - 1, m + 2:w), value = 0.2) :
subscript out of bounds
I would really appreciate any hints, since I don't see the loop commands get to exceed the matrix dimensions. Thanks a lot in advance!
Note that : takes precedence over +. E.g., n:n+1 is the same as n+1. I guess you want n:(n+1).
The maximal value of w is 60:
w <- 60
m <- 1
m+2:w
#[1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#[49] 51 52 53 54 55 56 57 58 59 60 61
And 61 is out of bounds. You need to add a lot of parentheses.

Resources