Creating folds for k-fold CV in R using Caret - r

I'm trying to make a k-fold CV for several classification methods/hiperparameters using the data available at
http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data.
This set is made of 208 rows, each with 60 attributes. I'm reading it into a data.frame using the read.table function.
The next step is to split my data into k folds, let's say k = 5. My first attempt was to use
test <- createFolds(t, k=5)
I had two issues with this. The first one is that the lengths of the folds are not next to each other:
Length Class Mode
Fold1 29 -none- numeric <br />
Fold2 14 -none- numeric <br />
Fold3 7 -none- numeric <br />
Fold4 5 -none- numeric <br />
Fold5 5 -none- numeric
The other one is that this apparently splitted my data according to the attributes indexes, but I want to split the data itself. I thought that by transposing my data.frame, using:
test <- t(myDataNumericValues)
But when I call the createFolds function, it gives me something like this:
Length Class Mode
Fold1 2496 -none- numeric <br />
Fold2 2496 -none- numeric <br />
Fold3 2495 -none- numeric <br />
Fold4 2496 -none- numeric <br />
Fold5 2497 -none- numeric
The length issue was solved, but it's still not splitting my 208 data accordingly.
What I can do? Is the caret package maybe not the most appropriate?

Please read ?createFolds to understand what the function does. It creates the indices that define which data are held out the separate folds (see the options to return the converse):
> library(caret)
> library(mlbench)
> data(Sonar)
>
> folds <- createFolds(Sonar$Class)
> str(folds)
List of 10
$ Fold01: int [1:21] 25 39 58 63 69 73 80 85 90 95 ...
$ Fold02: int [1:21] 19 21 42 48 52 66 72 81 88 89 ...
$ Fold03: int [1:21] 4 5 17 34 35 47 54 68 86 100 ...
$ Fold04: int [1:21] 2 6 22 29 32 40 60 65 67 92 ...
$ Fold05: int [1:20] 3 14 36 41 45 75 78 84 94 104 ...
$ Fold06: int [1:21] 10 11 24 33 43 46 50 55 56 97 ...
$ Fold07: int [1:21] 1 7 8 20 23 28 31 44 71 76 ...
$ Fold08: int [1:20] 16 18 26 27 38 57 77 79 91 99 ...
$ Fold09: int [1:21] 13 15 30 37 49 53 74 83 93 96 ...
$ Fold10: int [1:21] 9 12 51 59 61 62 64 70 82 87 ...
To use these to split the data:
> split_up <- lapply(folds, function(ind, dat) dat[ind,], dat = Sonar)
> dim(Sonar)
[1] 208 61
> unlist(lapply(split_up, nrow))
Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
21 21 21 21 20 21 21 20 21 21
The function train is used in this package to do the actual modeling (you don't usually need to do the splitting yourself. See this page).

I'm not familiar with the caret package, but I used to write a function calculating CV based on decision tree from the rpart package. Of course, the function needs motifying in order to suit your purpose.
CV <- function(form, x, fold = 10, cp = 0.01) {
# x is the data
n <- nrow(x)
prop <- n%/%fold
set.seed(7)
newseq <- rank(runif(n))
k <- as.factor((newseq - 1)%/%prop + 1)
y <- unlist(strsplit(as.character(form), " "))[2]
vec.accuracy <- vector(length = fold)
for (i in seq(fold)) {
# It depends on which classification method you use
fit <- rpart(form, data = x[k != i, ], method = "class")
fit.prune <- prune(fit, cp = cp)
fcast <- predict(fit.prune, newdata = x[k == i, ], type = "class")
cm <- table(x[k == i, y], fcast)
accuracy <- (cm[1, 1] + cm[2, 2])/sum(cm)
vec.accuracy[i] <- accuracy
}
avg.accuracy <- mean(vec.accuracy)
avg.error <- 1 - avg.accuracy
cv <- data.frame(Accuracy = avg.accuracy, Error = avg.error)
return(cv)
}

Related

how to find average of vector of averages of multiple samples in R

I want to generate 8820 samples of size 139 each with binomial distribution. I then want to find the average of each sample, put all of these averages in a single array and then find the average of said array. Here's my code:
set.seed(1865)
for(i in 1:8820){
name <- paste("V", i, sep = "")
assign(name, rbinom(139, 46, 0.83))
avgs = c(mean(name[i]))
avg = mean(avgs)
}
print(avg)
However, I get NA error after the print.
Any help?
You are attempting to create your 8820 samples in your workspace.
You could better use replicate.
n <- 8820
set.seed(1865)
samps <- replicate(n, rbinom(139, 46, 0.83), simplify=FALSE) |>
setNames(paste0('V', sprintf('%04d', seq_len(n))))
Gives a list.
str(samps)
# List of 8820
# $ V0001: int [1:139] 39 39 37 35 39 34 38 43 33 41 ...
# $ V0002: int [1:139] 43 41 39 37 36 41 38 35 40 39 ...
# $ V0003: int [1:139] 41 37 40 34 38 39 44 40 37 39 ...
# $ V0004: int [1:139] 38 43 40 39 40 39 39 39 33 39 ...
# $ V0005: int [1:139] 37 42 38 38 40 38 40 43 39 39 ...
# $ V0006: int [1:139] 40 40 40 41 38 39 40 41 39 39 ...
# $ V0007: ...
Calculate means.
sapply(samps, mean)
# V0001 V0002 V0003 V0004 V0005 V0006 V0007
# 37.88489 38.10791 38.14388 38.04317 38.31655 38.37410 38.07194
# V0008 V0009 V0010 V0011 V0012 V0013 V0014
# 38.58273 37.90647 37.94245 38.17986 38.28058 38.33813 38.28058
# V0015 V0016 V0017 V0018 V0019 V0020 V0021
# 38.51799 37.92086 38.20863 38.01439 38.05036 38.48921 38.58273
# ...
You probably meant a vector you want to put the samples into?
main <- do.call(c, samps)
length(main)/n
# [1] 139
Calculate mean.
mean(main)
# [1] 38.181
You can still bloat your workspace then using:
# list2env(samps, .GlobalEnv) ## not recommended!
If you are looking for a 8820 x 139 matrix, just don't use simplify=.
set.seed(1865)
m <- t(replicate(n, rbinom(139, 46, 0.83)) )
dim(m)
# [1] 8820 139

Using a loop to create a polynomial model gives R trouble understanding it?

I create a lot of polynomial models to compare them, so I used a loop like this:
library(ISLR)
library(boot)
data(Wage)
list = list()
for (i in 1:10){
list[[i]] = lm(wage ~ poly(age, i), data = Wage)
assign(paste("fit.aov", i, sep = ""), list[[i]])
}
agelims <- range(Wage$age)
age.grid <- seq(agelims[1], agelims[2])
If I run the following code
preds <- predict(fit.aov1, data.frame(age = age.grid), se=TRUE)
I receive the following error:
Error: variable 'poly(age, i)' was fitted with type "nmatrix.1" but type "nmatrix.10" was supplied
In addition: Warning message:
In Z/rep(sqrt(norm2[-1L]), each = length(x)) :
longer object length is not a multiple of shorter object length
However, if I create each model manually like this
fit1 = lm(wage, poly(age,1), data = Wage)
Then the predict() function runs just fine.
Here we need to create the formula with paste
lst1 <- vector('list', 10)
for (i in 1:10){
fmla <- sprintf("wage~ poly(age,%d)", i)
print(fmla)
lst1[[i]] = lm(as.formula(fmla), data = Wage)
lst1[[i]]$call <- parse(text =fmla )[[1]]
assign(paste("fit.aov", i, sep = ""), lst1[[i]])
}
-testing with predict
predict(fit.aov1, data.frame(age = age.grid), se=TRUE)
#$fit
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14
# 94.43570 95.14298 95.85025 96.55753 97.26481 97.97208 98.67936 99.38663 100.09391 100.80119 101.50846 102.21574 102.92301 103.63029
# 15 16 17 18 19 20 21 22 23 24 25 26 27 28
#104.33757 105.04484 105.75212 106.45939 107.16667 107.87394 108.58122 109.28850 109.99577 110.70305 111.41032 112.11760 112.82488 113.53215
# 29 30 31 32 33 34 35 36 37 38 39 40 41 42
#114.23943 114.94670 115.65398 116.36126 117.06853 117.77581 118.48308 119.19036 119.89764 120.60491 121.31219 122.01946 122.72674 123.43402
# 43 44 45 46 47 48 49 50 51 52 53 54 55 56
#124.14129 124.84857 125.55584 126.26312 126.97039 127.67767 128.38495 129.09222 129.79950 130.50677 131.21405 131.92133 132.62860 133.33588
# 57 58 59 60 61 62 63
#134.04315 134.75043 135.45771 136.16498 136.87226 137.57953 138.28681
# ...
The issue was that we are passing poly(age, i) which is not getting recognized as 1, 2, ... instead as only i

Writing a function to compare differences of a series of numeric variables

I am working on a problem set and absolutely cannot figure this one out. I think I've fried my brain to the point where it doesn't even make sense anymore.
Here is a look at the data ...
sex age chol tg ht wt sbp dbp vldl hdl ldl bmi
<chr> <int> <int> <int> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl>
1 M 60 137 50 68.2 112. 110 70 10 53 74 2.40
2 M 26 154 202 82.8 185. 88 64 34 31 92 2.70
3 M 33 198 108 64.2 147 120 80 22 34 132 3.56
4 F 27 154 47 63.2 129 110 76 9 57 88 3.22
5 M 36 212 79 67.5 176. 130 100 16 37 159 3.87
6 F 31 197 90 64.5 121 122 78 18 58 111 2.91
7 M 28 178 163 66.5 167 118 68 19 30 135 3.78
8 F 28 146 60 63 105. 120 80 12 46 88 2.64
9 F 25 231 165 64 126 130 72 23 70 137 3.08
10 M 22 163 30 68.8 173 112 70 6 50 107 3.66
# … with 182 more rows
I must write a function, myTtest, to perform the following task:
Perform a two-sample t-tests to compare the differences of a series of numeric variables between each level of a classification variable
The first argument, dat, is a data frame
The second argument, classVar, is a character vector of length 1. It is the name of the classification variable, such as 'sex.'
The third argument, numVar, is a character vector that contains the name of the numeric variables, such as c("age", "chol", "tg"). This means I need to perform three t-tests to compare the difference of those between males and females.
The function should return a data frame with the following variables: Varname, F.mean, M.mean, t (for t-statistics), df (for degrees of freedom), and p (for p-value).
I should be able to run this ...
myTtest(dat = chol, classVar = "sex", numVar = c("age", "chol", "tg")
... and then get the data frame to appear.
Any help is greatly appreciated. I am pulling my hair out over this one! As well, as noted in my comment below, this has to be done without Tidyverse ... which is why I'm having so much trouble to begin with.
The intuition for this solution is that you can loop over your dependent variables, and call t.test() in each loop. Then save the results from each DV and stack them together in one big data frame.
I'll leave out some bits for you to fill in, but here's the gist:
First, some example data:
set.seed(123)
n <- 20
grp <- sample(c("m", "f"), n, replace = TRUE)
df <- data.frame(grp = grp, age = rnorm(n), chol = rnorm(n), tg = rnorm(n))
df
grp age chol tg
1 m 1.2240818 0.42646422 0.25331851
2 m 0.3598138 -0.29507148 -0.02854676
3 m 0.4007715 0.89512566 -0.04287046
4 f 0.1106827 0.87813349 1.36860228
5 m -0.5558411 0.82158108 -0.22577099
6 f 1.7869131 0.68864025 1.51647060
7 f 0.4978505 0.55391765 -1.54875280
8 f -1.9666172 -0.06191171 0.58461375
9 m 0.7013559 -0.30596266 0.12385424
10 m -0.4727914 -0.38047100 0.21594157
Now make a container that each of the model outputs will go into:
fits_df <- data.frame()
Loop over each DV and append the model output to fits_df each time with rbind:
for (dv in c("age", "chol", "tg")) {
frml <- as.formula(paste0(dv, " ~ grp")) # make a model formula: dv ~ grp
fit <- t.test(frml, two.sided = TRUE, data = df) # perform the t-test
# hint: use str(fit) to figure out how to pull out each value you care about
fit_df <- data.frame(
dv = col,
f_mean = xxx,
m_mean = xxx,
t = xxx,
df = xxx,
p = xxx
)
fits_df <- rbind(fits_df, fit_df)
}
Your output will look like this:
fits_df
dv f_mean m_mean t df p
1 age -0.18558068 -0.04446755 -0.297 15.679 0.7704954
2 chol 0.07731514 0.22158672 -0.375 17.828 0.7119400
3 tg 0.09349567 0.23693052 -0.345 14.284 0.7352112
One note: When you're pulling out values from fit, you may get odd row names in your output data frame. This is due to the names property of the various fit attributes. You can get rid of these by using as.numeric() or as.character() wrappers around the values you pull from fit (for example, fit$statistic can be cleaned up with as.character(round(fit$statistic, 3))).

How to resample and remodel n times by vectorization?

here's my for loop version of doing resample and remodel,
B <- 999
n <- nrow(butterfly)
estMat <- matrix(NA, B+1, 2)
estMat[B+1,] <- model$coef
for (i in 1:B) {
resample <- butterfly[sample(1:n, n, replace = TRUE),]
re.model <- lm(Hk ~ inv.alt, resample)
estMat[i,] <- re.model$coef
}
I tried to avoid for loop,
B <- 999
n <- nrow(butterfly)
resample <- replicate(B, butterfly[sample(1:n, replace = TRUE),], simplify = FALSE)
re.model <- lapply(resample, lm, formula = Hk ~ inv.alt)
re.model.coef <- sapply(re.model,coef)
estMat <- cbind(re.model.coef, model$coef)
It worked but didn't improve efficiency. Is there any approach I can do vectorization?
Sorry, not quite familiar with StackOverflow. Here's the dataset butterfly.
colony alt precip max.temp min.temp Hk
pd+ss 0.5 58 97 16 98
sb 0.8 20 92 32 36
wsb 0.57 28 98 26 72
jrc+jrh 0.55 28 98 26 67
sj 0.38 15 99 28 82
cr 0.93 21 99 28 72
mi 0.48 24 101 27 65
uo+lo 0.63 10 101 27 1
dp 1.5 19 99 23 40
pz 1.75 22 101 27 39
mc 2 58 100 18 9
hh 4.2 36 95 13 19
if 2.5 34 102 16 42
af 2 21 105 20 37
sl 6.5 40 83 0 16
gh 7.85 42 84 5 4
ep 8.95 57 79 -7 1
gl 10.5 50 81 -12 4
(Assuming butterfly$inv.alt <- 1/butterfly$alt)
You get the error because resample is not a list of resampled data.frames, which you can obtain with:
resample <- replicate(B, butterfly[sample(1:n, replace = TRUE),], simplify = FALSE)
The the following should work:
re.model <- lapply(resample, lm, formula = Hk ~ inv.alt)
To extract coefficients from a list of models, re.model$coef does work. The correct path to coefficients are: re.model[[1]]$coef, re.model[[2]]$coef, .... You can get all of them with the following code:
re.model.coef <- sapply(re.model, coef)
Then you can combined it with the observed coefficients:
estMat <- cbind(re.model.coef, model$coef)
In fact, you can put all of them into replicate:
re.model.coef <- replicate(B, {
bf.rs <- butterfly[sample(1:n, replace = TRUE),]
coef(lm(formula = Hk ~ inv.alt, data = bf.rs))
})
estMat <- cbind(re.model.coef, model$coef)

Subset using R Studio

How do I compute the ADF test with R if I do not want all my observations in it?
My time series contains 3000 observations. Now I want to compute the ADF test for example for the 200 first observations. I tried the following: ur.df(x, lags=5, selectlags="AIC", type="drift", subset=1:200) from the package urca, library(urca), but I get the following error message:
Error in summary(ur.df(Vstoxx, lags = 5, selectlags = "AIC", type = "drift", :
Fehler bei der Auswertung des Argumentes 'object' bei der Methodenauswahl
for function 'summary': Error in ur.df(Vstoxx, lags = 5, selectlags = "AIC", type = "drift", subset = 1:200) :
unused argument (subset = 1:200)
where the german part translates to: Error during evaluation of the argument 'object' in the method selection.
Here is a small data sample:
x
1 14.4700
2 14.5100
3 14.4200
4 13.8000
5 13.5700
6 12.9200
7 13.6800
8 14.0500
9 13.6400
10 13.5700
11 13.2000
12 13.1700
13 13.6300
14 14.1700
15 13.9600
16 14.1100
17 13.6300
18 13.3200
19 12.4600
20 12.8100
21 12.7200
22 12.3600
23 12.2500
24 12.3800
25 11.6000
26 11.9900
27 11.9200
28 12.1900
29 12.0400
30 11.9900
31 12.5200
32 12.3500
33 13.6600
34 13.5700
35 13.0100
36 13.2400
37 13.4900
38 13.9900
39 13.1900
40 12.2100
41 12.8900
42 12.3500
43 12.8600
44 12.5700
45 11.9300
46 11.7200
47 12.0000
48 12.5300
49 13.4700
50 12.9600
51 13.3500
52 12.4900
53 14.5700
Many thanks
Instead of adding a subset= parameter, you can simply use indexing to subset x (see my example below)
x <- c(14.4700, 14.5100, 14.4200, 13.8000, 13.5700, 12.9200, 13.6800,
14.0500, 13.6400, 13.5700, 13.2000, 13.1700, 13.6300, 14.1700, 13.9600,
14.1100, 13.6300, 13.3200, 12.4600, 12.8100, 12.7200, 12.3600, 12.2500, 12.3800,
11.6000, 11.9900, 11.9200, 12.1900, 12.0400, 11.9900, 12.5200, 12.3500, 13.6600,
13.5700, 13.0100, 13.2400, 13.4900, 13.9900, 13.1900, 12.2100, 12.8900, 12.3500,
12.8600, 12.5700, 11.9300, 11.7200, 12.0000, 12.5300, 13.4700, 12.9600, 13.3500,
12.4900, 14.5700)
library(urca)
# We'll use only the 50 first elements in x
ur.df(x[1:50], lags=5, selectlags="AIC", type="drift")
Output:
###############################################################
# Augmented Dickey-Fuller Test Unit Root / Cointegration Test #
###############################################################
The value of the test statistic is: -2.1741 2.3635

Resources