following is the code i am trying to run.The main objective is to run the model for different K values then after calculate the accuracies in order to choose the best K value.
so i thought of using for loop where every model.Result and the respective accuracy is stored in lists.,then after is sent out with respective k values..
but the thing is for the following code...the list isnt having any values from 1:29 and there is predicted values for 30..
k = 1:30
for(l in k){
pre[[l]] = knn(train_dataset,test_dataset,cl = labels_train, k = l)
}
output :
enter image description here
can someone help me out with this....like why the list is coming like that and what should be done in order to get the correct result..and why so..?
Here is a solution, with the models fit using the code in tacoman's comment.
library(class)
set.seed(1) # Make the results reproducible
knn_list <- lapply(1:30, function(l){
knn(train_dataset, test_dataset, cl = labels_train, k = l)
})
ok <- sapply(knn_list, '==', labels_test)
acc <- colMeans(ok)
which(acc == max(acc))
plot(acc, type = "b")
The for loop in the question can also be run, as long as the results list is created beforehand. The results are identical.
set.seed(1) # Make the results reproducible
k <- 1:30
pre <- vector("list", length = 30)
for(l in k){
pre[[l]] <- knn(train_dataset, test_dataset, cl = labels_train, k = l)
}
identical(pre, knn_list)
#[1] TRUE
Example data
set.seed(2021)
n <- nrow(iris)
i <- sample(n, 0.7*n)
train_dataset <- iris[i, -5]
test_dataset <- iris[-i, -5]
labels_train <- iris[i, 5]
labels_test <- iris[-i, 5]
Related
Title's a little rough, open to suggestions to improve.
I'm trying to calculate time-average covariances for a 500 length vector.
This is the equation we're using
The result I'm hoping for is a vector with an entry for k from 0 to 500 (0 would just be the variance of the whole set).
I've started with something like this, but I know I'll need to reference the gap (i) in the first mean comparison as well:
x <- rnorm(500)
xMean <-mean(x)
i <- seq(1, 500)
dfGam <- data.frame(i)
dfGam$gamma <- (1/(500-dfGam$i))*(sum((x-xMean)*(x[-dfGam$i]-xMean)))
Is it possible to do this using vector math or will I need to use some sort of for loop?
Here's the for loop that I've come up with for the solution:
gamma_func <- function(input_vec) {
output_vec <- c()
input_mean <- mean(input_vec)
iter <- seq(1, length(input_vec)-1)
for(val in iter){
iter2 <- seq((val+1), length(input_vec))
gamma_sum <- 0
for(val2 in iter2){
gamma_sum <- gamma_sum + (input_vec[val2]-input_mean)*(input_vec[val2-val]-input_mean)
}
output_vec[val] <- (1/length(iter2))*gamma_sum
}
return(output_vec)
}
Thanks
Using data.table, mostly for the shift function to make x_{t - k}, you can do this:
library(data.table)
gammabar <- function(k, x){
xbar <- mean(x)
n <- length(x)
df <- data.table(xt = x, xtk = shift(x, k))[!is.na(xtk)]
df[, sum((xt - xbar)*(xtk - xbar))/n]
}
gammabar(k = 10, x)
# [1] -0.1553118
The filter [!is.na(xtk)] starts the sum at t = k + 1, because xtk will be NA for the first k indices due to being shifted by k.
Reproducible x
x <- c(0.376972124936433, 0.301548373935665, -1.0980231706536, -1.13040590360378,
-2.79653431987176, 0.720573498411587, 0.93912102300901, -0.229377746707471,
1.75913134696347, 0.117366786802848, -0.853122822287008, 0.909259181618213,
1.19637295955276, -0.371583903741348, -0.123260233287436, 1.80004311672545,
1.70399587729432, -3.03876460529759, -2.28897494991878, 0.0583034949929225,
2.17436525195634, 1.09818265352131, 0.318220322390854, -0.0731475581637693,
0.834268741278827, 0.198750636733429, 1.29784138432631, 0.936718306241348,
-0.147433193833294, 0.110431994640128, -0.812504663900505, -0.743702167768748,
1.09534507180741, 2.43537370755095, 0.38811846676708, 0.290627670295127,
-0.285598287083935, 0.0760147178373681, -0.560298603759627, 0.447188372143361,
0.908501134499943, -0.505059597708343, -0.301004012157305, -0.726035976548133,
-1.18007702699501, 0.253074712637114, -0.370711296884049, 0.0221795637601637,
0.660044122429767, 0.48879363533552)
I am fairly new to programming in R, so I apologize if this question is too basic. I am trying to study the properties of OLS with error terms created by three different processes (i.e., normal1, normal2, and chi-square). I include these in a list, 'fun_list'.
I would like to iterate through 1,000 (iter) regressions, each with sample size 500 (n). I would like to save all 1,000 X 500 observations in a dataset (big_data) as well as the regression results (reg_results).
At the end of the program, I would like 1,000 regressions for each of the three processes (for a total of 3,000 regressions). I have set up nested loops for the three functions on one level and the 1,000 iterations on a different (sub-) level. I am having trouble getting the program to loop through the three different functions. I am not sure how to call out each element of the list in this embedded loop. Any help would be greatly appreciated!
library(psych)
library(arm)
library(dplyr)
library(fBasics)
library(sjstats)
#set sample size and number of iterations
set.seed(12345)
n <- 500
iter <- 1000
#setting empty vectors. Probably a better way to do this. :)
bn <- rep(NA,iter)
sen <- rep(NA,iter)
#these are the three functions I want to use to generate en,
#which is the error term below. I want one loop for each of the three.
# I can get f1, f2 and f3 to work independently, but I can't get the list
#to work to cycle through all three.
f1 <- function (n) {rnorm(n, 0, 2)}
f2 <- function (n) {rnorm(n, 0, 10)}
f3 <- function (n) {rchisq(n, 2)}
fun_list <- list(f1, f2, f3)
#following line starting point for saving all iterations in one big
#dataset
datalist = list()
#if I remove the following line (for (j ....)), I can get this to work by
#referencing each function independently (i.e., using 'en <- f1(n)').
for (j in fun_list) {
for (s in 1:iter) {
# en <- f1(n)
en <- fun_list[[1]]
x <- rnorm(n, 0, .5)
yn <- .3*x + en
#this is the part that saves the data#
dat <- data.frame(yn, x, en)
dat$s <- s
datalist[[s]] <- dat
#### run model for normal data and save parameters###
lm1n <- lm(yn ~ x)
int.hatn <- coef (lm1n)[1]
b.hatn <- coef (lm1n)[2]
se.hatn <- se.coef (lm1n) [2]
##save them for each iteration
bn[s] = b.hatn
sen[s] = se.hatn
}
}
reg_results<- tibble(bn, sen)
big_data = do.call(rbind,datalist)
When using the loop, I get the following error:
Error in 0.3 * x + en : non-numeric argument to binary operator
I am assuming this is because I do not fully understand how to call out each of the three functions in the list.
Here is a complete solution which wraps the multiple points discussed in the comments:
library(psych)
library(arm)
library(dplyr)
library(fBasics)
library(sjstats)
#set sample size and number of iterations
set.seed(12345)
n <- 500
iter <- 1000
#setting empty vectors. Probably a better way to do this. :)
bn <- c()
sen <- c()
#these are the three functions I want to use to generate en,
#which is the error term below. I want one loop for each of the three.
# I can get f1, f2 and f3 to work independently, but I can't get the list
#to work to cycle through all three.
f1 <- function (n) {rnorm(n, 0, 2)}
f2 <- function (n) {rnorm(n, 0, 10)}
f3 <- function (n) {rchisq(n, 2)}
fun_list <- list(f1, f2, f3)
#following line starting point for saving all iterations in one big
#dataset
datalist = list()
#if I remove the following line (for (j ....)), I can get this to work by
#referencing each function independently (i.e., using 'en <- f1(n)').
for (j in c(1:length(fun_list))) {
en <- fun_list[[j]]
for (s in 1:iter) {
x <- rnorm(n, 0, .5)
random_part <- en(n)
yn <- .3*x + random_part
#this is the part that saves the data#
dat <- data.frame(yn, x, random_part)
dat$s <- s
datalist[[s]] <- dat
#### run model for normal data and save parameters###
lm1n <- lm(yn ~ x)
int.hatn <- coef(lm1n)[1]
b.hatn <- coef(lm1n)[2]
se.hatn <- se.coef(lm1n)[2]
##save them for each iteration
bn = c(bn,b.hatn)
sen = c(sen,se.hatn)
}
}
reg_results<- tibble(bn, sen)
big_data = do.call(rbind,datalist)
I have dataset (HEART). I split it into chunks. I would like to predict each chunk with his (M=3) previous models. In this case, I would like to predict chunk number 10 - with models 7,8,9. chunk 9 - with models 6,7,8... chunk 4 - with models 1,2,3.
Here is my code:
library(caret)
dat1 <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"), header = FALSE,sep = ",")
colnames(dat1) <- c(LETTERS[1:(ncol (dat1)-1)],"CLA")
dat1$CLA<-as.factor (dat1$CLA)
chunk <- 30
n <- nrow(dat1)
r <- rep(1:floor(n/chunk),each=chunk)[1:n]
d <- split(dat1,r)
N<-floor(n/chunk)
cart.models <- list()
for(i in 1:N){cart.models[[i]]<-rpart(CLA~ ., data = d[[i]]) }
for (i in (1+M):N) { k=0
for (j in (i-M):(i-1)) {
k=k+1
d[[i]][,(ncol(d[[i]])+k)]<-(predict(cart.models[[j]], d[[i]][,c(-14)], type = "class") )
}
}
I get the following Error:
Error in `[<-.data.frame`(`*tmp*`, , (ncol(d[[i]]) + k), value = c(1L, :
new columns would leave holes after existing columns
Your question is a bit puzzling, you load caret without using any functions from it. The objective seems like a time series analyses but instead of building on one chunk and predicting on the one that comes after it, you have a more complex desire, so createTimeSlices from caret won't do the trick.
You could create custom folds in caret with index and indexOut arguments in trainControl but that would ultimately lead to the creation of more models (21 to be exact) than is required for the presented objective (9). So I do believe loops are an appropriate way:
create the models:
library(rpart)
N <- 9
cart.models <- list()
for(i in 1:N){
cart.models[[i]] <- rpart(CLA~ ., data = d[[i]])
}
N can be 9 since 10 will not be utilized later on.
create a matrix to store the values:
cart.predictions <- matrix(nrow = chunk, ncol = length(4:10)*3)
it should have the same number of rows as there are predictions in each chunk (so 30) and it should have as many columns are there are predictions (three models for 4:10 chunks).
k <- 0 #as a counter
for (j in 4:10) { #prediction on chunks 4:10
p <- j-3
pred <- list()
for(i in p : (p+2)) { #using models (chink - 3) : (chunk - 1)
k = k + 1
predi <- predict(cart.models[[i]], d[[j]], type = "class")
cart.predictions[,k] <- predi
}
}
this creates a numeric matrix for predictions. By default when R converts factors to numeric it gives them numbers: 1 to the first level, 2 to the second etc - so to get the levels (0:4) you can just:
cart.predictions <- as.data.frame(cart.predictions - 1)
to create the column names:
names <- expand.grid(3:1, 4:10)
names$Var1 <- with(names, Var2 - Var1)
colnames(cart.predictions) <- make.names(paste0(names$Var1,"_", names$Var2))
lets check if it correct:
prediction from model 5 on chunk 6 converted to numeric
as.numeric(as.character(predict(cart.models[[5]], d[[6]], type = "class")))
should be equal to
cart.predictions[["X5_6"]] #that's how the names were designed
all.equal(as.numeric(as.character(predict(cart.models[[5]], d[[6]], type = "class"))),
cart.predictions[["X5_6"]])
#output
TRUE
or you can create a character matrix in the first place:
cart.predictions <- matrix(data = NA_character_, nrow = chunk, ncol = length(4:10)*3)
k <- 0 #as a counter
for (j in 4:10) {
p <- j-3
pred <- list()
for(i in p : (p+2)) {
k = k + 1
predi <- predict(cart.models[[i]], d[[j]], type = "class")
cart.predictions[,k] <- predi
}
}
cart.predictions <- as.data.frame(cart.predictions)
This should be the preferred method if the classes are certain "names".
So I needed some help with a train and test set that I am creating in R. The goal of the code is to break a data set into a certain amount k, and the number of folds the test set will be i. It will then return the training and test sets. We assume that k will be 5 or 10.
This is what I have so far.
create_sets<-function(df,k,i)
{
n<-dim(df)[1]
#fold size
size<-n/k
#beggining of test set
test_start<-(size*i)-(size)+1
#end of test set
test_end<-size*i
indices<-df(test_start,test_end)
train<-df[indices,]
test<-df[-indices,]
return (list(train=train,test=test))
}
df is just a random data frame of x and y. That is:
x<-c(1,6,7,4,3,5,7,8,9,8,7,6,5,4,3,4,5,3,2,1)
y<-c(3,5,6,7,5,4,3,5,7,8,9,0,2,3,4,5,6,7,5,6)
df<-data.frame(x,y)
When I run the code I am returning an error
Error in df(test_start, test_end) :
argument "df2" is missing, with no default
This is how I would approach it:
n <- nrow(df)
k <- 5
set.seed(10272015)
s <- sample(1:k, n, replace=TRUE)
result <- rep(NA, k)
for (i in 1:k) {
train <- df[s!=i, ]
test <- df[s==i,]
# fit model
# evaluate
# result[i] <- evalscore
}
mean(result)
I think you just need an index for different subsets,like this:
k <- 5
folds <- sample(rep(1:k,length=nrow(df)))
then, you can get any one of k subsets by(take 1 for example):
df[folds==1,]
I'm writing some code in R and I came across following problem:
Basically, I want to calculate a variable X[k], where X takes on values for each k, like this:
where A is a known variable which takes on different values for each index.
For the moment, I have something like this:
k <- NULL
X <- NULL
z<- 1: n
for (k in seq(along =z)){
for (j in seq (along = 1:k)){
X[k] = 1/k*sum(A[n-k]/A[n-j+1])
}
}
which can't be right. Any idea on how to fix this one?
As always, any help would be dearly appreciated.
Try this
# define A
A <- c(1,2,3,4)
n <- length(A)
z <- 1:n
#predefine X (don't worry, all values will be overwritten, but it will have the same length as A
X <- A
for(k in z){
for(j in 1:k){
X[k] = 1/k*sum(A[n-k]/A[n-j+1])
}
}
You don't need to define z, it is only used inside the for. In this case, do for(k in 1:n){
As
You can do the following
set.seed(42)
A <- rnorm(10)
k <- sample(length(A), 4)
calc_x <- function(A, k){
n <- length(A)
c_sum <- cumsum(1/rev(A)[1:max(k)])
A[n-k]/k * c_sum[k]
}
calc_x(A,k)
what returns:
[1] 0.07775603 2.35789999 -0.45393983 0.13323284