Find the best value for KNN for statement - r

I am trying to to write a for statement to find the best value k in KNN. Unfortunately, I tried my code snippet now several times, but it seems like it does not calculate the correct value. Do you have an idea what is wrong about the statement
# Tune the value of K using K-Fold Cross Validation
bestaccuracy = 0
bestaccuracy
n.folds <- 100
for (k in 1:n.folds) {
set.seed(1)
knn.cvac <- knn.cv(train= x.australian.stan, cl= y.australian, k=k)
knn.cvac.table <- table (knn.cvac, y.australian)
knn.cvac.accuracy <- sum(diag(knn.cvac.table))/sum(knn.cvac.table)
if(bestaccuracy< knn.cvac.accuracy) bestk=k
if(bestaccuracy< knn.cvac.accuracy) bestaccuracy = knn.cvac.accuracy}
print(bestk)
print(bestaccuracy)

I tested it on a few simulation-based data and it works just fine! The only thing to notice is that you may have different Ks for which you get the highest accuracy and you print the biggest K(Because of the way it is coded).
Perhaps you can change the line of your code to this:
if(bestaccuracy< knn.cvac.accuracy) bestk=c(bestk, k)
So you can see all the optimal Ks when you print bestk.

Related

Grid Search in R for Nonparametric Quantile Regression

I use a library called "quantreg" in R and try to estimate full nonparametric quantile regression on time series basis. To get statistically significant results I try lots of variables and smoothing parameter values (lambda). But it's exhausting and very time consuming. Therefore, I want to apply grid search, however it is a little bit hard for me. I want to determine best smoothing values, so I should construct a for loop. But I want that loop to try every combination. At the I want to have the lambda values of best model or models (all variables' p values<0.05 condition).
For example if I have three variables in my equation I've written something like that:
lambdas1<-rbind(1,2,3)
lambdas2<-rbind(1,2,3)
lambdas3<-rbind(1,2,3)
mylist<-list()
for (i in 1:3) {
for (j in 1:3) {
for (n in 1:3) {
f <-try(rqss(Y~qss(X1,lambda = lambdas1[i])+qss(X2,lambda = lambdas2[j])+qss(X3,lambda = lambdas3[n]), tau=0.05))
sf<-summary(f)
if( (sf[["qsstab"]]['X1','Pr(>F)']<0.05)&(sf[["qsstab"]]['X2','Pr(>F)']<0.05)&(sf[["qsstab"]]['X3','Pr(>F)']<0.05) ){
mylist[[i]]<-f$lambdas
}
}
}
}
How can I rearrange this code?
Is there any shortcut?
Any help will be appreciated.
Thank you in advance.
You can use baseR expand.grid to create a data.frame of all the possible combinations and then use apply(grid, MARGIN=2, ...) to loop through its rows, also I "optimized" the code you were looking if each p-value I changed it to use all(p.vals < .05)
lambdas <- expand.grid(1:3,1:3,1:3)
check_lambdas <- function(lambdas){
f <-try(rqss(Y~qss(X1,lambda = lambdas[1])+qss(X2,lambda = lambdas[2])+qss(X3,lambda = lambdas[3]), tau=0.05))
if( all(summary(f)$qsstab[,'Pr(>F)']<0.05) ) f$lambdas else NULL
}
apply(lambdas, 2, check_lambdas)

Dynamic linear regression loop for different order summation

I've been trying hard to recreate this model in R:
Model
(FARHANI 2012)
I've tried many things, such as a cumsum paste - however that would not work as I could not assign strings the correct variable as it kept thinking that L was a function.
I tried to do it manually, I'm only looking for p,q = 1,2,3,4,5 however after starting I realized how inefficient this is.
This is essentially what I am trying to do
model5 <- vector("list",20)
#p=1-5, q=0
model5[[1]] <- dynlm(DLUSGDP~L(DLUSGDP,1))
model5[[2]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2))
model5[[3]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3))
model5[[4]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3)+L(DLUSGDP,4))
model5[[5]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3)+L(DLUSGDP,4)+L(DLUSGDP,5))
I'm also trying to do this for regressing DLUSGDP on DLWTI (my oil variable's name) for when p=0, q=1-5 and also p=1-5, q=1-5
cumsum would not work as it would sum the variables rather than treating them as independent regresses.
My goal is to run these models and then use IC to determine which should be analyzed further.
I hope you understand my problem and any help would be greatly appreciated.
I think this is what you are looking for:
reformulate(paste0("L(DLUSGDP,", 1:n,")"), "DLUSGDP")
where n is some order you want to try. For example,
n <- 3
reformulate(paste0("L(DLUSGDP,", 1:n,")"), "DLUSGDP")
# DLUSGDP ~ L(DLUSGDP, 1) + L(DLUSGDP, 2) + L(DLUSGDP, 3)
Then you can construct your model fitting by
model5 <- vector("list",20)
for (i in 1:20) {
form <- reformulate(paste0("L(DLUSGDP,", 1:i,")"), "DLUSGDP")
model5[[i]] <- dynlm(form)
}

Genetic algorithms under R, adding suggestions

In the genalg package, the rbga.bin command offers the possibility to add a list of suggestion, however, I can't find any example of this feature actually working, could anyone give me some help ?
library(genalg)
evaluation<-function(x){
n<- 2
if (sum(x)!= n){
return(100) }
if (sum(x)== n){
sequen<- which(x>0)
l=sum(sequen)
return(-l) } }
vect1<-rep(0,times=40)
vect1[c(1,2)]<-c(1,1)
sug<-list(vect1)
for (iin 2:100){
vect1<-sample(vect1)
sug[[i]]<-vect1
}
GAmodel <- rbga.bin(size=40,popSize =100, iters =100, suggestions=sug,mutationChance = 0.01,elitism =T, evalFunc=evaluation,verbose=T)
Although documentation for rbga.bin function says:
suggestions: optional list of suggested chromosomes
rbga.bin apparently wants a data.frame or matrix:
# taken from the rbga.bin source code
suggestionCount = dim(suggestions)[1]
for (i in 1:suggestionCount) {
population[i, ] = suggestions[i, ]
}
When given a matrix, it seems to work fine:
sug2 <- t(replicate(sample(vect1),n = 10)) # needs to be rotated. check your solution n = 99 and it will fail
GAmodel <- rbga.bin(size=40,popSize =100, iters =100, suggestions=sug2,mutationChance = 0.01,elitism =T, evalFunc=evaluation,verbose=T)
Output:
Testing the sanity of parameters...
Not showing GA settings...
Adding suggestions to first population...
Filling others with random values in the given domains...
Starting iteration 1
Calucating evaluation values... .................................................................................................... done.
Creating next generation...
sorting results...
applying elitism...
applying crossover...
applying mutations... 40 mutations applied
Starting iteration 2
Calucating evaluation values... .................................................................................................. done.
Creating next generation...
<...>
Starting iteration 100
Calucating evaluation values... .................................................................................................. done.

A small simulation study about normality tests in R

I am conducting a small simulation study to judge how good two normality tests really are. My plan is to generate a multitude of normal distribution samples of not too many observations and determine how often each test rejects the null hypothesis of normality.
The (incomplete) code I have so far is
library(nortest)
y<-replicate(10000,{
x<-rnorm(50)
ad.test(x)$p.value
ks.test(x,y=pnorm)$p.value
}
)
Now I would like to count the proportion of these p-values that are smaller than 0.05 for each test. Could you please tell me how I could do that? I apologise if this is a newbie question, but I myself am new to R.
Thank you.
library(nortest)
nsim <- 10000
nx <- 50
set.seed(101)
y <- replicate(nsim,{
x<-rnorm(nx)
c(ad=ad.test(x)$p.value,
ks=ks.test(x,y=pnorm)$p.value)
}
)
apply(y<0.05,MARGIN=1,mean)
## ad ks
## 0.0534 0.0480
Using MARGIN=1 tells apply to take the mean across rows, rather than columns -- this is sensible given the ordering that replicate()'s built-in simplification produces.
For examples of this type, the type I error rates of any standard tests will be extremely close to their nominal value (0.05 in this example).
If you run each test separately, then you can simply count which vals are stored in y that are less than 0.05.
y<-replicate(1000,{
x<-rnorm(50)
ks.test(x,y=pnorm)$p.value})
length(which(y<0.05))
Your code isn't outputting the p-values. You could do something like this:
rep_test <- function(reps=10000) {
p_ks <- rep(NA, reps)
p_ad <- rep(NA, reps)
for (i in 1:reps) {
x <- rnorm(50)
p_ks[i] <- ks.test(x, y=pnorm)$p.value
p_ad[i] <- ad.test(x)$p.value
}
return(data.frame(cbind(p_ks, p_ad)))
}
sum(test$p_ks<.05)/10000
sum(test$p_ad<.05)/10000

writing data form loops to matrix in R

ı need help. I have Form x (5000 samples) and FormY (5000 samples). I draw samples (sample size differentiate between 10-400) from X and Y forms and I equte these forms in R. But I have a problem. I want to write equated scores ( for each sample size and for 100 replications) to matrix But ı couldnt. If you help me, ı will be glad...
My code:
x<-read.table("X_Top_25_SANS_0.csv")
y<-read.table("Y_sum25_SANS_0_SD0.1.csv")
data_xy<-data.frame(x,y)
ii<-c(10,15,25,50,75,100,125,150,200,250,400) # Sample sizes
jj<-c(rep(1:100)) # replication
################# örneklem döngüleri ######################
for(i in 1:length(ii)){ #sample loop
for(j in 1:length(jj)){ #replication loop
x_rep<-sample(data_xy[,1],ii[i],replace=TRUE) #drawing sample
y_rep<-sample(data_xy[,2],ii[i],replace=TRUE)
xy_lin=(sd(y_rep)/sd(x_rep))*x_rep + mean(y_rep) - (mean(x_rep)*(sd(y_rep)/sd(x_rep)))
}}
I want to write "xy_lin" to matrix
As I said in my comment, it is difficult to understand the problem without a reproducable example, and I am not sure I understand the formula correctly that you apply in the loop, so my solution is based on some guesses. Anyway, try this:
data_xy<-data.frame(x=rnorm(5000),y=rnorm(5000)) # random data
ii<-c(10,15,25,50,75,100,125,150,200,250,400)
jj<-c(rep(1:100))
xy_lin <- matrix(NA, nrow=length(ii), ncol=length(jj)) # create the results matrix
for(i in 1:length(ii)){
for(j in 1:length(jj)){
x_rep<-sample(data_xy[,1],ii[i],replace=TRUE)
y_rep<-sample(data_xy[,2],ii[i],replace=TRUE)
xy_lin[i,j] <- sd(y_rep)/sd(x_rep)*mean(x_rep)+mean(y_rep) - mean(x_rep)*(sd(y_rep)/sd(x_rep))
}}
Note that I changed your formula from ...*x_rep to ...*mean(x_rep), since I think that this is what you had in mind. But I might be mistaken, so please be aware of that.

Resources