Grid Search in R for Nonparametric Quantile Regression - r

I use a library called "quantreg" in R and try to estimate full nonparametric quantile regression on time series basis. To get statistically significant results I try lots of variables and smoothing parameter values (lambda). But it's exhausting and very time consuming. Therefore, I want to apply grid search, however it is a little bit hard for me. I want to determine best smoothing values, so I should construct a for loop. But I want that loop to try every combination. At the I want to have the lambda values of best model or models (all variables' p values<0.05 condition).
For example if I have three variables in my equation I've written something like that:
lambdas1<-rbind(1,2,3)
lambdas2<-rbind(1,2,3)
lambdas3<-rbind(1,2,3)
mylist<-list()
for (i in 1:3) {
for (j in 1:3) {
for (n in 1:3) {
f <-try(rqss(Y~qss(X1,lambda = lambdas1[i])+qss(X2,lambda = lambdas2[j])+qss(X3,lambda = lambdas3[n]), tau=0.05))
sf<-summary(f)
if( (sf[["qsstab"]]['X1','Pr(>F)']<0.05)&(sf[["qsstab"]]['X2','Pr(>F)']<0.05)&(sf[["qsstab"]]['X3','Pr(>F)']<0.05) ){
mylist[[i]]<-f$lambdas
}
}
}
}
How can I rearrange this code?
Is there any shortcut?
Any help will be appreciated.
Thank you in advance.

You can use baseR expand.grid to create a data.frame of all the possible combinations and then use apply(grid, MARGIN=2, ...) to loop through its rows, also I "optimized" the code you were looking if each p-value I changed it to use all(p.vals < .05)
lambdas <- expand.grid(1:3,1:3,1:3)
check_lambdas <- function(lambdas){
f <-try(rqss(Y~qss(X1,lambda = lambdas[1])+qss(X2,lambda = lambdas[2])+qss(X3,lambda = lambdas[3]), tau=0.05))
if( all(summary(f)$qsstab[,'Pr(>F)']<0.05) ) f$lambdas else NULL
}
apply(lambdas, 2, check_lambdas)

Related

Permutations and combinations of all the columns in R

I want to check all the permutations and combinations of columns while selecting models in R. I have 8 columns in my data set and the below piece of code lets me check some of the models, but not all. Models like column 1+6, 1+2+5 will not be covered by this loop. Is there any better way to accomplish this?
best_model <- rep(0,3) #store the best model in this array
for(i in 1:8){
for(j in 1:8){
for(x in k){
diabetes_prediction <- knn(train = diabetes_training[, i:j], test = diabetes_test[, i:j], cl = diabetes_train_labels, k = x)
accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/183
if( best_model[1] < accuracy[x] ){
best_model[1] = accuracy[x]
best_model[2] = i
best_model[3] = j
}
}
}
}
Well, this answer isn't complete, but maybe it'll get you started. You want to be able to subset by all possible subsets of columns. So instead of having i:j for some i and j, you want to be able to subset by c(1,6) or c(1,2,5), etc.
Using the sets package, you can for the power set (set of all subsets) of a set. That's the easy part. I'm new to R, so the hard part for me is understanding the difference between sets, lists, vectors, etc. I'm used to Mathematica, in which they're all the same.
library(sets)
my.set <- 1:8 # you want column indices from 1 to 8
my.power.set <- set_power(my.set) # this creates the set of all subsets of those indices
my.names <- c("a") #I don't know how to index into sets, so I created names (that are numbers, but of type characters)
for(i in 1:length(my.power.set)) {my.names[i] <- as.character(i)}
names(my.power.set) <- my.names
my.indices <- vector("list",length(my.power.set)-1)
for(i in 2:length(my.power.set)) {my.indices[i-1] <- as.vector(my.power.set[[my.names[i]]])} #this is the line I couldn't get to work
I wanted to create a list of lists called my.indices, so that my.indices[i] was a subset of {1,2,3,4,5,6,7,8} that could be used in place of where you have i:j. Then, your for loop would have to run from 1:length(my.indices).
But alas, I have been spoiled by Mathematica, and thus cannot decipher the incredibly complicated world of R data types.
Solved it, below is the code with explanatory comments:
# find out the best model for this data
number_of_columns_to_model <- ncol(diabetes_training)-1
best_model <- c()
best_model_accuracy = 0
for(i in 2:2^number_of_columns_to_model-1){
# ignoring the first case i.e. i=1, as it doesn't represent any model
# convert the value of i to binary, e.g. i=5 will give combination = 0 0 0 0 0 1 0 1
combination = as.binary(i, n=number_of_columns_to_model) # from the binaryLogic package
model <- c()
for(i in 1:length(combination)){
# choose which columns to consider depending on the combination
if(combination[i])
model <- c(model, i)
}
for(x in k){
# for the columns decides by model, find out the accuracies of model for k=1:27
diabetes_prediction <- knn(train = diabetes_training[, model, with = FALSE], test = diabetes_test[, model, with = FALSE], cl = diabetes_train_labels, k = x)
accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/length(diabetes_test_labels)
if( best_model_accuracy < accuracy[x] ){
best_model_accuracy = accuracy[x]
best_model = model
print(model)
}
}
}
I trained with Pima.tr and tested with Pima.te. KNN Accuracy for pre-processed predictors was 78% and 80% without pre-processing (and this because of the large influence of some variables).
The 80% performance is at par with a Logistic Regression model. You don't need to preprocess variables in Logistic Regression.
RandomForest, and Logistic Regression provide a hint on which variables to drop, so you don't need to go and perform all possible combinations.
Another way is to look at a matrix Scatter plot
You get a sense that there is difference between type 0 and type 1 when it comes to npreg, glu, bmi, age
You also notice the highly skewed ped and age, and you notice that there may be an anomaly data point between skin and and and other variables (you may need to remove that observation before going further)
Skin Vs Type box plot shows that for type Yes, an extreme outlier exist (try removing it)
You also notice that most of the boxes for Yes type are higher than No type=> the variables may add prediction to the model (you can confirm this through a Wilcoxon Rank Sum Test)
The high correlation between Skin and bmi means that you can use one or the other or an interact of both.
Another approach to reducing the number of predictors is to use PCA

Dynamic linear regression loop for different order summation

I've been trying hard to recreate this model in R:
Model
(FARHANI 2012)
I've tried many things, such as a cumsum paste - however that would not work as I could not assign strings the correct variable as it kept thinking that L was a function.
I tried to do it manually, I'm only looking for p,q = 1,2,3,4,5 however after starting I realized how inefficient this is.
This is essentially what I am trying to do
model5 <- vector("list",20)
#p=1-5, q=0
model5[[1]] <- dynlm(DLUSGDP~L(DLUSGDP,1))
model5[[2]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2))
model5[[3]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3))
model5[[4]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3)+L(DLUSGDP,4))
model5[[5]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3)+L(DLUSGDP,4)+L(DLUSGDP,5))
I'm also trying to do this for regressing DLUSGDP on DLWTI (my oil variable's name) for when p=0, q=1-5 and also p=1-5, q=1-5
cumsum would not work as it would sum the variables rather than treating them as independent regresses.
My goal is to run these models and then use IC to determine which should be analyzed further.
I hope you understand my problem and any help would be greatly appreciated.
I think this is what you are looking for:
reformulate(paste0("L(DLUSGDP,", 1:n,")"), "DLUSGDP")
where n is some order you want to try. For example,
n <- 3
reformulate(paste0("L(DLUSGDP,", 1:n,")"), "DLUSGDP")
# DLUSGDP ~ L(DLUSGDP, 1) + L(DLUSGDP, 2) + L(DLUSGDP, 3)
Then you can construct your model fitting by
model5 <- vector("list",20)
for (i in 1:20) {
form <- reformulate(paste0("L(DLUSGDP,", 1:i,")"), "DLUSGDP")
model5[[i]] <- dynlm(form)
}

integrating vector of one variable in r

I'm looking for a way to properly integrate my function:
lik = function(par, x){
cl = c()
for(i in 1:ncluster){
sub = c()
for(j in 1:nsub){
times = t[[i]][[j]]
m = c(1,t[[i]][j],t(cov[[i]][j,]))
repmat = cbind(1,1:t[[i]][j],matrix(rep(cov[[i]][j,], times),times, 3,byrow=T))
sub[j] = d[[i]][j]*m%*%c(par[-5],x)-sum(log((1+exp(repmat%*%c(par[-5],x)))))
}
cl[i] = sum(sub)
}
return(exp(cl))
}
function lik (which is likelihood) takes x, vector par of length 5, and yields a vector of likelihood at x at each cluster. For example,
> lik(1:5,1)
[1] 4.640101e-30 3.632315e-44 5.348611e-09 1.121790e-27 1.696704e-98
> #number of clusters=5
I want to integrate out x so that I can obtain the vector of marginalized pdf at each cluster, but function integrate or any other numerical integration packages are only capable of integrating scalar function. I've searched questions relating to this, and maybe Vectorization is the key to solving this problem, but I just do not know how.
I will really appreciate if you can give me any help. Thanks
Typically I recommend converting a function of 2 variables to a function of one variable prior to integrating as follows.
myfunc <-function(x,y){ stuff}
intfunc <-function(x){myfunc(x,y)}
integrate(intfunc,x, etc)

How to find significant correlations in a large dataset

I'm using R.
My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.
I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b.
Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.
I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?
You can use the function rcorr from the package Hmisc.
Using the same demo data from Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Then:
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
To access the p-values:
correlations$P
To visualize you can use the package corrgram
library(corrgram)
corrgram(the_data)
Which will produce:
In order to print a list of the significant correlations (p < 0.05), you can use the following.
Using the same demo data from #Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Install Hmisc
install.packages("Hmisc")
Import library and find the correlations (#Carlos)
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
Loop over the values printing the significant correlations
for (i in 1:m){
for (j in 1:m){
if ( !is.na(correlations$P[i,j])){
if ( correlations$P[i,j] < 0.05 ) {
print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
}
}
}
}
Warning
You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.
Here's some sample data for reproducibility.
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.
correlations <- vapply(
the_data[, -1],
function(x)
{
cor(the_data[, 1], x)
},
numeric(1)
)
You can then find the column with the largest magnitude of correlation with y using:
correlations[which.max(abs(correlations))]
So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.
If you are trying to predict y using only one variable than you have to take the one that is mainly correlated with y.
To do this just use the command which.max(abs(cor(x,y))). If you want to use more than one variable in your model then you have to consider something like the lasso estimator
One option is to run a correlation matrix:
cor_result=cor(data)
write.csv(cor_result, file="cor_result.csv")
This correlates all the variables in the file against each other and outputs a matrix.

Fill matrix with loop

I am trying to create a matrix n by k with k mvn covariates using a loop.
Quite simple but not working so far... Here is my code:
n=1000
k=5
p=100
mu=0
sigma=1
x=matrix(data=NA, nrow=n, ncol=k)
for (i in 1:k){
x [[i]]= mvrnorm(n,mu,sigma)
}
What's missing?
I see several things here:
You may want to set the random seed for replicability (set.seed(20430)). This means that every time you run the code, you will get exactly the same set of pseudorandom variates.
Next, your data will just be independent variates; they won't actually have any multivariate structure (although that may be what you want). In general, if you want to generate multivariate data, you should use ?mvrnorm, from the MASS package. (For more info, see here.)
As a minor point, if you want standard normal data, you don't need to specify mu = 0 and sigma = 1, as those are the default values for rnorm().
You don't need a loop to fill a matrix in R, just generate as many values as you like and add them directly using the data= argument in the matrix() function. If you really were committed to using a loop, you should probably use a double loop, so that you are looping over the columns, and within each loop, looping over the rows. (Note that this is a very inefficient way to code in R--although I do things like that all the time ;-).
Lastly, I can't tell what p is supposed to be doing in your code.
Here is a basic way to do what you seem to be going for:
set.seed(20430)
n = 1000
k = 5
dat = rnorm(n*k)
x = matrix(data=dat, nrow=n, ncol=k)
If you really wanted to use loops you could do it like this:
mu = 0
sigma = 1
x = matrix(data=NA, nrow=n, ncol=k)
for(j in 1:k){
for(i in 1:n){
x[i,j] = rnorm(1, mu, sigma)
}
}
define the matrix first
E<-matrix(data=0, nrow=10, ncol=10);
run two loops to iterate i for rows and j for columns, mine is a exchangeable correlation structure
for (i in 1:10)
{
for (j in 1:10)
{
if (i==j) {E[i,j]=1}
else {E[i,j]=0.6}
}
};
A=c(2,3,4,5);# In your case row terms
B=c(3,4,5,6);# In your case column terms
x=matrix(,nrow = length(A), ncol = length(B));
for (i in 1:length(A)){
for (j in 1:length(B)){
x[i,j]<-(A[i]*B[j])# do the similarity function, simi(A[i],B[j])
}
}
x # matrix is filled
I was thinking in my problem perspective.

Resources