Optimizing weights for average across 3 variables - r

I am trying to create an ensemble of machine learning and I have code that made predictions based on occupation Ocp, Age Age, and Gender Gender.
I want to average the 3 predictions for a final prediction, but I am not sure how I should optimize the weights to minimize the RSME.
I know that Gender should dominate the data set.
Here is my attempt at the code:
temp <- NA; temp2 <- NA;temp3 <- NA
for (i in seq_len(11)) {
for (j in seq_len(11)){
temp2 = ((i-1)/10)*(((j-1)/10)*movie_pred2[,1]+((11-j)/10)*movie_pred2[,2]) +
((11-i)/10)*movie_pred[,3]
temp2[temp2 > 5] = 5
temp2[temp2 < 1] = 1
temp[j] <- (sum((temp2 - tsind2[,2])^2)/length(tsind2[,2]))^.5
}
temp3[i,j] = temp[j]
}
I now get the warning:
Error in temp3[i, j] = temp[j] : incorrect number of subscripts on matrix
In ((i - 1)/10) * (((j - 1)/10) * movie_pred2[, 1] + ((11 - ... :
longer object length is not a multiple of shorter object length

Your code begins:
> temp3<- NA
.. then some other stuff and ends
> temp3[i,j] = temp[j]
but it doesn't matter what dimensions or size your result temp is you can't push dimensioned data into a null dimension object.
>dim(temp3)
NULL
You probably want something like:
>temp3=matrix(NA, i,j)
>temp3[,j] <- something
Now ..firstly I'm sorry I can't be any more helpful/specific but it's near impossible to interpret the rest of your code without an example of the input data. Secondly unless this is a homework assignment or self-learning I recommend you investigate the many R packages that will calculate the RMSE and/or do ensemble learning for you e.g. the train function of caret

Related

ChoiceModelR, none=TRUE, "Error in betadraw[good, ] = newbeta[good, ]"

I am using ChoiceModelR for hierarchical multinomial logit. I want to get estimates for the utility of the outside good (which follows a normal distribution). The outside good has no covariates like the inside goods - e.g. it cannot have a price or brand dummy - , so I set list(none=TRUE) and do not add this no-choice to the X data (as described in the documentation of ChoiceModelR) but only to the y (choice) data.
The iterations start normally, then at some point it stops and says
"Error in betadraw[good, ] = newbeta[good, ] : NAs are not allowed in subscripted assignments".
This likely happens because in row 388 of the function "choicemodelr", the "good" subscript is NA.
I looked at some questions about choicemodelr (this,this,this), and also about NAs in subscript (this,this), but my guess is that my problem is specific to this function in the sense that probably some inputs in the iteration just get so large/small such that "good" will turn to be NA.
Below is a very simple example. I generate data with 3 products with varying attributed. In half of the periods product 3 is not offered. The 2000 consumers have preferences - distributed normally - over 3 attributes (and a preference for the outside good). Logit error added to be consistent with the model. Outside good is indexed as product 4 (both when 3 and 2 products were in the choice set).
How could I avoid the NA error? Am I doing something wrong, or is it a general bug in the function?
I also searched for examples online setting the option none=TRUE, but I did not find any reproducible one. Perhaps this option is only the problematic thing as there is no problem recovering the true parameters if I set none=FALSE, and I do not let customers choose the outside option.
So the code which results in the NA bug is the following:
library("ChoiceModelR")
library("MASS")
set.seed(36)
# Set demand pars
beta_mu = c(-3,4,1)
beta_sigma = diag(c(1,1,1))
alfa_mu = 5 #outside good mean utility
alfa_sigma = 2 #outside good sd
# Three/two products, 3 vars (2 continuous,1 dummy)
threeprod <- list()
twoprod <- list()
purchase <- list()
for (t in 1:1000){
threeprod[[t]] = cbind(rep(t,3),c(1,1,1),c(1,2,3),runif(3),runif(3),ceiling(runif(3,-0.5,0.5)))
purchase[[t]] = which.max(rbind(threeprod[[t]][,c(4,5,6)]%*%mvrnorm(1,beta_mu,beta_sigma) +
matrix( -log(-log(runif(3))), 3, 1),rnorm(1,alfa_mu,alfa_sigma)) )
threeprod[[t]] = cbind(threeprod[[t]],c(purchase[[t]],0,0))
}
for (t in 1001:2000){
twoprod[[t]] = cbind(rep(t,2),c(1,1),c(1,2),runif(2),runif(2),ceiling(runif(2,-0.5,0.5)))
purchase[[t]] = which.max(rbind(twoprod[[t]][,c(4,5,6)]%*%mvrnorm(1,beta_mu,beta_sigma) +
matrix( -log(-log(runif(2))), 2, 1),rnorm(1,alfa_mu,alfa_sigma)) )
if (purchase[[t]] == 3) {purchase[[t]] <- 4}
twoprod[[t]] = cbind(twoprod[[t]],c(purchase[[t]],0))
}
X <- rbind(do.call(rbind,threeprod),do.call(rbind,twoprod))
xcoding <- c(1,1,1)
mcmc = list(R = 5000, use = 2000)
options = list(none=TRUE, save=TRUE, keep=5)
out = choicemodelr(X, xcoding, mcmc = mcmc,options = options)
You have to sort them by ID,Set,Alt .. that solved the error (the same you got)The questions have to sorted by Respondent ID, The set number (Questions) and Alternatives in a given question.

Permutations and combinations of all the columns in R

I want to check all the permutations and combinations of columns while selecting models in R. I have 8 columns in my data set and the below piece of code lets me check some of the models, but not all. Models like column 1+6, 1+2+5 will not be covered by this loop. Is there any better way to accomplish this?
best_model <- rep(0,3) #store the best model in this array
for(i in 1:8){
for(j in 1:8){
for(x in k){
diabetes_prediction <- knn(train = diabetes_training[, i:j], test = diabetes_test[, i:j], cl = diabetes_train_labels, k = x)
accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/183
if( best_model[1] < accuracy[x] ){
best_model[1] = accuracy[x]
best_model[2] = i
best_model[3] = j
}
}
}
}
Well, this answer isn't complete, but maybe it'll get you started. You want to be able to subset by all possible subsets of columns. So instead of having i:j for some i and j, you want to be able to subset by c(1,6) or c(1,2,5), etc.
Using the sets package, you can for the power set (set of all subsets) of a set. That's the easy part. I'm new to R, so the hard part for me is understanding the difference between sets, lists, vectors, etc. I'm used to Mathematica, in which they're all the same.
library(sets)
my.set <- 1:8 # you want column indices from 1 to 8
my.power.set <- set_power(my.set) # this creates the set of all subsets of those indices
my.names <- c("a") #I don't know how to index into sets, so I created names (that are numbers, but of type characters)
for(i in 1:length(my.power.set)) {my.names[i] <- as.character(i)}
names(my.power.set) <- my.names
my.indices <- vector("list",length(my.power.set)-1)
for(i in 2:length(my.power.set)) {my.indices[i-1] <- as.vector(my.power.set[[my.names[i]]])} #this is the line I couldn't get to work
I wanted to create a list of lists called my.indices, so that my.indices[i] was a subset of {1,2,3,4,5,6,7,8} that could be used in place of where you have i:j. Then, your for loop would have to run from 1:length(my.indices).
But alas, I have been spoiled by Mathematica, and thus cannot decipher the incredibly complicated world of R data types.
Solved it, below is the code with explanatory comments:
# find out the best model for this data
number_of_columns_to_model <- ncol(diabetes_training)-1
best_model <- c()
best_model_accuracy = 0
for(i in 2:2^number_of_columns_to_model-1){
# ignoring the first case i.e. i=1, as it doesn't represent any model
# convert the value of i to binary, e.g. i=5 will give combination = 0 0 0 0 0 1 0 1
combination = as.binary(i, n=number_of_columns_to_model) # from the binaryLogic package
model <- c()
for(i in 1:length(combination)){
# choose which columns to consider depending on the combination
if(combination[i])
model <- c(model, i)
}
for(x in k){
# for the columns decides by model, find out the accuracies of model for k=1:27
diabetes_prediction <- knn(train = diabetes_training[, model, with = FALSE], test = diabetes_test[, model, with = FALSE], cl = diabetes_train_labels, k = x)
accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/length(diabetes_test_labels)
if( best_model_accuracy < accuracy[x] ){
best_model_accuracy = accuracy[x]
best_model = model
print(model)
}
}
}
I trained with Pima.tr and tested with Pima.te. KNN Accuracy for pre-processed predictors was 78% and 80% without pre-processing (and this because of the large influence of some variables).
The 80% performance is at par with a Logistic Regression model. You don't need to preprocess variables in Logistic Regression.
RandomForest, and Logistic Regression provide a hint on which variables to drop, so you don't need to go and perform all possible combinations.
Another way is to look at a matrix Scatter plot
You get a sense that there is difference between type 0 and type 1 when it comes to npreg, glu, bmi, age
You also notice the highly skewed ped and age, and you notice that there may be an anomaly data point between skin and and and other variables (you may need to remove that observation before going further)
Skin Vs Type box plot shows that for type Yes, an extreme outlier exist (try removing it)
You also notice that most of the boxes for Yes type are higher than No type=> the variables may add prediction to the model (you can confirm this through a Wilcoxon Rank Sum Test)
The high correlation between Skin and bmi means that you can use one or the other or an interact of both.
Another approach to reducing the number of predictors is to use PCA

Performing an Interval Regression in R

I am trying to run an interval regression, where my dependent variable, y is made up of 14 intervals, representing incomes. I have 5000 observations. I have six independent variables I am trying to use to predict my y.
I am trying to follow the steps performed here:
http://www.karlin.mff.cuni.cz/~pesta/NMFM404/interval.html#References
So, I actually have y at its exact values, but am trying to learn how to do an interval regression from this. So, first I convert y into an interval.
Income[Income < 10000] <- 1
Income[Income > 10000 & Income < 20001] <- 2
Income[Income > 20000 & Income < 30001] <- 3
...
Income[Income > 300000] <- 14
Okay, fine. From the above link, I should actually convert it to correspond to each lower bound of the interval, and each upper bound. I have to imagine that isn't the only way, but for now, I am following those directions.
lIncome <- rep(0,5000)#lower income bound
uIncome <- rep(0,5000)#upper income bound
for (i in 1:5000){
if(Income[i] == 1){
lIncome[i] = 0
uIncome[i] = 10000
}
if(Income[i] == 2){
lIncome[i] = 10001
uIncome[i] = 20000
}
...
if(Income[i] == 14){
lIncome[i] = 300001
uIncome[i] = Inf
}
}
So now I have columns lIncome and uIncome which correspond to the levels of income. I am fine for this part. Perhaps it is problematic my last interval goes to infinity; but even if I just cap it at 500000 I still get errors.
The instructions next say to incorporate the Surv() function.
So, I perform:
TEST <- Surv(lIncome, uIncome, event = rep(3,5000))
However, my errors start now. I get:
Warning message:
In Surv(lIncome, uIncome, event = rep(3, 5000)) :
Invalid status value, converted to NA
If I try
TEST <- Surv(lIncome, uIncome, event = rep(2,5000))
it works, but then:
m <- survreg(TEST ~ Age + AgeSq + ... , dist="gaussian")
gives:
Error in survreg(TEST ~ Age + AgeSq + NoDegree, dist = "gaussian") :
Invalid survival type
First of all, I am not sure why changing the 3 -> 2 makes it work. Even if I change the Inf value to 500000 (or any appropriate number) having it equal to 2 (or any number) does not resolve the issue.
Second, when I can get past that part, the fact that survreg is failing is leaving me puzzled.
Right now, my approach is to play around with my intervals, to see if I can get it to work somehow, then go from there. I am also looking closer at all the documentation for ?Surv and ?survreg
Any help is very appreciated though, thank you.

R: multicollinearity issues using glib(), Bayesian Model Averaging (BMA-package)

I am experiencing difficulties estimating a BMA-model via glib(), due to multicollinearity issues, even though I have clearly specified which columns to use. Please find the details below.
The data I'll be using for the estimation via Bayesian Model Averaging:
Cij <- c(357848,766940,610542,482940,527326,574398,146342,139950,227229,67948,
352118,884021,933894,1183289,445745,320996,527804,266172,425046,
290507,1001799,926219,1016654,750816,146923,495992,280405,
310608,1108250,776189,1562400,272482,352053,206286,
443160,693190,991983,769488,504851,470639,
396132,937085,847498,805037,705960,
440832,847631,1131398,1063269,
359480,1061648,1443370,
376686,986608,
344014)
n <- length(Cij);
TT <- trunc(sqrt(2*n))
i <- rep(1:TT,TT:1); #row numbers: year of origin
j <- sequence(TT:1) #col numbers: year of development
k <- i+j-1 #diagonal numbers: year of payment
#Since k=i+j-1, we have to leave out another dummy in order to avoid multicollinearity
k <- ifelse(k == 2, 1, k)
I want to evaluate the effect of i and j both via levels and factors, but of course not in the same model. Since I can decide to include i and j as factors, levels, or not include them at all and for k either to include as level, or exclude, there are a total of 18 (3x3x2) models. This brings us to the following data frame:
X <- data.frame(Cij,i.factor=as.factor(i),j.factor=as.factor(j),k,i,j)
X <- model.matrix(Cij ~ -1 + i.factor + j.factor + k + i + j,X)
X <- as.data.frame(X[,-1])
Next, via the following declaration I specify which variables to consider in each of the 18 models. According to me, no linear dependence exists in these specifications.
model.set <- rbind(
c(rep(0,9),rep(0,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,0),
c(rep(0,9),rep(0,9),0,0,1),
c(rep(0,9),rep(0,9),1,0,0),
c(rep(1,9),rep(0,9),0,0,0),
c(rep(0,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,1),
c(rep(0,9),rep(0,9),1,1,0),
c(rep(0,9),rep(1,9),0,1,0),
c(rep(0,9),rep(0,9),1,0,1),
c(rep(1,9),rep(0,9),0,0,1),
c(rep(1,9),rep(0,9),1,0,0),
c(rep(0,9),rep(1,9),1,0,0),
c(rep(1,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),1,1,1),
c(rep(0,9),rep(1,9),1,1,0),
c(rep(1,9),rep(0,9),1,0,1),
c(rep(1,9),rep(1,9),1,0,0))
Then I call the glib() function, telling it to select the specified columns from X according to model.set.
library(BMA)
model.glib <- glib(X,Cij,error="poisson", link="log",models=model.set)
which results in the error
Error in glim(x, y, n, error = error, link = link, scale = scale) : X matrix is not full rank
The function first checks whether the matrix is f.c.r, before it evaluates which columns to select from X via model.set. How do I circumvent this, or is there any other way to include all 18 models in the glib() function?
Thank you in advance.

Create list from for loop if value in sequence is greater than .5

Trying to create a list of significant variables for a regression model. I have 82 variables, so in order to only include the significant ones, I created a list of the correlations and sorted them. I want to include the variables with correlations > 0.5 and <-0.5. I created a for loop, but am having difficulty getting a list.
composite is a data frame
sortedcor <- sort(cor(composite)[,1])
regvar = 0
for (i in sortedcor){
if (sortedcor[i] < 0.5){
regvar[] <- i
}
}
getting this error:
Error in if (sortedcor[i] < 0.5) { : argument is of length zero
In addition: Warning messages:
1: In if (sortedcor[i] < 0.5) { :
the condition has length > 1 and only the first element will be used
Here is some code doing what you want. You will need to modify what sortcor in my code is to match yours.
#Create some example correlation values
correlations = runif(100,-1,1)
#Don't need to sort but I'll do it anyway to follow your logic
sortcor = sort(correlations)
#Create an empty vector to hold the parameters of interest
regvar = NULL
k = 1
for(i in 1:length(sortcor)){
if(sortcor[i] > .5 | sortcor[i] < -.5){
regvar[k] = i
k = k+1
}
}
so now regvar is a vector of indexes corresponding to the parameters of the regression model that have correlation bigger than 0.5 or less that -0.5. Note that I sorted the correlations like you did but really you don't need to if you don't sort the parameters of your model as well.

Resources