R programming - not deleting the right column - r

I am writing to paste here my code.
I am following an online course in R and I was trying to automate a multiple variables regression. I have tried to check what's going on and at the beginning, it works, but when it comes to the last two variables, it enters in a loop and does not eliminate them, even though it enters in the if.
At the end, I have this error
Error in if (maxVar > sl) { : missing value where TRUE/FALSE needed
Here is the code
backwardElimination <-function(training,sl) {
numVar=length(training)
funzRegressor = lm(formula = profit ~.,
data = training)
p = summary(funzRegressor)$coefficients[,4]
maxVar = max(p)
if (maxVar > sl){
for (j in c(1:numVar)){
if (maxVar == p[j]) {
training = training[, -j]
backwardElimination(training,sl)
}
}
}
return(summary(funzRegressor))
}
Thanks in advance
Edit: this is the rest of my code
#importing dataset
dataset = read.csv('50_Startups.csv')
# Encoding categorical data
dataset$State = factor(dataset$State,
levels = c('New York', 'California', 'Florida'),
labels = c(1, 2, 3))
#splitting in train / test set
library(caTools)
set.seed(123)
split = sample.split(dataset$Profit, SplitRatio = 4/5)
trainingSet = subset(dataset, split == TRUE)
testSet = subset(dataset, split == FALSE)
#Transforming state in dummy variables
trainingSet$State = factor(trainingSet$State)
dummies = model.matrix(~trainingSet$State)
trainingSet = cbind(trainingSet,dummies)
profit = trainingSet$Profit
trainingSet = trainingSet[, -4]
trainingSet = trainingSet[, -4]
trainingSet = cbind(trainingSet,profit)
#calling the function
SL = 0.05
backwardElimination(trainingSet, SL)

This error indicates that you have an NA instead of a boolean value in your if statement.
if (NA) {}
## Error in if (NA) { : missing value where TRUE/FALSE needed
Either your p contains NA, either sl is NA.

Your intercepts are also fed back in the next step of modeling, you need to get rid of it before moving to the next iteration.

I can replicate your error with R in-built dataset state.x77
dataset <- as.data.frame(state.x77)
dataset$State <- rownames(dataset)
dataset$profit <- rnorm(nrow(dataset))
backwardElimination <-function(training,sl) {
if (!"profit" %in% names(training)) return(NULL)
numVar=length(training)
funzRegressor = lm(formula = profit ~.,
data = training)
p = summary(funzRegressor)$coefficients[,4]
maxVar = max(p)
#print(funzRegressor)
if (maxVar > sl){
for (j in c(1:numVar)){
if (maxVar == p[j]) {
training = training[, -j]
backwardElimination(training,sl)
}
}
}
return(summary(funzRegressor))
}
backwardElimination(dataset, 0.05)
There are NAs in some of your betas and all the p-values becomes NaN. Do you need to regress within states? Otherwise you can remove the State column to remove the error.
There will be another error when you reach the boundary case in your recursion, which you can fix :)

Related

DepmixS4 fix state order - R

I am using the package depmixS4 to fit a HMM on time-series data. Here is an example with some high vol and low vol data.
In the getpars function we can see the parameter value estimates.
What is happening is that sometimes the first two values in the density are the low vol state and sometimes the second two values are the low vol state. Is there any way to fix (maybe setting initial priors?)
set.seed(1)
a <- data.frame(v1 = c(rnorm(n = 100, sd = 10), rnorm(n=100, sd = 1)))
a <- sample(a)
my_model <- depmixS4::depmix(response = v1 ~ 1 , nstates = 2 , data = a)
fitted_model <- depmixS4::fit(my_model)
getpars(fitted_model)
for (i in 100:200) {
my_model2 <- depmixS4::depmix(response = v1 ~ 1 , nstates = 2 , data = a[1:i, , drop = FALSE])
fitted_model2 <- depmixS4::fit(my_model2)
pars <- getpars(fitted_model2)
if (pars[8] > 8) {
print(i)
}
}
This is called label switching.
Models in which you swap the label of states (e.g., relabel state 1 as state 2 and state 2 as state 1) have the same likelihood and hence are both valid maximum likelihood solutions.
You can try to "fix" this issue by:
setting initial values for the parameters (which make it more likely that the EM algorithm will converge to a particular solution, although this is not guaranteed!);
or by setting order constraints (e.g. forcing the mean for state 1 to be larger than the mean for state 2). Such constraints can be supplied to the fit method in depmixS4 (see examples in ?fit);
a final option is to switch the labels of a fitted depmixS4 object.
Here is a function to relabel a fitted depmix object I have used before (not tested well though!):
label_switch <- function(mod,labels) {
# labels is vector, first element is new integer label for original state integer 1, second is new integer label for original state integer 2, etc.
if(!is(mod,"depmix") || !is(mod,"depmix.fitted")) stop("this function is for depmix models")
n_states <- mod#nstates
if(length(labels) != n_states || length(unique(labels)) != n_states || !(all(labels) %in% 1:n_states)) {
stop("labels needs to be a vector of unique integers between 1 and", n_states)
}
inv_labels <- sapply(1:n_states,function(x) which(labels == x))
tmp <- mod
# relabel prior
ppars <- getpars(mod#prior)
fpars <- getpars(mod#prior,which="fixed")
out_pars <- as.numeric(t(matrix(ppars,nrow=length(ppars)/n_states,byrow = TRUE)[,inv_labels]))
out_fixed <- as.logical(t(matrix(fpars,nrow=length(fpars)/n_states,byrow = TRUE)[,inv_labels]))
if(!tmp#prior#family$link=="identity") tmp#prior#family$base <- labels[tmp#prior#family$base]
# relabel transition
for(i in 1:n_states) {
ppars <- getpars(mod#transition[[inv_labels[i]]])
fpars <- getpars(mod#transition[[inv_labels[i]]],which="fixed")
out_pars <- c(out_pars,as.numeric(t(matrix(ppars,nrow=length(ppars)/n_states,byrow = TRUE)[,inv_labels])))
out_fixed <- c(out_fixed,as.logical(t(matrix(fpars,nrow=length(fpars)/n_states,byrow = TRUE)[,inv_labels])))
tmp#transition[[i]] <- mod#transition[[inv_labels[i]]]
if(!tmp#transition[[i]]#family$link=="identity") tmp#transition[[i]]#family$base <- labels[tmp#transition[[i]]#family$base]
#out_pars <- c(out_pars,getpars(mod#transition[[inv_labels[i]]]))
}
# relabel response
for(i in 1:n_states) {
out_pars <- c(out_pars,unlist(lapply(mod#response[[inv_labels[i]]],getpars)))
out_fixed <- c(out_fixed,unlist(lapply(mod#response[[inv_labels[i]]],getpars,which="fixed")))
}
tmp <- setpars(tmp,out_fixed,which="fixed")
tmp <- setpars(tmp,out_pars)
if(is(tmp,"depmix.fitted")) tmp#posterior <- viterbi(tmp)
return(tmp)
}

Error in bayesm rhierNegbinRw function:

I am attempting to fit a hierarchical negative binomial model with bayesm. Though my data is proprietary, I was able to recreate the same error with the margarine dataset. The error I get is as follows:
> look <- rhierNegbinRw(Data = list(regdata = dat1), Mcmc = list(R = 1000,
nprint = 100))
Z not specified - using a column of ones instead
Error in alpha <= 0 :
comparison (4) is possible only for atomic and list types
I set up the mock data as follows(the regression is completely nonsensical -- just trying to get the thing to work):
data(margarine)
chpr <- margarine$choicePrice
chpr$hhid <- as.factor(chpr$hhid)
N <- nlevels(chpr$hhid)
dat1 <- vector(mode = "list", length = N)
for (i in 1:N) {
dat1[[i]]$y <- chpr[chpr$hhid==levels(chpr$hhid)[i], "PPk_Stk"]
dat1[[i]]$X <- model.matrix( ~ choice + PBB_Stk,
data = chpr[chpr$hhid == levels(chpr$hhid)[i], ])
}
I would greatly appreciate any insight into this issue.

Applying a function to a list and outputting results

I've got a big database which I've split up by year and created at train and test for each respective year
#split the dataset into a list of datasets
Y <- split(dat_all, dat_all$year)
#create a train and test dataset for all years
#takes Y is inp
create_sets <- function(x){
train_set <- sample(2, nrow(x), replace = TRUE, prob = c(0.7, 0.3))
train <- x[train_set == 1, ]
test <- x[train_set == 2, ]
assign('x', list(train = train, test = test))
}
Ylist <- lapply(Y, create_sets)
To call each item out you use Ylist$'2016'$train
I've made an accuracy ratio function which I can run each list through individually but I am looking for a way to do it all in one to save massive amounts of code (theres 16 years of data)
Below is how I currently create an accuracy ratio for one year
val_train<-Ylist$'2016'$train
val_train$pred<-predict(modf,newdata=Ylist$'2016'$train)
val_train$probs<-exp(val_train$pred)/(1+exp(val_train$pred))
x<-data.frame(rcorr.cens(val_train$probs, val_train$default_flag))
train_AR<-x[2,1]
train_AR
modfull <-ModFit(test)
val_test<-test
val_test$pred<-predict(modf,newdata=test)
val_test$probs<-exp(val_test$pred)/(1+exp(val_test$pred))
x<-data.frame(rcorr.cens(val_test$probs, val_test$default_flag))
test_AR<-x[2,1]
test_AR
AR_Logistic1<-c(train_AR,test_AR,)
AR_Logistic2<-c(train_AR,test_AR) #just in to see if table works
AccuracyRatio<-rbind(AR_Logistic1,AR_Logistic2)
colnames(AccuracyRatio)<-c("Train","Test","All")
AccuracyRatio
Just to clarify I'm trying to run through my whole list through my accuracy ratio and then output the AR for each year for its train and test.
Any help is greatly appreciated
With lapply and wrapping the AR calculations in a function you can summarise the output as below.
Without sample data, I could not test it but let us know if you face any errors.
fn_Calc_AR <- function(yearDat = listInput) {
#yearDat <== Ylist$'2016'
trainDat <- yearDat$train
testDat <- yearDat$test
val_train<- trainDat
val_train$pred<-predict(modf,newdata= trainDat)
val_train$probs<-exp(val_train$pred)/(1+exp(val_train$pred))
x<-data.frame(rcorr.cens(val_train$probs, val_train$default_flag))
train_AR<-x[2,1]
#train_AR
modfull <-ModFit(testDat)
val_test<-testDat
val_test$pred<-predict(modf,newdata=testDat)
val_test$probs<-exp(val_test$pred)/(1+exp(val_test$pred))
x<-data.frame(rcorr.cens(val_test$probs, val_test$default_flag))
test_AR<-x[2,1]
#test_AR
AR_Logistic1<-c(train_AR,test_AR) # removed extraneous comma, previous input c(train_AR,test_AR,)
AR_Logistic2<-c(train_AR,test_AR) #just in to see if table works
AccuracyRatio<-rbind(AR_Logistic1,AR_Logistic2)
colnames(AccuracyRatio)<-c("Train","Test","All")
#confirm yearName is being created
try(yearName <- head(names(x),1)) #retain only year
if(length(yearName) > 0L) {
AR_DF <- data.frame(yearName = yearName , AccuracyRatio,stringsAsFactors=FALSE)
}else{
AR_DF <- AccuracyRatio
}
return(AR_DF)
}
Summarise Output:
AR_Summary = do.call(rbind,lapply(Ylist,fn_Calc_AR))
Aggregate Dataset:
aggregateTrain = do.call(rbind,lapply(Ylist,function(x) x$train))
aggregateTest = do.call(rbind,lapply(Ylist,function(x) x$test))
aggregateList = list(train = aggregateTrain,test = aggregateTest)
AR_AggregateSummary = do.call(rbind,lapply(aggregateList,function(x) fn_Calc_AR(x) ))

problems with cross validation code - r -

I'm writing a function to perform logistic regression on two columns of a dataframe. I can't get around the errors... I am trying to use 10-fold cross validation. Here's the code I'm using:
SAdata = read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data",
sep=",",head=T,row.names=1)
log.fun = function(x,y) {
prediction = data.frame()
tset = data.frame()
dframe = cbind(x,y)
dframe = as.data.frame(dframe)
dframe$fold = sample(1:10, nrow(data), replace = TRUE)
list = 1:10
for (i in 1:10) {
train = subset(dframe, fold %in% list[-i])
test = subset(dframe, fold %in% c(i))
model = glm(x~y, data=train, family=binomial)
pred = as.data.frame(predict(model, test[,-1]))
prediction <- rbind(prediction, pred)
}
}
log.fun(SAdata$chd,SAdata$obesity)
The error I get is "Error in sample.int(length(x), size, replace, prob) :
invalid 'size' argument"
Any ideas?
This is kinda sub-optimal use of for loops and specially modelling... if you want to try some good models developing try the package 'caret'
If you still want to use that function here is a workaround
SAdata = read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data",sep=",",head=T,row.names=1)
log.fun=function(x,y){
prediction = data.frame()
tset=data.frame()
dframe=cbind(x,y)
dframe=as.data.frame(dframe)
dframe$fold = sample(1:10, nrow(dframe), replace = TRUE)
list = 1:10
results <- list()
for (i in 1:10) {
results[[paste0('Fold',i)]]$train <- subset(dframe, fold %in% list[-i])
results[[paste0('Fold',i)]]$test <- subset(dframe, fold %in% c(i))
results[[paste0('Fold',i)]]$model <- glm(x~y, data=results[[i]]$train, family=binomial)
results[[paste0('Fold',i)]]$pred <- as.data.frame(predict(results[[i]]$model, results[[i]]$test[,-1]))
results[[paste0('Fold',i)]]$prediction <- rbind(prediction, results[[i]]$pred)
}
results}
your_results<-log.fun(SAdata$chd,SAdata$obesity)
head(your_results$Fold1$prediction)
In fact you had some problems in the function 'sample' since you were specifying 'data' and that object did not exist ... I replace it for dframe and added some names to each part of your results.
Hope it helps

How can I ensure that a partition has representative observations from each level of a factor?

I wrote a small function to partition my dataset into training and testing sets. However, I am running into trouble when dealing with factor variables. In the model validation phase of my code, I get an error if the model was built on a dataset that doesn't have representation from each level of a factor. How can I fix this partition() function to include at least one observation from every level of a factor variable?
test.df <- data.frame(a = sample(c(0,1),100, rep = T),
b = factor(sample(letters, 100, rep = T)),
c = factor(sample(c("apple", "orange"), 100, rep = T)))
set.seed(123)
partition <- function(data, train.size = .7){
train <- data[sample(1:nrow(data), round(train.size*nrow(data)), rep= FALSE), ]
test <- data[-as.numeric(row.names(train)), ]
partitioned.data <- list(train = train, test = test)
return(partitioned.data)
}
part.data <- partition(test.df)
table(part.data$train[,'b'])
table(part.data$test[,'b'])
EDIT - New function using 'caret' package and createDataPartition():
partition <- function(data, factor=NULL, train.size = .7){
if (("package:caret" %in% search()) == FALSE){
stop("Install and Load 'caret' package")
}
if (is.null(factor)){
train.index <- createDataPartition(as.numeric(row.names(data)),
times = 1, p = train.size, list = FALSE)
train <- data[train.index, ]
test <- data[-train.index, ]
}
else{
train.index <- createDataPartition(factor,
times = 1, p = train.size, list = FALSE)
train <- data[train.index, ]
test <- data[-train.index, ]
}
partitioned.data <- list(train = train, test = test)
return(partitioned.data)
}
Try the caret package, particularly the function createDataPartition(). It should do exactly what you need, available on CRAN, homepage is here:
caret - data splitting
The function I mentioned is partially some code I found a while back on net, and then I modified it slightly to better handle edge cases (like when you ask for a sample size larger than the set, or a subset).
stratified <- function(df, group, size) {
# USE: * Specify your data frame and grouping variable (as column
# number) as the first two arguments.
# * Decide on your sample size. For a sample proportional to the
# population, enter "size" as a decimal. For an equal number
# of samples from each group, enter "size" as a whole number.
#
# Example 1: Sample 10% of each group from a data frame named "z",
# where the grouping variable is the fourth variable, use:
#
# > stratified(z, 4, .1)
#
# Example 2: Sample 5 observations from each group from a data frame
# named "z"; grouping variable is the third variable:
#
# > stratified(z, 3, 5)
#
require(sampling)
temp = df[order(df[group]),]
colsToReturn <- ncol(df)
#Don't want to attempt to sample more than possible
dfCounts <- table(df[group])
if (size > min(dfCounts)) {
size <- min(dfCounts)
}
if (size < 1) {
size = ceiling(table(temp[group]) * size)
} else if (size >= 1) {
size = rep(size, times=length(table(temp[group])))
}
strat = strata(temp, stratanames = names(temp[group]),
size = size, method = "srswor")
(dsample = getdata(temp, strat))
dsample <- dsample[order(dsample[1]),]
dsample <- data.frame(dsample[,1:colsToReturn], row.names=NULL)
return(dsample)
}

Resources