producing a 2 class label prediction with neuralnet package in R - r

I am working with the neuralnet package in R (I am more familiar with nnet).
My target variable is a 2 label class. (Phone_Sales 1/0). I have a train and test set. Also, all variable were normalized to [0,1] scale.
My nn model is:
wireless_model <- neuralnet(formula = Phone_sale ~ Topflight + Balance +
Qual_miles + cc1_miles. + cc2_miles. +
cc3_miles. + Bonus_miles + Bonus_trans +
Flight_miles_12mo + Flight_trans_12 +
Online_12 + Email + Club_member + Any_cc_miles_12mo,
data = wireless_train, hidden=1, linear.output=FALSE)
the predicted results from wireless_model$net.result are produced as floats between 0 and 1 (in fact almost all hover very close to zero). ie .07 and .21, etc instead of 1 or 0.
So obviously when I compare my train to my test- my prediction is bad b/c of the two different types of DV.
I want the predicted results to be in the form of either 1 or 0. I am sure I did not use specify a correct setting somewhere in the neuralnet package.
A guess is that I may need to set the "family" in the formula for logistic so I get on 1 or 0 output. But not sure how that works in this package.
Any help?

Related

lavaan WARNING: some observed variances are (at least) a factor 1000 times larger than others; use varTable(fit) to investigate

I am trying to evaluate the sem model from a dataset, some of the data are in likert scale i.e from 1-5. and some of the data are COUNTS generated from the computer log for some of the activity.
Whereas while performing the fits the laveen is giving me the error as:
lavaan WARNING: some observed variances are (at least) a factor 1000 times larger than others; use varTable(fit) to investigate
To mitigate this warning I want to scale some of the variables. But couldn't understand the way for doing that.
Log_And_SurveyResult <- read_excel("C:/Users/Aakash/Desktop/analysis/Log-And-SurveyResult.xlsx")
model <- '
Reward =~ REW1 + REW2 + REW3 + REW4
ECA =~ ECA1 + ECA2 + ECA3
Feedback =~ FED1 + FED2 + FED3 + FED4
Motivation =~ Reward + ECA + Feedback
Satisfaction =~ a*MaxTimeSpentInAWeek + a*TotalTimeSpent + a*TotalLearningActivityView
Motivation ~ Satisfaction'
fit <- sem(model,data = Log_And_SurveyResult)
summary(fit, standardized=T, std.lv = T)
fitMeasures(fit, c("cfi", "rmsea", "srmr"))
I want to scale some of the variables like MaxTimeSpentInAWeek and TotalTimeSpent
Could you please help me figure out how to scale the variables? Thank you very much.
As Elias pointed out, the difference in the magnitude between the variables is huge and it is suggested to scale the variables.
The warning gives a hint and inspecting varTable(fit) returns summary information about the variables in a fitted lavaan object.
Rather than running scale() separately on each column, you could use apply() on a subset or on your whole data.frame:
## Scale the variables in the 4th and 7h column
Log_And_SurveyResult[, c(4, 7)] <- apply(Log_And_SurveyResult[, c(4, 7)], 2, scale)
## Scale the whole data.frame
Log_And_SurveyResult <- apply(Log_And_SurveyResult, 2, scale)
You can just use scale(MaxTimeSpentInAWeek). This will scale your variable to mean = 0 and variance = 1. E.g:
Log_And_SurveyResult$MaxTimeSpentInAWeek <-
scale(Log_And_SurveyResult$MaxTimeSpentInAWeek)
Log_And_SurveyResult$TotalTimeSpent <-
scale(Log_And_SurveyResult$TotalTimeSpent)
Or did I misunderstand your question?

R Quantreg: Singularity with categorical survey data

For my Bachelor's thesis I am trying to apply a linear median regression model on constant sum data from a survey (see formula from A.Blass (2008)). It is an attempt to recreate the probability elicitation approach proposed by A. Blass et al (2008) - Using Elicited Choice Probabilities to Estimate Random Utility Models: Preferences for Electricity Reliability
My dependent variable is the log-odds transformation of the constant sum allocations. Calculated using the following formula:
PE_raw <- PE_raw %>% group_by(sys_RespNum, Task) %>% mutate(LogProb = c(log(Response[1]/Response[1]),
log(Response[2]/Response[1]),
log(Response[3]/Response[1])))
My independent variables are delivery costs, minimum order quantity and delivery window, each categorical variables with levels 0, 1, 2 and 3. Here, level 0 represent the none-option.
Data snapshot
I tried running the following quantile regression (using R's quantreg package):
LAD.factor <- rq(LogProb ~ factor(`Delivery costs`) + factor(`Minimum order quantity`) + factor(`Delivery window`) + factor(NoneOpt), data=PE_raw, tau=0.5)
However, I ran into the following error indicating singularity:
Error in rq.fit.br(x, y, tau = tau, ...) : Singular design matrix
I ran a linear regression and applied R's alias function for further investigation. This informed me of three cases of perfect multicollinearity:
minimum order quantity 3 = delivery costs 1 + delivery costs 2 + delivery costs 3 - minimum order quantity 1 - minimum order quantity 2
delivery window 3 = delivery costs 1 + delivery costs 2 + delivery costs 3 - delivery window 1 - delivery window 2
NoneOpt = intercept - delivery costs 1 - delivery costs 2 - delivery costs 3
In hindsight these cases all make sense. When R dichotomizedthe categorical variables you get these results by construction as, delivery costs 1 + delivery costs 2 + delivery costs 3 = 1 and minimum order quantity 1 + minimum order quantity 2 + minimum order quantity 3 = 1. Rewriting gives the first formula.
It looks like a classic dummy trap. In an attempt to workaround this issue I tried to manually dichotomize the data and used the following formula:
LM.factor <- rq(LogProb ~ Delivery.costs_1 + Delivery.costs_2 + Minimum.order.quantity_1 + Minimum.order.quantity_2 + Delivery.window_1 + Delivery.window_2 + factor(NoneOpt), data=PE_dichomitzed, tau=0.5)
Instead of an error message I now got the following:
Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique
When using the summary function:
> summary(LM.factor)
Error in base::backsolve(r, x, k = k, upper.tri = upper.tri, transpose = transpose, :
singular matrix in 'backsolve'. First zero in diagonal [2]
In addition: Warning message:
In summary.rq(LM.factor) : 153 non-positive fis
Is anyone familiar with this issue? I am looking for alternative solutions. Perhaps I am making mistakes using the rq() function, or the data might be misrepresented.
I am grateful for any input, thank you in advance.
Reproducible example
library(quantreg)
#### Raw dataset (PE_raw_SO) ####
# quantile regression (produces singularity error)
LAD.factor <- rq(
LogProb ~ factor(`Delivery costs`) +
factor(`Minimum order quantity`) + factor(`Delivery window`) +
factor(NoneOpt),
data = PE_raw_SO,
tau = 0.5
)
# linear regression to check for singularity
LM.factor <- lm(
LogProb ~ factor(`Delivery costs`) +
factor(`Minimum order quantity`) + factor(`Delivery window`) +
factor(NoneOpt),
data = PE_raw_SO
)
alias(LM.factor)
# impose assumptions on standard errors
summary(LM.factor, se = "iid")
summary(LM.factor, se = "boot")
#### Manually created dummy variables to get rid of
#### collinearity (PE_dichotomized_SO) ####
LAD.di.factor <- rq(
LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
Minimum.order.quantity_1 + Minimum.order.quantity_2 +
Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
data = PE_dichotomized_SO,
tau = 0.5
)
summary(LAD.di.factor) #backsolve error
# impose assumptions (unusual results)
summary(LAD.di.factor, se = "iid")
summary(LAD.di.factor, se = "boot")
# linear regression to check for singularity
LM.di.factor <- lm(
LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
Minimum.order.quantity_1 + Minimum.order.quantity_2 +
Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
data = PE_dichotomized_SO
)
alias(LM.di.factor)
summary(LM.di.factor) #regular results, all significant
Link to sample data + code: GitHub
The Solution may be nonunique behaviour is not unusual when doing quantile regressions with dummy explanatory variables.
See, e.g., the quantreg FAQ:
The estimation of regression quantiles is a linear programming
problem. And the optimal solution may not be unique.
A more intuitive explanation for what is happening is given by Roger Koenker (the author of quantreg) on r-help back in 2006:
When computing the median from a sample with an even number of
distinct values there is inherently some ambiguity about its value:
any value between the middle order statistics is "a" median.
Similarly, in regression settings the optimization problem solved by
the "br" version of the simplex algorithm, modified to do general
quantile regression identifies cases where there may be non
uniqueness of this type. When there are "continuous" covariates this
is quite rare, when covariates are discrete then it is relatively
common, atleast when tau is chosen from the rationals. For univariate
quantiles R provides several methods of resolving this sort of
ambiguity by interpolation, "br" doesn't try to do this, instead
returning the first vertex solution that it comes to.
Your second warning -- "153 non-positive fis" -- is a warning related to how the local densities are calculated by rq. Occasionally, it could be possible that local densities of the quantile regression function end up being negative (which is obviously impossible). If this happens, rq automatically sets them to zero. Again, quoting from the FAQ:
This is generally harmless, leading to a somewhat conservative
(larger) estimate of the standard errors, however if the reported
number of non-positive fis is large relative to the sample size then
it is an indication of misspecification of the model.

How to run cross validation for logistic regression in R?

I'm trying to run a cross validation (leave one out and k fold) using logistic regression in R, binary outcome.
I have a problem with the cost function. I do not understand the cost function in the R help, and found a more intuitive one here on Stack Overflow, but I don't know how to call it, more specifically, how to pass on the arguments.
library(ISLR)
D = Default
mycost <- function(r, pi)
{
weight1 = 1 #cost for getting 1 wrong
weight0 = 1 #cost for getting 0 wrong
c1 = (r==1)&(pi<0.5) #logical vector - true if actual 1 but predict 0
c0 = (r==0)&(pi>=0.5) #logical vector - true if actual 0 but predict 1
return(mean(weight1*c1+weight0*c0))
}
glm.fit1 = glm(default~balance + student, data = D, family = binomial)
The problem is: if R runs several logistic regressions in the background (for example 3 for K=3), how can I pass on the vectors of predicated probabilities (pi) and the vector of actual values ?
I am confused...
Is there a way to use a for loop and do it manually instead of using cv.glm?

how to create manual contrasts with emmeans? - R

Suppose I have these data
library(MASS)
m<-lmer(Y~N*V + (1|B),data=oats)
How can I create a manual contrast in emmeans? For example
Victoria_0.2cwt 1
Victoria_0.4cwt -1
Marvellous_0.2cwt -1
Marvellous_0.4cwt 1
emm = emmeans(m, ~ V * N)
emm
contrast(emm, list(con = c(0,0,0,0,-1,1,0,0,-1,0,0,0)))
However, this is actually a linear function, not a contrast, because the coefficients do not sum to zero.
Note: I may have mis-remembered the factor levels, and if so, the coefficients may need to be rearranged. They should correspond to the combinations you see in the results if the 2nd line

LMEM: Chi-square = 0 , prob = 1 - what's wrong with my code?

I'm running a LMEM (linear mixed effects model) on some data, and compare the models (in pairs) with the anova function. However, on a particular subset of data, I'm getting nonsense results.
This is my full model:
m3_full <- lmer(totfix ~ psource + cond + psource:cond +
1 + cond | subj) + (1 + psource + cond | object), data, REML=FALSE)
And this is the model I'm comparing it to: (basically dropping out one of the main effects)
m3_psource <- lmer (totfix ~ psource + cond + psource:cond -
psource + (1 + cond | subj) + (1 + psource + cond | object),
data, REML=FALSE)
Running the anova() function (anova(m3_full, m3_psource) returns Chisq = 0, pr>(Chisq) = 1
I'm doing the same for a few other LMEMs and everything seems fine, it's just this particular response value that gives me the weird chi-square and probability values. Anyone has an idea why and how I can fix it? Any help will be much appreciated!
This is not really a mixed-model-specific question: rather, it has to do with the way that R constructs model matrices from formulas (and, possibly, with the logic of your model comparison).
Let's narrow it down to the comparison between
form1 <- ~ psource + cond + psource:cond
and
form2 <- ~ psource + cond + psource:cond - psource
(which is equivalent to ~cond + psource:cond). These two formulas give equivalent model matrices, i.e. model matrices with the same number of columns, spanning the same design space, and giving the same overall goodness of fit.
Making up a minimal data set to explore:
dd <- expand.grid(psource=c("A","B"),cond=c("a","b"))
What constructed variables do we get with each formula?
colnames(model.matrix(form1,data=dd))
## [1] "(Intercept)" "psourceB" "condb" "psourceB:condb"
colnames(model.matrix(form2,data=dd))
## [1] "(Intercept)" "condb" "psourceB:conda" "psourceB:condb"
We get the same number of contrasts.
There are two possible responses to this problem.
There is one school of thought (typified by Nelder, Venables, etc.: e.g. see Venables' famous (?) but unpublished exegeses on linear models, section 5, or Wikipedia on the principle of marginality) that says that it doesn't make sense to try to test main effects in the presence of interaction terms, which is what you're trying to do.
There are occasional situations (e.g in a before-after-control-impact design where the 'before' difference between control and impact is known to be zero due to experimental protocol) where you really do want to do this comparison. In this case, you have to make up your own dummy variables and add them to your data, e.g.
## set up model matrix and drop intercept and "psourceB" column
dummies <- model.matrix(form1,data=dd)[,-(1:2)]
## d='dummy': avoid colons in column names
colnames(dummies) <- c("d_cond","d_source_by_cond")
colnames(model.matrix(~d_cond+d_source_by_cond,data.frame(dd,dummies)))
## [1] "(Intercept)" "d_cond" "d_source_by_cond"
This is a nuisance. My guess at the reason for this being difficult is that the original authors of R and S before it were from school of thought #1, and figured that generally when people were trying to do this it was a mistake; they didn't make it impossible, but they didn't go out of their way to make it easy.

Resources