neural network in R - r

hi i am trying to use neuralnet function in R so i can predict an integer outcome (meaning) using the rest of the variables.
here is the code that i have used:
library("neuralnet")
I am going to put 2/3 from the data for neural network learning and the rest
for test
ind<-sample(1:nrow(Data),6463,replace=FALSE)
Train<-Data[ind,]
Test<-Data[-ind,]
m <- model.matrix(
~meaning +
firstLevelAFFIRM + firstLevelDAT.PRSN + firstLevelMODE +
firstLevelO.DEF + firstLevelO.INDIV + firstLevelS.AGE.INDIV +
secondLevelV.BIN + secondLevelWord1 + secondLevelWord2 +
secondLevelWord3 + secondLevelWord4 + thirdLevelP.TYPE,
data = Train[,-1]) #(the first column is ID , i am not going to use it)
PredictorVariables <- paste("m[," , 3:ncol(m),"]" ,sep="")
Formula <- formula(paste("meaning ~ ", paste(PredictorVariables, collapse=" + ")))
net <- neuralnet(Formula,data=m, hidden=3, threshold=0.05)
m.test < -model.matrix(
~meaning +
firstLevelAFFIRM + firstLevelDAT.PRSN + firstLevelMODE +
firstLevelO.DEF + firstLevelO.INDIV + firstLevelS.AGE.INDIV +
secondLevelV.BIN + secondLevelWord1 + secondLevelWord2 +
secondLevelWord3 + secondLevelWord4 + thirdLevelP.TYPE,
data = Test[,-1])
net.results <- compute(net, m.test[,-c(1,2)]) #(first column is ID and the second one is the outcome that i am trying to predict)
output<-cbind(round(net.results$net.result),Test$meaning)
mean(round(net.results$net.result)!=Test$meaning)
the misclassification that i got was around 0.01 which is great, but my question is why the outcome that i got (net.results$net.result) is not an integer?

I assume that your output is linear. Try setting linear.output = FALSE.
net <- neuralnet(Formula, data = m, hidden = 3, threshold = 0.05, linear.output = FALSE)

Related

svyglm - how to code for a logistic regression model across all variables?

In R using GLM to include all variables you can simply use a . as shown How to succinctly write a formula with many variables from a data frame?
for example:
y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)
however I am struggling to do this with svydesign. I have many exploratory variables and an ID and weight variable, so first I create my survey design:
des <-svydesign(ids=~id, weights=~wt, data = df)
Then I try creating my binomial model using weights:
binom <- svyglm(y~.,design = des, family="binomial")
But I get the error:
Error in svyglm.survey.design(y ~ ., design = des, family = "binomial") :
all variables must be in design = argument
What am I doing wrong?
You typically wouldn't want to do this, because "all the variables" would include design metadata such as weights, cluster indicators, stratum indicators, etc
You can use col.names to extract all the variable names from a design object and then reformulate, probably after subsetting the names, eg with the api example in the package
> all_the_names <- colnames(dclus1)
> all_the_actual_variables <- all_the_names[c(2, 11:37)]
> reformulate(all_the_actual_variables,"y")
y ~ stype + pcttest + api00 + api99 + target + growth + sch.wide +
comp.imp + both + awards + meals + ell + yr.rnd + mobility +
acs.k3 + acs.46 + acs.core + pct.resp + not.hsg + hsg + some.col +
col.grad + grad.sch + avg.ed + full + emer + enroll + api.stu

lmer poly() function on interactions

My goal is to run a quadratic function using several IVs across time within subject. I have come across some code and am a little confused. Below is a reproducible example of what I am trying to run. Following the code will be my questions.
set.seed(1234)
obs <- 1:200
IV1<- rnorm(length(obs), mean = 1, sd = mean(obs^2) / 4)
IV2<- rnorm(length(obs), mean = 1.5, sd = mean(obs^2) / 8)
IV3<- rnorm(length(obs), mean = .5, sd = mean(obs^3) / 4)
y <- (obs + obs^2 + obs^3) + rnorm(length(obs), mean = 0, sd = mean(obs^2) / 4)
my.data <- data.frame(obs,IV1,IV2,IV3,y,
DV = y/10000,
time= c(1,2,3,4,5),
Subj= rep(letters[1:20], each =5),
group= rep(letters[1:2], each =5))
my.data$group.factor<-as.factor(my.data$group)
my.data$dx <- as.numeric(ifelse(my.data$group == "a", 1, 0))
Polylmer<- lmer(DV ~ poly(time*IV1, 2) + poly(time*IV2, 2) + poly(time*IV3, 2) + poly(time*dx, 2) + (1|Subj), data = my.data)
My questions are as follow:
In non poly() lmer statement time*IV2 would give coefficients for time and IV2 interaction as well as the lower order coefficients time and IV2. Am I correct that using the poly() statement does not put the lower terms into the model?
I have been taught that if you include the higher terms you should include the lower terms also. Is this still correct with the poly() function in r?
If so Would make sense to use either
Polylmer2<- lmer(DV ~ poly(time, 2)*poly(IV1, 2) + poly(time, 2)*poly(IV2, 2) + poly(time, 2)*poly(IV3, 2) + poly(time*dx, 2) + (1|Subj), data = my.data)
Polylmer3<- lmer(DV ~ time + IV1 + IV2 + IV3 + dx + poly(time, 2) + poly(IV1, 2) + poly(IV2, 2) + poly(IV3, 2) + poly(time*IV1, 2) + poly(time*IV2, 2) + poly(time*IV3, 2) + poly(time*dx, 2) + (1|Subj), data = my.data)
I would assume that the two above are equivalent, however, I am wrong as the second gives me an error:
Error: Dropping columns failed to produce full column rank design matrix
What columns did I drop?
Thank you for helping. I am very new to r so I am trying my best to understand what these functions do rather than just follow a recipe.

How to resolve "number of items to replace is not a multiple of replacement length" error in a bootstrapped regression?

I am trying to conduct a bootstrapped regression model using code from Andy Field's textbook Discovering Statistics Using R.
I am struggling to interpret an error message that I receive when running the boot() function. From reading other forum posts I understand that it is telling me that there is an imbalance in the number of items between two objects, but I don't understand what this means in my context and how I can resolve it.
You can download my data here (a publicly available Dataset on Airbnb listings) and find my code and the full error message below. I am using a mixture of factored dummy variables and continuous variables as predictors. Thanks in advance for any help!
Code:
bootReg <- function (formula, data, i)
{
d <- data [i,]
fit <- lm(formula, data = d)
return(coef(fit))
}
bootResults <- boot(statistic = bootReg, formula = review_scores_rating ~ instant_bookable + cancellation_policy +
host_since_cat + host_location_cat + host_response_time +
host_is_superhost + host_listings_cat + property_type + room_type +
accommodates + bedrooms + beds + price + security_deposit +
cleaning_fee + extra_people + minimum_nights + amenityBreakfast +
amenityAC + amenityElevator + amenityKitchen + amenityHostGreeting +
amenitySmoking + amenityPets + amenityWifi + amenityTV,
data = listingsRating, R = 2000)
Error:
Error in t.star[r, ] <- res[[r]] :
number of items to replace is not a multiple of replacement length
In addition: Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
restarting interrupted promise evaluation
The Problem
The problem is your factor variables. When you do an lm() on a subset of your data (which is done over and over again in boot::boot()), you only get coefficients for the factor levels that are present. Then each coefficient draw could be of different lengths. This can be reproduced if you do
debug(boot)
set.seed(123)
bootResults <- boot(statistic = bootReg, formula = review_scores_rating ~ instant_bookable + cancellation_policy +
host_since_cat + host_location_cat + host_response_time +
host_is_superhost + host_listings_cat + property_type + room_type +
accommodates + bedrooms + beds + price + security_deposit +
cleaning_fee + extra_people + minimum_nights + amenityBreakfast +
amenityAC + amenityElevator + amenityKitchen + amenityHostGreeting +
amenitySmoking + amenityPets + amenityWifi + amenityTV,
data = listingsRating, R = 2)
which will allow you to move through the function call one line at a time. After you run the line
res <- if (ncpus > 1L && (have_mc || have_snow)) {
if (have_mc) {
parallel::mclapply(seq_len(RR), fn, mc.cores = ncpus)
}
else if (have_snow) {
list(...)
if (is.null(cl)) {
cl <- parallel::makePSOCKcluster(rep("localhost",
ncpus))
if (RNGkind()[1L] == "L'Ecuyer-CMRG")
parallel::clusterSetRNGStream(cl)
res <- parallel::parLapply(cl, seq_len(RR), fn)
parallel::stopCluster(cl)
res
}
else parallel::parLapply(cl, seq_len(RR), fn)
}
} else lapply(seq_len(RR), fn)
Then try
setdiff(names(res[[1]]), names(res[[2]]))
# [1] "property_typeBarn" "property_typeNature lodge"
There are two factor levels present in the first subset not present in the second. This is causing your problem.
The Solution
Use model.matrix() to expand your factors before hand (following this Stack Overflow post):
df2 <- model.matrix( ~ review_scores_rating + instant_bookable + cancellation_policy +
host_since_cat + host_location_cat + host_response_time +
host_is_superhost + host_listings_cat + property_type + room_type +
accommodates + bedrooms + beds + price + security_deposit +
cleaning_fee + extra_people + minimum_nights + amenityBreakfast +
amenityAC + amenityElevator + amenityKitchen + amenityHostGreeting +
amenitySmoking + amenityPets + amenityWifi + amenityTV - 1, data = listingsRating)
undebug(boot)
set.seed(123)
bootResults <- boot(statistic = bootReg, formula = review_scores_rating ~ .,
data = as.data.frame(df2), R = 2)
(Note that throughout I reduce R to 2 just for faster runtime during debugging).
The way you are defining bootReg and calling it are wrong.
First, you must keep to the order of arguments of the function statistic, in this case bootReg. The first argument is the dataset and the second argument is the indices. Then come other, optional arguments.
bootReg <- function (data, i, formula){
d <- data[i, ]
fit <- lm(formula, data = d)
return(coef(fit))
}
Second, in the call, the other optional arguments will be passed in the dots ... argument. So once again, keep to the order of arguments as defined in help("boot"), section Usage.
bootResults <- boot(data = iris, statistic = bootReg, R = 2000,
formula = Sepal.Length ~ Sepal.Width)
colMeans(bootResults$t)
#[1] 6.5417719 -0.2276868

Holding the coefficients of a linear model constant while exchanging predictors for their sample means?

I've been trying to look at the explanatory power of individual variables in a model by holding other variables constant at their sample mean.
However, I am unable to do something like:
Temperature = alpha + Beta1*RFGG + Beta2*RFSOx + Beta3*RFSolar
where Beta1=Beta2=Beta3 -- something like
Temperature = alpha + Beta1*(RFGG + RFSolar + RFSOx)
I want to do this so I can compare the difference in explanatory power (R^2/size of residuals) when one independent variable is not held at the sample mean while the rest are.
Temperature = alpha + Beta1*(RFGG + meanRFSolar + meanRFSOx)
or
Temperature = alpha + Beta1*RFGG + Beta1*meanRFSolar + Beta1*meanRFSOx
However, the lm function seems to estimate its own coefficients so I don't know how I can hold anything constant.
Here's some ugly code I tried throwing together that I know reeks of wrongness:
# fixing a new clean matrix for my data
dat = cbind(dat[,1:2],dat[,4:6]) # contains 162 rows of: Date, Temp, RFGG, RFSolar, RFSOx
# make a bunch of sample mean independent variables to use
meandat = dat[,3:5]
meandat$RFGG = mean(dat$RFGG)
meandat$RFSolar = mean(dat$RFSolar)
meandat$RFSOx = mean(dat$RFSOx)
RFTotal = dat$RFGG + dat$RFSOx + dat$RFSolar
B = coef(lm(dat$Temp ~ 1 + RFTot)) # trying to save the coefficients to use them...
B1 = c(rep(B[1],length = length(dat[,1])))
B2 = c(rep(B[2],length = length(dat[,1])))
summary(lm(dat$Temp ~ B1 + B2*dat$RFGG:meandat$RFSOx:meandat$RFSolar)) # failure
summary(lm(dat$Temp ~ B1 + B2*RFTot))
Thanks for taking a look to whoever sees this and please ask me any questions.
Thank you both of you, it was a combination of eliminating the intercept with (-1) and the offset function.
a = lm(Temp ~ I(RFGG + RFSOx + RFSolar),data = dat)
beta1hat = rep(coef(a)[1],length=length(dat[,1]))
beta2hat = rep(coef(a)[2],length=length(dat[,1]))
b = lm(Temp ~ -1 + offset(beta1hat) + offset(beta2hat*(RFGG + RFSOx_bar + RFSolar_bar)),data = dat)
c = lm(Temp ~ -1 + offset(beta1hat) + offset(beta2hat*(RFGG_bar + RFSOx + RFSolar_bar)),data = dat)
d = lm(Temp ~ -1 + offset(beta1hat) + offset(beta2hat*(RFGG_bar + RFSOx_bar + RFSolar)),data = dat)

How to get which data point was in which cluster based on flexmix

I have done cluster analysis using flexmix.
m7 <- stepFlexmix(ADA ~ NLEAD + BIG4 + LOGMKT + LEV + ROA + ROAL + LOSS +
CFO + BTM + GROWTH + ALTMAN + ABSACCRL +
STDEARN + TENURE + LOGASSETS, data = dt,
control = list(verbose = 0), k = 1:5, nrep = 5)
m7 <- getModel(m7, "BIC");
However, I am not sure how to extract the info that which data point fell in which cluster. Someone suggest the solution. Thanks.
With the function str() you can see the structure of the object m7 (object made with function stepFlexmix() and not the getModel()) and you will see that there is an element named cluster that contains cluster numbers.
str(m7)
m7#cluster

Resources