Looping GLM model and Printing Results - r

I have a data set that I want to run a number of univariate regressions with 20 variables or so. Listed below is a truncated data frame showing just two of the variables of interest (Age and Anesthesia). The regression works fine but the issue I am running into is how to store the data. Since Age has only coefficient but Anesthesia has >4 and the error I get is:
Error in results[i, i] <- summary.glm(logistic.model)$coefficients :
number of items to replace is not a multiple of replacement length
Here is the truncated data.
COMPLICATION Age Anesthesia
0 45 General
1 23 Local
1 33 Lumbar
0 21 Other
varlist <- c("Age", “Anesthesia")
univars <- data.frame() # create an empty data frame
for (i in seq_along(varlist))
{
mod <- as.formula(sprintf("COMPLICATION ~ %s", varlist[i]))
logistic.model<- glm(formula = mod, family = binomial, data=data.clean)
results[i,i] <-summary.glm(logistic.model)$coefficients
}

Filling the dataframe in a loop is not a good idea and it can be inefficient. Moreover, summary.glm(logistic.model)$coefficients returns more than 1 row hence you get the error. Try using this lapply approach.
varlist <- c("Age", "Anesthesia")
lapply(varlist, function(x) {
mod <- glm(reformulate(x, 'COMPLICATION'), data.clean, family = binomial)
summary.glm(mod)$coefficients
}) -> result
result
If you want to combine the result in one dataframe you can do :
result <- do.call(rbind, result)

Related

Failing regression loop in R

I wish to regress certain factor exposures calculated for a portfolio on that portfolios returns over 550 months (monthly observations). Fama-macbeth regression.
So essentially, regressing 550 return observations on "constant," factor exposures previously calculated. My loop thus far is as follows
'''
# Second Regression
library(sandwich)
library(broom)
library(tibble)
#Chose file containing factor exposures (note: data columns = exposures, rows = portfolios)
f <- file.choose()
betas <- read.csv(f)
BTM2R <- betas[1, 3]
BIPR <- betas[1, 4]
BInfR <- betas[1, 5]
BUnR <- betas[1, 6]
BOilR <- betas[1, 7]
#Chose file containing return data (columns = portfolios, rows = monthly return observations)
f <- file.choose()
retur <- read.csv(f)
for (i in 1:nrow(retur)){
mod <- lm(data = retur, retur[[i,1]]~BTM2R+BIPR+BInfR+BUnR+BOilR)
print(mod$coefficients)
}
'''
(I also wish to develop this loop further so that after running this regression for each portfolio, it is run for the next portfolio, such that the column number in "return," = j??? But will first address my current problem)
My current problem is that when I run the regression, all coefficient values return "NA," aside from the intercept, which returns a value. To confuse matters further, the intercept does not equal the return value at time = t, which, although the regression results would still obviously be incorrect, one would expect if all other coefficients return "NA,"

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

naive bayes error in R: subscript out of bounds

I'm trying to classify 94 text of speech.
Since naiveBayes cannot work well if categories of trainset do not exist in categories of testset, I randomized and confirmed.
There were no problem with categories.
But classifier didn't work with testset.
Following is error message:
Df.dtm<-cbind(Df.dtm, category)
dim(Df.dtm)
Df.dtm[1:10, 530:532]
# Randomize and Split data by rownumber
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))
test <- (1:nrow(Df.dtm))[- train]
# Isolate classifier
cl <- Df.dtm[, "category"]
> summary(cl[train])
dip eds ind pols
23 8 3 13
# Create model data and remove "category"
modeldata <- Df.dtm[,!colnames(Df.dtm) %in% "category"]
#Boolean feature Multinomial Naive Bayes
#Function to convert the word frequencies to yes and no labels
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
#Apply the convert_count function to get final training and testing DTMs
train.cc <- apply(modeldata[train, ], 2, convert_count)
test.cc <- apply(modeldata[test, ], 2, convert_count)
#Training the Naive Bayes Model
#Train the classifier
system.time(classifier <- naiveBayes(train.cc, cl[train], laplace = 1) )
This classifier worked well:
用户 系统 流逝
0.45 0.00 0.46
#Use the classifier we built to make predictions on the test set.
system.time(pred <- predict(classifier, newdata=test.cc))
However, prediction failed.
Error in [.default(object$tables[[v]], , nd) : 下标出界
Timing stopped at: 0.2 0 0.2
Consider the following:
# Indicies of training observations as observations.
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))
# Indicies of whatever is left over from the previous sample, again, also observations are being returned.
#that still remains inside of Df.dtm, notation as follows:
test <- Df.dtm[-train,]
After clearing up what my sample returned (row indicies) and how I wanted to slice up my test set (again, rows or columns need to be established at this point), the I would tweak that apply function with the argument necessary here is a link of how the apply function works, but for the sake of time, if you pass it a 2 you apply over each column and if you pass it a 1 it will apply the function given over each row. Again, depending on how you want your sample (rows or columns) we can tweak this either way.

In R using the pls package, how can I obtain estimates of coefficients by group/factor

I've started looking at the pls package & I am unsure about how to extract separate coefficients by group/factor. I can run separate models per group, or consider the X ~ group interaction term, but that isn't what I'm after.
I'm using the following syntax:
model1 <- plsr(outcome ~ pred * group, data =plsDATA,2)
I've tried using the following:
model2 <- plsr(outcome ~ embed(pred:as.factor(group)), data=plsDATA,2)
but this results in this error:
Error in model.frame.default(formula = outcome ~ embed(pred:as.factor(group)), :
variable lengths differ (found for 'embed(pred:as.factor(group))')
In addition: Warning messages:
1: In pred:as.factor(group) :
numerical expression has 640 elements: only the first used
2: In pred:as.factor(group) :
numerical expression has 32 elements: only the first used
I'm not sure why I'm getting the variable lengths error since running the following command gives compatible dimensions:
dim(group)
[1] 32 1
dim(outcome)
[1] 32 1
dim(pred)
[1] 32 20
The code is below:
library(pls) #Dummy Data
setwd("/Users/John/Documents")
Data <- read.csv("SamplePLS.csv") #Define each of the inputs pred is X, group is the factor & outcome is Y
pred <- as.matrix(Data[,3:22])
group <- as.matrix(Data[,1])
outcome <- as.matrix(Data[,2]) #now combine the matrices into a single dataframe
plsDATA <- data.frame(SampN=c(1:nrow(Data)))
plsDATA$pred <- pred
plsDATA$group <- group
plsDATA$outcome <-outcome #define the model - ask for two components
model1 <- plsr(outcome ~ pred * group, data=plsDATA,2)#Get coefficients from this object
According to your question, you are wanting to extract the coefficients. There is a function, 'coef()' that will pull them out easily. See the results below.
Data <- read.csv("SamplePLS.csv") #Define each of the inputs pred is X, group
is the factor & outcome is Y
> pred <- as.matrix(Data[,3:22])
> group <- as.matrix(Data[,1])
> outcome <- as.matrix(Data[,2]) #now combine the matrices into a single dataframe
> plsDATA <- data.frame(SampN=c(1:nrow(Data)))
> plsDATA$pred <- pred
> plsDATA$group <- group
> plsDATA$outcome <-outcome #define the model - ask for two components
> model1 <- plsr(outcome ~ pred * group, data=plsDATA,2)
> coef(model1)
, , 2 comps
outcome
predpred1 -1.058426e-02
predpred2 2.634832e-03
predpred3 3.579453e-03
predpred4 1.135424e-02
predpred5 3.271867e-04
predpred6 4.438445e-03
predpred7 8.425997e-03
predpred8 3.001517e-03
predpred9 2.111697e-03
predpred10 -9.264594e-04
predpred11 1.885554e-03
predpred12 -2.798959e-04
predpred13 -1.390471e-03
predpred14 -1.023795e-03
predpred15 -3.233470e-03
predpred16 5.398053e-03
predpred17 9.796533e-03
predpred18 -8.237801e-04
predpred19 4.778983e-03
predpred20 1.235484e-03
group 9.463735e-05
predpred1:group -8.814101e-03
predpred2:group 9.013430e-03
predpred3:group 7.597494e-03
predpred4:group 1.869234e-02
predpred5:group 1.462835e-03
predpred6:group 6.928687e-03
predpred7:group 1.925111e-02
predpred8:group 3.752095e-03
predpred9:group 2.404539e-03
predpred10:group -1.288023e-03
predpred11:group 4.271393e-03
predpred12:group 6.704938e-04
predpred13:group -3.943964e-04
predpred14:group -5.468510e-04
predpred15:group -5.595737e-03
predpred16:group 1.090501e-02
predpred17:group 1.977715e-02
predpred18:group -3.013597e-04
predpred19:group 1.169534e-02
predpred20:group 3.389127e-03
The same results could also be achieved with the call model1$coefficients or model1[[1]]. Based on the question, I think this is the result you are looking for.
Actually, I've just figured this out. You need to dummy code the grouping variable & make it the outcome (i.e. predicted variable). In this case, I had two columns representing group membership. In each case, membership in the group was indicated by 1 and non-membership by 0. Then I called the first two columns as group (i.e. group <- as.matrix(Data[,1:2])) & ran the rest of the code as before substituting group for outcome.

How to run an ANOVA on specific variables in R using a CSV?

I am new to R and currently I am working with participant data in which I want to run ANOVA's on each category.
I have read my data from my CSV using the following line of code:
CSV <- read.csv("data.csv", header=TRUE)
The vector CSV now is a lot of obs. of 12 variables. The last variable depicts the group, where as the first 11 variables are each category. I wish to run ANOVAs by separating the data into groups based on the value of the 12th variable and run an ANOVA for each variable 2 - 11.
How would I separate the data into N groups based on the 12th variable and run an ANOVA for each variable 2 - 11?
I'm a little confused as to what you mean by ANOVA for each variable. Below is a loop that runs through each value of your 12th variable, subsets the data, and then runs ANOVA. You need to change the "y ~ x" part since I don't know what your dependent/independent variables will be. If you want to run ANOVA for each variable on each other variable, you might require another loop which i tried below.
for(i in unique(CSV[,12])) {
data<-subset(CSV, subset=CSV[,12]==i)
fit <- aov(y ~ x, data=data)
fit
}
for each variable
`%ni%`<-Negate(`%in%`) ##setting up 'not in'
for(i in unique(CSV[,12])) {
data<-subset(CSV, subset=CSV[,12]==i)
for( j in 1:11) {
fm <- as.formula(paste(names(data[,j])," ~", paste(names(data)[names(data) %ni% names(data[,j])], collapse = "+")))
fit<-aov(fm,data=data)
fit #you may want to output the results rather than printing them here
}
}
Read in the file:
CSV <- read.csv("data.csv", header=TRUE)
Set the index for 12, the upper bound, and lower bound for use of iterating through the categories:
lenCSV <- 12
upper <- 11
lower <- 2
Iterate through the categories and output the summary:
for( j in lower:upper) {
fm <- as.formula(paste(names(CSV[j])," ~", names(lenCSV)))
fit<-aov(fm,data=CSV)
print(fit)
}

Resources