I have a data set that I want to run a number of univariate regressions with 20 variables or so. Listed below is a truncated data frame showing just two of the variables of interest (Age and Anesthesia). The regression works fine but the issue I am running into is how to store the data. Since Age has only coefficient but Anesthesia has >4 and the error I get is:
Error in results[i, i] <- summary.glm(logistic.model)$coefficients :
number of items to replace is not a multiple of replacement length
Here is the truncated data.
COMPLICATION Age Anesthesia
0 45 General
1 23 Local
1 33 Lumbar
0 21 Other
varlist <- c("Age", “Anesthesia")
univars <- data.frame() # create an empty data frame
for (i in seq_along(varlist))
{
mod <- as.formula(sprintf("COMPLICATION ~ %s", varlist[i]))
logistic.model<- glm(formula = mod, family = binomial, data=data.clean)
results[i,i] <-summary.glm(logistic.model)$coefficients
}
Filling the dataframe in a loop is not a good idea and it can be inefficient. Moreover, summary.glm(logistic.model)$coefficients returns more than 1 row hence you get the error. Try using this lapply approach.
varlist <- c("Age", "Anesthesia")
lapply(varlist, function(x) {
mod <- glm(reformulate(x, 'COMPLICATION'), data.clean, family = binomial)
summary.glm(mod)$coefficients
}) -> result
result
If you want to combine the result in one dataframe you can do :
result <- do.call(rbind, result)
Related
I wish to regress certain factor exposures calculated for a portfolio on that portfolios returns over 550 months (monthly observations). Fama-macbeth regression.
So essentially, regressing 550 return observations on "constant," factor exposures previously calculated. My loop thus far is as follows
'''
# Second Regression
library(sandwich)
library(broom)
library(tibble)
#Chose file containing factor exposures (note: data columns = exposures, rows = portfolios)
f <- file.choose()
betas <- read.csv(f)
BTM2R <- betas[1, 3]
BIPR <- betas[1, 4]
BInfR <- betas[1, 5]
BUnR <- betas[1, 6]
BOilR <- betas[1, 7]
#Chose file containing return data (columns = portfolios, rows = monthly return observations)
f <- file.choose()
retur <- read.csv(f)
for (i in 1:nrow(retur)){
mod <- lm(data = retur, retur[[i,1]]~BTM2R+BIPR+BInfR+BUnR+BOilR)
print(mod$coefficients)
}
'''
(I also wish to develop this loop further so that after running this regression for each portfolio, it is run for the next portfolio, such that the column number in "return," = j??? But will first address my current problem)
My current problem is that when I run the regression, all coefficient values return "NA," aside from the intercept, which returns a value. To confuse matters further, the intercept does not equal the return value at time = t, which, although the regression results would still obviously be incorrect, one would expect if all other coefficients return "NA,"
I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86
I'm trying to classify 94 text of speech.
Since naiveBayes cannot work well if categories of trainset do not exist in categories of testset, I randomized and confirmed.
There were no problem with categories.
But classifier didn't work with testset.
Following is error message:
Df.dtm<-cbind(Df.dtm, category)
dim(Df.dtm)
Df.dtm[1:10, 530:532]
# Randomize and Split data by rownumber
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))
test <- (1:nrow(Df.dtm))[- train]
# Isolate classifier
cl <- Df.dtm[, "category"]
> summary(cl[train])
dip eds ind pols
23 8 3 13
# Create model data and remove "category"
modeldata <- Df.dtm[,!colnames(Df.dtm) %in% "category"]
#Boolean feature Multinomial Naive Bayes
#Function to convert the word frequencies to yes and no labels
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
#Apply the convert_count function to get final training and testing DTMs
train.cc <- apply(modeldata[train, ], 2, convert_count)
test.cc <- apply(modeldata[test, ], 2, convert_count)
#Training the Naive Bayes Model
#Train the classifier
system.time(classifier <- naiveBayes(train.cc, cl[train], laplace = 1) )
This classifier worked well:
用户 系统 流逝
0.45 0.00 0.46
#Use the classifier we built to make predictions on the test set.
system.time(pred <- predict(classifier, newdata=test.cc))
However, prediction failed.
Error in [.default(object$tables[[v]], , nd) : 下标出界
Timing stopped at: 0.2 0 0.2
Consider the following:
# Indicies of training observations as observations.
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))
# Indicies of whatever is left over from the previous sample, again, also observations are being returned.
#that still remains inside of Df.dtm, notation as follows:
test <- Df.dtm[-train,]
After clearing up what my sample returned (row indicies) and how I wanted to slice up my test set (again, rows or columns need to be established at this point), the I would tweak that apply function with the argument necessary here is a link of how the apply function works, but for the sake of time, if you pass it a 2 you apply over each column and if you pass it a 1 it will apply the function given over each row. Again, depending on how you want your sample (rows or columns) we can tweak this either way.
I've started looking at the pls package & I am unsure about how to extract separate coefficients by group/factor. I can run separate models per group, or consider the X ~ group interaction term, but that isn't what I'm after.
I'm using the following syntax:
model1 <- plsr(outcome ~ pred * group, data =plsDATA,2)
I've tried using the following:
model2 <- plsr(outcome ~ embed(pred:as.factor(group)), data=plsDATA,2)
but this results in this error:
Error in model.frame.default(formula = outcome ~ embed(pred:as.factor(group)), :
variable lengths differ (found for 'embed(pred:as.factor(group))')
In addition: Warning messages:
1: In pred:as.factor(group) :
numerical expression has 640 elements: only the first used
2: In pred:as.factor(group) :
numerical expression has 32 elements: only the first used
I'm not sure why I'm getting the variable lengths error since running the following command gives compatible dimensions:
dim(group)
[1] 32 1
dim(outcome)
[1] 32 1
dim(pred)
[1] 32 20
The code is below:
library(pls) #Dummy Data
setwd("/Users/John/Documents")
Data <- read.csv("SamplePLS.csv") #Define each of the inputs pred is X, group is the factor & outcome is Y
pred <- as.matrix(Data[,3:22])
group <- as.matrix(Data[,1])
outcome <- as.matrix(Data[,2]) #now combine the matrices into a single dataframe
plsDATA <- data.frame(SampN=c(1:nrow(Data)))
plsDATA$pred <- pred
plsDATA$group <- group
plsDATA$outcome <-outcome #define the model - ask for two components
model1 <- plsr(outcome ~ pred * group, data=plsDATA,2)#Get coefficients from this object
According to your question, you are wanting to extract the coefficients. There is a function, 'coef()' that will pull them out easily. See the results below.
Data <- read.csv("SamplePLS.csv") #Define each of the inputs pred is X, group
is the factor & outcome is Y
> pred <- as.matrix(Data[,3:22])
> group <- as.matrix(Data[,1])
> outcome <- as.matrix(Data[,2]) #now combine the matrices into a single dataframe
> plsDATA <- data.frame(SampN=c(1:nrow(Data)))
> plsDATA$pred <- pred
> plsDATA$group <- group
> plsDATA$outcome <-outcome #define the model - ask for two components
> model1 <- plsr(outcome ~ pred * group, data=plsDATA,2)
> coef(model1)
, , 2 comps
outcome
predpred1 -1.058426e-02
predpred2 2.634832e-03
predpred3 3.579453e-03
predpred4 1.135424e-02
predpred5 3.271867e-04
predpred6 4.438445e-03
predpred7 8.425997e-03
predpred8 3.001517e-03
predpred9 2.111697e-03
predpred10 -9.264594e-04
predpred11 1.885554e-03
predpred12 -2.798959e-04
predpred13 -1.390471e-03
predpred14 -1.023795e-03
predpred15 -3.233470e-03
predpred16 5.398053e-03
predpred17 9.796533e-03
predpred18 -8.237801e-04
predpred19 4.778983e-03
predpred20 1.235484e-03
group 9.463735e-05
predpred1:group -8.814101e-03
predpred2:group 9.013430e-03
predpred3:group 7.597494e-03
predpred4:group 1.869234e-02
predpred5:group 1.462835e-03
predpred6:group 6.928687e-03
predpred7:group 1.925111e-02
predpred8:group 3.752095e-03
predpred9:group 2.404539e-03
predpred10:group -1.288023e-03
predpred11:group 4.271393e-03
predpred12:group 6.704938e-04
predpred13:group -3.943964e-04
predpred14:group -5.468510e-04
predpred15:group -5.595737e-03
predpred16:group 1.090501e-02
predpred17:group 1.977715e-02
predpred18:group -3.013597e-04
predpred19:group 1.169534e-02
predpred20:group 3.389127e-03
The same results could also be achieved with the call model1$coefficients or model1[[1]]. Based on the question, I think this is the result you are looking for.
Actually, I've just figured this out. You need to dummy code the grouping variable & make it the outcome (i.e. predicted variable). In this case, I had two columns representing group membership. In each case, membership in the group was indicated by 1 and non-membership by 0. Then I called the first two columns as group (i.e. group <- as.matrix(Data[,1:2])) & ran the rest of the code as before substituting group for outcome.
I am new to R and currently I am working with participant data in which I want to run ANOVA's on each category.
I have read my data from my CSV using the following line of code:
CSV <- read.csv("data.csv", header=TRUE)
The vector CSV now is a lot of obs. of 12 variables. The last variable depicts the group, where as the first 11 variables are each category. I wish to run ANOVAs by separating the data into groups based on the value of the 12th variable and run an ANOVA for each variable 2 - 11.
How would I separate the data into N groups based on the 12th variable and run an ANOVA for each variable 2 - 11?
I'm a little confused as to what you mean by ANOVA for each variable. Below is a loop that runs through each value of your 12th variable, subsets the data, and then runs ANOVA. You need to change the "y ~ x" part since I don't know what your dependent/independent variables will be. If you want to run ANOVA for each variable on each other variable, you might require another loop which i tried below.
for(i in unique(CSV[,12])) {
data<-subset(CSV, subset=CSV[,12]==i)
fit <- aov(y ~ x, data=data)
fit
}
for each variable
`%ni%`<-Negate(`%in%`) ##setting up 'not in'
for(i in unique(CSV[,12])) {
data<-subset(CSV, subset=CSV[,12]==i)
for( j in 1:11) {
fm <- as.formula(paste(names(data[,j])," ~", paste(names(data)[names(data) %ni% names(data[,j])], collapse = "+")))
fit<-aov(fm,data=data)
fit #you may want to output the results rather than printing them here
}
}
Read in the file:
CSV <- read.csv("data.csv", header=TRUE)
Set the index for 12, the upper bound, and lower bound for use of iterating through the categories:
lenCSV <- 12
upper <- 11
lower <- 2
Iterate through the categories and output the summary:
for( j in lower:upper) {
fm <- as.formula(paste(names(CSV[j])," ~", names(lenCSV)))
fit<-aov(fm,data=CSV)
print(fit)
}