Julia: Iterating over columns in a dataframe and calculate LinearRegression - plot

I have a data frame with 10 columns(features) COL0 to COL9 and a column RESP. How do I calculate a LinearRegression Model for each pair COL0 to COL9 ~ RESP?
I am expecting to get 10 graphs showing the Model and also a table with the coefficients of my model for each column.
What I tried do far:
model2 = fit(LinearModel, #formula(RESP ~EXPL_0 + EXPL_1 + EXPL_2 +
EXPL_3 + EXPL_4 + EXPL_5 + EXPL_6 + EXPL_7 + EXPL_8 + EXPL_9), df)
And I get what i want.
I still need to know how to plot all this graphs
and if I had COL0 to COL1000, how do I can avoid to type all the columns from 0 to 1000?
I am new to Julia and I really dont have a clue how to get this done. Any help?
Thanks

As Bogumil says, it's not ideal to ask many questions in one post on StackOverflow - your question should be well defined and targeted, ideally with a minimum working example to make it easiest for people to help you.
Let's therefore answer what from the title of the post I take as your main question: how can I fit a linear regression model with GLM which includes many response columns. That question is almost a duplicate of this question, so a very similar answer applies: broadcast the term function over the names in your DataFrame which you want to include on the right hand side like this:
julia> using DataFrames, GLM
julia> df = hcat(DataFrame(RESP = rand(100)), DataFrame(rand(100, 10), :auto));
julia> mymodel = lm(term(:RESP) ~ sum(term.(names(df[!, Not(:RESP)]))), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
RESP ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10
(...)
You also ask about plotting, but that's probably best dealt with in a separate question. You can access the estimated coefficients of your linear model using the coef function, e.g.
julia> coef(mymodel)
11-element Vector{Float64}:
0.504236533528822
0.11712812154185266
-0.0206810430546413
-0.15693089456050294
-0.011916514466331067
-0.1030171434361648
0.10378957999147352
-0.09447743618381275
-0.08860977078650123
0.0816071818033377
0.09939548661830626
and the full output with coeftable.
Finally note that you won't necessarily see from that which columns "impact most" your model as you say in the comment, unless you have standardized your regressors.

Related

lavaan WARNING: some observed variances are (at least) a factor 1000 times larger than others; use varTable(fit) to investigate

I am trying to evaluate the sem model from a dataset, some of the data are in likert scale i.e from 1-5. and some of the data are COUNTS generated from the computer log for some of the activity.
Whereas while performing the fits the laveen is giving me the error as:
lavaan WARNING: some observed variances are (at least) a factor 1000 times larger than others; use varTable(fit) to investigate
To mitigate this warning I want to scale some of the variables. But couldn't understand the way for doing that.
Log_And_SurveyResult <- read_excel("C:/Users/Aakash/Desktop/analysis/Log-And-SurveyResult.xlsx")
model <- '
Reward =~ REW1 + REW2 + REW3 + REW4
ECA =~ ECA1 + ECA2 + ECA3
Feedback =~ FED1 + FED2 + FED3 + FED4
Motivation =~ Reward + ECA + Feedback
Satisfaction =~ a*MaxTimeSpentInAWeek + a*TotalTimeSpent + a*TotalLearningActivityView
Motivation ~ Satisfaction'
fit <- sem(model,data = Log_And_SurveyResult)
summary(fit, standardized=T, std.lv = T)
fitMeasures(fit, c("cfi", "rmsea", "srmr"))
I want to scale some of the variables like MaxTimeSpentInAWeek and TotalTimeSpent
Could you please help me figure out how to scale the variables? Thank you very much.
As Elias pointed out, the difference in the magnitude between the variables is huge and it is suggested to scale the variables.
The warning gives a hint and inspecting varTable(fit) returns summary information about the variables in a fitted lavaan object.
Rather than running scale() separately on each column, you could use apply() on a subset or on your whole data.frame:
## Scale the variables in the 4th and 7h column
Log_And_SurveyResult[, c(4, 7)] <- apply(Log_And_SurveyResult[, c(4, 7)], 2, scale)
## Scale the whole data.frame
Log_And_SurveyResult <- apply(Log_And_SurveyResult, 2, scale)
You can just use scale(MaxTimeSpentInAWeek). This will scale your variable to mean = 0 and variance = 1. E.g:
Log_And_SurveyResult$MaxTimeSpentInAWeek <-
scale(Log_And_SurveyResult$MaxTimeSpentInAWeek)
Log_And_SurveyResult$TotalTimeSpent <-
scale(Log_And_SurveyResult$TotalTimeSpent)
Or did I misunderstand your question?

Combining grep() family of functions with a conditional if statment

I'm using LASSO as a variable selection method for my analysis, but there's one particular variable that I wish to ensure is contained in the final formula. I have automated the entire process to return the variables that LASSO selects and spits them into a character string formula e.g. formula = y~x1+x2+x3+... However there is one variable in particular I would like to keep in the formula even if LASSO does not select it. Now I could easily manually add this variable to the formula after the fact, but in the interest of improving my R skills I'm trying to automate the entire process.
My thoughts to achieve my goal so far was nesting the grep() function inside an ifelse() statement e.g. ifelse(grep("variable I'm concerned with",formula)!=1, formula=formula,formula=paste0(formula,'variable I'm concerned with',collapse="+")) but this has not done the trick.
Am I on the right track or can anyone think of alternative routes to take?
According to documentation
penalty.factor - Separate penalty factors can be applied to each
coefficient. This is a number that multiplies lambda to allow
differential shrinkage. Can be 0 for some variables, which implies no
shrinkage, and that variable is always included in the model. Default
is 1 for all variables (and implicitly infinity for variables listed
in exclude). Note: the penalty factors are internally rescaled to sum
to nvars, and the lambda sequence will reflect this change.
So apply this as an argument to glmnet using a penalty factor of 0 for your "key coefficient" and 1 elsewhere.
Formula is not a character object, but you might want to explore terms.formula if your goal is to edit formulas directly based on character output. terms objects are really powerful ways of doing variable subset and selection. But you really need to explore it because the formula language was not really meant to be automated easily, rather it was meant to be a convenient and readable way to specify model fits (look at how difficult SAS is by comparison).
f <- y ~ x1 +x2
t <- terms(f)
## drop 'x2'
i.x2 <- match('x2', attr(t, 'term.labels'))
t <- t[, -i.x2] ## drop the variable
## t is still a "terms" object but `lm` and related functions have implicit methods for interpreting as a "formula" object.
lm(t)
Currently, you are attempting to adjust character value of formula to a formula object which will not work given the different types. Instead, consider stats::update which will not add any terms not already included as a term:
lasso_formula <- as.formula("y ~ x1 + x2 + x3")
# EXISTING TERM
lasso_formula <- update(lasso_formula, ~ . + x3)
lasso_formula
# y ~ x1 + x2 + x3
# NEEDED VARIABLE
lasso_formula <- update(lasso_formula, ~ . + myTerm)
lasso_formula
# y ~ x1 + x2 + x3 + myTerm
Should formula be a character string, be sure to use grepl (not grep) in ifelse. And do not assign with = inside ifelse as it is a function itself returning a value itself and not to be confused with if...else:
lasso_formula <- "y ~ x1 + x2 + x3"
lasso_formula <- ifelse(grepl("myterm", lasso_formula),
lasso_formula,
paste(lasso_formula, "+ myterm"))
lasso_formula
# [1] "y ~ x1 + x2 + x3 + myterm"

How to use dplyr to make several simple regressions using always the same independent variable but changing the dependent one?

I hope this is not the simplest question. I need to make a simple regression (yes, a simple one: Y = a + bX + epsilon). My data frame is such that each column has one variable (and each column has 20 rows (observations)). The problem is that the first 10 columns are from Y1 to Y10 and the last one is the only independent variable.
So, I have to run 10 regressions, changing only the Yi (i = 1,...10). For example:
Y1 = a + bX + epsilon
Y2 = a + bX + epsilon
...
Y10 = a + bX + epsilon
(Yi and X are all vectors (20 x 1), it's really a simply exercise)
I can do it one by one, but I was thinking to do them all in one command. I am not a veteran in programming and I was thinking if dplyr could help me with this.
I am really looking for suggestions.
Thank you.
You can try
lapply(d1[paste0('Y',1:10)], function(y) lm(y~d1[,'X']))
where d1 is the dataset

How to separate specific list items with "+" and add to formula?

I am trying to generate a formula using dataframe column names of the following format:
d ~ x1 + x2 + x3 + x4
From the following sample dataset:
a = c(1,2,3)
b = c(2,4,6)
c = c(1,3,5)
d = c(9,8,7)
x1 = c(1,2,3)
x2 = c(2,4,6)
x3 = c(1,3,5)
x4 = c(9,8,7)
df = data.frame(a,b,c,d,x1,x2,x3,x4)
As for what I have tried already:
I know that I can subset only the columns I need using the following approach
predictors = names(df[5:8])
response = names(df[4])
Although, my efforts to try and include these into a formula have failed
How can I assemble the predictors and the response variables into the following format:
d ~ x1 + x2 + x3 + x4
I ultimately want to input this formula into a randomForest function.
We can avoid the entire problem by using the default method of randomForest (rather than the formula method):
randomForest(df[5:8], df[[4]])
or in terms of predictors and response defined in the question:
randomForest(df[predictors], df[[response]])
As mentioned in the Note section of the randomForest help file the default method used here has the additional advantage of better performance than the formula method.
How about:
reformulate(predictors,response=response)

Multiple formulae with shared parameters in R

We're trying to come up with a way for an R function to handle a model which has multiple responses, multiple explanatory variables, and possibly shared parameters between the responses. For example:
Y1 ~ X1 + X2 + X3
Y2 ~ X3 + X4
specifies two responses and four explanatory variables. X3 appears in both, and we want the user to control whether the associated parameter value is the same or different. ie:
Y1 = b1 X1 + b2 X2 + b3 X3
Y2 = b3 X3 + b4 X4
which is a model with four 'b' parameters, or
Y1 = b1 X1 + b2 X2 + b3 X3
Y2 = b4 X3 + b5 X4
a model with five parameters.
Two possibilities:
Specify all the explanatory variables in one formula and supply a matrix mapping responses to explanatories. In which case
Foo( Y1+Y2 ~ X1 + X2 + X3 + X4 + X5, map=cbind(c(1,1,1,0),c(0,0,1,1)))
would correspond to the first case, and
Foo( Y1+Y2 ~ X1 + X2 + X3 + X4 + X5, map=cbind(c(1,1,1,0,0),c(0,0,0,1,1)))
would be the second. Obviously some parsing of the LHS would be needed, or it could be cbind(Y1,Y2). The advantage of this notation is that there is also other information that might be required for each parameter - starting values, priors etc - and the ordering is given by the ordering in the formula.
Have multiple formulae and a grouping function that just adds an attribute so shared parameters can be identified - the two examples then become:
Foo( Y1 ~ X1+X2+G(X3,1), Y2 ~ G(X3,1)+X4)
where the X3 parameter is shared between the formula, and
Foo( Y1 ~ X1+X2+X3, Y2 ~ X3+X4)
which has independent parameters. The second parameter of G() is a grouping ID which gives the power to share model parameters flexibly.
A further explanation of the G function is shown by the following:
Foo( Y1 + X1+X2+G(X3,1), Y2~G(X3,1)+G(X4,2), Y3~G(X3,3)+G(X4,2), Y4~G(X3,3))
would be a model where:
Y1 = b1 X1 + b2 X2 + b3 X3
Y2 = b3 X3 + b4 X4
Y3 = b5 X3 + b4 X4
Y4 = b5 X3
where there are two independent parameters for X3 (G(X3,1) and G(X3,3)). How to handle a group that refers to a different explanatory variable is an open question - suppose that model had Y4~G(X3,2) - that seems to imply a shared parameter between different explanatory variables, since there's a G(X4,2) in there.
This notation seems easier for the user to comprehend, but if you also have to specify starting values then the mapping between a vector of starting values and the parameters they correspond to is no longer obvious. I suspect that internally we'd have to compute the mapping matrix from the G() notation.
There may be better ways of doing this, so my question is... does anyone know one?
Interesting question (I wish all package authors worried a lot more in advance about how they were going to create extensions to the basic Wilkinson-Rogers formula notation ...)
How about something like
formula=list(Y1~X1+X2+X3,Y2~X3+X4,Y3~X3+X4,Y4~X3),
shared=list(Y1+Y2~X3,Y2+Y3~X4,Y3+Y4~X3)
or something like that for your second example above?
The formula component gives the list of equations.
The shared component simply lists which response variables share the same parameter for specified predictor variables. It could obviously be mapped into a logical or binary table, but (for me at least -- this is certainly in the eye of the beholder) it's more straightforward. I think the map solution above is awkward when (as in this case) a variable (such as X3) is shared in two distinct sets of relationships.
I guess some straightforward rule like "starting values in the order in which the parameters appear in the list of formulas" -- in this case
X1, X2, X3(1), X4, X3(2)
would be OK, but it might be nice to provide a helper function that would tell the users the names of the coefficient vector (i.e. the order) given a formula/shared specification ...
From a bit of personal experience, I would say that embedding more fanciness in the formula itself leads to pain ... for example, the original nlme syntax with the random effects specified separately was easier to deal with than the new lme4-style syntax with random effects and fixed effects mixed in the same formula ...
An alternative (which I don't like nearly as well) would be
formula=list(Y1~X1+X2+X3,Y2~X3+X4,Y3~X3[2]+X4,Y4~X3[2])
where new parameters are indicated by some sort of tag (with [1] being implicit).
Also note suggestion from the chat room by #Andrie that interfaces designed for structural equation modeling (sem, lavaan packages) may be useful references.
Of the two methods you propose, the second one with the idea of several formulae looks more natural, but the G notation makes no sense to me.
The first one is much easier to understand, but I have two suggested tweaks to the map argument.
It should really take logical values rather the numbers.
Consider having a default of including all the independent variables for each response variable.

Resources