How to extract coefficients outputs from a linear regression with loop - r

I would like to know how I can loop a regression n times, and in each time with a different set of variables, extract a data.frame where each column is a regression and each row represent a variable.
In my case I have a data.frame of:
dt_deals <- data.frame(Premium=c(1,3,4,5),Liquidity=c(0.2,0.3,1.5,0.8),Leverage=c(1,3,0.5,0.7))
But I have another explanatory dummy variable called hubris, that is the product of a binomial distribution, with 0.25 of mean. Like that:
n <- 10
hubris_dataset <- data.frame(replicate(n, rbinom(4,1,0.25))
In this sense, what I need is to make n simulation of hubris, so I can, make n regression each one with a different set of random binomial distribution and the output of each distribution needs to be put in a data.frame
So far I could reach this:
# define n as the number of simulations i want
n=10
# define beta as a data.frame to put every coefficient from the lm regression
beta=NULL
for(i in 1:n) {
dt_deals2 <- dt_deals
beta[[i]] <- coef(lm(dt_deals$Premium ~ dt_deals$Liquidity + dt_deals$Leverage + hubris_dataset[,i], data=dt_deals2))
beta <- cbind(reg$coefficients)
}
But this way it only generate the first set of coefficient, and doesn't make another ten columns for the data.frame.

#jogo give an idea to change the for-loop method and use sapply, and change the object beta to list(). This was the result:
beta <- sapply(1:n, function(i) coef(lm(Premium ~ Liquidity +Leverage+ hubris_dataset[,i], data=dt_deals2)))
And it worked

Related

Finding non-linear correlations in R

I have about 90 variables stored in data[2-90]. I suspect about 4 of them will have a parabola-like correlation with data[1]. I want to identify which ones have the correlation. Is there an easy and quick way to do this?
I have tried building a model like this (which I could do in a loop for each variable i = 2:90):
y <- data$AvgRating
x <- data$Hamming.distance
x2 <- x^2
quadratic.model = lm(y ~ x + x2)
And then look at the R^2/coefficient to get an idea of the correlation. Is there a better way of doing this?
Maybe R could build a regression model with the 90 variables and chose the ones which are significant itself? Would that be in any way possible? I can do this in JMP for linear regression, but I'm not sure I could do non-linear regression with R for all the variables at ones. Therefore I was manually trying to see if I could see which ones are correlated in advance. It would be helpful if there was a function to use for that.
You can use nlcor package in R. This package finds the nonlinear correlation between two data vectors.
There are different approaches to estimate a nonlinear correlation, such as infotheo. However, nonlinear correlations between two variables can take any shape.
nlcor is robust to most nonlinear shapes. It works pretty well in different scenarios.
At a high level, nlcor works by adaptively segmenting the data into linearly correlated segments. The segment correlations are aggregated to yield the nonlinear correlation. The output is a number between 0 to 1. With close to 1 meaning high correlation. Unlike a pearson correlation, negative values are not returned because it has no meaning in nonlinear relationships.
More details about this package here
To install nlcor, follow these steps:
install.packages("devtools")
library(devtools)
install_github("ProcessMiner/nlcor")
library(nlcor)
After you install it,
# Implementation
x <- seq(0,3*pi,length.out=100)
y <- sin(x)
plot(x,y,type="l")
# linear correlation is small
cor(x,y)
# [1] 6.488616e-17
# nonlinear correlation is more representative
nlcor(x,y, plt = T)
# $cor.estimate
# [1] 0.9774
# $adjusted.p.value
# [1] 1.586302e-09
# $cor.plot
As shown in the example the linear correlation was close to zero although there was a clear relationship between the variables that nlcor could detect.
Note: The order of x and y inside the nlcor is important. nlcor(x,y) is different from nlcor(y,x). The x and y here represent 'independent' and 'dependent' variables, respectively.
Fitting a generalized additive model, will help you identify curvature in the
relationships between the explanatory variables. Read the example on page 22 here.
Another option would be to compute mutual information score between each pair of variables. For example, using the mutinformation function from the infotheo package, you could do:
set.seed(1)
library(infotheo)
# corrleated vars (x & y correlated, z noise)
x <- seq(-10,10, by=0.5)
y <- x^2
z <- rnorm(length(x))
# list of vectors
raw_dat <- list(x, y, z)
# convert to a dataframe and discretize for mutual information
dat <- matrix(unlist(raw_dat), ncol=length(raw_dat))
dat <- discretize(dat)
mutinformation(dat)
Result:
| | V1| V2| V3|
|:--|---------:|---------:|---------:|
|V1 | 1.0980124| 0.4809822| 0.0553146|
|V2 | 0.4809822| 1.0943907| 0.0413265|
|V3 | 0.0553146| 0.0413265| 1.0980124|
By default, mutinformation() computes the discrete empirical mutual information score between two or more variables. The discretize() function is necessary if you are working with continuous data transform the data to discrete values.
This might be helpful at least as a first stab for looking for nonlinear relationships between variables, such as that described above.

Pooled likelihood in R, how to code it?

I don't really know how to construct the following likelihood in R.
I have panel data for several individuals. Each individual is observed over time and I have several observations for each individual. I assume that the observations for each individual are independent and follow a normal distribution with an individual-specific mean and a variance common to all the individuals. In other words, I want to use all the data for estimating the variance, while using individual specific data to estimate the individual means.
Formally, let $i=(1,2,...,N)$ be the $i-th$ individual and $j=(1,2,...,n_i)$ the $j-th$ observation of individual $i$.
The likelihood function for individual $i$ is
$$
L_i(\mu_i,\sigma)=\prod_{j=1}^{n_i}\frac{1}{\sigma\sqrt{2\pi}}\exp(-\frac{(x_{ij}-\mu_i)^2}{2\sigma^2})
$$
L_i(mu_i, sigma) = PRODUCT_j 1/(sigma*sqrt 2pi) * exp( -(x_ij-mu_i)^2/2sigma^2 )
and the full likelihood is then
$$
L(\mu_1,...\mu_N,\sigma)=\prod_{i=1}^N \prod_{j=1}^{n_i}\frac{1}{\sigma\sqrt{2\pi}}\exp(-\frac{(x_{ij}-\mu_i)^2}{2\sigma^2})
$$
L_i(mu_i,...,mu_N, sigma) = PRODUCT_i L_i(mu_i, sigma)
What's a smart way to code the log of the full likelihood in R in order to pass it to some optimizer?
Example data:
set.seed(2)
A <- rep(1,15)
B <- rep(2,11)
ID <- c(A,B)
dA <- rnorm(15,2,3)
dB <- rnorm(11,1,3)
X <- c(dA,dB)
DATA <- data.frame(ID,X)
DATA

how to create many linear models at once and put the coefficients into a new matrix?

I have 365 columns. In each column I have 60 values. I need to know the rate of change over time for each column (slope or linear coefficient). I created a generic column as a series of numbers from 1:60 to represent the 60 corresponding time intervals. I want to create 356 linear regression models using the generic time stamp column with each of the 365 columns of data.
In other words, I have many columns and I would like to create many linear regression models at once, extract the coefficients and put those coefficients into a new matrix.
First of all, statistically this might not be the best possible approach to analyse temporal data. Although, regarding the approach you propose, it is very simple to build a loop to obtain this:
Coefs <- matrix(,ncol(Data),2)#Assuming your generic 1:60 column is not in the same object
for(i in 1:ncol(Data)){
Coefs[i,] <- lm(Data[,i]~GenericColumn)$coefficients
}
Here's a way to do it:
# Fake data
dat = data.frame(x=1:60, y1=rnorm(60), y2=rnorm(60),
y3=rnorm(60))
t(sapply(names(dat)[-1], function(var){
coef(lm(dat[,var] ~ x, data=dat))
}))
(Intercept) x
y1 0.10858554 -0.004235449
y2 -0.02766542 0.005364577
y3 0.20283168 -0.008160786
Now, where's that turpentine soap?

Compare model fits of two GAM

I have a matrix Expr with rows representing variables and columns samples.
I have a categorical vector called groups (containing either "A","B", or "C")
I want to test which of variables 'Expr' can be explained by the fact that the sample belong to a group.
My strategy would be modelling the problem with a generalized additive model (with a negative binomial distribution).
And then I want use a likelihood ratio test in a variable wise way to get a p value for each variable.
I do:
require(VGAM)
m <- vgam(Expr ~ group, family=negbinomial)
m_alternative <- vgam(Expr ~ 1, family=negbinomial)
and then:
lr <- lrtest(m, m_alternative)
The last step is wrong because it is testing the overall likelihood ratio of the two model not the variable wise.
Instead of a single p value I would like to get a vector of the p-values for every variable.
How should I do it?
(I am very new to R, so forgive me my stupidity)
It sounds like you want to use Expr as your predictors It think you may have your formula backwards. The response should be on the left, so I guess that's groups in your case.
If Expr is a data.frame, you can do regression on all variables with
m <- vgam(group ~ ., Expr, family=negbinomial)
If class(Expr)=="matrix", then
m <- vgam(group ~ Expr, family=negbinomial)
probably should work, but you may just get slightly odd looking coefficient labels.

How to do statistics and save results in a loop in R

In modeling it is helpful to do univariate regressions of a dependent on an independent in linear, quadratic, cubic and quaternary (?) forms to see which captures the basic shape of the statistical data. I'm a fairly new R programmer and need some help.
Here's pseudocode:
for i in 1:ncol(data)
data[,ncol(data) + i] <- data[, i]^2 # create squared term
data[,ncol(data) + i] <- data[, i]^3 # create cubed term
...and similarly for cubed and fourth power terms
# now do four regressions starting with linear and including one higher order term each time and display for each i the form of regression that has the highest adj R2.
lm(y ~ data[,i], ... )
# retrieve R2 and save indexed for linear case in vector in row i
lm(y tilda data[,i], data[,ncol(data) + i, ...]
# retrieve R2 and save...
Result is a dataframe indexed by i with column name in data of original x variable and results for each of the four regressions (all run with an intercept term).
Ordinarily we do this by looking at plots but where you have 800 variables that is not feasible.
If you really want to help out write code to automatically insert the required number of exponentiated variables into data.
And this doesn't even take care of the kinky variables that come clumped up in a couple of clusters or only relevant at one value, etc.
I'd say the best way to do this is by using the polynomial function in R, poly(). Imagine you have an independent numeric variable, x, and a numeric response variable, y.
models=list()
for (i in 1:4)
models[[i]]=lm(y~poly(x,i),raw=TRUE)
The raw=TRUE part ensures that the model uses the raw polynomials, and not the orthogonal polynomials.
When you want to get one of the models, just type in models[[1]] or models[[2]], etc.

Resources