In modeling it is helpful to do univariate regressions of a dependent on an independent in linear, quadratic, cubic and quaternary (?) forms to see which captures the basic shape of the statistical data. I'm a fairly new R programmer and need some help.
Here's pseudocode:
for i in 1:ncol(data)
data[,ncol(data) + i] <- data[, i]^2 # create squared term
data[,ncol(data) + i] <- data[, i]^3 # create cubed term
...and similarly for cubed and fourth power terms
# now do four regressions starting with linear and including one higher order term each time and display for each i the form of regression that has the highest adj R2.
lm(y ~ data[,i], ... )
# retrieve R2 and save indexed for linear case in vector in row i
lm(y tilda data[,i], data[,ncol(data) + i, ...]
# retrieve R2 and save...
Result is a dataframe indexed by i with column name in data of original x variable and results for each of the four regressions (all run with an intercept term).
Ordinarily we do this by looking at plots but where you have 800 variables that is not feasible.
If you really want to help out write code to automatically insert the required number of exponentiated variables into data.
And this doesn't even take care of the kinky variables that come clumped up in a couple of clusters or only relevant at one value, etc.
I'd say the best way to do this is by using the polynomial function in R, poly(). Imagine you have an independent numeric variable, x, and a numeric response variable, y.
models=list()
for (i in 1:4)
models[[i]]=lm(y~poly(x,i),raw=TRUE)
The raw=TRUE part ensures that the model uses the raw polynomials, and not the orthogonal polynomials.
When you want to get one of the models, just type in models[[1]] or models[[2]], etc.
Related
I'm new to applying splines to longitudinal data, so here comes my question:
I've some longitudinal data on growing mice in 3 timepoints: at x, y and z months. It's known from the existent literature that the trajectories of growth in this type of data are usually better modeled in non-linear terms.
However, since I have only 3 timepoints, I wonder if this allows me to apply natural quadratic spline to age variable in my lmer model?
edit:I mean is
lmer<-mincLmer(File ~ ns(Age,2) * Genotype + Sex + (1|Subj_ID),data, mask=mask)
a legit way to go around?
I'm sorry if this is a stupid question - I'm just a lonely PhD student without supervision, and I would be super-grateful for any advice!!!
Marina
With the nls() function you can fit your data to whatever non-linear function you want. Then, from the biological point of view, probably your data is described by a Gompertz-like function (sigmoidal), but as you have only three time points, probably you can simplify these kind of functions into an exponential one. Try the following:
fit_formula <- independent_variable ~ a * exp(b * dependent_variable)
result <- nls(formula = fit_formula, data = your_Dataset)
It will probably give you an error the first times, something like singular matrix gradient at initial estimates ; if this happens, try adding the additional parameter start, where you provide different starting values for a and b more close to the true values. Remember that in your dataset, the column names must be equal to the names of the variables in the formula.
The toy model below stands in for one with a bunch more variables, transforms, lags, etc. Assume I got that stuff right.
My data is ordered in time, but is now formatted as an R time series, because I need to exclude certain periods, etc. I'd rather not make it a time series for this reason, because I think it would be easy to muck up, but if I need to, or it greatly simplifies the estimating process, I'd like to just use an integer sequence, such as index. below, to represent time if that is allowed.
My problem is a simple one (I hope). I would like to use the first part of my data to estimate the coefficients of the model. Then I want to use those estimates, and not estimates from a sliding window, to do one-ahead forecasts for each of the remaining values of that data. The idea is that the formula is applied with a sliding window even though it is not estimated with one. Obviously I could retype the model with coefficients included and then get what I want in multiple ways, with base R sapply, with tidyverse dplyr::mutate or purrr::map_dbl, etc. But I am morally certain there is some standard way of pulling the formula out of the lm object and then wielding it as one desires, that I just haven't been able to find. Example:
set.seed(1)
x1 <- 1:20
y1 <- 2 + x1 + lag(x1) + rnorm(20)
index. <- x1
data. <- tibble(index., x1, y1)
mod_eq <- y1 ~ x1 + lag(x1)
lm_obj <- lm(mod_eq, data.[1:15,])
and I want something along the lines of:
my_forecast_values <- apply_eq_to_data(eq = get_estimated_equation(lm_obj), my_data = data.[16:20])
and the lag shouldn't give me an error.
Also, this is not part of my question per se, but I could use a pointer to a nice tutorial on using R formulas and the standard estimation output objects produced by lm, glm, nls and the like. Not the statistics, just the programming.
The common way to use the coefficients is by calling the predict(), coefficients(), or summary() function on the model object for what it is worth. You might try the ?predict.lm() documentation for details on formula.
A simple example:
data.$lagx <- dplyr::lag(data.$x1, 1) #create lag variable
lm_obj1 <- lm(data=data.[2:15,], y1 ~ x1 + lagx) #create model object
data.$pred1 <- predict(lm_obj1, newdata=data.[16,20]) #predict new data; needs to have same column headings
I am trying to perform a linear regression on experimental data consisting of replicate measures of the same condition (for several conditions) to check for the reliability of the experimental data. For each condition I have ~5k-10k observations stored in a data frame df:
[1] cond1 repA cond1 repB cond2 repA cond2 repB ...
[2] 4.158660e+06 4454400.703 ...
[3] 1.458585e+06 4454400.703 ...
[4] NA 887776.392 ...
...
[5024] 9571785.382 9.679092e+06 ...
I use the following code to plot scatterplot + lm + R^2 values (stored in rdata) for the different conditions:
for (i in seq(1,13,2)){
vec <- matrix(0, nrow = nrow(df), ncol = 2)
vec[,1] <- df[,i]
vec[,2] <- df[,i+1]
vec <- na.exclude(vec)
plot(log10(vec[,1]),log10(vec[,2]), xlab = 'rep A', ylab = 'rep B' ,col="#00000033")
abline(fit<-lm(log10(vec[,2])~log10(vec[,1])), col='red')
legend("topleft",bty="n",legend=paste("R2 is",rdata[1,((i+1)/2)] <- format(summary(fit)$adj.r.squared,digits=4)))
}
However, the lm seems to be shifted so that it does not fit the trend I see in the experimental data:
It consistently occurs for every condition. I unsuccesfully tried to find an explanation by looking up the scource code and browsing different forums and posts (this or here).
Would have like to simply comment/ask a few questions, but can't.
From what I've understood, both repA and repB are measured with error. Hence, you cannot fit your data using an ordinary least square procedure, which only takes into account the error in Y (some might argue a weighted OLS may work, however I'm not skilled enough to discuss that). Your question seem linked to this one.
What you can use is a total least square procedure: it takes into account the error in X and Y. In the example below, I've used a "normal" TLS assuming there is the same error in X and Y (thus error.ratio=1). If it is not, you can specify the error ratio by entering error.ratio=var(y1)/var(x1) (at least I think it's var(Y)/var(X): check on the documentation to ensure that).
library(mcr)
MCR_reg=mcreg(x1,y1,method.reg="Deming",error.ratio=1,method.ci="analytical")
MCR_intercept=getCoefficients(MCR_reg)[1,1]
MCR_slope=getCoefficients(MCR_reg)[2,1]
# CI for predicted values
x_to_predict=seq(0,35)
predicted_values=MCResultAnalytical.calcResponse(MCR_reg,x_to_predict,alpha=0.05)
CI_low=predicted_values[,4]
CI_up=predicted_values[,5]
Please note that, in Deming/TLS regressions, your x- and y-errors are supposed to follow normal distribution, as explained here. If it's not the case, go for a Passing-Bablok regressions (and the R code is here).
Also note that the R2 isn't defined for Deming nor Passing Bablok regressions (see here). A correlation coefficient is a good proxy, although it does not exactly provide the same information. Since you're studying a linear correlation between two factors, see Pearson's product moment correlation coefficient, and use e.g. the rcorrfunction.
I have a matrix Expr with rows representing variables and columns samples.
I have a categorical vector called groups (containing either "A","B", or "C")
I want to test which of variables 'Expr' can be explained by the fact that the sample belong to a group.
My strategy would be modelling the problem with a generalized additive model (with a negative binomial distribution).
And then I want use a likelihood ratio test in a variable wise way to get a p value for each variable.
I do:
require(VGAM)
m <- vgam(Expr ~ group, family=negbinomial)
m_alternative <- vgam(Expr ~ 1, family=negbinomial)
and then:
lr <- lrtest(m, m_alternative)
The last step is wrong because it is testing the overall likelihood ratio of the two model not the variable wise.
Instead of a single p value I would like to get a vector of the p-values for every variable.
How should I do it?
(I am very new to R, so forgive me my stupidity)
It sounds like you want to use Expr as your predictors It think you may have your formula backwards. The response should be on the left, so I guess that's groups in your case.
If Expr is a data.frame, you can do regression on all variables with
m <- vgam(group ~ ., Expr, family=negbinomial)
If class(Expr)=="matrix", then
m <- vgam(group ~ Expr, family=negbinomial)
probably should work, but you may just get slightly odd looking coefficient labels.
I'm trying to use R for the first time.
In this case, y is oxygen consumption, x is time and g is status indicated by up to three letters (NYF, IR, F, M, or NF). It will run regressions for each status except for F.
[Side note: I've also tried accomplishing this with multiple regressions using the subset function. When I use
lm(O2~time,subset(data,Status=="NYF"))
it does not actually adhere to the subset and gives me a regression for the entire data set regardless of which status I enter.
How do I get multiple simple linear regressions from a single data set based on the codes in the status column?
You question isn't clear. Suppose you have a data frame, dd, with three columns: y, x, g. The variables y and x are numeric and g takes the values NYF, IR, F, M, or NF. To carry out simple linear regression for a particular status, then:
lm(y ~ x, data=dd[dd$g=="NYF",])
#Or
lm(y ~ x, data=dd[dd$g=="IR",])
To perform multiple linear regression, try
lm(y ~ x + g, data=dd)
where the present or absence of a factor is indicated by a binary variable.
lm(y~x,subset(dd,g=='NYF'))
is appropriate syntax to fit the line for a single status (although others are giving you variants that will work). I would check to make sure your data frame is indeed named "data" and your status variable is named "Status".