zero inflated data and lognormal distribution - r

I am trying to fit a regression model to zero-inflated data with a lognormal distribution using r.
The histogram looks like this:
I searched a little on the net. So far I believe there is no a possibility to fit these conditions to glm. I found the gamlss function as the possibility to fit a lognormal distribution with the LOGNO family. However I get an error: "family = LOGNO, : response variable out of range" - maybe because of the zero inflation?
To make my question a little clearer:
I am trying to investigate the influence of various Aminoacid combinations collected under certain conditions on a certain ratio. The ratio is my response variable plotted in the shown histogram. In the end I end up with a continuous response variable and some other categorical independent variables
Has anyone an idea how I can deal with the above-mentioned problem? I couldn't find a solution so far!
Thank you!

Related

Extracting normal-distributed subset from a dataset in R

Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)

Multiple Regression in R

I have been trying to do a simple regression in R using the following syntax:
Unfortunately, R keeps giving me warnings and the summary is not possible:
I can't find out the problem. The data includes more than just the 11 predictors mentioned in the syntax.
Thank you!
Melanie
This answer partially consists of comments in the original question.
That is not an error. It's a warning message (it differs from error). It's generated because you attempt to use lm() for a factor-type response variable. Operations like + and - does not work on factor, hence the message "-" not meaningful for factors.
If the response is truly a categorical variable, lm() might not be the right way to go to model it. Alternatives in this situation:
glm(): Binary logistic regression, Poisson regression, negative binomial regression
MASS::polr(): Ordinal logistic regression
nnet::multinom(): Multinomial logistic regression
and many more others.
Please research the corresponding methods before actually using it.
If the response is actually NOT a categorical variable, you will want to look further why it is coded as a factor, and turn it to numeric first.

Inverse prediction using drm package in R

I've fit a model using a 5 parameter logistic fit using the drm library. I apologize if this is a dumb question; I'm just getting started with r.
If dose in on my x-axis and response is on my y-axis, it is very easy to use this model to predict my response based on a given dose. You can either use the function PR or predict. However, I want to estimate a dose for a given response. I can't find a function to do this. For my assay, I fit my data to a standard curve and now I have measured a response from my unknowns. I would like to estimate concentration (dose) based on this response. I could fit the data in the opposite direction (flip x and y) but the fit differs slightly and that's not a very conventional strategy. If anyone has any suggestions I would greatly appreciate them. thank you
In case any one comes across this, the easiest way to do this is to use the ED function with the argument type = "absolute".

How to fit a curve to data with sd in R?

I'm completely new to R, so apologies for asking something I'm sure must be basic. I just wonder if I can use the nls() command in R to fit a non-linear curve to a data structure where I have means and sd's, but not the actual replicates. I understand how to fit a curve to single data points or to replicates, but I can't see how to proceed when I have a mean+sd for each data point and I want R to consider variation in my data when fitting.
One possible way to go would be to simulate data using your means and standard deviations and do the regression with the simulated data. Doing this a number of times could give you a good impression on the margin of plausible values for your regression coefficients.

How do you correctly perform a glmmPQL on non-normal data?

I ran a model using glmer looking at the effect that Year and Treatment had on the number of points covered with wood, then plotted the residuals to check for normality and the resulting graph is slightly skewed to the right. Is this normally distributed?
model <- glmer(Number~Year*Treatment(1|Year/Treatment), data=data,family=poisson)
This site recommends using glmmPQL if your data is not normal: http://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html
library(MASS)
library(nlme)
model1<-glmmPQL(Number~Year*Treatment,~1|Year/Treatment,
family=gaussian(link = "log"),
data=data,start=coef(lm(Log~Year*Treatment)),
na.action = na.pass,verbose=FALSE)
summary(model1)
plot(model1)
Now do you transform the data in the Excel document or in the R code (Number1 <- log(Number)) before running this model? Does the link="log" imply that the data is already log transformed or does it imply that it will transform it?
If you have data with zeros, is it acceptable to add 1 to all observations to make it more than zero in order to log transform it: Number1<-log(Number+1)?
Is fit<-anova(model,model1,test="Chisq") sufficient to compare both models?
Many thanks for any advice!
tl;dr your diagnostic plots look OK to me, you can probably proceed to interpret your results.
This formula:
Number~Year*Treatment+(1|Year/Treatment)
might not be quite right (besides the missing + between the terms above ...) In general you shouldn't include the same term in both the random and the fixed effects (although there is one exception - if Year has more than a few values and there are multiple observations per year you can include it as a continuous covariate in the fixed effects and a grouping factor in the random effects - so this might be correct).
I'm not crazy about the linked introduction; at a quick skim there's nothing horribly wrong with it, but there seem to b e a lot of minor inaccuracies and confusions. "Use glmmPQL if your data aren't Normal" is really shorthand for "you might want to use a GLMM if your data aren't Normal". Your glmer model should be fine.
interpreting diagnostic plots is a bit of an art, but the degree of deviation that you show above doesn't look like a problem.
since you don't need to log-transform your data, you don't need to get into the slightly messy issue of how to log-transform data containing zeros. In general log(1+x) transformations for count data are reasonable - but, again, unnecessary here.
anova() in this context does a likelihood ratio test, which is a reasonable way to compare models.

Resources