What happens when logistic regression does not quite capture the data? - r

I have modeled the probability of an aggressive (vs indolent) form of recurrent respiratory papillomatosis as a function of age of diagnosis. Generally speaking, those who are diagnosed before the age of 5 have a 80% probability of running an aggressive course. Those diagnosed after the age of 10 years have about a 30% chance. Between 5 years and 10 years it is somewhere in between. In all three age groups there does not seem to be a correlation with age (within the limits of the age group).
Look at the graph (open circles) logistic regression wants to go with but look at my manual line (dotted line) that seems to better describe what is going on. My x-axis is log of diagnostic age. The y-axis is probability of aggressive disease. How do I model the dotted line? I thought of using my own logistic function but I do not know how to make R find the best parameters.
Am I missing something in my understanding of the mathematics of the two graphs?
How do I operationalize this in R. Or perhaps I am looking for the dashed green line. I cannot believe the dashed line is correct. Biologically speaking there is little to imagine that the risk of someone diagnosed at age 9.9 is very different than one diagnosed at age 10.1 years

I agree that discontinuous or step functions typically make little ecological sense. Then again, probably your dotted line doesn't, either. If we can agree that the level won't make any discontinuous jumps (as in your green dashed line), then why should the regression coefficient of the response to age make discontinuous jumps to yield the "kinks" in your green line?
You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.

The "standard" logistic function $\frac{1}{1+e^{-x}}$ passes through 0 and 1 at $±\infty$. This is not a great match for your data, which doesn't seem to approach either of those values, but instead approaches 0.8 from the left and 0.3 from the right.
You may want to add scale and offset parameters so that you can squash and shift that curve into that range. My guess is that, despite the extra parameters, the model will fit better (via AIC, etc) and will end up resembling your dashed line.
Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like
dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset
You would then fit this with nls: simply pass in the formula (above), and the data. If you have reasonable guesses for the starting values (which you do here), providing them can help speed converge.

Related

Checking for multicollinearity and adjust the data

I would appreciate help interpreting the following pairwise scatterplots of predictor variables to check for multicollinearity and then fit the data to the results to avoid this occurring.
Background: I am working on a task where I have to carry out multilinear regression. In this task I have three explanatory variables, tar, nico and weight and want to predict CO, so CO is the response variable (dependent variable). The data comes from 25 American cigarette brands where tar, nico and weight are the respective brand's content of tar, nicotine and weight per cigarette. And CO is how much carbon monoxide a cigarette emits.
Question: In the task, I will now plot all the explanatory variables in pairs against each other to look for multicollinearity and find an observation that is questionable to include in the regression. Which I have done, see the picture above. But how should I interpret this image?
My thoughts: I have understood that multicollinearity would not exist if all the images in this plot looked different, but I can clearly see that this is not the case here. For example, three out of four plots after "tar" are similar and this form also appears in one plot after "nico" and two plots after "weight". But does this then mean that the three predictor variables are multicollinear? Or that some data in "tar" is collinear with another data in "tar"? After I figure out where this collinearity (possibly) arises, I need to fit the data and run a new multilinear regression on the reduced data set for which the questionable observation has been removed. I think this is done by setting the value of the dubious observation to NA, but then I have to find this one first.
Finally: How should I interpret the image and then fit the data to get rid of any collinearity?
Any thoughts and tips on this are welcomed! Thanks in advance!

Dharma residuals vs predictors interperation of dashed lines

I'm creating diagnostics for my glmmTMB model using DHARMa, and, while I understand most of the lines, I have problems interpreting scaled residual versus predictor variables: there is a red dashed line. Any advice on interpretation?
Example of residual vs one of the predictor plots:
Let me know if you need more information to give me an answer.
The red dashed line is (to my knowledge) simply a non-parametric estimator for the average. In the perfect world scenario, we would expect it to be 0.
In the real world, we expect not to see any systematic deviations from 0. Yours looks rather good here. It oscillates between 0 at random, and only in the area where information is lacking ( pred > 2.5 ) it starts to deviate.
The answer can be found here. Shortly, what you see is a nonparametric smoother that is the default for large datasets. You can get a more intuitive plot by specifying quantreg = T

Interpreting the confidence interval from CausalImpact

I am not sure how to interpret the confidence interval obtained when using the CausalImpact function in the CausalImpact R package.
I am confused because I think there is a contradiction - the model is returning a very low p-value (0.009) which indicates that there is a casual effect, and yet the "actual" line (the solid line) appears to be well within the 95% confidence band of the counterfactual. If there was a causal impact, wouldn't you expect the line to be outside the blue band?
These are my results:
and here are the model summary results (my apologies for the large text)
What's happening here?
The two results answer different questions.
The plot shows daily effects. The fact that the CIs contain zero means that the effect wasn't significant on any day by itself.
The table shows overall effects. Unlike the plot, the table pools information over time, which increases statistical power. The fact that effects were consistently negative throughout the post-period provides evidence that, overall, there probably was a negative effect. It's just too subtle to show up on any day by itself.
A side note: There seems to be a strong dip in the gap between pre- and post-period. You may want to be extra careful here and think about whether the effect in the post-period could have been caused by whatever happened in the gap rather than by the treatment.

Did I just do an ANCOVA or MANOVA?

I’m trying to do an ANCOVA here ...
I want to analyze the effect of EROSION FORCE and ZONATION on all the species (listed with small letters) in each POOL.STEP (ranging from 1-12/1-4), while controlling for the effect of FISH.
I’m not sure if I’m doing it right. What is the command for ANCOVA?
So far I used lm(EROSIONFORCE~ZONATION+FISH,data=d), which yields:
So what I see here is that both erosion force percentage (intercept?) and sublittoral zonation are significant in some way, but I’m still not sure if I’ve done an ANCOVA correctly here or is this just an ANOVA?
In general, ANCOVA (analysis of covariance) is simply a special case of the general linear model with one categorical predictor (factor) and one continuous predictor (the "covariate"), so lm() is the right function to use.
However ... the bottom line is that you have a moderately challenging statistical problem here, and I would strongly recommend that you try to get local help (if you're working within a research group, can you consult with others in your group about appropriate methods?) I would suggest following up either on CrossValidated or r-sig-ecology#r-project.org
by putting EROSIONFORCE on the left side of the formula, you're specifying that you want to use EROSIONFORCE as a response (dependent) variable, i.e. your model is estimating how erosion force varies across zones and for different fish numbers - nothing about species response
if you want to analyze the response of a single species to erosion and zone, controlling for fish numbers, you need something like
lm(`Acmaeidae s...` ~ EROSIONFORCE+ZONATION+FISH, data=your_data)
the lm() suggestion above would do each species independently, i.e. you'd have to do a separate analysis for each species. If you also want to do it separately for each POOL.STEP you're going to have to do a lot of separate analyses. There are various ways of automating this in R, the most idiomatic is probably to melt your data (see reshape2::melt or tidy::gather) into long format and then use lmList from lme4.
since you have count data with low means, i.e. lots of zeros (and a few big values), you should probably consider a Poisson or negative binomial model, and possibly even a zero-inflated/hurdle model (i.e. analyze presence-absence and size of positive responses separately)
if you really want to analyze the joint distribution of all species (i.e., a response of a multivariate analysis, which is the M in MANOVA), you're going to have to work quite a bit harder ... there are a variety of joint species distribution models by people like Pierre Legendre, David Warton and others ... I'd suggest you try starting with the mvabund package, but you might need to do some reading first

R - Approach to find outliers/artefacts in blood pressure curve

Do you guys have an idea how to approach the problem of finding artefacts/outliers in a blood pressure curve? My goal is to write a program, that finds out the start and end of each artefact. Here are some examples of different artefacts, the green area is the correct blood pressure curve and the red one is the artefact, that needs to be detected:
And this is an example of a whole blood pressure curve:
My first idea was to calculate the mean from the whole curve and many means in short intervals of the curve and then find out where it differs. But the blood pressure varies so much, that I don't think this could work, because it would find too many non existing "artefacts".
Thanks for your input!
EDIT: Here is some data for two example artefacts:
Artefact1
Artefact2
Without any data there is just the option to point you towards different methods.
First (without knowing your data, which is always a huge drawback), I would point you towards Markov switching models, which can be analysed using the HiddenMarkov-package, or the HMM-package. (Unfortunately the RHmm-package that the first link describes is no longer maintained)
You might find it worthwile to look into Twitter's outlier detection.
Furthermore, there are many blogposts that look into change point detection or regime changes. I find this R-bloggers blog post very helpful for a start. It refers to the CPM-package, which stands for "Sequential and Batch Change Detection Using Parametric and Nonparametric Methods", the BCP-package ("Bayesian Analysis of Change Point Problems"), and the ECP-package ("Non-Parametric Multiple Change-Point Analysis of Multivariate Data"). You probably want to look into the first two as you don't have multivariate data.
Does that help you getting started?
I could provide an graphical answer that does not use any statistical algorithm. From your data I observe that the "abnormal" sequences seem to present constant portions or, inversely, very high variations. Working on the derivative, and setting limits on this derivative could work. Here is a workaround:
require(forecast)
test=c(df2$BP)
test=ma(test, order=50)
test=test[complete.cases(test)]
which <- ma(0+abs(diff(test))>1, order=10)>0.1
abnormal=test; abnormal[!which]<-NA
plot(x=1:NROW(test), y=test, type='l')
lines(x=1:NROW(test), y=abnormal, col='red')
What it does: first "smooths" the data with a moving average to prevent the micro-variations to be detected. Then it applyes the "diff" function (derivative) and tests if it is greater than 1 (this value is to be adjusted manually depending on the soothing amplitude). THen, in order to get a whole "block" of abnormal sequence without tiny gaps, we apply again a smoothing on the boolean and test it superior to 0.1 to grasp better the boundaries of the zone. Eventually, I overplot the spotted portions in red.
This works for one type of abnormality. For the other type, you could make up a low treshold on the derivative, inversely, and play with the tuning parameters of smoothing.

Resources