Interpreting the confidence interval from CausalImpact - r

I am not sure how to interpret the confidence interval obtained when using the CausalImpact function in the CausalImpact R package.
I am confused because I think there is a contradiction - the model is returning a very low p-value (0.009) which indicates that there is a casual effect, and yet the "actual" line (the solid line) appears to be well within the 95% confidence band of the counterfactual. If there was a causal impact, wouldn't you expect the line to be outside the blue band?
These are my results:
and here are the model summary results (my apologies for the large text)
What's happening here?

The two results answer different questions.
The plot shows daily effects. The fact that the CIs contain zero means that the effect wasn't significant on any day by itself.
The table shows overall effects. Unlike the plot, the table pools information over time, which increases statistical power. The fact that effects were consistently negative throughout the post-period provides evidence that, overall, there probably was a negative effect. It's just too subtle to show up on any day by itself.
A side note: There seems to be a strong dip in the gap between pre- and post-period. You may want to be extra careful here and think about whether the effect in the post-period could have been caused by whatever happened in the gap rather than by the treatment.

Related

use.u=TRUE in bootMer function

I have a question about boostrapping confidence intervals for the random effects (BLUPs) of a multilevel model.
I'm currently using bootMer and there is an argument use.u=TRUE that allows one to treat the BLUPs as fixed instead of re-estimating them. Since the BLUPs are random variables it would seem appropriate to re-estimate them at each bootstrap, and indeed the default option is use.u=FALSE.
However the underlying assumption is that my clusters are a random sample of clusters drawn from a population of clusters. In my case I am running a survey experiment in 26 countries (this is the cluster of interest) which in reality were not randomly drawn. And while I am interested in drawing inferences about the larger population of countries from which my sample is drawn, I am also interested in the cluster specific effects, AKA the BLUPs, for each one of these clusters. Because of this I'm resorting to performing bootstrap to get valid confidence intervals for these "estimates".
In this case would it be OK to set use.u=TRUE?
A related question was asked here: https://stats.stackexchange.com/questions/417518/how-to-get-confidence-intervals-for-modeled-data-of-lmer-model-in-r-with-bootmer
however I'm not sure if the answer travelled to my case. Anyone have ideas?

Is there a numerical method for approaching the first derivative at t = 0 s in a real-time application?

I want to improve step-by-step, whilst unevenly-sampled data are coming, the value of the first derivative at t = 0 s. For example, if you want to find the initial velocity in a projectile's motion, but you do not know its final position and velocity, however, you are receiving (slowly) the measurements of the projectile's current position and time.
Update - 26 Aug 2018
I would like to give you more details:
"Unevenly-sampled data" means the time intervals are not regular (irregular times between successive measurements). However, data have almost the same sampling frequency, i.e., it is about 15 min. Thus, there are some measurements without changes, because of the nature of the phenomenon (heat transfer). It gives an exponential tendency and I can fit data to a known model, but an important amount of information is required. For practical purposes, I only need to know the value of the very first slope for the whole process.
I tried a progresive Weighted Least Squares (WLS) fitting procedure, with a weight matrix such as
W = diag((0.5).^(1:kk)); % where kk is the last measurement id
But it was using preprocessed data (i.e., jitter-removing, smoothing, and fitting using the theoretical functional). I gave me the following result:
This is a real example of the problem and its "current solution"
It is good for me, but I would like to know if there is an optimal manner of doing that, but employing the raw data (or smoothed data).
IMO, additional data is not relevant to improve the estimate at zero. Because perturbations come into play and the correlation between the first and last samples goes decreasing.
Also, the asymptotic behavior of the phenomenon is probably not known rigorously (is it truly a first order linear model) ? And this can introduce a bias in the measurements.
I would stick to the first points (say up to t=20) and fit a simple model, say quadratic.
If in fact what you are trying to do is to fit a first order linear model to the data, then least-squares fitting on the raw data is fine. If there are significant outliers, robust fitting is preferable.

What happens when logistic regression does not quite capture the data?

I have modeled the probability of an aggressive (vs indolent) form of recurrent respiratory papillomatosis as a function of age of diagnosis. Generally speaking, those who are diagnosed before the age of 5 have a 80% probability of running an aggressive course. Those diagnosed after the age of 10 years have about a 30% chance. Between 5 years and 10 years it is somewhere in between. In all three age groups there does not seem to be a correlation with age (within the limits of the age group).
Look at the graph (open circles) logistic regression wants to go with but look at my manual line (dotted line) that seems to better describe what is going on. My x-axis is log of diagnostic age. The y-axis is probability of aggressive disease. How do I model the dotted line? I thought of using my own logistic function but I do not know how to make R find the best parameters.
Am I missing something in my understanding of the mathematics of the two graphs?
How do I operationalize this in R. Or perhaps I am looking for the dashed green line. I cannot believe the dashed line is correct. Biologically speaking there is little to imagine that the risk of someone diagnosed at age 9.9 is very different than one diagnosed at age 10.1 years
I agree that discontinuous or step functions typically make little ecological sense. Then again, probably your dotted line doesn't, either. If we can agree that the level won't make any discontinuous jumps (as in your green dashed line), then why should the regression coefficient of the response to age make discontinuous jumps to yield the "kinks" in your green line?
You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.
The "standard" logistic function $\frac{1}{1+e^{-x}}$ passes through 0 and 1 at $±\infty$. This is not a great match for your data, which doesn't seem to approach either of those values, but instead approaches 0.8 from the left and 0.3 from the right.
You may want to add scale and offset parameters so that you can squash and shift that curve into that range. My guess is that, despite the extra parameters, the model will fit better (via AIC, etc) and will end up resembling your dashed line.
Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like
dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset
You would then fit this with nls: simply pass in the formula (above), and the data. If you have reasonable guesses for the starting values (which you do here), providing them can help speed converge.

Meaning of pct.change parameter in longpower package?

Hey so I'm trying to perform a power calculation for a longitudinal study. I've been working with the longpower package. I was a bit confused about the meaning behind the pct.change parameter in the lmmpower command when I was trying to calculate sample size for an nlme model. So, for example, in the following command what does the .3 represent.
lmmpower(model.3, pct.change = .3, t = seq(1,7,1), power = 0.80)$n
The package writeup lists it as "the percent change in the pilot estimate of the parameter of interest (beta, the placebo/null effect)" but am having trouble understanding it. If someone could explain it with a simple example I'd really appreciate it. Also not sure if this belonged here or on cross validated so sorry if it doesn't.
Whenever you do a power calculation, you need to specify an effect size, i.e. the magnitude of response that you're expecting from your experiment. This is usually the smallest effect you would reasonably expect to be able to detect, and an effect size below which you would be comfortable concluding that the effect probably wasn't important in terms of your subject area (biology, economics, business, whatever ...)
lmmpower allows you to specify either pct.change or delta; probably the reason that pct.change is emphasized is that it's often reasonable/easier to interpret proportional changes in an effect. Among other things, this makes the values of the parameter independent of the scale (units) on which the response variable is measured. Alternatively you can specify delta, which is the change in absolute units (i.e., the same units as the predictor variable is measured in).
For what it's worth
"the percent change in the pilot estimate of the parameter of interest (beta, the placebo/null effect)"
seems a little puzzling to me too; I would add "as a proportion of" before "beta" in the parenthetical clause.

R - Approach to find outliers/artefacts in blood pressure curve

Do you guys have an idea how to approach the problem of finding artefacts/outliers in a blood pressure curve? My goal is to write a program, that finds out the start and end of each artefact. Here are some examples of different artefacts, the green area is the correct blood pressure curve and the red one is the artefact, that needs to be detected:
And this is an example of a whole blood pressure curve:
My first idea was to calculate the mean from the whole curve and many means in short intervals of the curve and then find out where it differs. But the blood pressure varies so much, that I don't think this could work, because it would find too many non existing "artefacts".
Thanks for your input!
EDIT: Here is some data for two example artefacts:
Artefact1
Artefact2
Without any data there is just the option to point you towards different methods.
First (without knowing your data, which is always a huge drawback), I would point you towards Markov switching models, which can be analysed using the HiddenMarkov-package, or the HMM-package. (Unfortunately the RHmm-package that the first link describes is no longer maintained)
You might find it worthwile to look into Twitter's outlier detection.
Furthermore, there are many blogposts that look into change point detection or regime changes. I find this R-bloggers blog post very helpful for a start. It refers to the CPM-package, which stands for "Sequential and Batch Change Detection Using Parametric and Nonparametric Methods", the BCP-package ("Bayesian Analysis of Change Point Problems"), and the ECP-package ("Non-Parametric Multiple Change-Point Analysis of Multivariate Data"). You probably want to look into the first two as you don't have multivariate data.
Does that help you getting started?
I could provide an graphical answer that does not use any statistical algorithm. From your data I observe that the "abnormal" sequences seem to present constant portions or, inversely, very high variations. Working on the derivative, and setting limits on this derivative could work. Here is a workaround:
require(forecast)
test=c(df2$BP)
test=ma(test, order=50)
test=test[complete.cases(test)]
which <- ma(0+abs(diff(test))>1, order=10)>0.1
abnormal=test; abnormal[!which]<-NA
plot(x=1:NROW(test), y=test, type='l')
lines(x=1:NROW(test), y=abnormal, col='red')
What it does: first "smooths" the data with a moving average to prevent the micro-variations to be detected. Then it applyes the "diff" function (derivative) and tests if it is greater than 1 (this value is to be adjusted manually depending on the soothing amplitude). THen, in order to get a whole "block" of abnormal sequence without tiny gaps, we apply again a smoothing on the boolean and test it superior to 0.1 to grasp better the boundaries of the zone. Eventually, I overplot the spotted portions in red.
This works for one type of abnormality. For the other type, you could make up a low treshold on the derivative, inversely, and play with the tuning parameters of smoothing.

Resources