I'm coding a hierarchical Poisson model in JAGS+R for count data for censored data.
A is a matrix, rows are the different places and the columns are the different time intervals in which I count the rainy days. As covariates I have a set of X_k$ matrices. For every place and time interval I have covariates such as mean temperature (stored in X_1), count of windy days (in X_2), mean humidity (in X_3).
Should I separate the censored data from the non censored ones? How do I do that with matrices?
Thanks for your help!
Update:
I have this in a cycle up to the last but one registered time interval (for each place there might be censoring due to equipment failure)
mu[i,j]<- a[1]*x[i,1]+a[2]*x[i,2]+ b[1,j]*varx[i,j]+b[2,j]*varx[i,j]
N[i,j] <- dpois(lambda[i,j])
log(lambda[i,j]) <- mu[i,j] + alpha[i]
alpha[i]<- G0[latent[i]]
latent[i]~ dcat(prob[])
repeated this for the last time interval. Added censoor[i]<-step(-censored[i]) in the last time interval bit, censored[i] is a vec that indicates if there is an equipment failure
I am new & it doesn't work, any help? Thanks
Related
I have two variables, x and y, measured at one minute intervals for over two years. The average daily values of x and y are almost 90% correlated. However, when I analyze x and y in one minute intervals they are only 50% correlated. How can I detect the time interval at which this correlation becomes 90%? Ideally I'd like to do this in R.
I'm new to statistics/econometrics, so my apologies if this question is very basic!
I'm not quite sure what you are asking here. What do you mean by x and y being 90 "percent" correlated? Do you mean you get a correlation coefficient of .9?
Beyond this clarification you can absolutely have a situation where the average of 2 variables is more correlated than any individual subset of the data. In other words order matters, so the correlation of the average is not the average of the correlation. For example, this R code shows if we took 3 measurements each hour for 2 hours (6 measurements total), the overall correlation is .5, while the correlation of the average hourly measure is a perfect 1. Essentially when you take the correlation of averages you are effectively removing the impact of the order your measurement values are distributed within the interval you are averaging over, which ends up actually being very important when taking the correlations. Let me know if I missed something about your question though.
X=c(1,2,3,4,5,6)
Y=c(3,2,1,6,5,4)
cor(X,Y)
HourAvgX=c(mean(X[1:3]),mean(X[4:6]))
HourAvgY=c(mean(Y[1:3]),mean(Y[4:6]))
cor(HourAvgX,HourAvgY)
Here is a short description of the problem I am trying to solve: I have test data for multiple variables (weight, thickness, absorption, etc.) that are taken at varying intervals over time - no set schedule, sometimes a test a day, sometimes days might go between tests. I want to detect trends in each of these and alert stake holders when any parameter is trending up/down more than a certain amount. I first did a linear model between each variable's raw data and test time (I converted the test time to days or weeks since a fixed date) and create a table with slopes for each variable - so the stake holders can view one table for all variables and quickly see if any of them is raising concern. The issue was that the data for most variables is very noisy. Someone suggested using time series functions, separating noise and seasonality from the trends, and studying the trend component for a cleaner analysis. I started to look into this and see a couple concerns/questions already:
Time series analysis seems to require specifying a frequency - how do you handle this if your test data is not taken at regular intervals
If one gets over the issue in #1 above, decomposes the data, and gets the trend separated out (ie. take out particularly the random variation/noise), how would you then get a slope metric from that? Namely, if I wanted to then fit a linear model to the trend component of the raw data (after decomposing), what would be the x (independent) variable? Is there a way to connect the trend component of the ts-decompose function with the original data's x-axis data (in this case the actual test date/times, say converted to weeks or days from a fixed date)?
Finally, is there a better way of accomplishing what I explained above? I am only looking for general trends over time - say over 3 months of data, not day to day trends.
Time series are generally used to see if previous observations of a variable have influence on future observations. You would model under the assumption that the previous observations are able to predict the future observations. That is the reason for that most (not all) time series models require evenly spaced instances of training data. If your data is not only very noisy, but also not collected on a regular basis, then you should seriously consider if time series is the appropriate choice of modelling.
Time series analysis seems to require specifying a frequency - how do you handle this if your test data is not taken at regular intervals.
What you can do, is creating an aggregate by increasing the time bucket (shift from daily data to a weekly average for instance) such that every unit of time has an instance of training data. Following your final comment, you could create the average of the observations of the last 3 months of data instead from the observations.
If one gets over the issue in #1 above, decomposes the data, and gets the trend separated out (ie. take out particularly the random variation/noise), how would you then get a slope metric from that? Namely, if I wanted to then fit a linear model to the trend component of the raw data (after decomposing), what would be the x (independent) variable?
In the simplest case of a linear model, the independent variable is the unit of time corresponding to the prediction you are trying to make. However this is not always regarded a time series model.
In the case of an autoregressive model, this would be the previous observation of what you are trying to predict, something similar to y(t) = x(t-1), for instance multiplied by a smoothing factor. I encourage you to read Forecasting: principles and practice which is an excellent book on the matter.
Is there a way to connect the trend component of the ts-decompose function with the original data's x-axis data (in this case the actual test date/times, say converted to weeks or days from a fixed date)?
The function decompose.ts returns a list which includes trend. Trend is a vector of the estimated trend components corresponding to it's respective time value.
Let's create an example time series with linear trend
df <- data.frame(
date = seq(from = as.Date("2021-01-01"), to = as.Date("2021-01-10"), by=1)
)
df$value <- jitter(seq(from = 1, to = nrow(df), by=1))
time_series <- ts(df$value, frequency = 5)
df$trend <- decompose(time_series)$trend
> df
date value trend
1 2021-01-01 0.9170296 NA
2 2021-01-02 1.8899565 NA
3 2021-01-03 3.0816892 2.992256
4 2021-01-04 4.0075589 4.042486
5 2021-01-05 5.0650478 5.046874
6 2021-01-06 6.1681775 6.051641
7 2021-01-07 6.9118942 7.074260
8 2021-01-08 8.1055282 8.041628
9 2021-01-09 9.1206522 NA
10 2021-01-10 9.9018900 NA
As you see, the trend component already is an estimate of the dependent variable at the corresponding time. In decompose the estimate of trend is based on a moving average.
I have a time series and would like to find the period that has the lowest contiguous variability, i.e. the period in which the rolling SD hovers around the minimum for the longest consecutive time steps.
test=c(10,12,14,16,13,13,14,15,15,14,16,16,16,16,16,16,16,15,14,15,12,11,10)
rol=rollapply(x, width=4, FUN=sd)
rol
I can easily see from the data or the graph that the longest period with the lowest variability start at t=11. Is there a function that can help me find this period of continued low variability, perhaps trying automatically different size for the rolling window? I am not interested in finding the time step with the lowest SD, but a period where this low SD is more consistent than others.
All I can think for now is looking at the difference between rol[i]-rol[i+1], looping through the vector and use a counter to find periods of consecutive low values of SD. I was also thinking of using cluster analysis, something like kmeans(rol, 5) but I can have long time series which are complex and I would have to manually pick the number of clusters.
I have been experimenting with a R package called RNN.
The following is the code site:
https://github.com/bquast/rnn
It has a very nice example for financial time series prediction.
I have read the code and I understand it uses the sequence of the time series to predict in advance the value of next day instrument.
The following is an example of run with 10 hidden nodes and 200 epochs
RNN financial time series prediction
What I would expect as result is that the algorithm succeed, at least in part, to predict in advance the value of the instrument.
From what I can see, apparently is only approximating the value of the time series at the current day, not giving any prediction on the next day.
Is my expectation wrong?
This code is very simple, how would you improve it?
y <- X[,1:input$training_amount+input$prediction_gap,as.numeric(input$target)]
matrix(y, ncol=input$training_amount)
y.train moves all the data forward by a day so that is what is being trained on - next day data for the currency pair you care about. With ncol = training_amount when there are too many columns (with them now equal to training_amount + prediction_gap), the first data points fall off; hence all the data gets moved forward by the prediction_gap.
I am a biologist working in aquaculture nutrition research and until recently I haven't paid much attention to the power of statistics. The usual method of analysis had been to run ANOVA on final weights of animals given various treatments and boom, you have a result. I have tried to improve my results by designing an experiment that could track individuals growth over time but I am having a really hard time trying to understand which model to use for the data I have.
For simplified explanation of my experiment: I have 900 abalone/snails which were sourced from a single cohort (spawned/born at the same time). I have individually marked each abalone (id) and recorded a length and weight at Time 0. The animals were then randomly assigned 1 of 6 treatment diets (n=30 abalone per treatment) each replicated n=5 times (n=150 abalone / replicate). Each replicate looks like a randomized block design where each treatment is only replicate once within each block and each is assigned to independent tank with n=30 abalone/tank (n treatment). Abalone were fed a known amount of feed for 90 days before being weighed and measured again (Time 1). They are back in their homes for another 90 days before the concluding the experiment.
From my understanding:
fixed effects - Time, Treatment
nested random effects - replicate, id
My raw data entered is in Long format with each row being a unique animal and columns for Time (0 or 1), Replicate (1-5), Treatment (1-6), Sex (M or F) Animal ID (1-900), Length (mm), Weight (g), Condition Factor (Weight/Length^2.99*5655)
I have used columns from my raw data and converted them to factors and vectors before using the new variables to create a data frame.
id<-as.factor(data.long[,5])
time<-as.factor(data.long[,1])
replicate<-as.factor(data.long[,2])
treatment<-data.long[,3]
weight<-as.vector(data.long[,7])
length<-as.vector(data.long[,6])
cf<-as.vector(data.long[,10])
My data frame is currently in the following structure:
df1<-data.frame(time,replicate,treatment,id,weight,length,cf)
I am struggling to understand how to nest my individual abalone within replicates. I can convert the weight data to change from initial but I think the package nlme already accounts this change when coded correctly. I could also create another measure of Specific Growth Rate for each animal at Time 1 but this would not allow the Time factor to be used.
lme(weight ~ time*treatment, random=~1 | id, method="ML", data=df1))
I would like to structure a mixed effects model so that my code takes into account the individual animal variability to detect statistical differences in their weight at Time 1 between treatments.