I tried to ask these questions through imputations, but I want to see if this can be done with predictive modelling instead. I am trying to use information from 2003-2004 NHANES to predict future NHANES cycles. For some context, in 2003-2004 NHANES measured blood contaminants in individual people's blood. In this cycle, they also measured things such as triglycerides, cholesterol etc. that influence the concentration of these blood contaminants.
The first step in my workflow is the impute missing blood contaminant concentrations in 2003-2004 using the measured values of triglycerides, cholesterol etc. This is an easy step and very straightforward. This will be my training dataset.
For future NHANES years (for example 2005-2006), they took individual blood samples combined them (or pooled in other words) and then measured blood contaminants. I need to figure out what the individual concentrations were in these cycles. I have individual measurements for triglycerides, cholesterol etc. and the pooled value is considered the mean. Could I use the mean, 2003-2004 data to unpool or predict the values? For example, if a pool contains 8 individuals, we know the mean, the distribution (2003-2004) and the other parameters (triglycerides) which we can use in the regression to estimate the blood contaminants in those 8 individuals. This would be my test dataset where I have the same contaminants as in the training dataset, with a column for the number of individuals in each pool and the mean value. Alternatively, I can create rows of empty values for contaminants, add mean values separately.
I can easily run MICE, but I need to make sure that the distribution of the imputed data matches 2003-2004 and that the average of the imputed 8 individuals from the pools is equal to the measured pool. So the 8 values for each pool, need to average to the measured pool value while the distribution has to be the same as 2003-2004.
Does that make sense? Happy to provide context if need be. There is an outline code below.
library(mice)
library(tidyverse)
library(VIM)
#Papers detailing these functions can be found in MICE Cran package
df <- read.csv('2003_2004_template.csv', stringsAsFactors = TRUE, na.strings = c("", NA))
#Checking out the NA's that we are working with
non_detect_summary <- as.data.frame(df %>% summarize_all(funs(sum(is.na(.)))))
#helpful representation of ND
aggr_plot <- aggr(df[, 7:42], col=c('navyblue', 'red'),
numbers=TRUE,
sortVars=TRUE,
labels=names(df[, 7:42]),
cex.axis=.7,
gap=3,
ylab=c("Histogram of Missing Data", "Pattern"))
#Mice time, m is the number of imputed datasets (you can think of this as # of cycles)
#You can check out what regression methods below in console
methods(mice)
#Pick Method based on what you think is the best method. Read up.
#Now apply the right method
imputed_data <- mice(df, m = 30)
summary(imputed_data)
#if you want to see imputed values
imputed_data$imp
#finish the dataset
finished_imputed_data <- complete(imputed_data)
#Check for any missing values
sapply(finished_imputed_data, function(x) sum(is.na(x))) #All features should have a value of zero
#Helpful plot is the density plot. The density of the imputed data for each imputed dataset is showed
#in magenta while the density of the observed data is showed in blue.
#Again, under our previous assumptions we expect the distributions to be similar.
densityplot(x = imputed_data, data = ~ LBX028LA+LBX153LA+LBX189LA)
#Print off finished dataset
write_csv(finished_imputed_data, "finished_imputed_data.csv")
#This is where I need to use the finished_imputed_data to impute the values in the future years.
I have a community matrix (species as columns, samples as rows) from which I would like to generate a species accumulation curve (SAC) using the specaccum() and fitspecaccum() functions in R's vegan package. In order for the resulting SAC and cumulative species richness at sample X to be comparable among regions (I have 1 community matrix per region), I need to have specaccum() choose the same number of sets within each region. My problem is that some regions have a larger number of sets than others. I would like to limit the sample size to the minimum number of sets among regions (in my case, the minimum number of sets is 45, so I would like specaccum() to randomly sample 45 sets, 100 times (set permutations=100) for each region. I would like to sample from the entire data set available for each region. The code below has not worked... it doesn't recognize "subset=45". The vegan package info says "subset" needs to be logical... I don't understand how subset number can be logical, but maybe I am misinterpreting what subset is... Is there another way to do this? Would it be sufficient to run specaccum() for the entire number of sets available for each region and then just truncate the output to 45?
require(vegan)
pool1<-specaccum(comm.matrix, gamma="jack1", method="random", subet=45, permutations=100)
Any help is much appreciated.
Why do you want to limit the function to work in a random sample of 45 cases? Just use the species accumulation up to 45 cases. Taking a random subset of 45 cases gives you the same accumulation, except for the random error of subsampling and throwing away information. If you want to compare your different cases, just compare them at the sample size that suits all cases, that is, at 45 or less. That is the idea of species accumulation models.
The subset is intended for situations where you have (possibly) heterogeneous collection of sampling units, and you want to stratify data. For instance, if you want to see only the species accumulation in the "OldLow" habitat type of the Barro Colorado data, you could do:
data(BCI, BCI.env)
plot(specaccum(BCI, subset = BCI.env$Habitat == "OldLow"))
If you want to have, say, a subset of 30 sample plots of the same data, you could do:
take <- c(rep(TRUE, 30), rep(FALSE, 20))
plot(specaccum(BCI)) # to see it all
# repeat the following to see how taking subset influences
lines(specaccum(BCI, subset = sample(take)), col = "blue")
If you repeat the last line, you see how taking random subset influences the results: the lines are normally within the error bars of all data, but differ from each other due to random error.
I have many data sets with known outliers (big orders)
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1", 155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5, 135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6, 222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6, 231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6, 329429882.8, 264012891.6, 496745973.9, 284484362.55),ncol=2,byrow=FALSE)
The top 11 outliers of this specific series are:
outliers <- matrix(c("14Q4","14Q2","12Q1","13Q1","14Q2","11Q1","11Q4","14Q2","13Q4","14Q4","13Q1",20193525.68, 18319234.7, 12896323.62, 12718744.01, 12353002.09, 11936190.13, 11356476.28, 11351192.31, 10101527.85, 9723641.25, 9643214.018),ncol=2,byrow=FALSE)
What methods are there that i can forecast the time series taking these outliers into consideration?
I have already tried replacing the next biggest outlier (so running the data set 10 times replacing the outliers with the next biggest until the 10th data set has all the outliers replaced).
I have also tried simply removing the outliers (so again running the data set 10 times removing an outlier each time until all 10 are removed in the 10th data set)
I just want to point out that removing these big orders does not delete the data point completely as there are other deals that happen in that quarter
My code tests the data through multiple forecasting models (ARIMA weighted on the out sample, ARIMA weighted on the in sample, ARIMA weighted, ARIMA, Additive Holt-winters weighted and Multiplcative Holt-winters weighted) so it needs to be something that can be adapted to these multiple models.
Here are a couple more data sets that i used, i do not have the outliers for these series yet though
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3", 26393.99306, 13820.5037, 23115.82432, 25894.41036, 14926.12574, 15855.8857, 21565.19002, 49373.89675, 27629.10141, 43248.9778, 34231.73851, 83379.26027, 54883.33752, 62863.47728, 47215.92508, 107819.9903, 53239.10602, 71853.5, 59912.7624, 168416.2995, 64565.6211, 94698.38748, 80229.9716, 169205.0023, 70485.55409, 133196.032, 78106.02227), ncol=2,byrow=FALSE)
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3",3311.5124, 3459.15634, 2721.486863, 3286.51708, 3087.234059, 2873.810071, 2803.969394, 4336.4792, 4722.894582, 4382.349583, 3668.105825, 4410.45429, 4249.507839, 3861.148928, 3842.57616, 5223.671347, 5969.066896, 4814.551389, 3907.677816, 4944.283864, 4750.734617, 4440.221993, 3580.866991, 3942.253996, 3409.597269, 3615.729974, 3174.395507),ncol=2,byrow=FALSE)
If this is too complicated then an explanation of how, in R, once outliers are detected using certain commands, the data is dealt with to forecast. e.g smoothing etc and how i can approach that writing a code myself (not using the commands that detect outliers)
Your outliers appear to be seasonal variations with the largest orders appearing in the 4-th quarter. Many of the forecasting models you mentioned include the capability for seasonal adjustments. As an example, the simplest model could have a linear dependence on year with corrections for all seasons. Code would look like:
df <- data.frame(period= c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3",
"10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2",
"13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1"),
order= c(155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5,
135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6,
222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6,
231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6,
329429882.8, 264012891.6, 496745973.9, 42748656.73))
seasonal <- data.frame(year=as.numeric(substr(df$period, 1,2)), qtr=substr(df$period, 3,4), data=df$order)
ord_model <- lm(data ~ year + qtr, data=seasonal)
seasonal <- cbind(seasonal, fitted=ord_model$fitted)
library(reshape2)
library(ggplot2)
plot_fit <- melt(seasonal,id.vars=c("year", "qtr"), variable.name = "Source", value.name="Order" )
ggplot(plot_fit, aes(x=year, y = Order, colour = qtr, shape=Source)) + geom_point(size=3)
which gives the results shown in the chart below:
Models with a seasonal adjustment but nonlinear dependence upon year may give better fits.
You already said you tried different Arima-models, but as mentioned by WaltS, your series don't seem to contain big outliers, but a seasonal-component, which is nicely captured by auto.arima() in the forecast package:
myTs <- ts(as.numeric(data[,2]), start=c(2008, 1), frequency=4)
myArima <- auto.arima(myTs, lambda=0)
myForecast <- forecast(myArima)
plot(myForecast)
where the lambda=0 argument to auto.arima() forces a transformation (or you could take log) of the data by boxcox to take the increasing amplitude of the seasonal-component into account.
The approach you are trying to use to cleanse your data of outliers is not going to be robust enough to identify them. I should add that there is a free outlier package in R called tsoutliers, but it won't do the things I am about to show you....
You have an interesting time series here. The trend changes over time with the upward trend weakening a bit. If you bring in two time trend variables with the first beginning at 1 and another beginning at period 14 and forward you will capture this change. As for seasonality, you can capture the high 4th quarter with a dummy variable. The model is parsimonios as the other 3 quarters are not different from the average plus no need for an AR12, seasonal differencing or 3 seasonal dummies. You can also capture the impact of the last two observations being outliers with two dummy variables. Ignore the 49 above the word trend as that is just the name of the series being modeled.