Apologies if this is a bit of a simple question, but I haven't been able to find any answer to this over the past week and it's driving me crazy.
Background Info: I have a dataset that tracks the weight of 5 individuals over 5 years. Each year, I have a distribution for the weight of individuals in the group, from which I calculate the mean and standard deviation. Data is as follows:
Year = [2002,2003,2004,2005,2006]
Weights_2002 = [12, 14, 16, 18, 20]
Weights_2003 = [14, 16, 18, 20,20]
Weights_2004 = [16, 18, 20, 22, 18]
Weights_2005 = [18, 21, 22, 22, 20]
Weights_2006 = [2, 21, 19, 20, 20]
The Question: How do I project annual distributions of weight for the group the next 10 years? Ideally, I would like the uncertainty about the mean to increase as time goes on. Likewise, I would like the uncertainty about the standard deviation to increase too. Phrased another way, I would like to project the distributions of weight going forward, accounting for both:
Natural Variance in the Data
Increasing uncertainty.
Any help would be greatly, greatly appreciated. If anyone can suggest how to do this in R, that would be even better.
Thanks guys!
Absent specific suggestions on how to use the forecasting tools in R, viz. the comments to your question, here is an alternative approach that uses Monte Carlo simulation.
First, some housekeeping: the value 2 in Weights_2006 is either a typo or an outlier. Since I can't tell which, I will assume it's an outlier and exclude it from the analysis.
Second, you say you want to project the distributions based on increasing uncertainty. But your data doesn't support that.
Year <- c(2002,2003,2004,2005,2006)
W2 <- c(12, 14, 16, 18, 20)
W3 <- c(14, 16, 18, 20,20)
W4 <- c(16, 18, 20, 22, 18)
W5 <- c(18, 21, 22, 22, 20)
W6 <- c(NA, 21, 19, 20, 20)
df <- rbind(W2,W3,W4,W5,W6)
df <- data.frame(Year,df)
library(reshape2) # for melt(...)
library(ggplot2)
data <- melt(df,id="Year", variable.name="Individual",value.name="Weight")
ggplot(data)+
geom_histogram(aes(x=Weight),binwidth=1,fill="lightgreen",colour="grey50")+
facet_grid(Year~.)
The mean weight goes up over time, but the variance decreases. A look at the individual time series shows why.
ggplot(data, aes(x=Year, y=Weight, color=Individual))+geom_line()
In general, an individual's weight increases linearly with time (about 2 units per year), until it reaches 20, when it stops increasing but fluctuates. Since your initial distribution was uniform, the individuals with lower weight saw an increase over time, driving the mean up. But the weight of heavier individuals stopped growing. So the distribution gets "bunched up" around 20, resulting in a decreasing variance. We can see this in the numbers: increasing mean, decreasing standard deviation.
smry <- function(x)c(mean=mean(x),sd=sd(x))
aggregate(Weight~Year,data,smry)
# Year Weight.mean Weight.sd
# 1 2002 16.0000000 3.1622777
# 2 2003 17.6000000 2.6076810
# 3 2004 18.8000000 2.2803509
# 4 2005 20.6000000 1.6733201
# 5 2006 20.0000000 0.8164966
We can model this behavior using a Monte Carlo simulation.
set.seed(1)
start <- runif(1000,12,20)
X <- start
result <- X
for (i in 2003:2008){
X <- X + 2
X <- ifelse(X<20,X,20) +rnorm(length(X))
result <- rbind(result,X)
}
result <- data.frame(Year=2002:2008,result)
In this model, we start with 1000 individuals whose weight forms a uniform distribution between 12 and 20, as in your data. At each time step we increase the weights by 2 units. If the result is >20 we clip it to 20. Then we add random noise distributed as N[0,1]. Now we can plot the distributions.
model <- melt(result,id="Year",variable.name="Individual",value.name="Weight")
ggplot(model,aes(x=Weight))+
geom_histogram(aes(y=..density..),fill="lightgreen",colour="grey50",bins=20)+
stat_density(geom="line",colour="blue")+
geom_vline(data=aggregate(Weight~Year,model,mean), aes(xintercept=Weight), colour="red", size=2, linetype=2)+
facet_grid(Year~.,scales="free")
The red bars show the mean weight in each year.
If you believe that the natural variation in the weight of an individual increases over time, then use N[0,sigma] as the error term in the model, with sigma increasing with Year. The problem is that there is nothing in your data to support that.
Related
I struggle with multilevel models and prepared a reproducible example to be clear.
Let's say I would like to predict the height of children after 12 months of follow_up, i.e. their height at month == 12, using the previous values obtained for the height, but also their previous values of weight, with such a dataframe.
df <- data.frame (ID = c (1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
month = c (1, 3, 6, 12, 1, 6, 12, 1, 6, 8, 12),
weight = c (14, 15, 17, 18, 21, 21, 22, 8, 8, 9, 10),
height = c (100, 102, 103, 104, 122, 123, 125, 82, 86, 88, 90))
ID month weight height
1 1 1 14 100
2 1 3 15 102
3 1 6 17 103
4 1 12 18 104
5 2 1 21 122
6 2 6 21 123
7 2 12 22 125
8 3 1 8 82
9 3 6 8 86
10 3 8 9 88
11 3 12 10 90
My plan was to use the following model (obviously I have much more data than 3 patients, and more lines per patient). Because my height are correlated within each patient, I wanted to add a random intercept (1|ID), but also a random slope and it is the reason why I added (month|ID) (I saw in several examples of predicting scores of students that the "occasion" or "day test" was added as a random slope). So I used the following code.
library(tidymodels)
library(multilevelmod)
library(lme4)
#Specifications
mixed_model_spec <- linear_reg() %>%
set_engine("lmer") %>%
set_args(na.action=na.exclude, control = lmerControl(optimizer ="bobyqa"))
#Fitting the model
mixed_model_fit <-
mixed_model_spec %>%
fit(height ~ weight + month + (month|ID),
data = df)
My first problem is that if I add "weight" (and its multiple values per ID) as a variable, I have the following error "boundary (singular) fit: see help('isSingular')" (even in my large dataset), while if I keep only variables with one value per patient (e.g. sex) I do not have this problem.
Can anyone explain me why ?
My second problem is that by training a similar model, I can predict for new children the values of height at nearly all months (I get a predicted value at month 1, month X, ..., month 12) that I can compare to the real values collected on my test set.
However, what I am interesting in is to predict the value at month 12 and integrate the previous values from each patients in this testing test. In other words, I do not want the model to predict the whole set of values from scratch (more precisely, from the patient data used for training), but also from the previous values of the new patient at month 1, month 4, month 6 etc. already available. How I can write my code to obtain such a prediction?
Thanks a lot for your help!
My first problem is that if I add "weight" (and its multiple values per ID) as a variable, I have the following error "boundary (singular) fit: see help('isSingular')" (even in my large dataset), while if I keep only variables with one value per patient (e.g. sex) I do not have this problem. Can anyone explain me why ?
This happens when the random effects structure is too complex to be supported by the data. Other than this it is usually not possible to identify exactly why this happens in some situations and not others. Basically the model is overfitted. A few things you can try are:
centering the month variable
centering other numeric variables
fitting the model without the correlation between random slopes and intercepts, by using || instead of |
There are also some related questions and answers here:
https://stats.stackexchange.com/questions/378939/dealing-with-singular-fit-in-mixed-models/379068#379068
https://stats.stackexchange.com/questions/509892/why-is-this-linear-mixed-model-singular/509971#509971
As for the 2nd question, it sounds like you want some kind of time series model. An autoregressive model such as AR(1) might be sufficient, but this is not supported by lme4. You could try nmle instead.
I found this post and solution on annual growth rates using dplyr really helpful:
R annual rate of change (growth rate) with dplyr lag
Instead of calculating annual rate of change, I want to calculate month-to-month growth rates for bimonthly data from part of the year, from May to September. I think I figured it out, using the post referenced above as a guide. Here is a reproducible example:
#build toy dataset. In contains the Plant ID, leaf length, Sampling Month, and Sampling Time within the month (either T1 or T2)
Plant_ID <- c("365","365","365","365","365","365","365","365","365","365")
Leaf_length <- c(4, 10, 15, 17, 20, 25, 30, 50, 45, 47)
Month <- c(5,5,6,6,7,7,8,8,9,9)
Period <- c("T1","T2","T1","T2","T1","T2","T1","T2","T1","T2")
toy_growthrate <- data.frame(Plant_ID, Leaf_length, Month, Period)
#look at dataset
toy_growthrate
#try to calculate bimonthly percentage change
toy_growthrate <- toy_growthrate %>% mutate(change=(Leaf_length-lag(Leaf_length,2))/lag(Leaf_length,2)*100)
#the new column "change" is filled with month-to-month growth
toy_growthrate
However, I am still stuck on how to do this month-to-month growth calculation if in certain months I have bimonthly data but in other months I have 4 data points (i.e. weekly data)? Do I need to take averages of the weekly data to convert it to bimonthly so that all months have the same number of data points?
Here is another code example with this new twist:
#build toy dataset. In contains the Plant ID, leaf length, Sampling Month, and Sampling Time within the month (T1, T2, T3, T4 in May, T1 and T2 in remaining months)
Plant_ID <- c("365","365","365","365","365","365","365","365","365","365","365","365")
Leaf_length <- c(1,2,4, 10, 15, 17, 20, 25, 30, 50, 45, 47)
Month <- c(5,5,5,5,6,6,7,7,8,8,9,9)
Period <- c("T1","T2","T3","T4","T1","T2","T1","T2","T1","T2","T1","T2")
toy_growthrate_with_twist <- data.frame(Plant_ID, Leaf_length, Month, Period)
#look at dataset
toy_growthrate_with_twist
In this new dataset, there are 4 measurements of Leaf length in May, but only 2 in the remaining months. How can I do the month-to-month growth calculation in this case?
Thanks!
I'm trying to create an excel one-way data table in R so that I can find the exponent that minimizes errors of a coefficient in an equation. I have a for loop that produces the correct result but it does something strange that I can't figure out.
Here is an example of the data. I'll use the Pythogrean Win formula from baseball and use a for loop to find the exponent that minimizes the mean absolute error in the win projections.
## Create Data
Teams <- c("Bulls", "Sharks", "Snakes", "Dogs", "Cats")
Wins <- c(5, 3, 8, 1, 9)
Losses <- 10 - Wins
Win.Pct <- Wins/(Wins + Losses)
Points.Gained <- c(30, 50, 44, 28, 60)
Points.Allowed <- c(28, 74, 40, 92, 25)
season <- data.frame(Teams, Wins, Losses, Win.Pct, Points.Gained, Points.Allowed)
season
## Calculate Scoring Ratio
season$Score.Ratio <- with(season, Points.Gained/Points.Allowed)
## Predict Wins from Scoring Ratio
exponent <- 2
season$Predicted.Wins <- season$Score.Ratio^exponent / (1 + season$Score.Ratio^exponent)
## Calculate Mean Absolute Error
season$Abs.Error <- with(season, abs(Win.Pct - Predicted.Wins))
mae <- mean(season$Abs.Error)
mae
Here is my for loop that is looking at a range of exponent options to see if any of them are better than the exponent, 2, used above. For some strange reason, when I run the for loop, it keeps repeating the table several times (many of the tables with incorrect results) until finally producing the correct table as the last one. Can anyone explain to me what is wrong with my for loop and why this is happening?
## Identify potential exponent options that minimize mean absolute error
exp.options <- seq(from = 0.5, to = 3, by = 0.1)
mae.results <- data.frame("Exp" = exp.options, "Results" = NA)
for(i in 1:length(exp.options)){
win.pct <- season$Predicted.Wins
pred.win.pct <-
(season$Points.Gained/season$Points.Allowed)^exp.options[i] /
(1 + (season$Points.Gained/season$Points.Allowed)^exp.options[i])
mae.results[i,2] <- mean(abs(win.pct - pred.win.pct))
print(mae.results)
}
I have a vector of numbers, and I would like to sample a number which is between a given position in the vector and its neighbors such that the two closest neighbors have the largest impact, and this impact is decreasing according to the distance from the reference point.
For example, lets say I have the following vector:
vec = c(15, 16, 18, 21, 24, 30, 31)
and my reference is the number 16 in position #2. I would like to sample a number which will be with a high probability between 15 and 16 or (with the same high probability) between 16 and 18. The sampled numbers can be floats. Then, with a decreasing probability to sample a number between 16 and 21, and with a yet lower probability between 16 and 24, and so on.
The position of the reference is not known in advance, it can be anywhere in the vector.
I tried playing with runif and quantiles, but I'm not sure how to design the scores of the neighbors.
Specifically, I wrote the following function but I suspect there might be a better/more efficient way of doing this:
GenerateNumbers <- function(Ind,N){
dist <- 1/abs(Ind- 1:length(N))
dist <- dist[!is.infinite(dist)]
dist <- dist/sum(dist)
sum(dist) #sanity check --> 1
V = numeric(length(N) - 1)
for (i in 1:(length(N)-1)) {
V[i] = runif(1, N[i], N[i+1])
}
sample(V,1,prob = dist)
}
where Ind is the position of the reference number (16 in this case), and N is the vector. "Dist" is a way of weighing the probabilities so that the closer neighbors have a higher impact.
Improvements upon this code would be highly appreciated!
I would go with a truncated Gaussian random sample generator, such as in the truncnorm package. On your example:
# To install it: install.package("truncnorm")
library(truncnorm)
vec <- c(15, 16, 18, 21, 24, 30, 31)
x <- rtruncnorm(n=100, a=vec[1], b=vec[7], mean=vec[2], sd=1)
The histogram of the generated sample fulfills the given prerequisites.
Suppose I have 100 marbles, and 8 of them are red. I draw 30 marbles, and I want to know what's the probability that at least five of the marbles are red. I am currently using http://stattrek.com/online-calculator/hypergeometric.aspx and I entered 100, 8, 30, and 5 for population size, number of success, sample size, and number of success in sample, respectively. So the probability I'm interested in is Cumulative Probability: $P(X \geq 5)$ which = 0.050 in this case. My question is, how do I calculate this in R?
I tried
> 1-phyper(5, 8, 92, 30, lower.tail = TRUE)
[1] 0.008503108
But this is very different from the previous answer.
phyper(5, 8, 92, 30) gives the probability of drawing five or fewer red marbles.
1 - phyper(5, 8, 92, 30) thus returns the probability of getting six or more red marbles
Since you want the probability of getting five or more (i.e. more than 4) red marbles, you should use one of the following:
1 - phyper(4, 8, 92, 30)
[1] 0.05042297
phyper(4, 8, 92, 30, lower.tail=FALSE)
[1] 0.05042297
Why use:
1 - phyper(..., lower.tail = TRUE)
?
Easier to use:
phyper(..., lower.tail = FALSE)
. Even if they are mathematically equivalent, there are numerical reasons for preferring the latter.
Does that fix your problem? I believe you are putting the correct inputs into the phyper function. Is it possible that you're looking at the wrong output in that web site you linked?