I am new to coding as well as posting on forum but I will do my best to explain the problem and give enough background so that you're able to help me work through it. I have done a lot of searching for solutions to similar errors but all of the code that produces it and the format of the data behind it are very different.
I am working with biological data that consists of various growth categories but all that I am interested in is length (SCL in my code) and age (Age in my code). I have many lengths and age estimates for each individual through time and I am fitting a linear nlme model to the juvenile ages and a Von Bert curve to the mature ages. My juvenile model works just fine and I extracted h (slope of the line) and t (x intercept). I now need to use those parameters as well as T (known age at maturity) to fit the mature stage. The mature model will estimate K (this is my only unknown). I have included a subset of my data for one individual (ID50). This is information for only the mature years with the h and t from it's juvenile fit appended in the farthest right columns.
Subset of my data:
This didn't format very well but I'm not sure how else to display it
Grouped Data: SCL ~ Age | ID
ID SCL Age Sex Location MeanSCL Growth Year Status T h t
50 86.8 27.75 Female VA 86.8 0.2 1994 Mature 27.75 1.807394 -19.83368
50 86.9 28.75 Female VA 87.1 0.4 1995 Mature 27.75 1.807394 -19.83368
50 87.3 29.75 Female VA 87.5 0.5 1996 Mature 27.75 1.807394 -19.83368
50 87.8 30.75 Female VA 88 0.4 1997 Mature 27.75 1.807394 -19.83368
50 88.1 31.75 Female VA 88.1 0 1998 Mature 27.75 1.807394 -19.83368
50 88.1 32.75 Female VA 88.2 0 1999 Mature 27.75 1.807394 -19.83368
50 88.2 33.75 Female VA 88.3 0.2 2000 Mature 27.75 1.807394 -19.83368
50 88.4 34.75 Female VA 88.4 0.1 2001 Mature 27.75 1.807394 -19.83368
50 88.4 35.75 Female VA 88.4 0 2002 Mature 27.75 1.807394 -19.83368
50 88.5 36.75 Female VA 88.5 0 2003 Mature 27.75 1.807394 -19.83368
This is the growth function:
vbBiphasic = function(Age,h,T,t,K) {
y=(h/(exp(K)-1))*(1-exp(K*((T+log(1-(exp(K)-1)*(T-t))/K)-Age)))
}
This is the original growth model that SHOULD have fit:
ID50 refers to my subsetted dataset with only individual 50
VB_mat <- nlme(SCL~vbBiphasic(Age,h,T,t,K),
data = ID50,
fixed = list(K~1),
random = K~1,
start = list(fixed=c(K~.01))
)
However this model produces the error:
Error in pars[, nm] : incorrect number of dimensions
Which tells me that it's trying to estimate a different number of parameters than I have (I think). Originally I was fitting it to all mature individuals (bur for the sake of simplification I'm now trying to fit to one). Here are all of my variations to the model code, ALL of them produced the same error:
inputting averaged values of (Age, h, T,t,K) of the whole population
instead of the variables.
using a subset of 5 individuals and both (Age, h, T,t,K) as well as the averaged values for those individuals for each variable.
using 5 different individuals separately with both (Age, h, T,t,k) as well as their actual values for those variables (all ran
individually i.e. 10 different strings of code just in case some
worked and others didn't... but none did).
Telling the model to estimate all parameters, not just K
eliminating all parameters except K
Turning all values into vectors (that's what one forum with a similar error said to do)
Most of these were in an effort to change the number of parameters that R thought it needed to estimate, however none have worked for me.
I'm no expert on nlme and often have similar problems when fitting models, especially when you cannot use nlsList to get started. My guess is that you have 4 parameters in your function (h, T, t, k), but you are only estimating one of them as both a fixed effect and with a random effect. I believe this then constrains the other parameters to zero which would in effect eliminate them from the model (but you still have them in the model!). Usually you include all the parameters as fixed, and then try to decide how many of them you also want to have a random effect. So I would include all 4 in the fixed argument and the start argument. Since you have 4 parameters, each one has to be either fixed or random, or both - otherwise, how can they be in the model?
Related
I would like to be able to acquire overall estimates from GAMs of the type provided by emmeans, in order to plot these fitted values and their confidence intervals, and then do some subsequent analysis.
I am working on a similar dataset to the one described here: https://rpubs.com/bbolker/ratgrowthcurves . I see at the end of the document the author notes that how best to get out overall estimates from the model is not resolved, but that one option might be emmeans. So here I am posting an example to see if people think this approach is correct, or if they could suggest a better package and method.
I will use the 'Orange' dataset as an example, but to make it fit my question let's first add a 'variety' factor:
data(Orange)
temp<-as.vector(1:31)
temp[c(8:14,22:28)]<-"Tall"
temp[c(1:7,15:21,29:35)]<-"Short"
Orange$variety<-as.factor(temp)
head(Orange)
# Tree age circumference variety
#1 1 118 30 Short
#2 1 484 58 Short
#3 1 664 87 Short
#4 1 1004 115 Short
#5 1 1231 120 Short
#6 1 1372 142 Short
Create a GAM with 'variety' as a factor and 'tree' as a random effect:
library(mgcv)
ex.mod<-gam(circumference~s(age,k=7,by=variety)+s(Tree,bs="re")+variety,method="REML",data=Orange)
The ggeffects package seems to provide nice functionality for a plot via emmeans:
library(ggeffects)
library(emmeans)
plotme<-ggemmeans(ex.mod,terms=c("age","variety"))
plot(plotme)
Next I can extract the estimated marginal means themselves, for example over a range of ages:
emmeans(ex.mod,specs=c("age","variety"),at=list(age=seq(from=300,to=1500,by=500)))
# age variety emmean SE df lower.CL upper.CL
# 300 Short 44.8 4.37 27.1 35.9 53.8
# 800 Short 90.4 3.27 27.1 83.7 97.1
# 1300 Short 136.0 3.68 27.1 128.5 143.6
# 300 Tall 51.5 6.24 27.1 38.7 64.3
# 800 Tall 127.3 6.09 27.1 114.8 139.8
# 1300 Tall 189.8 5.56 27.1 178.4 201.2
#Results are averaged over the levels of: Tree
#Confidence level used: 0.95
My questions are:
1) If I am interested in using this GAM model to e.g. comparing the estimated mean difference between orange trees of variety 'Tall' and variety 'Short' at 800 days of age, is it appropriate to base this pairwise comparison on the emmeans?
2) If several studies have been done on Orange tree growth in different places, and I am interested in meta-analysing the mean difference in circumference between 'Tall' and 'Short' variety trees at certain ages, is it appropriate to use mean differences and variance in the emmeans for meta-analysis?
(emmeans provides the SE, I think this would need to be converted to standard deviation...)
3) ... or does someone have a better suggestion for either of the above?
I feel that this is a somewhat complex issue which may not necessarily have a simple solution and may require machine learning or other advanced techniques to resolve.
Firstly, to explain the issue at hand, say we have a runner who participates in a number of outdoor races where the elements (ie wind) affect the athletes speed. If we know the baseline speed of the runner it’s easy to determine the percentage affect that the elements have had in each race, for example:
Name Baseline Race1 Race2 Race3
1 Runner 100 102 98 106
The contributing element_factors for Race1, Race2 and Race3 are:
[1] 1.02 0.98 1.06
In this example we can see that the runner in Race 1 has had a tail wind which has increased his baseline speed by 2%, etc.
However, in the real world we don’t necessarily know what the runners baseline speed is because all we have are their race results to go on and we don’t necessarily know how the elements are affecting the baseline.
Take for example the race results as listed in the following dataframe
df<-data.frame(Name = c("Runner 1","Runner 2","Runner 3","Runner 4","Runner 5"),
Baseline = c("unknown","unknown","unknown","unknown","unknown"),
Race1 = c(101,"NA",80.8,111.1,95.95),
Race2 = c(102,91.8,"NA",112.2,"NA"),
Race3 = c(95,85.5,76,"NA",90.25),
Race4 = c("NA",95.4,74.8,116.6,100.7))
Name Baseline Race1 Race2 Race3 Race4
1 Runner 1 unknown 101 102 95 NA
2 Runner 2 unknown NA 91.8 85.5 95.4
3 Runner 3 unknown 80.8 NA 76 74.8
4 Runner 4 unknown 111.1 112.2 NA 116.6
5 Runner 5 unknown 95.95 NA 90.25 100.7
What I want to be able to do is calculate (approximate) from this dataframe each runners baseline speed value and the factors relating to each race. The solutions in this case would be:
Baseline<-c(100,90,80,100,95)
[1] 100 90 80 100 95
element_factors<-c(1.01,1.02,0.95,1.06)
[1] 1.01 1.02 0.95 1.06
Setting the baseline speed as the runners average is overly simplistic as we can see that some runners only race in events that have a tail wind and therefore their baseline will fall lower than all their race results.
Looking at my time trend plot, I wonder how to test the statistical significance in the trend shown here given this simple "years vs rate" ecological data, using R? I tried ANOVA turned in p<0.05 treating year variable as a factor. But I'm not satisfied with ANOVA. Also, the article I reviewed suggested Wald statistics to test the time trend. But I found no guiding examples in Google yet.
My data:
> head(yrrace)
year racecat rate outcome pop
1 1995 1 14.2 1585 11170482
2 1995 2 8.7 268 3070363
3 1996 1 14.1 1574 11170482
4 1996 2 7.5 230 3070363
5 1997 1 13.3 1482 11170482
6 1997 2 8.3 254 3070363
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 6 years ago.
This is similar but not equal to Using weights in R to consider the inverse of sampling probability.
I have a long data frame and this is a part of the real data:
age gender labour_situation industry_code FACT FACT_2....
35 M unemployed 15 1510
21 F inactive 00 651
FACT is a variable that means, for the first row, that a male unemployed individual of 35 years represents 1510 individuals of the population.
I need to obtain some tables to show relevant information like the % of employed and unemployed people, etc. In Stata there are some options like tab labour_situation [w=FACT] that shows the number of employed and unemployed people in the population while tab labour_situation shows the number of employed and unemployed people in the sample.
A partial solution could be to repeat the 1st row of the data frame 1510 times and then the 2nd row of my data frame 651 times? As I've searched one options is to run
longdata <- data[rep(1:nrow(data), data$FACT), ]
employment_table = with(longdata, addmargins(table(labour_situation, useNA = "ifany")))
The other thing I need to do is to run a regression having in mind that there was cluster sampling in the following way: the population was divided in regions. This creates a problem: one individual
interviewed in represents people while an individual interviewed in represents people but and are not in proportion to the total population of each region, so some regions will be overrepresented and other regions will be underrepresented. In order to take this into account, each observation should be weighted by the inverse of its probability of being sampled.
The last paragraph means that the model can be estimated with valid equations BUT the variance-covariance matrix won't be but if I consider the inverse of sampling probability.
In Stata it is possible to run a regression by doing reg y x1 x2 [pweight=n] and that calculates the right variance-covariance matrix considering the inverse of sampling probability. At the time I have to use Stata for some part of my work and R for others. I'd like to use just R.
You can do this by repeating the rownames:
df1 <- df[rep(row.names(df), df$FACT), 1:5]
> head(df1)
age gender labour_situation industry_code FACT
1 35 M unemployed 15 1510
1.1 35 M unemployed 15 1510
1.2 35 M unemployed 15 1510
1.3 35 M unemployed 15 1510
1.4 35 M unemployed 15 1510
1.5 35 M unemployed 15 1510
> tail(df1)
age gender labour_situation industry_code FACT
2.781 21 F inactive 0 787
2.782 21 F inactive 0 787
2.783 21 F inactive 0 787
2.784 21 F inactive 0 787
2.785 21 F inactive 0 787
2.786 21 F inactive 0 787
here 1:5 refers to the columns to keep. If you leave that bit blank, all will be returned.
apologies for what is likely to be a very basic question, I am very new to R.
I am looking to read off my augPred plot in order to average out the values to provide a prediction between a time period.
> head(tthm.groupeddata)
Grouped Data: TTHM ~ Yearmon | WSZ_Code
WSZ_Code Treatment_Code Year Month TTHM CL2_FREE BrO3 Colour PH TURB Yearmon
1 2 3 1996 1 30.7 0.35 0.00030 0.75 7.4 0.055 Jan 1996
2 6 1 1996 2 24.8 0.25 0.00055 0.75 6.9 0.200 Feb 1996
3 7 4 1996 2 60.4 0.05 0.00055 0.75 7.1 0.055 Feb 1996
4 7 4 1996 2 58.1 0.15 NA 0.75 7.5 0.055 Feb 1996
5 7 4 1996 3 62.2 0.20 NA 2.00 7.6 0.055 Mar 1996
6 5 2 1996 3 40.3 0.15 0.00140 2.00 7.7 0.055 Mar 1996
This is my model:
modellme<- lme(TTHM ~ Yearmon, random = ~ 1|WSZ_Code, data=tthm.groupeddata)
and my current plot:
plot(augPred(modellme, order.groups=T),xlab="Date", ylab="TTHM concentration", main="TTHM Concentration with Time for all Water Supply Zones")
I would like a way to read off the graph by either placing lines between a specific time period in a specific WSZ_Code (my group) and averaging the values between this period...
Of course any other way/help or guidance would be much appreciated!
Thanks in advance
I don't think we can tell whether it is "entirely incorrect", since you have not described the question and have not included any data. (The plotting question is close to being entirely incorrect, though.) I can tell you that the answer is NOT to use abline, since augPred objects are plotted with plot.augPred which returns (and plots) a lattice object. abline is a base graphic function and does not share a coordinate system with the lattice device. Lattice objects are lists that can be modified. Your plot probably had different panels at different levels of WSZ_Code, but the location of the desired lines is entirely unclear especially since you trail off with an ellipsis. You refer to "times" but there is no "times" variable.
There are lattice functions such as trellis.focus and update.trellis that allow one to apply modifications to lattice objects. You would first assign the plot object to a named variable, make mods to it and then plot() it again.
help(package='lattice')
?Lattice
(If this is a rush job, you might be better off making any calculations by hand and using ImageMagick to edit pdf or png output.)