I am trying to fit some data to a negative binomial model and run a pairwise comparison using emmeans. The data has two different sample sizes, 15 and 20 (num_sample in the example below).
I have set up two data frames: good.data which produces the expected result of offset() using random sample sizes between 15 and 20, and bad.data using a sample size of either 15 or 20, which seems to produce a factor of either 15 or 20. The bad.data pairwise comparison produces way too many comparisons compared to the good.data, even though they should produce the same number?
set.seed(1)
library(dplyr)
library(emmeans)
library(MASS)
# make data that works
data.frame(site=c(rep("A",24),
rep("B",24),
rep("C",24),
rep("D",24),
rep("E",24)),
trt_time=rep(rep(c(10,20,30),8),5),
pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
num_sample=sample(c(15,17,20),24*5,T),# more than 2 sample sizes...
bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->good.data
# make data that doesn't work
data.frame(site=c(rep("A",24),
rep("B",24),
rep("C",24),
rep("D",24),
rep("E",24)),
trt_time=rep(rep(c(10,20,30),8),5),
pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
num_sample=sample(c(15,20),24*5,T),# only 2 sample sizes...
bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->bad.data
# fit models
good.data%>%
mutate(trt_time=factor(trt_time),
pre_trt=factor(pre_trt),
storage_time=factor(storage_time))%>%
MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
data=.)->mod.good
bad.data%>%
mutate(trt_time=factor(trt_time),
pre_trt=factor(pre_trt),
storage_time=factor(storage_time))%>%
MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
data=.)->mod.bad
# pairwise comparison
emmeans::emmeans(mod.good,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()
emmeans::emmeans(mod.bad,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()
First , I think you should look up how to use emmeans.The intent is not to give a duplicate of the model formula, but rather to specify which factors you want the marginal means of.
However, that is not the issue here. What emmeans does first is to setup a reference grid that consists of all combinations of
the levels of each factor
the average of each numeric predictor; except if a
numeric predictor has just two different values, then
both its values are included.
It is that exception you have run against. Since num_samples has just 2 values of 15 and 20, both levels are kept separate rather than averaged. If you want them averaged, add cov.keep = 1 to the emmeans call. It has nothing to do with offsets you specify in emmeans-related functions; it has to do with the fact that num_samples is a predictor in your model.
The reason for the exception is that a lot of people specify models with indicator variables (e.g., female having values of 1 if true and 0 if false) in place of factors. We generally want those treated like factors rather than numeric predictors.
To be honest I'm not exactly sure what's going on with the expansion (276, the 'correct' number of contrasts, is choose(24,2), the 'incorrect' number of contrasts is 1128 = choose(48,2)), but I would say that you should probably be following the guidance in the "offsets" section of one of the emmeans vignettes where it says
If a model is fitted and its formula includes an offset() term, then by default, the offset is computed and included in the reference grid. ...
However, many users would like to ignore the offset for this kind of model, because then the estimates we obtain are rates per unit value of the (logged) offset. This may be accomplished by specifying an offset parameter in the call ...
The most natural choice for setting the offset is to 0 (i.e. make predictions etc. for a sample size of 1), but in this case I don't think it matters.
get_contr <- function(x) as_tibble(x$contrasts)
cfun <- function(m) {
emmeans::emmeans(m,
pairwise~trt_time:pre_trt:storage_time, offset=0) |>
get_contr()
}
nrow(cfun(mod.good)) ## 276
nrow(cfun(mod.bad)) ## 276
From a statistical point of view I question the wisdom of looking at 276 pairwise comparisons, but that's a different issue ...
I am working on investigating the relationship between body measurements and overall weight in a set of biological specimens using regression equations. I have been comparing my results to previous studies, which did not draw their measurement data and body weights from the same series of individuals. Instead, these studies used the mean values reported for each species from the previously published literature (with body measurements and weight drawn from different sets of individuals) or just took the midpoint of reported ranges of body measurements.
I am trying to figure out how to introduce a small amount of random error in my data to simulate the effects of drawing measurement and weight data from different sources. For example, mutating all data to be slightly altered from their actual value by roughly +/- 5% of their actual value, which is close to the difference I get between my measurements and the literature measurements, and seeing how much that affects accuracy statistics. I know there is the jitter() command, but that only seems to work with plotting data.
There is jitter function in base R which allows you to add random noise in the data.
x <- 1:10
set.seed(123)
jitter(x)
#[1] 0.915 2.115 2.964 4.153 5.176 5.818 7.011 8.157 9.021 9.983
Check ?jitter which explains different ways to control the noise added.
Straight forward if you know what the error looks like (i.e. how is your error distributed?). Is the error normally distributed? Uniform?
v1 <- rep(100, 10) # measurements with no noise
v1_n <- v1 + rnorm(10, 0, 20) #error with mean 0 and sd 20 sampled from normal distribution
v1_u <- v1 + runif(10, -5, 5) #error with mean 0 min -5 and max 5 from uniform distribution
v1_n
[1] 87.47092 103.67287 83.28743 131.90562 106.59016 83.59063 109.74858 114.76649 111.51563 93.89223
v1_u
[1] 104.34705 97.12143 101.51674 96.25555 97.67221 98.86114 95.13390 98.82388 103.69691 98.40349
Hello StackOverflow community,
5 weeks ago I learned to write and read R and it made me a happier being :) Stack Overflow helped me out a hundred times or more! For a while I have been struggling with vegan now. So far I have succeeded in making beautiful nMDS plots. The next step for me is DCA, but here I run into trouble...
Let me explain:
I have a abundance dataset where the columns are different species (N=120) and the rows are transects (460). Column 1 with transect codes is deleted. Abundance is in N (not relative or transformed). Most species are rare to very rare and a couple of species have very high abundance (10000-30000). Total N individuals is about 100000.
When I run the decorana function it returns this info.
decorana(veg = DCAMVA)
Detrended correspondence analysis with 26 segments.
Rescaling of axes with 4 iterations.
DCA1 DCA2 DCA3 DCA4
Eigenvalues 0.7121 0.4335 0.1657 0.2038
Decorana values 0.7509 0.4368 0.2202 0.1763
Axis lengths 1.7012 4.0098 2.5812 3.3408
The eigenvalues are however really small... Only 1 species has a DCA1 value of 2 the rest is all -1.4E-4 etc... This high DCA1 point has an abundance of 1 individual... But this is not the only species that has only 1 individual..
DCA1 DCA2 DCA3 DCA4 Totals
almaco.jack 6.44e-04 1.85e-01 1.37e-01 3.95e-02 0
Atlantic.trumpetfish 4.21e-05 5.05e-01 -6.89e-02 9.12e-02 104
banded.butterflyfish -4.62e-07 6.84e-01 -4.04e-01 -2.68e-01 32
bar.jack -3.41e-04 6.12e-01 -2.04e-01 5.53e-01 91
barred.cardinalfish -3.69e-04 2.94e+00 -1.41e+00 2.30e+00 15
and so on
I can't plot the picture yet on StackOverflow, but the idea is that there is spread on the Y-axis, but the X-values are not. Resulting in a line in the plot.
I guess everything is running okay, no errors returned or so.. I only really wonder what the reason for this clustering is... Anybody has any clue?? Is there a ecological idea behind this??
Any help is appreciated :)
Love
Erik
Looks like your data has an "outlier", a deviant site with deviant species composition. DCA has essentially selected the first axis to separate this site from everything else, and then DCA2 reflects a major pattern of variance in the remaining sites. (D)CA is known to suffer (if you want to call it that) from this problem, but it is really telling you something about your data. This likely didn't affect NMDS at all because metaMDS() maps the rank order of the distances between samples and that means it only need to put this sample slightly further away from any other sample than the distance between the next two most dissimilar samples.
You could just stop using (D)CA for these sorts of data and continue to use NMDS via metaMDS() in vegan. An alternative is to apply a transformation such as the Hellinger transformation and then use PCA (see Legendre & Gallagher 2001, Oecologia, for the details). This transformation can be applied via decostand(...., method = "hellinger") but it is trivial to do by hand as well...
I'm trying to do some econometric analysis using R and can't figure out how to do the analysis I'm look for. Specifically, I want to calculate consumer surplus.
I am trying to predict number of trips (dependent) based on variables like water quality, scenery, parking, etc. I've run a regression of my independent variables on my dependent variable using:
lm()
and also got my predicted values using:
y_hat <- as.matrix(mydata[c("y")])
Now I want to calculate the consumer surplus for each individual (~260 total) from my predicted (y_hat) values.
Welcome to R. I studied economics in college and wish R was taught. You will find that the programming language is very useful in your work.
Note that R is able to accomplish vectorized operations that may speed up your analysis. Consider:
mydata <- data.frame(x=letters[1:3], y=1:3)
x y
1 a 1
2 b 2
3 c 3
Let's say your predicted 'y' is 1.25.
y_hat <- 1.25
You can subtract that number by the entire column of the dataset and it will go row by row for you without having to use compicated 'for loops.'
y_hat - mydata[c("y")]
y
1 0.25
2 -0.75
3 -1.75
Without more information about your particular issue, that is all the help that I can offer. In the future, add a reproducible example that illustrates your data and the specific issue that you are stuck on.
Hi…I have a very basic question regarding the input of weighted data into R. Currently I have to process data (mostly for curve fitting purposes) similar to the following:
> head(mydata, 10)
v sf
1 0.3003434 3.933106
2 0.3027852 5.947432
3 0.3052270 9.832596
4 0.3076688 12.927439
5 0.3101106 14.197519
6 0.3125525 13.572904
7 0.3149943 11.691078
8 0.3174361 9.543095
9 0.3198779 8.048558
10 0.3223197 7.660252
The first column is the data (increasing & equidistant), while the 2nd column gives the frequency (weights), currently these weights don't add up to one, but I can easily fix that.
Now, I searched for weighted data in R and the closest I found was via using the survey package and the svydesign() command, but is it really that hard?
What I did to work around my lack of knowledge, and that got me in trouble with the Kolmogorov_Smirnov test (more below), is the following:
> y <- with(mydata, c(rep(v, times=floor(10*sf))))
which will repeat the elements of the first column in proportion to the corresponding weight (times 10 to get a whole number). But now the problem is, when I conduct the Kolmogorov-Smirnov goodness of fit test, I get a warning that the p-value can not be computed since the data has ties.
Question is: How can I input and process the data in its original form (i.e. as a frequency or probability table) for the purpose of curve fitting? Thanks.