Boxplot including outliers in R, make the whole ranges being compared. - r

I am comparing several values using R, they are 8 variables stored in 1000 length vectors. That means, 1000*8 matrix, 8 columns represent 8 variables.
Then I call
boxplot(test),
I got like:
The mean values of 8 variables are very close to each other. Which makes the comparison and interpretation very hard. Can I include all the outliers in my plot ? Then the whole range would be easier to compare ? Or any other suggestions could be given to distinguish these variables ?

Here is the boxplot in question (since the OP doesn't have the rep to post pictures):
It looks like the medians (and likely also the means) are pretty much identical, but the variances differ between the eight categories, with category 1 having the lowest and 8 the highest variance. Depending on the real question involved, these two pieces of information (similar median/mean, different variance) may already be enough.
If you want a formal significance test whether the variances are equal, you can use Hartley's or Bartlett's test. If you want to formally test equality of means with unequal variances (so ANOVA is not appropriate), look here.

Related

Negative Binomial model offset seems to be creating a 2 level factor

I am trying to fit some data to a negative binomial model and run a pairwise comparison using emmeans. The data has two different sample sizes, 15 and 20 (num_sample in the example below).
I have set up two data frames: good.data which produces the expected result of offset() using random sample sizes between 15 and 20, and bad.data using a sample size of either 15 or 20, which seems to produce a factor of either 15 or 20. The bad.data pairwise comparison produces way too many comparisons compared to the good.data, even though they should produce the same number?
set.seed(1)
library(dplyr)
library(emmeans)
library(MASS)
# make data that works
data.frame(site=c(rep("A",24),
rep("B",24),
rep("C",24),
rep("D",24),
rep("E",24)),
trt_time=rep(rep(c(10,20,30),8),5),
pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
num_sample=sample(c(15,17,20),24*5,T),# more than 2 sample sizes...
bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->good.data
# make data that doesn't work
data.frame(site=c(rep("A",24),
rep("B",24),
rep("C",24),
rep("D",24),
rep("E",24)),
trt_time=rep(rep(c(10,20,30),8),5),
pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
num_sample=sample(c(15,20),24*5,T),# only 2 sample sizes...
bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->bad.data
# fit models
good.data%>%
mutate(trt_time=factor(trt_time),
pre_trt=factor(pre_trt),
storage_time=factor(storage_time))%>%
MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
data=.)->mod.good
bad.data%>%
mutate(trt_time=factor(trt_time),
pre_trt=factor(pre_trt),
storage_time=factor(storage_time))%>%
MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
data=.)->mod.bad
# pairwise comparison
emmeans::emmeans(mod.good,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()
emmeans::emmeans(mod.bad,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()
First , I think you should look up how to use emmeans.The intent is not to give a duplicate of the model formula, but rather to specify which factors you want the marginal means of.
However, that is not the issue here. What emmeans does first is to setup a reference grid that consists of all combinations of
the levels of each factor
the average of each numeric predictor; except if a
numeric predictor has just two different values, then
both its values are included.
It is that exception you have run against. Since num_samples has just 2 values of 15 and 20, both levels are kept separate rather than averaged. If you want them averaged, add cov.keep = 1 to the emmeans call. It has nothing to do with offsets you specify in emmeans-related functions; it has to do with the fact that num_samples is a predictor in your model.
The reason for the exception is that a lot of people specify models with indicator variables (e.g., female having values of 1 if true and 0 if false) in place of factors. We generally want those treated like factors rather than numeric predictors.
To be honest I'm not exactly sure what's going on with the expansion (276, the 'correct' number of contrasts, is choose(24,2), the 'incorrect' number of contrasts is 1128 = choose(48,2)), but I would say that you should probably be following the guidance in the "offsets" section of one of the emmeans vignettes where it says
If a model is fitted and its formula includes an offset() term, then by default, the offset is computed and included in the reference grid. ...
However, many users would like to ignore the offset for this kind of model, because then the estimates we obtain are rates per unit value of the (logged) offset. This may be accomplished by specifying an offset parameter in the call ...
The most natural choice for setting the offset is to 0 (i.e. make predictions etc. for a sample size of 1), but in this case I don't think it matters.
get_contr <- function(x) as_tibble(x$contrasts)
cfun <- function(m) {
emmeans::emmeans(m,
pairwise~trt_time:pre_trt:storage_time, offset=0) |>
get_contr()
}
nrow(cfun(mod.good)) ## 276
nrow(cfun(mod.bad)) ## 276
From a statistical point of view I question the wisdom of looking at 276 pairwise comparisons, but that's a different issue ...

R partykit::ctree() how to break tie in selecting splitting variable of identical p-value

For a node x in partykit::ctree object, I use the following lines to get the splitting variables on the node:
k=info_node(x)
names(k$p.value)
However, a splitting variables of a node returned by this code is different from the one on the tree created by plot. It turns out that three columns in k$criterion have the minimum p-value; i.e.
inds=which(k$criterion['p.value',]==k$p.value)
length(inds) #3
Seems the info_node(x) returns the 1st of the three variables as names(k$p.value), but plot chooses the 3rd one. I wonder if such discrepancy is caused by two reasons:
Multiple variables have the minimum p-value, and there is an internal method to break such a tie in selecting only one splitting variable.
Maybe these three variable have slightly different p-value, but because of the fixed p-value precision in k$criterion, they appear to have the same p-value.
Any insight is appreciated!
The comparisons are done internally on the log-p-value scale, i.e., are more reliable in case of tiny p-values. If ties (within machine precision) still remain for the p-value, they are broken based on the size of the corresponding test statistic.
here is one example. Thank you!
library(partykit)
a=rep('N',87)
a[77]='Y'
b=rep(F,87)
b[c(7,10,11,33,56,77)]=T
d=rep(1,87)
d[c(29,38,40,42,65,77)]=0
dfb=data.frame(a=as.factor(a),b=as.factor(b),d=as.factor(d))
tFit=ctree(a ~ ., data=dfb, control = ctree_control(minsplit= 10,minbucket = 5,
maxsurrogate=2, alpha = 0.05))
plot(tFit) #displayed splitting variable is d
tNodes=node_party(tFit)
nodeInfo=info_node(tNodes)
names(nodeInfo$p.value) #b, not d

PCoA function pcoa extract vectors; percentage of variance explained

I have a dataset consisting of 132 observations and 10 variables.
These variables are all categorical. I am trying to see how my observations cluster and how they are different based on the percentage of variance. i.e I want to find out if a) there are any variables which helps to draw certain observation points apart from one another and b) if yes, what is the percentage of variance explained by it?
I was advised to run a PCoA (Principle Coordinates Analysis) on my data. I ran it using vegan and ape package. This is my code after loading my csv file into r, I call it data
#data.dis<-vegdist(data,method="gower",na.rm=TRUE)
#data.pcoa<-pcoa(data.dis)
I was then told to extract the vectors from the pcoa data and so
#data.pcoa$vectors
It then returned me 132 rows but 20 columns of values (e.g. from Axis 1 to Axis 20)
I was perplexed over why there were 20 columns of values when I only have 10 variables. I was under the impression that I would only get 10 columns. If any kind souls out there could help to explain a) what do the vectors actually represent and b) how do I get the percentage of variance explained by Axis 1 and 2?
Another question that I had was I don't really understand the purpose of extracting the eigenvalues from data.pcoa because I saw some websites doing that after running a pcoa on their distance matrix but there was no further explanation on it.
Gower index is non-Euclidean and you can expect more real axes than the number of variables in Euclidean ordination (PCoA). However, you said that your variables are categorical. I assume that in R lingo they are factors. If so, you should not use vegan::vegdist() which only accepts numeric data. Moreover, if the variable is defined as a factor, vegan::vegdist() refuses to compute the dissimilarities and gives an error. If you managed to use vegdist(), you did not properly define your variables as factors. If you really have factor variables, you should use some other package than vegan for Gower dissimilarity (there are many alternatives).
Te percentage of "variance" is a bit tricky for non-Euclidean dissimilarities which also give some negative eigenvalues corresponding to imaginary dimensions. In that case, the sum of all positive eigenvalues (real axes) is higher than the total "variance" of data. ape::pcoa() returns the information you asked in the element values. The proportion of variances explained is in its element values$Relative_eig. The total "variance" is returned in element trace. All this was documented in ?pcoa where I read it.

In R, what is the difference between ICCbare and ICCbareF in the ICC package?

I am not sure if this is a right place to ask a question like this, but Im not sure where to ask this.
I am currently doing some research on data and have been asked to find the intraclass correlation of the observations within patients. In the data, some patients have 2 observations, some only have 1 and I have an ID variable to assign each observation to the corresponding patient.
I have come across the ICC package in R, which calculates the intraclass correlation coefficient, but there are 2 commands available: ICCbare and ICCbareF.
I do not understand what is the difference between them as they do give completely different ICC values on the same variables. For example, on the same variable, x:
ICCbare(ID,x) gave me a value of -0.01035216
ICCbareF(ID,x) gave me a value of 0.475403
The second one using ICCbareF is almost the same as the estimated correlation I get when using random effects models.
So I am just confused and would like to understand the algorithm behind them so I could explain them in my research. I know one is to be used when the data is balanced and there are no NA values.
In the description it says that it is either calculated by hand or using ANOVA - what are they?
By: https://www.rdocumentation.org/packages/ICC/versions/2.3.0/topics/ICCbare
ICCbare can be used on balanced or unbalanced datasets with NAs. ICCbareF is similar, however ICCbareF should not be used with unbalanced datasets.

Dealing with "less than"s in R

Perhaps this is a philosophical question rather than a programming question, but here goes...
In R, is there some package or method that will let you deal with "less than"s as a concept?
Backstory: I have some data which, for privacy reasons, is given as <5 for small numbers (representing integers 1, 2, 3 or 4, in fact). I'd like to do some simple arithmetic on this data (adding, subtracting, averaging, etc.) but obviously I need to find some way to deal with these <5s conceptually. I could replace them all with NAs, sure, but of course that's throwing away potentially useful information, and I would like to avoid that if possible.
Some examples of what I mean:
a <- c(2,3,8)
b <- c(<5,<5,8)
mean(a)
> 4.3333
mean(b)
> 3.3333 -> 5.3333
If you are interested in the values at the bounds, I would take each dataset and split it into two datasets; one with all <5s set to 1 and one with all <5s set to 4.
a <- c(2,3,8)
b1 <- c(1,1,8)
b2 <- c(4,4,8)
mean(a)
# 4.333333
mean(b1)
# 3.3333
mean(b2)
# 5.3333
Following #hedgedandlevered proposal, but he's wrong wrt normal and/or uniform. You ask for integer numbers, so you have to use discrete distributions, like Poisson, binomial (including negative one), geometric etc
In statistics "less than" data is known as "left censored" https://en.wikipedia.org/wiki/Censoring_(statistics), searching on "censored data" might help.
My favoured approach to analysing such data is maximum likelihood https://en.wikipedia.org/wiki/Maximum_likelihood. There are a number of R packages for maximum likelihood estimation, I like the survival package https://cran.r-project.org/web/packages/survival/index.html but there are others, e.g. fitdistrplus https://cran.r-project.org/web/packages/fitdistrplus/index.html which "provides functions for fitting univariate distributions to different types of data (continuous censored or non-censored data and discrete data) and allowing different estimation methods (maximum likelihood, moment matching, quantile matching and maximum goodness-of-t estimation)".
You will have to specify (assume?) the form of the distribution of the data; you say it is integer so maybe a Poisson [related] distribution may be appropriate.
Treat them as a certain probability distribution of your choosing, and replace them with actual randomly generated numbers. All equal to 2.5, normal-like distribution capped at 0 and 5, uniform on [0,5] are all options
I deal with similar data regularly. I strongly dislike any of the suggestions of replacing the <5 values with a particular number. Consider the following two cases:
c(<5,<5,<5,<5,<5,<5,<5,<5,6,12,18)
c(<5,6,12,18)
The problem comes when you try to do arithmetic with these.
I think a solution to your issue is to think of the values as factors (in the R sense. You can bin the values above 5 too if that helps, for example
c(<5,<5,<5,<5,<5,<5,<5,<5,5-9,10-14,15-19)
c(<5,5-9,10-14,15-19)
Now, you still wouldn't do arithmetic on these, but your summary statistics (histograms/proportion tables/etc...) would make more sense.

Resources