Trim argument in mean() when number of observations is odd - r

I need some clarification about the trim argument in the function mean().
In ?mean we find that
trim is the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed
If trim is non-zero, a symmetrically trimmed mean is computed
I assume that it will trim the values symmetrically, taking as many observations from the lower range of values as from the upper.
My question is, if x has an odd number of observations, and if we set trim = 0.5, will it remove one less observation in order to cut the same ones from both sides? Or will it just take one extra out randomly either from the top or the bottom?
Thanks in advance,
Ines

I don't exactly know the answer to your question, but I tested with this:
vec <- c(rep(0, 50), rep(1, 51))
mean(vec)
# 0.5049505
mean(vec, trim = .1)
# 0.5061728
So in this case it seems that the function trimmed one low value before

Related

Rolling weighted average in R (multiple observations)

Is there any fast function that is able to calculate a rolling average that is weighted? This is necessary because I have multiple observation (not always the same number) per data point (change in seconds) and I average that. When I take the rolling average, I want to re-weight to get an unbiased rolling average.
So far, I came up with this solution (in this example with a window of 3 seconds).
sam <- data.table(val_mean=c(1:15),N=c(11:25))
sam[,weighted:=val_mean*N]
sam[,rollnumerator:=rollapply(weighted,3,sum,fill=NA,align="left")]
sam[,rolldenominator:=rollapply(N,3,sum,fill=NA,align="left")]
sam[,rollnumerator/rolldenominator]
I couldn't find any question that already addresses this problem.
This is not about unequal spacing of the data: I can take care of that by expanding my data.table with NAs to include each second (the example above is equally spaced). Also, I don't want to include weights in the sense of RcppRoll's roll_mean: There, weights are fixed for all time windows ("A vector of length n, giving the weights for each element within a window."), while in my case the weights change according to the values currently processed. Thirdly, I don't want an adaptive window size, it should stay fixed (say at 3 seconds).
1) Use by.column = FALSE:
library(data.table)
library(zoo)
wmean <- function(x) weighted.mean(x[, 1], x[, 2])
sam[, rollapplyr(.SD, 3, wmean, by.column = FALSE, fill = NA, align = "left")]
2) Another approach is to encode the values and weights into a complex vector:
wmean_cmplx <- function(x) weighted.mean(Re(x), Im(x))
sam[, rollapply(complex(real = val_mean, imag = N), 3, wmean_cmplx,
fill = NA, align = "left")]

Understanding hist() and break intervals in R [duplicate]

This question already has answers here:
Exact number of bins in Histogram in R
(5 answers)
Closed 4 years ago.
I've recently started using R and I don't think I'm understanding the hist() function well. I'm currently working with a numeric vector of length 296, and I'd like to divide it up into 10 equal intervals, and produce a frequency histogram to see which values fall into each interval. I thought hist(dataset, breaks = 10) would do the job, but it's dividing it into 12 intervals instead. I obviously misunderstood what breaks does.
If I want to divide up my data into 10 intervals in my histogram, how should I go about doing that? Thank you.
As per the documentation, if you give the breaks argument a single number, it is treated as a suggestion as it gives pretty breakpoints. If you want to force it to be 10 equally spaced bins, the easiest is probably the following,
x = rnorm(50)
hist(x, breaks = seq(min(x), max(x), length.out = 11))
The length should be n+1 where n is the number of desired bins.
If you read help(hist) you will find this explanation:
breaks: one of:
• a vector giving the breakpoints between histogram cells,
• a function to compute the vector of breakpoints,
• a single number giving the number of cells for the histogram,
• a character string naming an algorithm to compute the
number of cells (see ‘Details’),
• a function to compute the number of cells.
In the last three cases the number is a suggestion only; as
the breakpoints will be set to ‘pretty’ values, the number is
limited to ‘1e6’ (with a warning if it was larger). If
‘breaks’ is a function, the ‘x’ vector is supplied to it as
the only argument (and the number of breaks is only limited
So the help specifically says that if you provide the function with a number it will only be used as a suggestion.
One possible solution is to provide the break points yourself like so:
x <- rnorm(296)
hist(x, breaks=c(-4,-3,-2,-1,0,1,2,3,4,5))
If you don't want to do that but instead want to specify the number of bins you can use the cut function
plot(cut(x, 10))

What does a single number mean when passed as parameter 'breaks' in an R histogram?

I am learning to plot histograms in R, but I have some problem with parameter "breaks" for a single number. In the help, it says:
breaks: a single number giving the number of cells for the histogram
I did the following experiment:
data("women")
hist(women$weight, breaks = 7)
I expect it should give me 7 bins, but the result is not what I expected! It gives me 10 bins.
Do you know, what does breaks = 7 mean? What does it mean in the help "number of cells"?
Reading carefully breaks argument help page to the end, it says:
breaks
one of:
a vector giving the breakpoints between histogram cells,
a function to compute the vector of breakpoints,
a single number giving the number of cells for the histogram,
a character string naming an algorithm to compute the number of cells (see ‘Details’),
a function to compute the number of cells.
In the last three cases the number is a suggestion only; the breakpoints will be set to pretty values. If breaks is a function, the
x vector is supplied to it as the only argument.
So, as you can notice, n is considered only a "suggestion", it probably tries to get near to that value but it depends on the input values and if they can be nicely split into n buckets (it uses function pretty to compute them).
Hence, the only way to force the number of breaks is to provide the vector of interval breakpoints between the cells.
e.g.
data("women")
n <- 7
minv <- min(women$weight)
maxv <- max(women$weight)
breaks <- c(minv, minv + cumsum(rep.int((maxv - minv) / n, n-1)), maxv)
hist(women$weight, breaks = breaks)

Boxplot main rectangles delimiter which percentage of data points?

I used the command:
boxplot(V15~Class,data=trainData, main="V15 value depending on Class", xlab="Class", ylab="V15")
I would like to understand which is the percentage of points in the rectangle(s)?
I mean: if I take all the samples inside the main rectangle, what percentage of the total count of samples will it be?
I found the documentation, but cannot figure out this answer.
The help text for boxplot, which you refer to, suggest that you should "See Also boxplot.stats which does the computation". From the "Details" section:
The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4).
The hinges equal the quartiles for odd n (where n <- length(x)) and differ for even n.
Whereas the quartiles only equal observations for n %% 4 == 1 (n = 1 mod 4),
the hinges do so additionally for n %% 4 == 2 (n = 2 mod 4), and are in the middle of two observations otherwise.
So yes, basically the middle 50% of the values fall inside the box, but the details of the calculation depend on the nature of the data.

how are the intervals of histogram plot identified

I was trying out a simple histogram
hist(c(-2,-1,0,1,2))
the histogram has frequency equal to 2 for -2 to -1 for the above code.
I am not quite getting how R places the values inside the interval in the plot that it gives. I mean, here shouldn't the frequency (y axis) be equal to 1 all the time, since there are no repetitions?
Also, I didn't quite get how the range works, is it upper bound inclusive / lower bound inclusive or..?? [,) or (,] or [,] or (,) ..?
All your questions can be answered by reading the help file for hist, help('hist') or ?hist
There are arguments include.lowest and right which both default to TRUE
Quoting from the help
-include.lowest
logical; if TRUE, an x[i] equal to the breaks value will be included in the first (or last, for right = FALSE) bar. This will be ignored (with a warning) unless breaks is a vector.
-right
logical; if TRUE, the histogram cells are right-closed (left open) intervals.

Resources