How to calculate mean of multiple standard deviation in R - r

i have a little problem i have 10 standard deviation and 10 mean with normal distribution like this
N(5,1) , N(10,3), N(8,2) N(6,1), N(10,3), N(7,2), N(4,1), N(10,3), N(9,2), N(8,1).
if i search the mean of total mean in R the code is
c=cbind(c(5,10,8,6,10,7,4,10,9,8))
y=mean(c)
so how to calculate average of standard deviation, but this average doesnt like the formula average as always?

Not sure if I understand you correctly, but if you have a vector of standard deviations, you can also just calculate the mean.
So e.g.
my_sd = c(1.23, 4.53, 3.343)
mean(my_sd)
If your quesiton is about how to calculate a standard deviation, this can be done easily with the sd function.

Not sure what error appears in your console.
May it be because you lack a ) at the end of your cbind function?
As you can see below, I can calculate mean of c without issue.
> c <- cbind(c(5,10,8,6,10,7,4,10,9,8))
> y <- mean(c)
> y
[1] 7.7

I'm not sure what your objective is. But averaging the standard deviations obscures the relationships between the standard deviations and the associated means. For instance N(25, 2) and N(5, 5) would have much different summary statistics than N(25, 5) and N(5, 2). Even though the averages of the means and standard deviations would be the same. A better statistic could be the average of the Coeffficients of Variation of each of the distributions. So:
ms <- c(5,10,8,6,10,7,4,10,9,8)
sds <- c(1,3,2,1,3,2,1,3,2,1)
cvs <- sds/ms
[1] 0.2000000 0.3000000 0.2500000 0.1666667 0.3000000 0.2857143 0.2500000 0.3000000 0.2222222 0.1250000
meancvs <- mean(cvs)
[1] 0.2399603

Related

Jaro-Winkler's difference between packages

I am using fuzzy matching to clean up medication data input by users, and I am using Jaro-Winkler's distance. I was testing which package with Jaro-Winkler's distance was faster when I noticed the default settings do not give identical values. Can anyone help me understand where the difference comes from? Example:
library(RecordLinkage)
library(stringdist)
jarowinkler("advil", c("advi", "advill", "advil", "dvil", "sdvil"))
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
1- stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"), method = "jw")
# [1] 0.9333333 0.9444444 1.0000000 0.9333333 0.8666667
I am assuming it has to do with the weights, and I know I am using the defaults on both. However, if someone with more experience could shed light on what's going on, I would really appreciate it. Thanks!
Documentation:
https://cran.r-project.org/web/packages/stringdist/stringdist.pdf
https://cran.r-project.org/web/packages/RecordLinkage/RecordLinkage.pdf
Tucked away in the documentation for stringdist is the following:
The Jaro-Winkler distance (method=jw, 0<p<=0.25) adds a correction term to the Jaro-distance. It is defined as d − l · p · d, where d is the Jaro-distance. Here, l is obtained by counting, from the start of the input strings, after how many characters the first character mismatch between the two strings occurs, with a maximum of four. The factor p is a penalty factor, which in the work of Winkler is often chosen 0.1.
However, in stringdist::stringdist, p = 0 by default. Hence:
1 - stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"),
method = "jw", p = .1)
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
In fact that value is hard-coded in the source of RecordLinkage::jarowinkler.

Stationarity Tests in R, checking mean, variance and covariance

Preamble
I am not sure if this question belongs to stackoverflow or stackexchange, I posted it on stackoverflow as it contains R-code. However, you are free to move it, if you feel it belongs to stackexchange.
Question
I was modelling some time-series and played around with different stationarity tests in R, when I noticed something strange. Do we have stationarity tests that check for all three components of a stationary time-series (according to Wikipedia: constant mean, constant variance, and constant covariance)?
Simulating some time-series, I found that the usual tests are not capable of finding some non-stationary time-series. Now the question is, am I missing something, or did I get something wrong, or are there better options than the tests mentioned?
Simulating some data
The first time-series should be stationary, whereas the other 5 are simulated to have either different means, or different variances.
set.seed(123)
dat <- data.frame(x = 1:1000,
# stationary
ts1 = rnorm(1000),
# random walk
ts2 = cumsum(rnorm(1000)),
# jumps in mean
ts3 = c(rnorm(200), rnorm(400, mean = 5), rnorm(400)),
# jumps in variance
ts4 = c(rnorm(100), rnorm(300, sd = 5), rnorm(500), rnorm(100)),
# increasing variance
ts5 = sapply(1:1000, function(x) rnorm(1, sd = x/1000)),
# variance as a squared function
ts6 = sapply(1:1000, function(x) rnorm(1, sd = (x - 500)^2))/1000000)
The series look like this. I would say that it is clear that only the first series has a constant mean and variance and that the other series are non-stationary.
The issues (?) with tests
So far, I looked into 4 different tests: adf.test, Box.test, kpss.test, and PP.test (all from the tseries-packages.
While a significant p-value in the adf.test and the PP.test indicates a stationary series, a significant value indicates non-stationarity for the other two tests.
library(tseries)
tests <- c("adf.test", "Box.test", "kpss.test", "PP.test")
sapply(tests, function(test){
apply(dat[, c("ts1", "ts2", "ts3", "ts4", "ts5", "ts6")], 2, function(x) {
get(test)(x)$p.value
})
})
# resulting in
# adf.test Box.test kpss.test PP.test # stat:non_stat
# ts1 0.0100000 0.386053779 0.10 0.0100000 # 4:0 clearly stat
# ts2 0.4195604 0.000000000 0.01 0.3260713 # 0:4 clearly non-stat
# ts3 0.5467517 0.000000000 0.01 0.0100000 # 1:3 most-likely non-stat
# ts4 0.0100000 0.004360365 0.10 0.0100000 # 2:2 ?!
# ts5 0.0100000 0.033007310 0.10 0.0100000 # 2:2 ?!
# ts6 0.0100000 0.307453035 0.10 0.0100000 # 4:0 stationary ?!
While the first series is found to be stationary by all four tests, the tests have issues with the other series, especially series with changing variance are missclassified in many cases.
Do you know if there are better tests that account for changes in variance and/or covariance?
Thank you very much for ideas, thoughts, and solutions.

How to exclude unwanted comparisons in two-way ANOVA in R

I have asked about this already on stats.exchange (original question), now I re-posted the same content here - hoping to get help from a wider population.
I would like to know the way to exclude all the unwanted pairs from the output generated from two-way ANOVA, so when there shows a significant result from summary(aov()), the post-hoc test won't give me any comparisons I don't want. Details as follows:
I have datTable contain proportion data under two factor site (four levels: A, B, C, D) and treatment(two levels: control and treated). Specifically, I want to do a pair-wise test among all the site under each same treatment (e.g. control-A VS. control-B, control-A VS.control-C, treated-A VS.treated-C, etc.), while excludes comparisons between different sites and different treatments(e.g., pairs such as control-A VS. treated-B, control-B VS. treated-C).
The data looks like this:
> datTable
site treatment proportion
A control 0.5000000
A control 0.4444444
A treated 0.1000000
A treated 0.4000000
B control 0.4444444
B control 0.4782609
B treated 0.0500000
B treated 0.3000000
C control 0.3214286
C control 0.4705882
C treated 0.1200000
C treated 0.4000000
D control 0.3928571
D control 0.4782609
D treated 0.4000000
D treated 0.4100000
I did a two-way ANOVA (also not sure whether to use within subject site/treatment or between subject site*treatment...), and summarised the results.
m1 <- aov(proportion~site*treatment,data=datTable) # Or should I use 'site/treatment'?
Then my summary(m1) gave me the following:
> summary(m1)
Df Sum Sq Mean Sq F value Pr(>F)
site 3 0.02548 0.00849 0.513 0.6845
treatment 1 0.11395 0.11395 6.886 0.0305 *
site:treatment 3 0.03686 0.01229 0.742 0.5561
Residuals 8 0.13239 0.01655
Next step is to use TukeyHSD post-hoc test to see actually which pair caused the * significance in site factor.
> TukeyHSD(m1)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = proportion ~ site * treatment, data = datTable)
$site
diff lwr upr p adj
B-A -0.042934783 -0.3342280 0.2483585 0.9631797
C-A -0.033106909 -0.3244002 0.2581863 0.9823452
D-A 0.059168392 -0.2321249 0.3504616 0.9124774
C-B 0.009827873 -0.2814654 0.3011211 0.9995090
D-B 0.102103175 -0.1891901 0.3933964 0.6869754
D-C 0.092275301 -0.1990179 0.3835685 0.7461309
$treatment
diff lwr upr p adj
treated-control -0.1687856 -0.3171079 -0.02046328 0.0304535
$`site:treatment`
diff lwr upr p adj
B:control-A:control -0.010869565 -0.5199109 0.4981718 1.0000000
C:control-A:control -0.076213819 -0.5852551 0.4328275 0.9979611
D:control-A:control -0.036663216 -0.5457045 0.4723781 0.9999828
A:treated-A:control -0.222222222 -0.7312635 0.2868191 0.6749021
B:treated-A:control -0.297222222 -0.8062635 0.2118191 0.3863364 # Not wanted
C:treated-A:control -0.212222222 -0.7212635 0.2968191 0.7154690 # Not wanted
D:treated-A:control -0.067222222 -0.5762635 0.4418191 0.9990671 # Not wanted
C:control-B:control -0.065344254 -0.5743856 0.4436971 0.9992203
D:control-B:control -0.025793651 -0.5348350 0.4832477 0.9999985
A:treated-B:control -0.211352657 -0.7203940 0.2976887 0.7189552 # Not wanted
B:treated-B:control -0.286352657 -0.7953940 0.2226887 0.4233804 # Not wanted
C:treated-B:control -0.201352657 -0.7103940 0.3076887 0.7583437 # Not wanted
D:treated-B:control -0.056352657 -0.5653940 0.4526887 0.9996991
D:control-C:control 0.039550603 -0.4694907 0.5485919 0.9999713
A:treated-C:control -0.146008403 -0.6550497 0.3630329 0.9304819 # Not wanted
B:treated-C:control -0.221008403 -0.7300497 0.2880329 0.6798628 # Not wanted
C:treated-C:control -0.136008403 -0.6450497 0.3730329 0.9499131
D:treated-C:control 0.008991597 -0.5000497 0.5180329 1.0000000 # Not wanted
A:treated-D:control -0.185559006 -0.6946003 0.3234823 0.8168230 # Not wanted
B:treated-D:control -0.260559006 -0.7696003 0.2484823 0.5194129 # Not wanted
C:treated-D:control -0.175559006 -0.6846003 0.3334823 0.8505865 # Not wanted
D:treated-D:control -0.030559006 -0.5396003 0.4784823 0.9999950
B:treated-A:treated -0.075000000 -0.5840413 0.4340413 0.9981528
C:treated-A:treated 0.010000000 -0.4990413 0.5190413 1.0000000
D:treated-A:treated 0.155000000 -0.3540413 0.6640413 0.9096378
C:treated-B:treated 0.085000000 -0.4240413 0.5940413 0.9960560
D:treated-B:treated 0.230000000 -0.2790413 0.7390413 0.6429921
D:treated-C:treated 0.145000000 -0.3640413 0.6540413 0.9326207
However, there are some pairs I don't want to be included in the two-way ANOVA which I preformed, specified as # not wanted.
Is there any way that I can tweak the aov or TukeyHSD function to exclude those possibilities ('not wanted' ones) I listed above? I could easily select the significant entires that I am interested (with *) from the long list produced from TukeyHSD. But I don't want my result from anova to be biased by those! (It happens in the real data that the significance actually caused by those unwanted pairs!)
NB: You might have noticed that the site:treatment post-hoc tests doesn't show any significance, this is because I only selected a small sample from the original data.
If you mean to exclude those comparisons completely from the calculations, Tukey's test works by doing pairwise comparisons for all combinations of conditions. It doesn't make sense to "exclude" any pairs.
If you mean you want to exclude the unwanted comparisons from showing in your final results then yes, it is possible. The result of TukeyHSD is simply a list and site:treatment is simply a matrix which you can manipulate as you like.
lst <- TukeyHSD(m1)
lst[['site:treatment']] <- lst[['site:treatment']][-c(5,6,7,10,11,12,15,16,18,19,20,21),]

Set time series vectors lengths equal (resize/rescale them with use of linear interpolation)

I have huge dataset of time series which are represented as vectors (no time labels available), due to some errors in measuring process their lengths (as values from length() show) varies slightly (~10%) but each of them definitively describs time interval of exacly two minutes. I would like to rescale/resize them and then calculate some statistics between them (so I need time series of equal lengths).
I need vary fast approach and linear interpolation is perfectly good choice for me, because speed is more important.
Simple example, rescaling vector of length 5 to vector of length of 10 :
input <- 0:4 # should be rescaled/resized into :
output <- c(0, .444, .888, 1.333, 1.777, 2.222, 2.666, 3.111, 3.555, 4)
I think that the fastest approach is to create matrix w ('w' for weights) which dimensions are : length(output) x length(input), so w %*% input gives output(as matrix object), if it is the fastest way, how to create matrices w efficiently ?
I think this could be enough:
resize <- function (input, len) approx(seq_along(input), input, n = len)$y
For example:
> resize(0:4, 10)
[1] 0.0000000 0.4444444 0.8888889 1.3333333 1.7777778 2.2222222 2.6666667 3.1111111 3.5555556 4.0000000
> resize( c(0, 3, 2, 1), 10)
[1] 0.000000 1.000000 2.000000 3.000000 2.666667 2.333333 2.000000 1.666667 1.333333 1.000000

Calculating standard error after a log-transform

Consider a random set of numbers that are normally distributed:
x <- rnorm(n=1000, mean=10)
We'd like to know the mean and the standard error on the mean so we do the following:
se <- function(x) { sd(x)/length(x) }
mean(x) # something near 10.0 units
se(x) # something near 0.001 units
Great!
However, let's assume we don't necessarily know that our original distribution follows a normal distribution. We log-transform the data and perform the same standard error calculation.
z <- log(x, base=10)
mean(z) # something near 1 log units
se(z) # something near 0.000043 log units
Cool, but now we need to back-transform to get our answer in units NOT log units.
10^mean(z) # something near 10.0 units
10^se(z) # something near 1.00 units
My question: Why, for a normal distribution, does the standard error differ depending on whether it was calculated from the distribution itself or if it was transformed, calculated, and back-transformed? In this example, it is interesting that the difference is almost exactly 3 orders of magnitude. Note: the means came out the same regardless of the transformation.
EDIT #1: Ultimately, I am interested in calculating a mean and confidence intervals for non-normally distributed data, so if you can give some guidance on how to calculate 95% CI's on transformed data including how to back-transform to their native units, I would appreciate it!
END EDIT #1
EDIT #2: I tried using the quantile function to get the 95% confidence intervals:
quantile(x, probs = c(0.05, 0.95)) # around [8.3, 11.6]
10^quantile(z, probs = c(0.05, 0.95)) # around [8.3, 11.6]
So, that converged on the same answer, which is good. However, using this method doesn't provide the exact same interval using non-normal data with "small" sample sizes:
t <- rlnorm(10)
mean(t) # around 1.46 units
10^mean(log(t, base=10)) # around 0.92 units
quantile(t, probs = c(0.05, 0.95)) # around [0.211, 4.79]
10^(quantile(log(t, base=10), probs = c(0.05, 0.95))) # around [0.209, 4.28]
Which method would be considered "more correct". I assume one would pick the most conservative estimate?
As an example, would you report this result for the non-normal data (t) as having a mean of 0.92 units with a 95% confidence interval of [0.211, 4.79]?
END EDIT #2
Thanks for your time!

Resources