How do I compute the weighted mean in R?
For example, I have 4 elements of which 1 element is of size (or: length, width, etc.) 10 and 3 elements are of size 2.
> z = data.frame(count=c(1,3), size=c(10,2))
> z
count size
1 1 10
2 3 2
The weighted average is (10 * 1 + 2 * 3) / 4 = 4.
Use weighted.mean:
> weighted.mean(z$size, z$count)
[1] 4
Seems like you already know how to calculate this, just need a nudge in the right direction to implement it. Since R is vectorized, this is pretty simple:
with(z, sum(count*size)/sum(count))
The with bit just saves on typing and is equivalent to sum(z$count*z$size)/sum(z$count)
Or use the built in function weighted.mean() as you also pointed out. Using your own function can prove faster, though will not do the same amount of error checking that the builtin function does.
builtin <- function() with(z, weighted.mean(count, size))
rollyourown <- function() with(z, sum(count*size)/sum(count))
require(rbenchmark)
benchmark(builtin(), rollyourown(),
replications = 1000000,
columns = c("test", "elapsed", "relative"),
order = "relative")
#-----
test elapsed relative
2 rollyourown() 13.26 1.000000
1 builtin() 22.84 1.722474
Related
I've been wrecking my head for the past four hours trying to find the solution to an R problem, which is driving me nuts. I've searching everywhere for a decent answer but so far I've been hitting wall after wall. I am now appealing to your good will of this fine community for help.
Consider the following dataset:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
I need to perform a t-test for every row in DataSample in order to find out if groups TRIAL and CONTROL differ (equal variance applies).
Then I need to count the number of rows with a p-value equal to, or lower than 0.05.
So here is the code I tried, which I know is wrong:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
pValResults <- apply(
DataSample[,1:12],1,function(x) t.test(x,DataSample[,13:24], var.equal=T)$p.value
)
sum(pValResults < 0.05) # Returns the wrong answer (so I was told)
I did try looking at many similar questions around stackoverflow, but I would often end-up with syntax errors or a dimensional mismatch. The code above is the best I could get without returning me an R error -- but I since the code is returning the wrong answer I have nothing to feel proud of.
Any advice will be greatly appreciated! Thanks in advance for your time.
One option is to loop over the data set calculating the t test for each row, but it is not as elegant.
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
# initialize vector of stored p-values
pvalue <- rep(0,nrow(DataSample))
for (i in 1:nrow(DataSample)){
pvalue[i] <- t.test(DataSample[i,1:12],DataSample[i,13:24])$p.value
}
# finding number that are significant
sum(pvalue < 0.05)
I converted to a data.table, and the answer I got was 45:
DataSample.dt <- as.data.table(DataSample)
sum(sapply(seq_len(nrow(DataSample.dt)), function(x)
t.test(DataSample.dt[x, paste0('Trial', 1:12), with=F],
DataSample.dt[x, paste0('Control', 13:24), with=F],
var.equal=T)$p.value) < 0.05)
To do a paired T test, you need to supply the paired = TRUE parameter. The t.test function isn't vectorised, but it's quite simple to do t tests a whole matrix at a time. Here's three methods (including using apply):
library("genefilter")
library("matrixStats")
library("microbenchmark")
dd <- DataSample[, 1:12] - DataSample[, 13:24]
microbenchmark::microbenchmark(
manual = {ps1 <- 2 * pt(-abs(rowMeans(dd) / sqrt(rowVars(dd) / ncol(dd))), ncol(dd) - 1)},
apply = {ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], paired=TRUE)$p.value)},
rowttests = {ps3 <- rowttests(dd)[, "p.value"]})
#Unit: milliseconds
# expr min lq mean median uq max
# manual 1.611808 1.641783 1.677010 1.663122 1.709401 1.852347
# apply 390.869635 398.720930 404.391487 401.508382 405.715668 634.932675
# rowttests 2.368823 2.417837 2.639671 2.574320 2.757870 7.207135
# neval
# 100
# 100
# 100
You can see the manual method is over 200x faster than apply.
If you actually meant an unpaired test, here's the equivalent comparison:
microbenchmark::microbenchmark(
manual = {x <- DataSample[, 1:12]; y <- DataSample[, 13:24]; ps1 <- 2 * pt(-abs((rowMeans(x) - rowMeans(y)) / sqrt((rowVars(x) + rowVars(y)) / ncol(x))), ncol(DataSample) - 2)},
apply = { ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], var.equal = TRUE)$p.value)},
rowttests = {ps3 <- rowttests(DataSample, factor(rep(1:2, each = 12)))[, "p.value"]})
Note the manual method assumes that the two groups are the same sizes.
Adding an alternative using an external library.
Performing the test:
library(matrixTests)
res <- row_t_equalvar(DataSample[,1:12], DataSample[,13:24])
Format of the result:
res
obs.x obs.y obs.tot mean.x mean.y mean.diff var.x var.y var.pooled stderr df statistic pvalue conf.low conf.high alternative mean.null conf.level
1 12 12 24 0.30569721 0.160622830 0.145074376 0.5034806 1.0769678 0.7902242 0.3629105 22 0.399752487 0.69319351 -0.6075559 0.89770469 two.sided 0 0.95
2 12 12 24 -0.27463354 -0.206396781 -0.068236762 0.8133311 0.2807800 0.5470556 0.3019535 22 -0.225984324 0.82329990 -0.6944500 0.55797651 two.sided 0 0.95
3 12 12 24 -0.19805092 -0.023207888 -0.174843032 0.4278359 0.5604078 0.4941219 0.2869733 22 -0.609265949 0.54858909 -0.7699891 0.42030307 two.sided 0 0.95
Number of rows with p <= 0.05:
> sum(res$pvalue <= 0.05)
[1] 4
How can I round off a number like 0.0000234889 (or in the form 8.829847e-07) to a power of ten, either below or above (whichever is my choice), ie here 0.00001 or 0.0001 ?
I tried round(...., digits=-100000) but it returns an error NaN error.
Ex: round(2e-07, digits=6) gives 0, while I would like 1e-06 and another function to give 1e-07.
# Is this what you're looking for?
# find the nearest power of ten for some number
x <- 0.0000234889 # Set test input value
y <- log10(x) # What is the fractional base ten logarithm?
yy <- round(y) # What is the nearest whole number base ten log?
xx <- 10 ^ yy # What integer power of ten is nearest the input?
print(xx)
# [1] 1e-05
The digits argument to the round() function must be positive. If you want your number to show up in scientific notation with an exponent n, just just do
round(value, 10^n)
However, this will only get you what you want up to a point. For example, you can do round(0.0000234889, 10^6) but you still get 2.34889e-05. (Notice that an exponent of 6 was specified but you got 5.)
Use options("scipen" = ) like this:
num <- 0.0000234889
> num
[1] 2.34889e-05
options("scipen" = 10)
options()$scipen
> num
[1] 0.0000234889
This will change the global option for the session. Read documentation here:https://stat.ethz.ch/R-manual/R-devel/library/base/html/options.html
There are several posts on computing pairwise differences among vectors, but I cannot find how to compute all differences within a vector.
Say I have a vector, v.
v<-c(1:4)
I would like to generate a second vector that is the absolute value of all pairwise differences within the vector. Similar to:
abs(1-2) = 1
abs(1-3) = 2
abs(1-4) = 3
abs(2-3) = 1
abs(2-4) = 2
abs(3-4) = 1
The output would be a vector of 6 values, which are the result of my 6 comparisons:
output<- c(1,2,3,1,2,1)
Is there a function in R that can do this?
as.numeric(dist(v))
seems to work; it treats v as a column matrix and computes the Euclidean distance between rows, which in this case is sqrt((x-y)^2)=abs(x-y)
If we're golfing, then I'll offer c(dist(v)), which is equivalent and which I'm guessing will be unbeatable.
#AndreyShabalin makes the good point that using method="manhattan" will probably be slightly more efficient since it avoids the squaring/square-rooting stuff.
Let's play golf
abs(apply(combn(1:4,2), 2, diff))
#Ben, yours is a killer!
> system.time(apply(combn(1:1000,2), 2, diff))
user system elapsed
6.65 0.00 6.67
> system.time(c(dist(1:1000)))
user system elapsed
0.02 0.00 0.01
> system.time({
+ v <- 1:1000
+ z = outer(v,v,'-');
+ z[lower.tri(z)];
+ })
user system elapsed
0.03 0.00 0.03
Who knew that elegant (read understandable/flexible) code can be so slow.
A possible solution is:
z = outer(v,v,'-');
z[lower.tri(z)];
[1] 1 2 3 1 2 1
I have some data showing a long list of regions, the population of each region and the number of people in each region with a certain disease. I'm trying to show the confidence intervals for each proportion (but I'm not testing whether the proportions are statistically different).
One approach is to manually calculate the standard errors and confidence intervals but I'd like to use a built-in tool like prop.test, because it has some useful options. However, when I use prop.test with vectors, it runs a chi-square test across all the proportions.
I've solved this with a while loop (see dummy data below), but I sense there must be a better and simpler way to approach this problem. Would apply work here, and how? Thanks!
dat <- data.frame(1:5, c(10, 50, 20, 30, 35))
names(dat) <- c("X", "N")
dat$Prop <- dat$X / dat$N
ConfLower = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[1]
ConfLower <- c(ConfLower, a)
x <- x + 1
}
ConfUpper = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[2]
ConfUpper <- c(ConfUpper, a)
x <- x + 1
}
dat$ConfLower <- ConfLower[2:6]
dat$ConfUpper <- ConfUpper[2:6]
Here's an attempt using Map, essentially stolen from a previous answer here:
https://stackoverflow.com/a/15059327/496803
res <- Map(prop.test,dat$X,dat$N)
dat[c("lower","upper")] <- t(sapply(res,"[[","conf.int"))
# X N Prop lower upper
#1 1 10 0.1000000 0.005242302 0.4588460
#2 2 50 0.0400000 0.006958623 0.1485882
#3 3 20 0.1500000 0.039566272 0.3886251
#4 4 30 0.1333333 0.043597084 0.3164238
#5 5 35 0.1428571 0.053814457 0.3104216
I have a file which contains Timestamps like this:
0.000100
0.003890
0.567980
0.999000
0.999990
1.000010
1.236800
1.456098
1.989001
2.098710
2.309879
2.890879
I want to find the per-second statistics , like in 1st second: 5 values, 2nd second: 4, 3rd second 3 in the file above using R. I also want to find Avg per second, max value in all the seconds and minimum value in all seconds. How can these be extracted using R? I am a newbie to R and still learning. I know how to plot these in histograms, but don't know how to extract the values.
Data:
x <- c(0.0001, 0.00389, 0.56798, 0.999, 0.99999, 1.00001, 1.2368, 1.456098,
1.989001, 2.09871, 2.309879, 2.890879)
You can also use the cut function to create a factor (time range) and then use in a similar fashion to how Justin proposes with aggregate:
y <- data.frame(val=x, time=cut(x, 0:round(max(x))))
aggregate(val~time, y, length)
aggregate(val~time, y, mean)
Or create your own function and do it in one fell swoop:
funner <- function(x){
c(mean=mean(x), n=length(x), min=min(x), max=max(x), sd=sd(x))
}
aggregate(val~time, y, funner)
yielding:
> aggregate(val~time, y, funner)
time val.mean val.n val.min val.max val.sd
1 (0,1] 0.5141920 5.0000000 0.0001000 0.9999900 0.4996575
2 (1,2] 1.4204773 4.0000000 1.0000100 1.9890010 0.4223025
3 (2,3] 2.4331560 3.0000000 2.0987100 2.8908790 0.4102205
You can do this using integer math:
x <- c(1e-04, 0.00389, 0.56798, 0.999, 0.99999, 1.00001, 1.2368, 1.456098,
1.989001, 2.09871, 2.309879, 2.890879)
> aggregate(x, list(x %/% 1), mean)
Group.1 x
1 0 0.514192
2 1 1.420477
3 2 2.433156
>
I would also suggest you look data.table and plyr packages for this sort of aggregation.
The max and min for each group follow fairly easily. If you just want the max or min of the series you can use those functions directly
> max(x)
[1] 2.890879
>