How to calculate a pooled standard deviation in R? - r

I want to calculate the pooled (actually weighted) standard deviation for all the unique sites in my data frame.
The values for these sites are values for single species forest stands and I want to pool the mean and the sd so that I can compare broadleaved stands with conifer stands.
This is the data frame (df) with values for the broadleaved stands:
keybl n mean sd
Vest02DenmDesp 3 58.16 6.16
Vest02DenmDesp 5 54.45 7.85
Vest02DenmDesp 3 51.34 1.71
Vest02DenmDesp 3 59.57 5.11
Vest02DenmDesp 5 62.89 10.26
Vest02DenmDesp 3 77.33 2.14
Mato10GermDesp 4 41.89 12.6
Mato10GermDesp 4 11.92 1.8
Wawa07ChinDesp 18 0.097 0.004
Chen12ChinDesp 3 41.93 1.12
Hans11SwedDesp 2 1406.2 679.46
Hans11SwedDesp 2 1156.2 464.07
Hans11SwedDesp 2 4945.3 364.58
Keybl is the code for the site. The formula for the pooled SD is:
s=sqrt((n1-1)*s1^2+(n2-1)*s2^2)/(n1+n2-2))
(Sorry I can't post pictures and did not find a link that would directly go to the formula)
Where 2 is the number of groups and therefore will change depending on site. I know this is used for t-test and two groups one wants to compare. In this case I'm not planning to compare these groups. My professor suggested me to use this formula to get a weighted sd. I didn't find a R function that incorporates this formula in the way I need it, therefore I tried to build my own. I am, however, new to R and not very good at making functions and loops, therefore I hope for your help.
This is what I got so far:
sd=function (data) {
nc1=data[z,"nc"]
sc1=data[z, "sc"]
nc2=data[z+1, "nc"]
sc2=data[z+1, "sc"]
sd1=(nc1-1)*sc1^2 + (nc2-1)*sc2^2
sd2=sd1/(nc1+nc2-length(nc1))
sqrt(sd2)
}
splitdf=split(df, with(df, df$keybl), drop = TRUE)
for (c in 1:length(splitdf)) {
for (i in 1:length(splitdf[[i]])) {
a = (splitdf[[i]])
b =sd(a)
}
}
1) The function itself is not correct as it gives slightly lower values than it should and I don't understand why. Could it be that it does not stop when z+1 has reached the last row? If so, how can that be corrected?
2) The loop is totally wrong but it is what I could come up with after several hours of no success.
Can anybody help me?
Thanks,
Antra

What you're trying to do would benefit from a more general formula which will make it easier. If you didn't need to break it into pieces by the keybl variable you'd be done.
dd <- df #df is not a good name for a data.frame variable since df has a meaning in statistics
dd$df <- dd$n-1
pooledSD <- sqrt( sum(dd$sd^2 * dd$df) / sum(dd$df) )
# note, in this case I only pre-calculated df because I'll need it more than once. The sum of squares, variance, etc. are only used once.
An important general principle in R is that you use vector math as much as possible. In this trivial case it won't matter much but in order to see how to do this on large data.frame objects where compute speed is more important read on.
# First use R's vector facilities to define the variables you need for pooling.
dd$df <- dd$n-1
dd$s2 <- dd$sd^2 # sd isn't a good name for standard deviation variable even in a data.frame just because it's a bad habit to have... it's already a function and standard deviations have a standard name
dd$ss <- dd$s2 * dd$df
And now just use convenience functions for splitting and calculating the necessary sums. Note only one function is executed here in each implicit loop (*apply, aggregate, etc. are all implicit loops executing functions many times).
ds <- aggregate(ss ~ keybl, data = dd, sum)
ds$df <- tapply(dd$df, dd$keybl, sum) #two different built in methods for split apply, we could use aggregate for both if we wanted
# divide your ss by your df and voila
ds$s2 <- ds$ss / ds$df
# and also you can easly get your sd
ds$s <- sqrt(ds$s2)
And the correct answer is:
keybl ss df s2 s
1 Chen12ChinDesp 2.508800e+00 2 1.254400e+00 1.120000
2 Hans11SwedDesp 8.099454e+05 3 2.699818e+05 519.597740
3 Mato10GermDesp 4.860000e+02 6 8.100000e+01 9.000000
4 Vest02DenmDesp 8.106832e+02 16 5.066770e+01 7.118125
5 Wawa07ChinDesp 2.720000e-04 17 1.600000e-05 0.004000
This looks much less concise than other methods (like 42-'s answer) but if you unroll those in terms of how many R commands are actually being executed this is much more concise. For a short problem like this either way is fine but I thought I'd show you the method that uses the most vector math. It also highlights why those convenient implicit loop functions are available, for expressiveness. If you used for loops to accomplish the same then the temptation would be stronger to put everything in the loop. This can be a bad idea in R.

The pooled SD under the assumption of independence (so the covariance terms can be assumed to be zero) will be: sqrt( sum_over_groups[ (var)/sum(n)-N_groups)] )
lapply( split(dat, dat$keybl),
function(dd) sqrt( sum( dd$sd^2 * (dd$n-1) )/(sum(dd$n-1)-nrow(dd)) ) )
#-------------------------
$Chen12ChinDesp
[1] 1.583919
$Hans11SwedDesp
[1] Inf
$Mato10GermDesp
[1] 11.0227
$Vest02DenmDesp
[1] 9.003795
$Wawa07ChinDesp
[1] 0.004123106

Related

R comparing unequal vectors with inequality

I have two single vector data frames of unequal length
aa<-data.frame(c(2,12,35))
bb<-data.frame(c(1,2,3,4,5,6,7,15,22,36))
For each observation in aa I want to count the number of instances bb is less than aa
My result:
bb<aa
1 1
2 7
3 9
I have been able to do it two ways by creating a function and using apply, but my datasets are large and I let one run all night without end.
What I have:
fun1<-function(a,b){k<-colSums(b<a)
k<-k*.000058242}
system.time(replicate(5000,data.frame(apply(aa,1,fun1,b=bb))))
user system elapsed
3.813 0.011 3.883
Secondly,
fun2<-function(a,b){k<-length(which(b<a))
k<-k*.000058242}
system.time(replicate(5000,data.frame(apply(aa,1,fun2,b=bb))))
user system elapsed
3.648 0.006 3.664
The second function is slightly faster in all my tests, but I let the first run all night on a dataset where bb>1.7m and aa>160k
I found this post, and have tried using with() but cannot seem to get it to work, also tried a for loop without success.
Any help or direction is appreciated.
Thank you!
aa<-data.frame(c(2,12,35))
bb<-data.frame(c(1,2,3,4,5,6,7,15,22,36))
sapply(aa[[1]],function(x)sum(bb[[1]]<x))
# [1] 1 7 9
Some more realistic examples:
n <- 1.6e3
bb <- sample(1:n,1.7e6,replace=T)
aa <- 1:n
system.time(sapply(aa,function(x)sum(bb<x)))
# user system elapsed
# 14.63 2.23 16.87
n <- 1.6e4
bb <- sample(1:n,1.7e6,replace=T)
aa <- 1:n
system.time(sapply(aa,function(x)sum(bb<x)))
# user system elapsed
# 148.77 18.11 167.26
So with length(aa) = 1.6e4 this takes about 2.5 min (on my system), and the process scales as O(length(aa)) - no surprise there. Therefore, with your full dataset, it should run in about 25 min. Still kind of slow. Maybe someone else will come up with a better way.
My original post I had been looking for the number of times bb
So in my example
aa<-data.frame(c(2,12,35))
bb<-data.frame(c(1,2,3,4,5,6,7,15,22,36))
x<-ecdf(bb[,1])
x(2)
[1] 0.2
x(12)
[1] 0.7
x(35)
[1] 0.9
To get the answers in my original post I would need to multiply by the number of data points within bb, in this instance 10. Although the first one is not the same because in my original post I had stated bb
I am dealing with large datasets of land elevation and water elevation over 1 million data points for each, but in the end I am creating an inundation curve. I want to know how much land will be inundated at a water levels given exceedance probability.
So using the above ecdf() function on all 1 million data points would still be time consuming, but I realized I do not need all the data points just enough to create my curve.
So I applied the ecdf() function to the entire land data set, but then created an elevation sequence of the water large enough to create the curve that I needed, but small enough that it could be computed rapidly.
land_elevation <- data.frame(rnorm(1e6))
water_elevation<- data.frame(rnorm(1e6))
cdf_land<- ecdf(land_elevation[,1])
elevation_seq <- seq(from = min(water_elevation[,1]), to = max(water_elevation[,1]), length.out = 1000)
land <- sapply(elevation_seq, cdf_land)
My results are the same, but they are much faster.

For loop inside a for loop? in R

I am new to R and am trying create a new dataframe of bootstrapped resamples of groups of different sizes. My dataframe has 6 variables and a group designation, and there are 128 groups of different Ns. Here is an example of my data:
head(PhenoM2)
ID Name PhenoNames Group HML RML FML TML FHD BIB
1 378607 PaleoAleut PaleoAleut 1 323.5 248.75 434.50 355.75 46.84 NA
2 378664 PaleoAleut PaleoAleut 1 NA 238.50 441.50 353.00 45.83 277.0
3 378377 PaleoAleut PaleoAleut 1 309.5 227.75 419.00 332.25 46.39 284.0
4 378463 PaleoAleut PaleoAleut 1 283.5 228.75 397.75 331.00 44.37 255.5
5 378602 PaleoAleut PaleoAleut 1 279.5 230.00 393.00 329.50 45.93 265.0
6 378610 PaleoAleut PaleoAleut 1 307.5 234.25 419.50 338.50 43.98 271.5
Pulling from this question - bootstrap resampling for hierarchical/multilevel data - and taking some advice from others (thanks!) I wrote the code:
resample.M <- NULL
for(i in 1000){
groups <- unique(PhenoM2$"Group")
for(ii in 1:128)
data.i.ii <- PhenoM2[PhenoM2$"Group"==groups[ii],]
resample.M[i] <- data.i.ii[sample(1:nrow(data.i.ii),replace=T),]
}
Unfortunately, this gives me the warning:
In resample.M[i] <- data.i.ii[sample(1:nrow(data.i.ii), replace = T),:
number of items to replace is not a multiple of replacement length
Which I understand, since each of the 128 groups has a different N and none of it is a multiple of 1000. I put in resample.M[i] to try and accumulate all of the 1000x resamples of the 128 groups into a single database, and I'm pretty sure the problem is here.
Nearly all of the examples of for loops I've read create a vector database - numeric(1000) - then plug in the information, but since I'm wanting all of the data (which include factors, integers, and numerics) this doesn't work. I tried making a matrix to put the info in (there are 2187 unique individuals in the dataframe):
resample.M <- matrix(ncol=2187000,nrow=10)
But it's giving me the same warning.
So, since I'm sure I'm missing something basic here, I have three questions:
How can I get this code to resample all of the groups (with replacement and based on their individual Ns)?
How can I get this code to repeat this resampling 1000x?
How can I get the resamples of every group into the same database?
Thank you so much for your insight and expertise!
I think you may have wanted to use double square bracket, to store the results in a list, i.e. resample.M[[i]] <- .... Apart from that it makes more sense to write PhenoM2$Group than PhenoM2$"Group" and also groups <- unique(PhenoM2$Group) can go outside of your for loop since you only need to compute it once. Also replace 1:128 by 1:length(groups) or seq_along(groups), so that you don't need to hard code the length of the vector.
Because you will often need to operate on data frames grouped by some variable, I suggest you familiarise yourself with a package designed to do that, rather than using for loops, which can be very slow. The best one for a beginner in R may be plyr, which has an easy syntax (although there are many possibilities, including the slightly more "advanced" packages like dplyr and data.table).
So for a subset d <- subset(PhenoM2, Group == 1), you already have the function you need to perform on it: function(d) d[sample(1:nrow(d), replace = TRUE),].
Now to go over all such subsets, perform this operation and then arrange the results in a new data frame named samples you do
samples <- ddply(PhenoM2, .(Group),
function(d) d[sample(1:nrow(d), replace = TRUE),])
So what remains is to iterate this 1000 or however many times you want. You can use a for loop for this, storing the results in a list. Note that you need to use double square bracket [[ to set elements of the list.
n <- 1000 # number of iterations
samples <- vector("list", n) # list of length n to store results
for (i in seq_along(samples))
samples[[i]] <- ddply(PhenoM2, .(Group),
function(d) d[sample(1:nrow(d), replace = TRUE),])
An alternative way would be to use the function replicate, that performs the same task many times.
Once you have done this, all resamples will be stored in a list. I am not sure what you mean by "How can I get the resamples of every group into the same database". If you want to group them in a single data frame, you do all.samples <- do.call(rbind, samples). In general, you can format your list of samples using do.call and lapply together with a function.

Using Histogram as input in R

This is admittedly a very simple question that I just can't find an answer to.
In R, I have a file that has 2 columns: 1 of categorical data names, and the second a count column (count for each of the categories). With a small dataset, I would use 'reshape' and the function 'untable' to make 1 column and do analysis that way. The question is, how to handle this with a large data set?
In this case, my data is humungous and that just isn't going to work.
My question is, how do I tell R to use something like the following as distribution data:
Cat Count
A 5
B 7
C 1
That is, I give it a histogram as an input and have R figure out that it means there are 5 of A, 7 of B and 1 of C when calculating other information about the data.
The desired input rather than output would be for R to understand that the data would be the same as follows,
A
A
A
A
A
B
B
B
B
B
B
B
C
In reasonable size data, I can do this on my own, but what do you do when the data is very large?
Edit
The total sum of all the counts is 262,916,849.
In terms of what it would be used for:
This is new data, trying to understand the correlation between this new data and other pieces of data. Need to work on linear regressions and mixed models.
I think what you're asking is to reshape a data frame of categories and counts into a single vector of observations, where categories are repeated. Here's one way:
dat <- data.frame(Cat=LETTERS[1:3],Count=c(5,7,1))
# Cat Count
#1 A 5
#2 B 7
#3 C 1
rep.int(dat$Cat,times=dat$Count)
# [1] A A A A A B B B B B B B C
#Levels: A B C
To follow up on #Blue Magister's excellent answer, here's a 100,000 row histogram with a total count of 551,245,193:
set.seed(42)
Cat <- sapply(rep(10, 100000), function(x) {
paste(sample(LETTERS, x, replace=TRUE), collapse='')
})
dat <- data.frame(Cat, Count=sample(1000:10000, length(Cat), replace=TRUE))
> head(dat)
Cat Count
1 XYHVQNTDRS 5154
2 LSYGMYZDMO 4724
3 XDZYCNKXLV 8691
4 TVKRAVAFXP 2429
5 JLAZLYXQZQ 5704
6 IJKUBTREGN 4635
This is a pretty big dataset by my standards, and the operation Blue Magister describes is very quick:
> system.time(x <- rep(dat$Cat,times=dat$Count))
user system elapsed
4.48 1.95 6.42
It uses about 6GB of RAM to complete the operation.
This really depends on what statistics you are trying to calculate. The xtabs function will create tables for you where you can specify the counts. The Hmisc package has functions like wtd.mean that will take a vector of weights for computing a mean (and related functions for standard deviation, quantiles, etc.). The biglm package could be used to expand parts of the dataset at a time and analyze. There are probably other packages as well that would handle the frequency data, but which is best depends on what question(s) you are trying to answer.
The existing answers are all expanding the pre-binned dataset into a full distribution and then using R's histogram function which is memory inefficient and will not scale for very large datasets like the original poster asked about. The HistogramTools CRAN package includes a
PreBinnedHistogram function which takes arguments for breaks and counts to create a Histogram object in R without massively expanding the dataset.
For Example, if the data set has 3 buckets with 5, 7, and 1 elements, all of the other solutions posted here so far expand that into a list of 13 elements first and then create the histogram. PreBinnedHistogram in contrast creates the histogram directly from the 3 element input list without creating a much larger intermediate vector in memory.
big.histogram <- PreBinnedHistogram(my.data$breaks, my.data$counts)

Identifying duplicate columns in a dataframe

I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.
My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))
This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):
summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))
If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?
How about:
testframe[!duplicated(as.list(testframe))]
You can do with lapply:
testframe[!duplicated(lapply(testframe, summary))]
summary summarizes the distribution while ignoring the order.
Not 100% but I would use digest if the data is huge:
library(digest)
testframe[!duplicated(lapply(testframe, digest))]
A nice trick that you can use is to transpose your data frame and then check for duplicates.
duplicated(t(testframe))
unique(testframe, MARGIN=2)
does not work, though I think it should, so try
as.data.frame(unique(as.matrix(testframe), MARGIN=2))
or if you are worried about numbers turning into factors,
testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))]
which produces
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
7 24 79.9 M
8 25 81.1 M
9 26 81.2 F
10 27 81.8 M
11 28 82.8 F
12 29 83.5 M
It is probably best for you to first find the duplicate column names and treat them accordingly (for example summing the two, taking the mean, first, last, second, mode, etc... To find the duplicate columns:
names(df)[duplicated(names(df))]
What about just:
unique.matrix(testframe, MARGIN=2)
Actually you just would need to invert the duplicated-result in your code and could stick to using subset (which is more readable compared to bracket notation imho)
require(dplyr)
iris %>% subset(., select=which(!duplicated(names(.))))
Here is a simple command that would work if the duplicated columns of your data frame had the same names:
testframe[names(testframe)[!duplicated(names(testframe))]]
If the problem is that dataframes have been merged one time too many using, for example:
testframe2 <- merge(testframe, testframe, by = c('age'))
It is also good to remove the .x suffix from the column names. I applied it here on top of Mostafa Rezaei's great answer:
testframe2 <- testframe2[!duplicated(as.list(testframe2))]
names(testframe2) <- gsub('.x','',names(testframe2))
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix I propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
The function
dataPreparation::which_are_bijection
which_are_in_double(testframe)
Which return 3 and 4 the columns that are duplicated in your example
Build a data set with wanted dimensions for performance tests
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
for (i in 1:12){
testframe = rbind(testframe,testframe)
}
# Result in 49152 rows
for (i in 1:5){
testframe = cbind(testframe,testframe)
}
# Result in 160 columns
The benchmark
To perform the benchmark, I use the library rbenchmark which will reproduce each computations 100 times
benchmark(
which_are_in_double(testframe, verbose=FALSE),
duplicated(lapply(testframe, summary)),
duplicated(lapply(testframe, digest))
)
test replications elapsed
3 duplicated(lapply(testframe, digest)) 100 39.505
2 duplicated(lapply(testframe, summary)) 100 20.412
1 which_are_in_double(testframe, verbose = FALSE) 100 13.581
So which are bijection 3 to 1.5 times faster than other proposed solutions.
NB 1: I excluded from the benchmark the solution testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))] because it was already 10 times slower with 12k rows.
NB 2: Please note, the way this data set is constructed we have a lot of duplicated columns which reduce the advantage of exponential search. With just a few duplicated columns, one would have much better performance for which_are_bijection and similar performances for other methods.

Replacing for-loops with apply to improve perfomance (with weighted.mean)

I am a R newbie so hopefully this is a solvable problem for some of you.
I have a dataframe containing more than a million data-points. My goal is to compute a weighted mean with an altering starting point.
To illustrate consider this frame ( data.frame(matrix(c(1,2,3,2,2,1),3,2)) )
X1 X2
1 1 2
2 2 2
3 3 1
where X1 is the data and X2 is the sampling weight.
I want to compute the weighted mean for X1 from starting point 1 to 3, from 2:3 and from 3:3.
With a loop I simply wrote:
B <- rep(NA,3) #empty result vector
for(i in 1:3){
B[i] <- weighted.mean(x=A$X1[i:3],w=A$X2[i:3]) #shifting the starting point of the data and weights further to the end
}
With my real data this is impossible to compute because for each iteration the data.frame is altered and the computing takes hours with no result.
Is there a way to implement a varrying starting point with an apply command, so that the perfomance increases?
regards,
Ruben
Building upon #joran's answer to produce the correct result:
with(A, rev(cumsum(rev(X1*X2)) / cumsum(rev(X2))))
# [1] 1.800000 2.333333 3.000000
Also note that this is much faster than the sapply/lapply approach.
You can use lapply to create your subsets, and sapply to loop over these, but I'd wager there would be a quicker way.
sapply(lapply(1:3,":",3),function(x) with(dat[x,],weighted.mean(X1,X2)))
[1] 1.800000 2.333333 3.000000

Resources