This is admittedly a very simple question that I just can't find an answer to.
In R, I have a file that has 2 columns: 1 of categorical data names, and the second a count column (count for each of the categories). With a small dataset, I would use 'reshape' and the function 'untable' to make 1 column and do analysis that way. The question is, how to handle this with a large data set?
In this case, my data is humungous and that just isn't going to work.
My question is, how do I tell R to use something like the following as distribution data:
Cat Count
A 5
B 7
C 1
That is, I give it a histogram as an input and have R figure out that it means there are 5 of A, 7 of B and 1 of C when calculating other information about the data.
The desired input rather than output would be for R to understand that the data would be the same as follows,
A
A
A
A
A
B
B
B
B
B
B
B
C
In reasonable size data, I can do this on my own, but what do you do when the data is very large?
Edit
The total sum of all the counts is 262,916,849.
In terms of what it would be used for:
This is new data, trying to understand the correlation between this new data and other pieces of data. Need to work on linear regressions and mixed models.
I think what you're asking is to reshape a data frame of categories and counts into a single vector of observations, where categories are repeated. Here's one way:
dat <- data.frame(Cat=LETTERS[1:3],Count=c(5,7,1))
# Cat Count
#1 A 5
#2 B 7
#3 C 1
rep.int(dat$Cat,times=dat$Count)
# [1] A A A A A B B B B B B B C
#Levels: A B C
To follow up on #Blue Magister's excellent answer, here's a 100,000 row histogram with a total count of 551,245,193:
set.seed(42)
Cat <- sapply(rep(10, 100000), function(x) {
paste(sample(LETTERS, x, replace=TRUE), collapse='')
})
dat <- data.frame(Cat, Count=sample(1000:10000, length(Cat), replace=TRUE))
> head(dat)
Cat Count
1 XYHVQNTDRS 5154
2 LSYGMYZDMO 4724
3 XDZYCNKXLV 8691
4 TVKRAVAFXP 2429
5 JLAZLYXQZQ 5704
6 IJKUBTREGN 4635
This is a pretty big dataset by my standards, and the operation Blue Magister describes is very quick:
> system.time(x <- rep(dat$Cat,times=dat$Count))
user system elapsed
4.48 1.95 6.42
It uses about 6GB of RAM to complete the operation.
This really depends on what statistics you are trying to calculate. The xtabs function will create tables for you where you can specify the counts. The Hmisc package has functions like wtd.mean that will take a vector of weights for computing a mean (and related functions for standard deviation, quantiles, etc.). The biglm package could be used to expand parts of the dataset at a time and analyze. There are probably other packages as well that would handle the frequency data, but which is best depends on what question(s) you are trying to answer.
The existing answers are all expanding the pre-binned dataset into a full distribution and then using R's histogram function which is memory inefficient and will not scale for very large datasets like the original poster asked about. The HistogramTools CRAN package includes a
PreBinnedHistogram function which takes arguments for breaks and counts to create a Histogram object in R without massively expanding the dataset.
For Example, if the data set has 3 buckets with 5, 7, and 1 elements, all of the other solutions posted here so far expand that into a list of 13 elements first and then create the histogram. PreBinnedHistogram in contrast creates the histogram directly from the 3 element input list without creating a much larger intermediate vector in memory.
big.histogram <- PreBinnedHistogram(my.data$breaks, my.data$counts)
Related
I have some datasets for which i want to calculate gamma diversity as the Shannon H index.
Example dataset:
Site SpecA SpecB SpecC
1 4 0 0
2 3 2 4
3 1 1 1
Calculating the alpha diversity is as follows:
vegan::diversity(df, index = "shannon")
However, i want this diversity function to calculate one number for the complete dataset instead of for each row. I can't wrap my head around this. My thought is that i need to write a function to merge all the columns into one and taking the average abundance of each species, thus creating a dataframe with one site contaning all the species information:
site SpecA SpecB SpecC
1 2.6 1 1.6
This seems like a giant workaround for something that could be done with some existing functions, but i don't know how. I hope someone can help in creating this dataframe or using some other method to use the diversity() function over the complete dataframe.
Regards
library(vegan)
data(BCI)
diversity(colSums(BCI)) # vector of sums is all you need
## vegan 2.6-0 in github has 'groups' argument for sampling units
diversity(BCI, groups = rep(1, nrow(BCI))) # one group, same result as above
diversity(BCI, groups = "a") # arg 'groups' recycled: same result as above
I have a datatable, DT, with columns A, B and C. I want only one A per unique B, and I want to choose that A based on the value of C (choose the largest C).
Based on this (incredibly helpful) SO page, Use data.table to get first of subgroup based on a variable, I tried something like this:
test <- data.table(A=c(1:3,1:2),B=c(1:5),C=c(11:15))
setkey(test,A,C)
test[,.SD[.N],by="A"]
In my test case, this gives me an answer that seems right:
# A B C
# 1: 1 6 16
# 2: 2 7 17
# 3: 3 8 18
# 4: 4 4 14
# 5: 5 5 15
And, as expected, the number of rows matches the number of unique entries for "A" in my DT:
length(unique(test$A))
# 5
However, when I apply this to my actual dataset, I am missing approximately 20% of my initially ~2 million rows.
I cannot seem to put together a test dataset that will recreate this type of a loss. There are no null values in the actual dataset. What else could be a factor in a dataset that would cause a discrepancy between the number of results from something like test[,.SD[.N],by="A"] and length(unique(test$A))?
Thanks to #Eddi's debugging coaching, here's the answer, at least for my dataset: differential handling of numbers in scientific notation.
In particular: In my actual dataset, columns A and B were very long numbers that, upon import from SQL to R, had been imported in scientific notation. It turns out the test[,.SD[.N],by="A"] and length(unique(test$A)) commands were handling this differently: length(unique(test$A)) was preserving the difference between two values that differed only in a small digit that is not visible in the collapsed scientific notation format printed as visual output, but test[,.SD[.N],by="A"] was, in essence, rounding the values and thus collapsing some of them together.
(I feel foolish that I didn't catch this myself before posting, but much appreciate the help - I hope somehow this spares someone else the same confusion, perhaps!)
I have two single vector data frames of unequal length
aa<-data.frame(c(2,12,35))
bb<-data.frame(c(1,2,3,4,5,6,7,15,22,36))
For each observation in aa I want to count the number of instances bb is less than aa
My result:
bb<aa
1 1
2 7
3 9
I have been able to do it two ways by creating a function and using apply, but my datasets are large and I let one run all night without end.
What I have:
fun1<-function(a,b){k<-colSums(b<a)
k<-k*.000058242}
system.time(replicate(5000,data.frame(apply(aa,1,fun1,b=bb))))
user system elapsed
3.813 0.011 3.883
Secondly,
fun2<-function(a,b){k<-length(which(b<a))
k<-k*.000058242}
system.time(replicate(5000,data.frame(apply(aa,1,fun2,b=bb))))
user system elapsed
3.648 0.006 3.664
The second function is slightly faster in all my tests, but I let the first run all night on a dataset where bb>1.7m and aa>160k
I found this post, and have tried using with() but cannot seem to get it to work, also tried a for loop without success.
Any help or direction is appreciated.
Thank you!
aa<-data.frame(c(2,12,35))
bb<-data.frame(c(1,2,3,4,5,6,7,15,22,36))
sapply(aa[[1]],function(x)sum(bb[[1]]<x))
# [1] 1 7 9
Some more realistic examples:
n <- 1.6e3
bb <- sample(1:n,1.7e6,replace=T)
aa <- 1:n
system.time(sapply(aa,function(x)sum(bb<x)))
# user system elapsed
# 14.63 2.23 16.87
n <- 1.6e4
bb <- sample(1:n,1.7e6,replace=T)
aa <- 1:n
system.time(sapply(aa,function(x)sum(bb<x)))
# user system elapsed
# 148.77 18.11 167.26
So with length(aa) = 1.6e4 this takes about 2.5 min (on my system), and the process scales as O(length(aa)) - no surprise there. Therefore, with your full dataset, it should run in about 25 min. Still kind of slow. Maybe someone else will come up with a better way.
My original post I had been looking for the number of times bb
So in my example
aa<-data.frame(c(2,12,35))
bb<-data.frame(c(1,2,3,4,5,6,7,15,22,36))
x<-ecdf(bb[,1])
x(2)
[1] 0.2
x(12)
[1] 0.7
x(35)
[1] 0.9
To get the answers in my original post I would need to multiply by the number of data points within bb, in this instance 10. Although the first one is not the same because in my original post I had stated bb
I am dealing with large datasets of land elevation and water elevation over 1 million data points for each, but in the end I am creating an inundation curve. I want to know how much land will be inundated at a water levels given exceedance probability.
So using the above ecdf() function on all 1 million data points would still be time consuming, but I realized I do not need all the data points just enough to create my curve.
So I applied the ecdf() function to the entire land data set, but then created an elevation sequence of the water large enough to create the curve that I needed, but small enough that it could be computed rapidly.
land_elevation <- data.frame(rnorm(1e6))
water_elevation<- data.frame(rnorm(1e6))
cdf_land<- ecdf(land_elevation[,1])
elevation_seq <- seq(from = min(water_elevation[,1]), to = max(water_elevation[,1]), length.out = 1000)
land <- sapply(elevation_seq, cdf_land)
My results are the same, but they are much faster.
I'm running R version 3.0.2 in RStudio and Excel 2011 for Mac OS X. I'm performing a quantile normalization between 4 sets of 45,015 values. Yes I do know about the bioconductor package, but my question is a lot more general. It could be any other computation. The thing is, when I perform the computation (1) "by hand" in Excel and (2) with a program I wrote from scratch in R, I get highly similar, yet not identical results. Typically, the values obtained with (1) and (2) would differ by less than 1.0%, although sometimes more.
Where is this variation likely to come from, and what should I be aware of concerning number approximations in R and/or Excel? Does this come from a lack of float accuracy in either one of these programs? How can I avoid this?
[EDIT]
As was suggested to me in the comments, this may be case-specific. To provide some context, I described methods (1) and (2) below in detail using test data with 9 rows. The four data sets are called A, B, C, D.
[POST-EDIT COMMENT]
When I perform this on a very small data set (test sample: 9 rows), the results in R and Excel do not differ. But when I apply the same code to the real data (45,015 rows), I get slight variation between R and Excel. I have no clue why that may be.
(2) R code:
dataframe A
Aindex A
1 2.1675e+05
2 9.2225e+03
3 2.7925e+01
4 7.5775e+02
5 8.0375e+00
6 1.3000e+03
7 8.0575e+00
8 1.5700e+02
9 8.1275e+01
dataframe B
Bindex B
1 215250.000
2 10090.000
3 17.125
4 750.500
5 8.605
6 1260.000
7 7.520
8 190.250
9 67.350
dataframe C
Cindex C
1 2.0650e+05
2 9.5625e+03
3 2.1850e+01
4 1.2083e+02
5 9.7400e+00
6 1.3675e+03
7 9.9325e+00
8 1.9675e+02
9 7.4175e+01
dataframe D
Dindex D
1 207500.0000
2 9927.5000
3 16.1250
4 820.2500
5 10.3025
6 1400.0000
7 120.0100
8 175.2500
9 76.8250
Code:
#re-order by ascending values
A <- A[order(A$A),, drop=FALSE]
B <- B[order(B$B),, drop=FALSE]
C <- C[order(C$C),, drop=FALSE]
D <- D[order(D$D),, drop=FALSE]
row.names(A) <- NULL
row.names(B) <- NULL
row.names(C) <- NULL
row.names(D) <- NULL
#compute average
qnorm <- data.frame(cbind(A$A,B$B,C$C,D$D))
colnames(qnorm) <- c("A","B","C","D")
qnorm$qnorm <- (qnorm$A+qnorm$B+qnorm$C+qnorm$D)/4
#replace original values by average values
A$A <- qnorm$qnorm
B$B <- qnorm$qnorm
C$C <- qnorm$qnorm
D$D <- qnorm$qnorm
#re-order by index number
A <- A[order(A$Aindex),,drop=FALSE]
B <- B[order(B$Bindex),,drop=FALSE]
C <- C[order(C$Cindex),,drop=FALSE]
D <- D[order(D$Dindex),,drop=FALSE]
row.names(A) <- NULL
row.names(B) <- NULL
row.names(C) <- NULL
row.names(D) <- NULL
(1) Excel
assign index numbers to each set.
re-order each set in ascending order: select the columns two by two and use Custom Sort... by A, B, C, or D:
calculate average=() over columns A, B, C, and D:
replace values in columns A, B, C, and D by those in the average column using Special Paste... > Values:
re-order everything according to the original index numbers:
if you use exactly the same algorithm you will get exactly the same results. not within 1% but to the 10th decimal. so you're not using the same algorithms. details probably won't change this general answer.
(or it could be a bug in excel or r but this is less likely)
Answering my own question!
It ended up being Excel's fault (well, kind of): at some point, either in the conversion from the original TAB-delimited file to CSV, or later on when I started copying and pasting stuff, the values got rounded up.
The original TAB-delimited files had 6 decimals, whereas the CSV files only had 2. I had been doing the analysis so far with quantile normalization done in Excel from the 6-decimal data, whereas I read the data from the CSV files for my quantile normalization function in R, hence the change.
For the above examples for R and Excel respectively, I used data coming from the same source, which is why I got the same results.
What would you suggest would be best now that I figured this out:
1/Change the title to let other clueless people know that this kind of thing can happen?
2/Consider this post useless and delete it?
I want to calculate the pooled (actually weighted) standard deviation for all the unique sites in my data frame.
The values for these sites are values for single species forest stands and I want to pool the mean and the sd so that I can compare broadleaved stands with conifer stands.
This is the data frame (df) with values for the broadleaved stands:
keybl n mean sd
Vest02DenmDesp 3 58.16 6.16
Vest02DenmDesp 5 54.45 7.85
Vest02DenmDesp 3 51.34 1.71
Vest02DenmDesp 3 59.57 5.11
Vest02DenmDesp 5 62.89 10.26
Vest02DenmDesp 3 77.33 2.14
Mato10GermDesp 4 41.89 12.6
Mato10GermDesp 4 11.92 1.8
Wawa07ChinDesp 18 0.097 0.004
Chen12ChinDesp 3 41.93 1.12
Hans11SwedDesp 2 1406.2 679.46
Hans11SwedDesp 2 1156.2 464.07
Hans11SwedDesp 2 4945.3 364.58
Keybl is the code for the site. The formula for the pooled SD is:
s=sqrt((n1-1)*s1^2+(n2-1)*s2^2)/(n1+n2-2))
(Sorry I can't post pictures and did not find a link that would directly go to the formula)
Where 2 is the number of groups and therefore will change depending on site. I know this is used for t-test and two groups one wants to compare. In this case I'm not planning to compare these groups. My professor suggested me to use this formula to get a weighted sd. I didn't find a R function that incorporates this formula in the way I need it, therefore I tried to build my own. I am, however, new to R and not very good at making functions and loops, therefore I hope for your help.
This is what I got so far:
sd=function (data) {
nc1=data[z,"nc"]
sc1=data[z, "sc"]
nc2=data[z+1, "nc"]
sc2=data[z+1, "sc"]
sd1=(nc1-1)*sc1^2 + (nc2-1)*sc2^2
sd2=sd1/(nc1+nc2-length(nc1))
sqrt(sd2)
}
splitdf=split(df, with(df, df$keybl), drop = TRUE)
for (c in 1:length(splitdf)) {
for (i in 1:length(splitdf[[i]])) {
a = (splitdf[[i]])
b =sd(a)
}
}
1) The function itself is not correct as it gives slightly lower values than it should and I don't understand why. Could it be that it does not stop when z+1 has reached the last row? If so, how can that be corrected?
2) The loop is totally wrong but it is what I could come up with after several hours of no success.
Can anybody help me?
Thanks,
Antra
What you're trying to do would benefit from a more general formula which will make it easier. If you didn't need to break it into pieces by the keybl variable you'd be done.
dd <- df #df is not a good name for a data.frame variable since df has a meaning in statistics
dd$df <- dd$n-1
pooledSD <- sqrt( sum(dd$sd^2 * dd$df) / sum(dd$df) )
# note, in this case I only pre-calculated df because I'll need it more than once. The sum of squares, variance, etc. are only used once.
An important general principle in R is that you use vector math as much as possible. In this trivial case it won't matter much but in order to see how to do this on large data.frame objects where compute speed is more important read on.
# First use R's vector facilities to define the variables you need for pooling.
dd$df <- dd$n-1
dd$s2 <- dd$sd^2 # sd isn't a good name for standard deviation variable even in a data.frame just because it's a bad habit to have... it's already a function and standard deviations have a standard name
dd$ss <- dd$s2 * dd$df
And now just use convenience functions for splitting and calculating the necessary sums. Note only one function is executed here in each implicit loop (*apply, aggregate, etc. are all implicit loops executing functions many times).
ds <- aggregate(ss ~ keybl, data = dd, sum)
ds$df <- tapply(dd$df, dd$keybl, sum) #two different built in methods for split apply, we could use aggregate for both if we wanted
# divide your ss by your df and voila
ds$s2 <- ds$ss / ds$df
# and also you can easly get your sd
ds$s <- sqrt(ds$s2)
And the correct answer is:
keybl ss df s2 s
1 Chen12ChinDesp 2.508800e+00 2 1.254400e+00 1.120000
2 Hans11SwedDesp 8.099454e+05 3 2.699818e+05 519.597740
3 Mato10GermDesp 4.860000e+02 6 8.100000e+01 9.000000
4 Vest02DenmDesp 8.106832e+02 16 5.066770e+01 7.118125
5 Wawa07ChinDesp 2.720000e-04 17 1.600000e-05 0.004000
This looks much less concise than other methods (like 42-'s answer) but if you unroll those in terms of how many R commands are actually being executed this is much more concise. For a short problem like this either way is fine but I thought I'd show you the method that uses the most vector math. It also highlights why those convenient implicit loop functions are available, for expressiveness. If you used for loops to accomplish the same then the temptation would be stronger to put everything in the loop. This can be a bad idea in R.
The pooled SD under the assumption of independence (so the covariance terms can be assumed to be zero) will be: sqrt( sum_over_groups[ (var)/sum(n)-N_groups)] )
lapply( split(dat, dat$keybl),
function(dd) sqrt( sum( dd$sd^2 * (dd$n-1) )/(sum(dd$n-1)-nrow(dd)) ) )
#-------------------------
$Chen12ChinDesp
[1] 1.583919
$Hans11SwedDesp
[1] Inf
$Mato10GermDesp
[1] 11.0227
$Vest02DenmDesp
[1] 9.003795
$Wawa07ChinDesp
[1] 0.004123106