I have a simple question.
I would like to sum of two non-parametric distributions.
Here is an example.
There are two cities which have 10 houses. we know energy consumption for each house. (edited) I want to get the probability distribution of the sum of a random house chosen from each city.
A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B
I have a probability distribution of A1 and B1, how can I get the probability distribution of A1+B1?
If I just use A1+B1 in R, it gives 12 15 18 20 20 22 22 24 26 29. However, I don't think this is right. Becuase there is not order in houses.
When I change the order of houses, it gives another results.
# Original
A1 <- c(1,2,3,3,3,4,4,5,6,7)
B1 <- c(11,13,15,17,17,18,18,19,20,22)
#change order 1
A2 <- c(7,6,5,4,4,3,3,3,2,1)
B2 <- c(22,20,19,18,18,17,17,15,13,11)
#change order 2
A3 <- c(3,3,3,4,4,5,6,7,1,2)
B3 <- c(17,17,18,18,19,13,20,11,22,15)
sum1 <- A1+B1; sum1
sum2 <- A1+B2; sum2
sum3 <- A3+B3; sum3
Red lines are sum1, sum2, and sum3. I am not sure how can I get the distribution of the sum of two distributions.Please give me any ideas.Thanks!
(If those distributions are normal or uniform distributions, I could get the sum of distribution easily, but these are not a normal and there is no order)
In theory, the sum distribution of two random variables is the convolution of their PDFs, details, as:
PDF(Z) = PDF(Y) * PDF(X)
So, I think this case can be computed by convolution.
# your data
A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B
# compute PDF/CDF
PDF_A1 <- table(A1)/length(A1)
CDF_A1 <- cumsum(PDF_A1)
PDF_B1 <- table(B1)/length(B1)
CDF_B1 <- cumsum(PDF_B1)
# compute the sum distribution
PDF_C1 <- convolve(PDF_B1, PDF_A1, type = "open")
# plotting
plot(PDF_C1, type="l", axe=F, main="PDF of A1+B1")
box()
axis(2)
# FIXME: is my understand for X correct?
axis(1, at=seq(1:14), labels=(c(names(PDF_A1)[-1],names(PDF_B1))))
Note:
CDF: cumulative distribution function
PDF: probability density function
## To make the x-values correspond to actually sums, consider
## compute PDF
## pad zeros in probability vectors to convolve
r <- range(c(A1, B1))
pdfA <- pdfB <- vector('numeric', diff(r)+1L)
PDF_A1 <- table(A1)/length(A1) # same as what you have done
PDF_B1 <- table(B1)/length(B1)
pdfA[as.numeric(names(PDF_A1))] <- as.vector(PDF_A1) # fill the values
pdfB[as.numeric(names(PDF_B1))] <- as.vector(PDF_B1)
## compute the convolution and plot
res <- convolve(pdfA, rev(pdfB), type = "open")
plot(res, type="h", xlab='Sum', ylab='')
## In this simple case (with discrete distribution) you can compare
## to previous solution
tst <- rowSums(expand.grid(A1, B1))
plot(table(tst) / sum(as.vector(table(tst))), type='h')
Edit:
Now that I better understand the question, and see #jeremycg 's answer, I think I have a different approach that I think will scale better with sample size.
Rather than relying on the values in A1 and B1 being the only values in the distribution, we could infer that those are simply samples from a distribution. To avoid imposing a particular form on the distribution, I'll use an empirical 'equivalent': the sample density. If we use the density function, we can infer the relative probabilities of sampling a continuous range of household energy uses from either town. We can randomly draw an arbitrary number of energies (with replacement), from the density()$x values, where the sample's we take are weighted with prob=density()$y ... i.e., peaks in the density plot are at x-values that should be resample more often.
As a heuristic, an oversimplified statement could say that mean(A1) is 3.8, and mean(B1) is 17, so the sum of energy uses from the two cities should be, on average, ~20.8. Using this as the "does it make sense test"/ heuristic, I think the following approach aligns with the type of result you want.
sample_sum <- function(A, B, n, ...){
qss <- function(X, n, ...){
r_X <- range(X)
dens_X <- density(X, ...)
sample(dens_X$x, size=n, prob=dens_X$y, replace=TRUE)
}
sample_A <- qss(A, n=n, ...)
sample_B <- qss(B, n=n, ...)
sample_A + sample_B
}
ss <- sample_sum(A1, B1, n=100, from=0)
png("~/Desktop/answer.png", width=5, height=5, units="in", res=150)
plot(density(ss))
dev.off()
Note that I bounded the density plot at 0, because I'm assuming you don't want to infer negative energies. I see that the peak in the resultant density is just above 20, so 'it makes sense'.
The potential advantage here is that you don't need to look at every possible combination of energies from the houses in the two cities to understand the distribution of summed energy uses. If you can define the distribution of both, you can define the distribution of paired sums.
Finally, the computation time is trivial, especially compared the approach finding all combinations. E.g., with 10 million houses in each city, if I try to do the expand.grid approach I get a Error: cannot allocate vector of size 372529.0 Gb error, whereas the sample_sum approach takes 0.12 seconds.
Of course, if the answer doesn't help you, the speed is worthless ;)
You probably want something like:
rowSums(expand.grid(A1, B1))
Using expand.grid will get you a dataframe of all combinations of A1 and B1, and rowSums will add them.
Is it not the case that sorting the distribution prior to adding solves this problem?
A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B
sort(A1)+sort(B1)
Related
Suppose I have some observations of variables and a model. The model result is not directly observable but from physical assumptions I know roughly what to expect. Here is a simplified example, the real model is quite complex:
its <- 1000 # number of iterations
obs1 <- rnorm(n=its, mean=20, sd=1) # an observation of one parameter
obs2 <- runif(n=its, min=3, max=12) # an observation of a second parameter
mod <- function(obs1, obs2){
res <- (obs1 + obs2)^2
return(res)
} # a model. the result cannot be directly observed
result <- mod(obs1=obs1, obs2=obs2)
## but i know from physical principles something about the result: it must follow a specific distribution, here for example a normal one:
res.info <- density(rnorm(1e4, mean=600, sd=100))
### and I also know the ratio of obs1/obs2 can't be greater than 4
res.info2 <- density(runif(n=1e4, min=0, max=4))
layout(mat=matrix(1:4, nrow=2))
par(mar=c(3,5,1,1))
hist(result, xlim=c(200, 1400))
plot(res.info, xlim=c(200, 1400))
hist(obs1/obs2, xlim=c(0,8))
plot(res.info2, xlim=c(0,8))
My question: How do I obtain a probability density function of the result, given what I know about obs1, obs2, and the result? Is this a situation in which Bayes Theorem can be applied? How would I do this programmatically?
Let's say I have a time-series like this
t x
1 100
2 50
3 200
4 210
5 90
6 80
7 300
Is it possible in R to generate a new dataset x1 which has the exact same summary statistics, e.g. mean, variance, kurtosis, skew as x?
The reason for my asking is that I would like to do an experiment where I want to test how subjects react to different graphs of data that contain the same information.
I recently read:
Matejka, Justin, and George Fitzmaurice. "Same stats, different graphs: Generating datasets with varied appearance and identical statistics through simulated annealing." Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 2017.
Generating Data with Identical Statistics but Dissimilar Graphics: A Follow up to the Anscombe Dataset, The American Statistician, 2007,
However, Matejka uses code in Python that is quite scientific and their data is more complex than time-series data, so I was wondering if there was a way to do this more efficiently for a simpler data set?
Best regards
I'm not aware of a package that can give you precisely what you are looking for. One option is using the datasets in the datasauRus package as JasonAizkalns pointed out. However, if you want to create your own dataset, you can try this:
Fit the Johnson distribution from the SuppDists package to get the moments of the dataset and draw new sets from that distribution until the difference is sufficiently small. Below an example with your dataset, although more observations make it easier to replicate the summary statistics:
library(SuppDists)
a <- c(100,50,200,210,90,80,300)
momentsDiffer <- function(x1,x2){
diff <- sum(abs(moments(x1)- moments(x2)))
return(diff)
}
repDataset <- function(x,n){
# fit Johnson distribution
parms<-JohnsonFit(a, moment="quant")
# generate from distribution n times storing if improved
current <- rJohnson(length(a),parms)
momDiff <- momentsDiffer(x,current)
for(i in 1:n){
temp <- rJohnson(length(a),parms)
tempDiff <- momentsDiffer(x,temp)
if(tempDiff < momDiff){
current <- temp
momDiff <- tempDiff
}
}
return(current)
}
# Drawing 1000 times to allow improvement
b <- repDataset(a,1000)
> moments(b)
mean sigma skew kurt
148.14048691 84.24884165 1.04201116 -0.05008629
> moments(a)
mean sigma skew kurt
147.1428571 84.1281821 0.5894543 -1.0198303
EDIT - Added additional method
Following the suggestion of #Jj Blevins, the method below generates a random sequence based upon the original sequence leaving out 4 observations. Those 4 observations are then added through solving a non-linear equation on the difference between the four moments of the original sequence and the new sequence. This still not generate a perfect match, feel free to improve.
library(nleqslv)
library(e1071)
set.seed(1)
a <- c(100,50,200,210,90,80,300)
#a <- floor(runif(1000,0,101))
init <- floor(runif(length(a)-4,min(a),max(a)+1))
moments <- moments(a)
f <- function(x) {
a <- mean(c(init,x))
b <- var(c(init,x))
c <- skewness(c(init,x))
d <- kurtosis(c(init,x))
c(a-moments[1],b-moments[2],c-moments[3],d-moments[4])
}
result <- nleqslv(runif(4,min(a),max(a)+1), f,control=list(ftol=.00000001, allowSingular=TRUE))
> moments(c(init,result$x))
mean sigma skew kurt
49.12747961 29.85435993 0.03327868 -1.25408078
> moments(a)
mean sigma skew kurt
49.96600000 29.10805462 0.03904256 -1.18250616
Suppose I have two data sets with different sizes, each data set contains x and y to locate each observation.
set.seed(1)
x1 <- runif(1000,-195.5,195.5)
y1 <- runif(1000,-49,49)
data1 <- data.frame(x1,y1)
x2 <- runif(2000,-195.5,195.5)
y2 <- runif(2000,-49,49)
data2 <- data.frame(x2,y2)
Here I generated two data sets with random locations within an specific area.
Then I generated two hexbins of each dataset. And I know for achieving tracing back the bins, I need to set IDs = TRUE
hbin_1 <- hexbin(x=data1$x1,y=data1$y1,xbins=30,shape=98/391,IDs=TRUE)
hbin_2 <- hexbin(x=data2$x2,y=data2$y2,xbins=30,shape=98/391,IDs=TRUE)
In next step, I wanna apply KL divergence to achieve comparison of these two datasets. Then the problem is how can I get the matching bin in second data set to the first data set? (I wanna compare the bins with same location in two different datasets)
UPDATE
We can get the table contains the cell name(bin number) with corresponding count of observations in this bin by
tI1 <- table(hbin_1#cID)
tI2 <- table(hbin_2#cID)
The problem is the bin numbers are different between dataset1 and dataset2. Even we set same xbins and shape in the function hexbin, we still get different bins of two datasets. Then how can I compare the two datasets (or get bins with same location)?
The function hexbin doesn't not return empty bins. Hence, even we set the xbins, xbnds and ybnds same, the returned hexbin results can be different for two datasets.
We can use kde2d from the package MASS to achieve two-dimensional kernel density estimation.
b1 <- kde2d(data1$x1,data1$y1,lims = c(xbnds,ybnds))
b2 <- kde2d(data2$x2,data2$y2,lims = c(xbnds,ybnds))
Then, we can get two vectors of kernel density estimation of two datasets, and then normalising the results by dividing by the sum of each vector of the estimated density. Finally, we can apply KL divergence to quantify the similarity of the distributions.
z1 <- as.vector(b1$z)
z2 <- as.vector(b2$z)
z1 <- mapply("/",z1,0.01509942)
z2 <- mapply("/",z2,0.01513236)
kullback.leibler(z1, z2)
I want to generate 2 continuous random variables Q1, Q2 (quantitative traits, each are normal) and 2 binary random variables Z1, Z2 (binary traits) with given pairwise correlations between all possible pairs of them.
Say
(Q1,Q2):0.23
(Q1,Z1):0.55
(Q1,Z2):0.45
(Q2,Z1):0.4
(Q2,Z2):0.5
(Z1,Z2):0.47
Please help me generate such data in R.
This is crude but might get you started in the right direction.
library(copula)
options(digits=3)
probs <- c(0.5,0.5)
corrs <- c(0.23,0.55,0.45,0.4,0.5,0.47) ## lower triangle
Simulate correlated values (first two quantitative, last two transformed to binary)
sim <- function(n,probs,corrs) {
tmp <- normalCopula( corrs, dim=4 , "un")
getSigma(tmp) ## test
x <- rCopula(1000, tmp)
x2 <- x
x2[,3:4] <- qbinom(x[,3:4],size=1,prob=rep(probs,each=nrow(x)))
x2
}
Test SSQ distance between observed and target correlations:
objfun <- function(corrs,targetcorrs,probs,n=1000) {
cc <- try(cor(sim(n,probs,corrs)),silent=TRUE)
if (is(cc,"try-error")) return(NA)
sum((cc[lower.tri(cc)]-targetcorrs)^2)
}
See how bad things are when input corrs=target:
cc0 <- cor(sim(1000,probs=probs,corrs=corrs))
cc0[lower.tri(cc0)]
corrs
objfun(corrs,corrs,probs=probs) ## 0.112
Now try to optimize.
opt1 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5))
opt1$value ## 0.0208
Stops after 501 iterations with "max iterations exceeded". This will never work really well because we're trying to use a deterministic hill-climbing algorithm on a stochastic objective function ...
cc1 <- cor(sim(1000,probs=c(0.5,0.5),corrs=opt1$par))
cc1[lower.tri(cc1)]
corrs
Maybe try simulated annealing?
opt2 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5),
method="SANN")
It doesn't seem to do much better than the previous value. Two possible problems (left as an exercise for the reader are) (1) we have specified a set of correlations that are not feasible with the marginal distributions we have chosen, or (2) the error in the objective function surface is getting in the way -- to do better we would have to average over more replicates (i.e. increase n).
In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.