This question is sort of a follow-up to how to extract intragroup and intergroup distances from a distance matrix? in R. In that question, they first computed the distance matrix for all points, and then simply extracted the inter-class distance matrix. I have a situation where I'd like to bypass the initial computation and skip right to extraction, i.e. I want to directly compute the inter-class distance matrix. Drawing from the linked example, with tweaks, let's say I have some data in a dataframe called df:
values<-c(0.002,0.3,0.4,0.005,0.6,0.2,0.001,0.002,0.3,0.01)
class<-c("A","A","A","B","B","B","B","A","B","A")
df<-data.frame(values, class)
What I'd like is a distance matrix:
1 2 3 8 10
4 .003 .295 .395 .003 .005
5 .598 .300 .200 .598 .590
6 .198 .100 .200 .198 .190
7 .001 .299 .399 .001 .009
9 .298 .000 .100 .298 .290
Does there already exist in R an elegant and fast way to do this?
EDIT After receiving a good solution for the 1D case above, I thought of a bonus question: what about a higher-dimensional case, say if instead df looks like this:
values1<-c(0.002,0.3,0.4,0.005,0.6,0.2,0.001,0.002,0.3,0.01)
values2<-c(0.001,0.1,0.1,0.001,0.1,0.1,0.001,0.001,0.1,0.01)
class<-c("A","A","A","B","B","B","B","A","B","A")
df<-data.frame(values1, values2, class)
And I'm interested in again getting a matrix of the Euclidean distance between points in class B with points in class A.
For general n-dimensional Euclidean distance, we can exploit the equation (not R, but algebra):
square_dist(b,a) = sum_i(b[i]*b[i]) + sum_i(a[i]*a[i]) - 2*inner_prod(b,a)
where the sums are over the dimensions of vectors a and b for i=[1,n]. Here, a and b are one pair from A and B. The key here is that this equation can be written as a matrix equation for all pairs in A and B.
In code:
## First split the data with respect to the class
n <- 2 ## the number of dimensions, for this example is 2
tmp <- split(df[,1:n], df$class)
d <- sqrt(matrix(rowSums(expand.grid(rowSums(tmp$B*tmp$B),rowSums(tmp$A*tmp$A))),
nrow=nrow(tmp$B)) -
2. * as.matrix(tmp$B) %*% t(as.matrix(tmp$A)))
Notes:
The inner rowSums compute sum_i(b[i]*b[i]) and sum_i(a[i]*a[i]) for each b in B and a in A, respectively.
expand.grid then generates all pairs between B and A.
The outer rowSums computes the sum_i(b[i]*b[i]) + sum_i(a[i]*a[i]) for all these pairs.
This result is then reshaped into a matrix. Note that the number of rows of this matrix is the number of points of class B as you requested.
Then subtract two times the inner product of all pairs. This inner product can be written as a matrix multiply tmp$B %*% t(tmp$A) where I left out the coercion to matrix for clarity.
Finally, take the square root.
Using this code with your data:
print(d)
## 1 2 3 8 10
##4 0.0030000 0.3111688 0.4072174 0.0030000 0.01029563
##5 0.6061394 0.3000000 0.2000000 0.6061394 0.59682493
##6 0.2213707 0.1000000 0.2000000 0.2213707 0.21023796
##7 0.0010000 0.3149635 0.4110985 0.0010000 0.01272792
##9 0.3140143 0.0000000 0.1000000 0.3140143 0.30364453
Note that this code will work for any n > 1. We can recover your previous 1-d result by setting n to 1 and not perform the inner rowSums (because there is now only one column in tmp$A and tmp$B):
n <- 1 ## the number of dimensions, set this now to 1
tmp <- split(df[,1:n], df$class)
d <- sqrt(matrix(rowSums(expand.grid(tmp$B*tmp$B,tmp$A*tmp$A)),
nrow=length(tmp$B)) -
2. * as.matrix(tmp$B) %*% t(as.matrix(tmp$A)))
print(d)
## [,1] [,2] [,3] [,4] [,5]
##[1,] 0.003 0.295 0.395 0.003 0.005
##[2,] 0.598 0.300 0.200 0.598 0.590
##[3,] 0.198 0.100 0.200 0.198 0.190
##[4,] 0.001 0.299 0.399 0.001 0.009
##[5,] 0.298 0.000 0.100 0.298 0.290
Here's an attempt via generating each combination and then simply taking the difference from each value:
abs(matrix(Reduce(`-`, expand.grid(split(df$values, df$class))), nrow=5, byrow=TRUE))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0.003 0.295 0.395 0.003 0.005
#[2,] 0.598 0.300 0.200 0.598 0.590
#[3,] 0.198 0.100 0.200 0.198 0.190
#[4,] 0.001 0.299 0.399 0.001 0.009
#[5,] 0.298 0.000 0.100 0.298 0.290
Related
Doing some PCA analysis and when comparing with the results for FactoMineR function PCA together with prcomp from base I dont get the same results. One example
library(ISLR)
library(FactoMineR)
data("NCI60")
df <- NCI60$data
pca_prcomp <- prcomp(df, scale. = T)
pca_facto <- FactoMineR::PCA(df, scale.unit = T, graph = F, ncp = 65)
# One column is missing
dim(pca_prcomp$x)
dim(pca_facto$ind$coord)
# Values are similiare - but not the same
head(pca_prcomp$x[, 1:2])
head(pca_facto$ind$coord[, 1:2])
# Using scale function - does not return same values
pca_facto_scale <- PCA(scale(df), scale.unit = F, graph = F, ncp = 65)
head(pca_facto$ind$coord[, 1:2], 3)
head(pca_facto_scale$ind$coord[, 1:2], 3)
Sorry for being late, the FactoMineR package uses the same approach of svd() which should be similar (but not identical) with the prcomp() approach and both of them are listed under the Q-mode, which is the preferred method to do PCA for its numerical accuracy. But note, I didn't say identical, why? FactoMineR uses its own algorithm for PCA where it calculates the number of components like the following:
ncp <- min(ncp, nrow(X) - 1, ncol(X))
which tells you clearly why you got number of components 63 not 64 as what prcomp() would normally give. Your data set is typical of genomics data where you have n rows smaller than p columns of genes and the above code will clearly take columns or rows, whichever has the less number. If you follow the svd() algorithm it will return 64 dimensions not 63.
To explore the source code further type FactoMineR:::PCA.
For differences between the Q-mode (svd, prcomp(), FactoMineR::PCA()) and R-mode (eigen(), princomp()) I would recommend visiting this answer.
Side note: for prcomp() you want pass the center = T argument in order to center your data before doing PCA. Scaling on the other hand will give all your gene columns equal weight.
pca_prcomp <- prcomp(df, center = T, scale. = T) # add center=T
For scaling, the prcomp() use N as the divisor while FactoMineR::PCA() uses N-1 instead. The code below will prove it (refer to the same linked answer above):
# this is the scaled data by scale()
df_scaled <- scale(df)
# then you need to get the standardized data matrix from the output of the FactoMineR::PCR() function, which can be done easily as follows:
df_restored <- pca_facto$svd$U %*% diag(pca_facto$svd$vs) %*% t(pca_facto$svd$V)
# the to make both FactoMineR::PCR() and scale() match up you need to do the correction
df_corrected <- df_restored * sqrt(63 / 64) # correct for sqrt(N-1/N)
head(df[, 1:5]) # glimpse the first five columns only!
head(df_scaled[, 1:5])
head(df_restored[, 1:5]) # glimpse the first five columns only!
head(df_corrected[, 1:5])
round(head(df_scaled[, 1:5]), 3) == round(head(df_corrected[, 1:5]), 3) # TRUE
R> head(df[, 1:5])
1 2 3 4 5
V1 0.300 1.180 0.550 1.140 -0.265
V2 0.680 1.290 0.170 0.380 0.465
V3 0.940 -0.040 -0.170 -0.040 -0.605
V4 0.280 -0.310 0.680 -0.810 0.625
V5 0.485 -0.465 0.395 0.905 0.200
V6 0.310 -0.030 -0.100 -0.460 -0.205
R> head(df_scaled[, 1:5])
1 2 3 4 5
V1 0.723 1.59461 1.315 1.345 -0.600
V2 1.584 1.73979 0.438 0.649 0.905
V3 2.173 -0.01609 -0.346 0.264 -1.301
V4 0.678 -0.37256 1.615 -0.441 1.235
V5 1.142 -0.57720 0.958 1.130 0.359
V6 0.746 -0.00289 -0.185 -0.120 -0.476
R> head(df_restored[, 1:5])
[,1] [,2] [,3] [,4] [,5]
[1,] 0.729 1.60722 1.326 1.356 -0.605
[2,] 1.596 1.75354 0.442 0.654 0.912
[3,] 2.190 -0.01622 -0.349 0.266 -1.311
[4,] 0.683 -0.37550 1.628 -0.444 1.244
[5,] 1.151 -0.58176 0.965 1.139 0.361
[6,] 0.752 -0.00291 -0.186 -0.121 -0.480
R> head(df_corrected[, 1:5])
[,1] [,2] [,3] [,4] [,5]
[1,] 0.723 1.59461 1.315 1.345 -0.600
[2,] 1.584 1.73979 0.438 0.649 0.905
[3,] 2.173 -0.01609 -0.346 0.264 -1.301
[4,] 0.678 -0.37256 1.615 -0.441 1.235
[5,] 1.142 -0.57720 0.958 1.130 0.359
[6,] 0.746 -0.00289 -0.185 -0.120 -0.476
R> round(head(df_scaled[, 1:5]), 3) == round(head(df_corrected[, 1:5]), 3)
1 2 3 4 5
V1 TRUE TRUE TRUE TRUE TRUE
V2 TRUE TRUE TRUE TRUE TRUE
V3 TRUE TRUE TRUE TRUE TRUE
V4 TRUE TRUE TRUE TRUE TRUE
V5 TRUE TRUE TRUE TRUE TRUE
V6 TRUE TRUE TRUE TRUE TRUE
Book excerpt
There is also the book for FactoMineR package called "Exploratory Multivariate Analysis by Example Using R" 2nd edition by François Husson, Sébastien, and Lê Jérôme Pagès. Below is an excerpt from page 55 of the book which was discussing a data set from a genomic study similar to yours with n rows (43) far less than p 7407 columns chicken.csv data set, you can see more info in their website as well as the data set itself can be downloaded from this link.
The difference is most likely between the eigenvalue and SVD methods for performing PCA (see this great answer for some details.
From ?prcomp:
The calculation is done by a singular value decomposition of the
(centered and possibly scaled) data matrix, not by using ‘eigen’
on the covariance matrix. This is generally the preferred method
for numerical accuracy.
From ?PCA:
Returns a list including:
eig: a matrix containing all the eigenvalues, the percentage of
variance and the cumulative percentage of variance
Let us consider a matrix C with 7 rows and 2 column, where the columns are
x=c(0.018,0.021,0.006,-0.018,-0.021,-0.006,0.018)
y=c(-0.017,0.002,0.027,0.0179,-0.002,-0.027,-0.017)
C=cbind(x,y)
I want to write C as C' where
C'=(0.018 -0.017,0.021 0.002,0.006 0.027,-0.018 0.017,-0.021-0.002,-0.006 -0.027,0.018 -0.017)
Thanks in advance.
x1 <- toString(paste(C[,1], C[,2], collapse = ','))
#[1] "0.018 -0.017,0.021 0.002,0.006 0.027,-0.018 0.0179,-0.021 -0.002,-0.006 -0.027,0.018 -0.017"
To get it without quotations,
print(x1, quote = FALSE)
#[1] 0.018 -0.017,0.021 0.002,0.006 0.027,-0.018 0.0179,-0.021 -0.002,-0.006 -0.027,0.018 -0.017
Reading your question as asking for a vector of numbers, the quickest way is to just call as.numeric on the transposition of C:
as.numeric(t(C))
# [1] 0.0180 -0.0170 0.0210 0.0020 0.0060 0.0270 -0.0180 0.0179 -0.0210 -0.0020 -0.0060
# [12] -0.0270 0.0180 -0.0170
This works by transposing C, i.e. flipping its i and j dimensions, so it looks like:
t(C)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# x 0.018 0.021 0.006 -0.0180 -0.021 -0.006 0.018
# y -0.017 0.002 0.027 0.0179 -0.002 -0.027 -0.017
as.numeric then drops the dimension information that makes a matrix a matrix, reducing it to just a vector of numbers.
You could do the same thing on C untransposed, but since matrices are filled by column by default, it will reverse in the same way, grabbing all the x values before all the ys instead of returning values by row. Transposing makes it read the values in the order you want.
I have two matrices in R showing the probability of each scoreline in the first half, and second half of a soccer game. I want to combine these to have one object (I'm guessing a 4 dimensional array) giving me the probability of each HT/FT scoreline combination.
Here is what I currently have, for simplicity here I have assume each team can score a maximum of 3 goals in each half, but in practice I will allow for more than this
# Probability of teams scoring 0,1,2,3 goals in each half
t1_h1 <- c(0.5, 0.3, 0.1, 0.1)
t2_h1 <- c(0.8, 0.1, 0.06, 0.04)
t1_h2 <- c(0.5, 0.4, 0.05, 0.05)
t2_h2 <- c(0.7, 0.1, 0.1, 0.1)
# Create matrix showing probability of possible first half scorelines
h1 <- t(t1_h1 %*% t(t2_h1))
h1
[,1] [,2] [,3] [,4]
[1,] 0.40 0.240 0.080 0.080
[2,] 0.05 0.030 0.010 0.010
[3,] 0.03 0.018 0.006 0.006
[4,] 0.02 0.012 0.004 0.004
# Create matrix showing probability of possible second half scorelines
h2 <- t(t1_h2 %*% t(t2_h2))
h2
[,1] [,2] [,3] [,4]
[1,] 0.35 0.28 0.035 0.035
[2,] 0.05 0.04 0.005 0.005
[3,] 0.05 0.04 0.005 0.005
[4,] 0.05 0.04 0.005 0.005
So for example from h1 you can see the probability of it being 0-0 at HT is 0.4, probability of it being 0-1 is 0.250 etc.
I want to end up with an object which gives the probability of each possible combination HT/FT. For example, it would tell me that the probability of it being 0-0 at HT and 0-0 at FT is 0.40 * 0.35 = 0.14. Or the probability of it being 1-1 at HT and 1-1 at FT (i.e. no goals in second half) is 0.03 * 0.35 =0.0105. It should also be clever enough to know that the probability of it being 1-0 at HT and 0-0 at FT is zero (FT goals cannot be less than HT goals).
It perhaps might be easier to end up with an object which shows HT/Second Half scoreline rather than HT/FT so we can ignore the last constraint that FT score can not be less than HT score.
It think what you are looking for is outer:
myscores <- outer(t(t1_h1 %*% t(t2_h1)), t(t2_h2 %*% t(t1_h2)))
dim(myscores) # 4 4 4 4
It's four dimensional as you mention and
myscores[1,1,1,1] # 0.14
myscores[3,1,1,1] # 0.0105
Calculates the probabilities you want. Navigating this object works by following the coordinates of the matrices in you question.
I am trying to simulate data via bootstrapping to create confidence bands for my real data with a funnel plot. I am building on the strategy of the accepted answer to a previous question. Instead of using a single probability distribution for simulating my data I want to modify it to use different probability distributions depending on the part of the data being simulated.
I greatly appreciate anyone who can help answer the question or help me phrase the question more clearly.
My problem is writing the appropriate R code to do a more complicated form of data simulation.
The current code is:
n <- 1e4
set.seed(42)
sims <- sapply(1:80,
function(k)
rowSums(
replicate(k, sample((1:7)/10, n, TRUE, ps))) / k)
This code simulates data where each data point has a value which is the mean of between 1:80 observations.
For example, when the values of the data points are the mean of 10 observations (k=10) it randomly samples 10 values (which can be either 0.1,0.2,0.3, 0.4, 0.5,0.6 or 0.7) based on a probability distribution ps, which gives the probability of each value (based on the entire empirical distribution).
ps looks like this:
ps <- prop.table(table((DF$mean_score)[DF$total_number_snps == 1]))
# 0.1 0.2 0.3 0.4 0.5 0.6 0.7
#0.582089552 0.194029851 0.124378109 0.059701493 0.029850746 0.004975124 0.004975124
eg probability that the value of an observation is 0.1 is 0.582089552.
Now instead of using one frequency distribution for all simulations I would like to use different frequency distributions conditionally depending on the number of observations underlying each datapoint.
I made a table, cond_probs, that has a row for each of my real data points. There is a column with the total number of observations and a column giving the frequency of each of the values for each observation.
Example of the cond_probs table:
gene_name 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 total
A1 0.664 0.319 0.018 0.000 0.000 0.000 0.000 0.000 0.000 113.000
A2 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000
So for the data point A2, there is only 1 observation, which has a value of 0.1. Therefore the frequency of the 0.1 observations is 1. For A1, there are 113 observations and the majority of those (0.664) have the value 0.1. The idea is that cond_probs is like ps, but cond_probs has a probability distribution for each data point rather than one for all the data.
I would like to modify the above code so that the sampling is modified to use cond_probs instead of ps for the frequency distribution. And to use the number of observations, k , as a criteria when choosing which row in cond_probs to sample from. So it would work like this:
For data points with k number of observations:
look in the cond_probs table and randomly select a row where the total number of observations is similar in size to k: 0.9k-1.1k. If no such rows exist, continue.
Once a datapoint is selected, use the probability distribution from that line in cond_probs just like ps is used in the original code, to randomly sample k number of observations and output the mean of these observations.
For each of the n iterations of replicate, randomly sample with replacement a new data point from cond_probs, out of all rows where the value of total is similar to the current value of k ( 0.9k-1.1k).
The idea is that for this dataset one should condition which probability distribution to use based on the number of observations underlying a data point. This is because in this dataset the probability of an observation is influenced by the number of observations (genes with more SNPs tend to have a lower score per observation due to genetic linkage and background selection).
UPDATE USING ANSWER BELOW:
I tried using the answer below and it works for the simulated cond_probs data in the example but not for my real cond_probs file.
I imported and converted my cond_probs file to a matrix with
cond_probs <- read.table("cond_probs.txt", header = TRUE, check.names = FALSE)
cond_probs <- as.matrix(cond_probs)
and the first example ten rows (out of ~20,000 rows) looks like this:
>cond_probs
total 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
[1,] 109 0.404 0.174 0.064 0.183 0.165 0.009 0.000 0.000 0.000 0.000
[2,] 181 0.564 0.221 0.144 0.066 0.006 0.000 0.000 0.000 0.000 0.000
[3,] 289 0.388 0.166 0.118 0.114 0.090 0.093 0.028 0.003 0.000 0.000
[4,] 388 0.601 0.214 0.139 0.039 0.008 0.000 0.000 0.000 0.000 0.000
[5,] 133 0.541 0.331 0.113 0.000 0.008 0.008 0.000 0.000 0.000 0.000
[6,] 221 0.525 0.376 0.068 0.032 0.000 0.000 0.000 0.000 0.000 0.000
[7,] 147 0.517 0.190 0.150 0.054 0.034 0.048 0.007 0.000 0.000 0.000
[8,] 107 0.458 0.196 0.252 0.084 0.009 0.000 0.000 0.000 0.000 0.000
[9,] 13 0.846 0.154 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
If I run:
sampleSize <- 20
set.seed(42)
#replace 1:80 with 1: max number of SNPs in gene in dataset
sims_test <- sapply( 1:50, simulateData, sampleSize )
and look at the means from the sampling with x number of observations I only get a single result, when there should be 20.
for example:
> sims_test[[31]]
[1] 0.1
And sims_test is not ordered in the same way as sims:
>sims_test
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0.1 0.1 0.1666667 0.200 0.14 0.2666667 0.2000000 0.2375 0.1888889
[2,] 0.1 0.1 0.1333333 0.200 0.14 0.2333333 0.1571429 0.2625 0.1222222
[3,] 0.1 0.1 0.3333333 0.225 0.14 0.1833333 0.2285714 0.2125 0.1555556
[4,] 0.1 0.1 0.2666667 0.250 0.10 0.1500000 0.2000000 0.2625 0.2777778
[5,] 0.1 0.1 0.3000000 0.200 0.16 0.2000000 0.2428571 0.1750 0.1000000
[6,] 0.1 0.1 0.3666667 0.250 0.16 0.1666667 0.2142857 0.2500 0.2000000
[7,] 0.1 0.1 0.4000000 0.300 0.12 0.2166667 0.1857143 0.2375 0.1666667
[8,] 0.1 0.1 0.4000000 0.250 0.10 0.2500000 0.2714286 0.2375 0.2888889
[9,] 0.1 0.1 0.1333333 0.300 0.14 0.1666667 0.1714286 0.2750 0.2888889
UPDATE 2
Using cond_probs <- head(cond_probs,n) I have determined that the code works until n = 517 then for all sizes greater than this it produces the same output as above. I am not sure if this is an issue with the file itself or a memory issue. I found that if I remove line 518 and duplicate the lines before several times to make a larger file, it works, suggesting that the line itself is causing the problem. Line 518 looks like this:
9.000 0.889 0.000 0.000 0.000 0.111 0.000 0.000 0.000 0.000 0.000
I found another 4 offending lines:
9.000 0.444 0.333 0.111 0.111 0.000 0.000 0.000 0.000 0.000 0.000
9.000 0.444 0.333 0.111 0.111 0.000 0.000 0.000 0.000 0.000 0.000
9.000 0.111 0.222 0.222 0.111 0.111 0.222 0.000 0.000 0.000 0.000
9.000 0.667 0.111 0.000 0.000 0.000 0.222 0.000 0.000 0.000 0.000
I don't notice anything unusual about them. They all have a 'total' of 9 sites. If I remove these lines and run the 'cond_probs' file containing only the lines BEFORE these then the code works. But there must be other problematic lines as the entire 'cond_probs' still doesn't work.
I tried putting these problematic lines back into a smaller 'cond_probs' file and this file then works, so I am very confused as it doesn't seem the lines are inherently problematic. On the other hand the fact they all have 9 total sites suggests some kind of causative pattern.
I would be happy to share the entire file privately if that helps as I don't know what to do next for troubleshooting.
One further issue that comes up is I'm not sure if the code is working as expected. I made a dummy cond_probs file where there are two data points with a 'total' of '1' observation:
total 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000
1.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
So I would expect them to both be sampled for data points with '1' observation and therefore get roughly 50% of observations with a mean of '0.2' and 50% with a mean of '0.6'. However the mean is always 0.2:
sims_test[[1]]
[1] 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
Even if I sample 10000 times all observations are 0.2 and never 0.6. My understanding of the code is that it should be randomly selecting a new row from cond_probs with similar size for each observation, but in this case is seems not to be doing so. Do I misunderstand the code or is it still a problem with my input not being correct?
The entire cond_probs file can be found at the following address:
cond_probs
UPDATE 3
Changing sapply to lapply when running the simulations fixed this issue.
Another reason I think leaving cond_probs as it is and choosing a distribution sampleSize number of times might be the best solution: The probability of choosing a distribution should be related to its frequency in cond_probs. If we combine distributions the odds of picking a distribution with total 9 or 10 will no longer depend on the number of observations with these totals. Example: If there are 90 distributions with total=10 and 10 with total=9 there should be a 90% chance to choose a distribution with total=10. If we combine distributions wouldn't the odds become 50/50 for choosing a distribution with 'total'= 9 or 10 (which would not be ideal)?
I simply wrote a function ps that chooses an appropriate distribution from cond_probs:
N <- 10 # The sampled values are 0.1, 0.2, ... , N/10
M <- 8 # number of distributions in "cond_probs"
#-------------------------------------------------------------------
# Example data:
set.seed(1)
cond_probs <- matrix(0,M,N)
is.numeric(cond_probs)
for(i in 1:nrow(cond_probs)){ cond_probs[i,] <- dnorm((1:N)/M,i/M,0.01*N) }
is.numeric(cond_probs)
total <- sort( sample(1:80,nrow(cond_probs)) )
cond_probs <- cbind( total, cond_probs/rowSums(cond_probs) )
colnames(cond_probs) <- c( "total", paste("P",1:N,sep="") )
#---------------------------------------------------------------------
# A function that chooses an appropiate distribution from "cond_prob",
# depending on the number of observations "numObs":
ps <- function( numObs,
similarityLimit = 0.1 )
{
similar <- which( abs(cond_probs[,"total"] - numObs) / numObs < similarityLimit )
if ( length(similar) == 0 )
{
return(NA)
}
else
{
return( cond_probs[similar[sample(1:length(similar),1)],-1] )
}
}
#-----------------------------------------------------------------
# A function that simulates data using a distribution that is
# appropriate to the number of observations, if possible:
simulateData <- function( numObs, sampleSize )
{
if (any(is.na(ps(numObs))))
{
return (NA)
}
else
{
return( rowSums(
replicate(
numObs,
replicate( sampleSize, sample((1:N)/10, 1, prob = ps(numObs))))
) / numObs )
}
}
#-----------------------------------------------------------------
# Test:
sampleSize <- 30
set.seed(42)
sims <- lapply( 1:80, simulateData, sampleSize )
The distributions in cond_probs:
total P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
[1,] 16 6.654875e-01 3.046824e-01 2.923948e-02 5.881753e-04 2.480041e-06 2.191926e-09 4.060763e-13 1.576900e-17 1.283559e-22 2.189990e-28
[2,] 22 2.335299e-01 5.100762e-01 2.335299e-01 2.241119e-02 4.508188e-04 1.900877e-06 1.680045e-09 3.112453e-13 1.208647e-17 9.838095e-23
[3,] 30 2.191993e-02 2.284110e-01 4.988954e-01 2.284110e-01 2.191993e-02 4.409369e-04 1.859210e-06 1.643219e-09 3.044228e-13 1.182153e-17
[4,] 45 4.407425e-04 2.191027e-02 2.283103e-01 4.986755e-01 2.283103e-01 2.191027e-02 4.407425e-04 1.858391e-06 1.642495e-09 3.042886e-13
[5,] 49 1.858387e-06 4.407417e-04 2.191023e-02 2.283099e-01 4.986746e-01 2.283099e-01 2.191023e-02 4.407417e-04 1.858387e-06 1.642492e-09
[6,] 68 1.642492e-09 1.858387e-06 4.407417e-04 2.191023e-02 2.283099e-01 4.986746e-01 2.283099e-01 2.191023e-02 4.407417e-04 1.858387e-06
[7,] 70 3.042886e-13 1.642495e-09 1.858391e-06 4.407425e-04 2.191027e-02 2.283103e-01 4.986755e-01 2.283103e-01 2.191027e-02 4.407425e-04
[8,] 77 1.182153e-17 3.044228e-13 1.643219e-09 1.859210e-06 4.409369e-04 2.191993e-02 2.284110e-01 4.988954e-01 2.284110e-01 2.191993e-02
The means of the distributions:
> cond_probs[,-1] %*% (1:10)/10
[,1]
[1,] 0.1364936
[2,] 0.2046182
[3,] 0.3001330
[4,] 0.4000007
[5,] 0.5000000
[6,] 0.6000000
[7,] 0.6999993
[8,] 0.7998670
Means of the simulated data for 31 observations:
> sims[[31]]
[1] 0.2838710 0.3000000 0.2935484 0.3193548 0.3064516 0.2903226 0.3096774 0.2741935 0.3161290 0.3193548 0.3032258 0.2967742 0.2903226 0.3032258 0.2967742
[16] 0.3129032 0.2967742 0.2806452 0.3129032 0.3032258 0.2935484 0.2935484 0.2903226 0.3096774 0.3161290 0.2741935 0.3161290 0.3193548 0.2935484 0.3032258
The appopriate distribution is the third one:
> ps(31)
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
2.191993e-02 2.284110e-01 4.988954e-01 2.284110e-01 2.191993e-02 4.409369e-04 1.859210e-06 1.643219e-09 3.044228e-13 1.182153e-17
I have this R code:
> coef
[1] 1.5 2.4 3.9 4.4
> y
[,1] [,2] [,3] [,4]
[1,] 1 2 12 45
[2,] 5 6 7 8
[3,] 9 10 2 12
[4,] 13 14 15 45
[5,] 17 18 39 7
I have to multiply each value of the column with the respective coef. The result should be:
First column:
1*1.5
5*1.5
9*1.5
13*1.5
17*1.5
Second column:
2*2.4
6*2.4
10*2.4
14*2.4
18*2.4
Third column:
12*3.9
7*3.9
2*3.9
15*3.9
39*3.9
Fourth column:
45*4.4
8*4.4
12*4.4
45*4.4
7*4.4
All the column's values moltiplied by the same coefficient at the same index in the vector.
How can I do this calculation?
The solution could be:
> y[,1] <- y[,1] * coef[1]
> y[,2] <- y[,2] * coef[2]
> y[,3] <- y[,3] * coef[3]
> y[,4] <- y[,4] * coef[4]
But doesn't seem too optimized! Something better?
Thank you!
This will give you what you want:
t( t(y) * coef )
Two more possibilities: sweep and scale (the latter only operates columnwise, and seems to me to be a bit of hack).
coef <- c(1.5,2.4,3.9,4.4)
y <- matrix(c(seq(1,17,by=4),
seq(2,18,by=4),
c(12,7,2,15,39,
45,8,12,45,7)),
ncol=4)
t(t(y)*coef)
t(apply(y,1,"*",coef))
sweep(y,2,coef,"*")
scale(y,center=FALSE,scale=1/coef)
library(rbenchmark)
benchmark(t(t(y)*coef),
y %*% diag(coef),
t(apply(y,1,"*",coef)),
sweep(y,2,coef,"*"),
scale(y,center=FALSE,scale=1/coef),
replications=1e4)
test replications elapsed relative
5 scale(y, center = FALSE, scale = 1/coef) 10000 0.990 4.342105
4 sweep(y, 2, coef, "*") 10000 0.846 3.710526
3 t(apply(y, 1, "*", coef)) 10000 1.537 6.741228
1 t(t(y) * coef) 10000 0.228 1.000000
2 y %*% diag(coef) 10000 0.365 1.600877
edit: added y %*% diag(coef) from #baptiste [not fastest, although it might be so for a big problem with a sufficiently optimized BLAS package ...] [and it was fastest in another trial, so I may just not have had a stable estimate]
edit: fixed typo in t(t(y)*coef) [thanks to Timur Shtatland] (but did not update timings, so they might be slightly off ...)
I also tried library(Matrix); y %*% Diagonal(x=coef), which is very slow for this example but might be fast for a large matrix (??). (I also tried constructing the diagonal matrix just once, but even multiplication by a predefined matrix was slow in this example (25x slower than the best, vs. 47x slower when defining the matrix on the fly.)
I have a mild preference for sweep as I think it expresses most clearly the operation being done ("multiply the columns by the elements of coef")
apply(y, 1, "*", coef)
# -- result --
[,1] [,2] [,3] [,4] [,5]
[1,] 1.5 7.5 13.5 19.5 25.5
[2,] 4.8 14.4 24.0 33.6 43.2
[3,] 46.8 27.3 7.8 58.5 152.1
[4,] 198.0 35.2 52.8 198.0 30.8
A late entry:
coef[col(y)]*y
On my system, this is the fastest.
test replications elapsed relative
6 coef[col(y)] * y 10000 0.068 1.000
5 scale(y, center = FALSE, scale = 1/coef) 10000 0.640 9.412
4 sweep(y, 2, coef, "*") 10000 0.535 7.868
3 t(apply(y, 1, "*", coef)) 10000 0.837 12.309
1 t(t(y) * coef) 10000 0.176 2.588
2 y %*% diag(coef) 10000 0.187 2.750