How to convert frequency distribution to probability distribution in R

How to convert frequency distribution to probability distribution in R - r

I have a matrix with n rows of observations. Observations are frequency distributions of the features. I would like to transform the frequency distributions to probability distributions where the sum of each row is 1. Therefore each element in the matrix should be divided by the sum of the row of the element.
I wrote the following R function that does the work but it is very slow with large matrices:
prob_dist <- function(x) {
row_prob_dist <- function(row) {
return (t(lapply(row, function(x,y=sum(row)) x/y)))
}
for (i in 1:nrow(x)) {
if (i==1) p_dist <- row_prob_dist(x[i,])
else p_dist <- rbind(p_dist, row_prob_dist(x[i,]))
}
return(p_dist)
}
B = matrix(c(2, 4, 3, 1, 5, 7), nrow=3, ncol=2)
B
[,1] [,2]
[1,] 2 1
[2,] 4 5
[3,] 3 7
prob_dist(B)
[,1] [,2]
[1,] 0.6666667 0.3333333
[2,] 0.4444444 0.5555556
[3,] 0.3 0.7
Could you suggest R function that does the job and/or tell me how can I optimise my function to perform faster?

Here's an attempt, but on a dataframe instead of a matrix:
df <- data.frame(replicate(100,sample(1:10, 10e4, rep=TRUE)))
I tried a dplyr approach:
library(dplyr)
df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs)
Here are the results:
library(microbenchmark)
mbm = microbenchmark(
dplyr = df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs),
t = t(t(df) / rep(rowSums(df), each=ncol(df))),
apply = t(apply(df, 1, prop.table)),
times = 100
)
#> mbm
#Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 123.1894 124.1664 137.7076 127.3376 131.1523 445.8857 100
# t 384.6002 390.2353 415.6141 394.8121 408.6669 787.2694 100
# apply 1425.0576 1520.7925 1646.0082 1599.1109 1734.3689 2196.5003 100
Edit: #David benchmark is more in line with OP so I suggest you consider his approach if you are to work with matrices.

Without apply, a vectorized solution in one line:
t(t(B) / rep(rowSums(B), each=ncol(B)))
[,1] [,2]
[1,] 0.6666667 0.3333333
[2,] 0.4444444 0.5555556
[3,] 0.3000000 0.7000000
Or:
diag(1/rowSums(B)) %*% B

Actually I gave it a quick thought and the best vecotization would be simply
B/rowSums(B)
# [,1] [,2]
# [1,] 0.6666667 0.3333333
# [2,] 0.4444444 0.5555556
# [3,] 0.3000000 0.7000000
Actually #Stevens benchmark was misleading because OP has a matrix, while Steven benchmark on a data frame.
Here's a benchmark with a matrix. So for matrices, both vectorized solution will be better than dplyr which doesn't work with matrices
set.seed(123)
m <- matrix(sample(1e6), ncol = 100)
library(dplyr)
library(microbenchmark)
Res <- microbenchmark(
dplyr = as.data.frame(m) %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs),
t = t(t(m) / rep(rowSums(m), each=ncol(m))),
apply = t(apply(m, 1, prop.table)),
DA = m/rowSums(m),
times = 100
)

I'm not sure that your function has any value, since you could just use the hist or density functions to accomplish the same result. Also, the use of apply would work as mentioned. But it serves as a reasonable programming example.
There are several inefficiencies in your code.
you use a for loop instead of vectorizing your code. This is very expensive. You should use apply as mentioned in the comments above.
You are using rbind instead of pre-allocating space for your output. This is extremely expensive as well.
out <- matrix(NA, nrow= n, ncol= ncol(B))
for (i in 1:nrow(B)) {
out[i,] <- row_prob_dist(B[i,])
}

Related

Is there a function in R that splits a matrix along a margin using a factor or categorical vector?

I am trying to split a matrix in the same way that you might split a data.frame using split. Is there a function that does that? For example, I have matrix m and I am trying to split it into a list of matrices using vector g.
m <- matrix(rnorm(50), ncol = 5)
groups <- c('A', 'B', 'C')
g <- sample(groups, 10, replace = T)
split doesn't seem to work with matrices so we could convert it into a data.frame:
split(data.frame(m), f = g)
This works but I'd like to keep it as a matrix. The following loop works:
lapply(groups, function(x) m[g == x,])
But is there a dedicated function, or a better way?

Here is a way to split a matrix using lapply/split.
lapply(split(m, g), matrix, ncol = ncol(m))
This can easily be written as a one-line function but I prefer a version with some error check.
mat_split <- function(x, f) {
stopifnot(nrow(x) == length(f))
lapply(split(x, f), matrix, ncol = ncol(x))
}
Edit
The original question is (my emphasis):
But is there a dedicated function, or a better way?
Following this comment by user20650 there is a function or, better said, a method.
The split.data.frame method can solve the problem.
split.data.frame(m, g)
And this is written in the documentation. From help('split') (my emphasis).
split and split<- are generic functions with default and data.frame
methods. The data frame method can also be used to split a matrix into
a list of matrices, and the replacement form likewise, provided they
are invoked explicitly.

We can split on the sequence of rows of 'm', and use that index to subset the rows of 'm'
lapply(split(seq_len(nrow(m)), g), function(i) m[i,])

You could iterate over unique values of g and use it to subset the matrices.
sapply(unique(g), function(x) m[g == x, ])
#$B
# [,1] [,2] [,3] [,4] [,5]
#[1,] -0.3752 -0.180 -0.333 -0.634 2.011
#[2,] -0.0628 -0.537 2.089 2.069 0.605
#[3,] -0.7635 1.028 -0.779 1.205 -0.110
#$A
# [,1] [,2] [,3] [,4] [,5]
#[1,] -0.128 -1.037 0.512 -1.104 0.303
#[2,] -0.246 -0.691 1.303 -0.571 2.023
#[3,] -1.374 0.867 0.810 -0.904 -0.268
#[4,] 0.111 -0.013 0.827 1.294 0.999
#$C
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0.712 1.465 -0.471 -0.383 -0.191
#[2,] 1.400 0.121 0.360 -0.890 0.412
#[3,] 0.967 -1.176 0.146 0.570 -0.143

A couple of solutions have been proposed. Here's a benchmark for a matrix of size 100,000 * 5 being split into 1,000 groups:
nr <- 1e5
m <- matrix(rnorm(nr * 5), ncol = 5)
groups <- seq_len(1000)
g <- sample(groups, nr, replace = T)
microbenchmark(
data.frame = split(data.frame(m), f = g),
split.data.frame = split.data.frame(m, g),
matrix = lapply(split(m, g), matrix, ncol = ncol(m)),
lapply1 = lapply(groups, function(x) m[g == x,]),
lappyl2 = lapply(split(seq_len(nrow(m)), g), function(i) m[i,])
)
Unit: milliseconds
expr min lq mean median uq max neval
data.frame 101.50167 119.37017 132.39754 124.60196 133.04204 299.9586 100
split.data.frame 14.82502 17.43736 24.66659 18.96938 25.33538 119.5009 100
matrix 18.99796 22.73603 28.14735 25.82694 31.52766 102.1667 100
lapply1 699.65742 778.61135 817.87159 811.95775 840.05130 1089.5721 100
lappyl2 15.37083 17.58404 24.13295 19.08363 24.65315 106.8594 100
For a small number of groups, lapply1 appears to be the faster method.

Fastest way to sort and desort rows of a matrix [r]

I have a matrix filled with somewhat random elements. I need every row sorted in decreasing order, then a function is called on the matrix, and finally the resulting matrix needs to be unsorted to the original order.
This is quickly accomplished vector-wise as shown here, but what's the fastest way to do this to every row in a matrix? Right now I'm doing:
# Example matrix
m <- matrix(runif(100), nrow = 25, ncol = 4)
# Get the initial order by row
om <- t(apply(m, 1, order, decreasing = T))
sm <- m
for (i in seq_len(nrow(m))) {
sm[i, ] <- sm[i, om[i, ]]
}
# ** Operations performed on sm **
# Then unsort
for (i in seq_len(nrow(m))) {
sm[i, ] <- sm[i, order(om[i, ])]
}
# sm is now sorted by-row in the same order as m
Is there some way given om in the above to sort and unsort while avoiding the for loop or an apply function (both of which make this operation very slow for big m). Thanks!
Edit: There are pointers here: Fastest way to sort each row of a large matrix in R
The operation is done inside a function that is already called using parallel, so this operation must be done using serial code.

Row-wise sorting seems to be straightforward. To get the original order back (un-sort) we need the row-wise ranks rather than their order. Thereafter, what works for column sorting in #Josh O'Brien's answer we can adapt for rows.
Base R solution:
rr <- t(apply(m, 1, rank)) ## get initial RANKS by row
sm <- t(apply(m, 1, sort)) ## sort m
## DOING STUFF HERE ##
sm[] <- sm[cbind(as.vector(row(rr)), as.vector(rr))] ## un-sort
all(m == sm) ## check
# [1] TRUE
Seems to work.
In your linked answer, the rowSort function of the Rfast package stands out well in terms of performance, which may cover the sorting issue. Moreover there's also a rowRanks function that will cover our ranking issue. So we can avoid apply.
Let's try it out.
m[1:3, ]
# [,1] [,2] [,3] [,4]
# [1,] 0.9148060 0.5142118 0.3334272 0.719355838
# [2,] 0.9370754 0.3902035 0.3467482 0.007884739
# [3,] 0.2861395 0.9057381 0.3984854 0.375489965
library(Rfast)
rr <- rowRanks(m) ## get initial RANKS by row
sm <- rowSort(m) ## sort m
sm[1:3, ] # check
# [,1] [,2] [,3] [,4]
# [1,] 0.36106962 0.4112159 0.6262453 0.6311956
# [2,] 0.01405302 0.2171577 0.5459867 0.6836634
# [3,] 0.07196981 0.2165673 0.5739766 0.6737271
## DOING STUFF HERE ##
sm[] <- sm[cbind(as.vector(row(rr)), as.vector(rr))] ## un-sort
all(sm == m) ## check
# [1] TRUE
Dito.
Benchmark
m.test <- matrix(runif(4e6), ncol = 4)
dim(m.test)
# [1] 1000000 4
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# Rfast 897.6286 910.91 956.6259 924.1914 986.1246 1048.058 3 a
# baseR 87931.2824 88004.73 95659.8671 88078.1737 99524.1594 110970.145 3 c
# forloop 58927.7784 59434.54 60317.3903 59941.2930 61012.1963 62083.100 3 b
Not so bad!!
Data/Code:
set.seed(42)
m <- matrix(runif(100), nrow = 25, ncol = 4)
## benchmark
m.test <- matrix(runif(4e6), ncol = 4)
microbenchmark::microbenchmark(
Rfast={
rr <- rowRanks(m.test)
sm <- rowSort(m.test)
sm[] <- sm[cbind(as.vector(row(rr)), as.vector(rr))]},
baseR={
rr <- t(apply(m.test, 1, rank))
sm <- t(apply(m.test, 1, sort))
sm[] <- sm[cbind(as.vector(row(rr)), as.vector(rr))]
},
forloop={
om <- t(apply(m.test, 1, order, decreasing = T))
sm <- m.test
for (i in seq_len(nrow(m.test))) {
sm[i, ] <- sm[i, om[i, ]]
}
for (i in seq_len(nrow(m.test))) {
sm[i, ] <- sm[i, order(om[i, ])]
}
}, times=3L
)

Matrix of distances with Geosphere: avoid repeat calculus

I want to compute the distance among all points in a very large matrix using distm from geosphere.
See a minimal example:
library(geosphere)
library(data.table)
coords <- data.table(coordX=c(1,2,5,9), coordY=c(2,2,0,1))
distances <- distm(coords, coords, fun = distGeo)
The issue is that due to the nature of the distances I am computing, distm gives me back a symmetric matrix, therefore, I could avoid to calculate more than half of the distances:
structure(c(0, 111252.129800202, 497091.059564718, 897081.91986428,
111252.129800202, 0, 400487.621661164, 786770.053508848, 497091.059564718,
400487.621661164, 0, 458780.072878927, 897081.91986428, 786770.053508848,
458780.072878927, 0), .Dim = c(4L, 4L))
May you help me to find a more efficient way to compute all those distances avoiding doing twice each one?

If you want to compute all pairwise distances for points x, it is better to use distm(x) rather than distm(x,x). The distm function returns the same symmetric matrix in both cases but when you pass it a single argument it knows that the matrix is symmetric, so it won't do unnecessary computations.
You can time it.
library("geosphere")
n <- 500
xy <- matrix(runif(n*2, -90, 90), n, 2)
system.time( replicate(100, distm(xy, xy) ) )
# user system elapsed
# 61.44 0.23 62.79
system.time( replicate(100, distm(xy) ) )
# user system elapsed
# 36.27 0.39 38.05
You can also look at the R code for geosphere::distm to check that it treats the two cases differently.
Aside: Quick google search finds parallelDist: Parallel Distance Matrix Computation on CRAN. The geodesic distance is an option.

You can prepare a data frame of possible combinations without repetitions (with gtools packages). Then to compute distances for those pairs. Here is the code:
library(gtools)
library(geosphere)
library(data.table)
coords <- data.table(coordX = c(1, 2, 5, 9), coordY = c(2, 2, 0, 1))
pairs <- combinations(n = nrow(coords), r = 2, repeats.allowed = F, v = c(1:nrow(coords)))
distances <- apply(pairs, 1, function(x) {
distm(coords[x[1], ], coords[x[2], ], fun = distGeo)
})
# Construct distances matrix
dist_mat <- matrix(NA, nrow = nrow(coords), ncol = nrow(coords))
dist_mat[upper.tri(dist_mat)] <- distances
dist_mat[lower.tri(dist_mat)] <- distances
dist_mat[is.na(dist_mat)] <- 0
print(dist_mat)
The results:
[,1] [,2] [,3] [,4]
[1,] 0.0 111252.1 497091.1 400487.6
[2,] 111252.1 0.0 897081.9 786770.1
[3,] 497091.1 400487.6 0.0 458780.1
[4,] 897081.9 786770.1 458780.1 0.0

Using combn() from base R might be slightly simpler and probably faster than loading additional packages. Then, distm() uses distGeo() as a source, so using the latter should be even faster.
coords <- as.data.frame(coords) # this won't work with data.tables though
cbind(t(combn(1:4, 2)), unique(geosphere::distGeo(coords[combn(1:4, 2), ])))
# [,1] [,2] [,3]
# [1,] 1 2 111252.1
# [2,] 1 3 497091.1
# [3,] 1 4 897081.9
# [4,] 2 3 786770.1
# [5,] 2 4 400487.6
# [6,] 3 4 458780.1
We could check it out with a benchmark.
Unit: microseconds
expr min lq mean median uq max neval cld
distm 555.690 575.846 597.7672 582.352 596.1295 904.718 100 b
distGeo 426.335 434.372 450.0196 441.516 451.8490 609.524 100 a
Looks good.

Vectorization of matrix operation in R matching string patterns

I'm using the code below to create a matrix that compares all strings in one vector to see if they contain any of the patterns in the second vector:
strngs <- c("hello there", "welcome", "how are you")
pattern <- c("h", "e", "o")
M <- matrix(nrow = length(strngs), ncol = length(pattern))
for(i in 1:length(strngs)){
for(j in 1:length(pattern)){
M[i, j]<-str_count(strngs[i], pattern[j])
}
}
M
It works great, and returns the matrix I'm looking for:
[,1] [,2] [,3]
[1,] 2 3 1
[2,] 0 2 1
[3,] 1 1 2
However, my real data set is huge, and looping like this doesn't scale well to a matrix with 117, 746, 754 values. Does anyone know a way I could vectorize this or otherwise speed it up? Or should I just learn C++? ;)
Thanks!

You can use outer and stri_count_fixed as suggested by #snoram.
outer(strngs, pattern, stringi::stri_count_fixed)
# [,1] [,2] [,3]
#[1,] 2 3 1
#[2,] 0 2 1
#[3,] 1 1 2

Here is some marginal improvement by removing the inner loop and switching to stringi (which stringr is built upon).
M <- matrix(0L, nrow = length(strngs), ncol = length(pattern))
for(i in 1:length(strngs)) {
M[i, ] <- stringi::stri_count_fixed(strngs[i], pattern)
}
And then a more standard R way:
t(sapply(strngs, stringi::stri_count_fixed, pattern))

Yet another solution, with sapply. Basically snoram's solution.
t(sapply(strngs, stringi::stri_count_fixed, pattern))
# [,1] [,2] [,3]
#hello there 2 3 1
#welcome 0 2 1
#how are you 1 1 2
Tests.
Since there are a total of 4 ways, here are some speed tests.
f0 <- function(){
M<-matrix(nrow=length(strngs),ncol=length(pattern))
for(i in 1:length(strngs)){
for(j in 1:length(pattern)){
M[i,j]<-stringr::str_count(strngs[i],pattern[j])
}
}
M
}
f1 <- function(){
M <- matrix(0L, nrow = length(strngs), ncol = length(pattern), )
for(i in 1:length(strngs)) {
M[i, ] <- stringi::stri_count_fixed(strngs[i], pattern)
}
M
}
f2 <- function() outer(strngs, pattern, stringi::stri_count_fixed)
f3 <- function() t(sapply(strngs, stringi::stri_count_fixed, pattern))
r0 <- f0()
r1 <- f1()
r2 <- f2()
r3 <- f3()
identical(r0, r1)
identical(r0, r2)
identical(r0, r3) # FALSE, the return has rownames
library(microbenchmark)
library(ggplot2)
mb <- microbenchmark(
op = f0(),
snoram = f1(),
markus = f2(),
rui = f3()
)
mb
#Unit: microseconds
# expr min lq mean median uq max
# op 333.425 338.8705 348.23310 341.7700 345.8060 542.699
# snoram 47.923 50.8250 53.96677 54.8500 56.3870 69.903
# markus 27.502 29.8005 33.17537 34.3670 35.7490 54.095
# rui 68.994 72.3020 76.77452 73.4845 77.1825 215.328
autoplot(mb)

Fastest R equivalent to MATLAB's reshape() method?

I am converting a MATLAB script into R and regretting it so far, as it is slower at the moment. I'm trying to use "vectorized functions" as much as possible, but I'm relatively new to R and do not know what is meant by this. From my research for loops are only slower than the apply() method in R if you use loads of operators (including the parenthesis). Otherwise, I don't see what R could have done to slow down it further. Here is code that works that I want to speed up.
somPEs <- 9;
inputPEs <- 6;
initial_w <- matrix(1, nrow=somPEs, ncol=inputPEs)
w <- apply(initial_w, 1, function(i) runif(i));
# Reshape w to a 3D matrix of dimension: c(sqrt(somPEs), sqrt(somPEs), inputPEs)
nw <- array(0, dim=c(sqrt(somPEs), sqrt(somPEs), inputPEs))
for (i in 1:inputPEs) {
nw[,,i] <- matrix(w[i,], nrow=sqrt(somPEs), ncol=sqrt(somPEs), byrow=TRUE)
}
w <- nw;
In MATLAB, this code is executed by a built-in function called "reshape", as is done as below:
w = reshape(w,[sqrt(somPEs) sqrt(somPEs) inputPEs]);
I timed my current R code and it's actually super fast, but I'd still like to learn about vectorization and how to convert my code to apply() for readability's sake.
user system elapsed
0.003 0.000 0.002

The first step is to convert your array w from 6x9 to 3x3x6 size, which in your case can be done by transposing and then changing the dimension:
neww <- t(w)
dim(neww) <- c(sqrt(somPEs), sqrt(somPEs), inputPEs)
This is almost what we want, except that the first two dimensions are flipped. You can use the aperm function to transpose them:
neww <- aperm(neww, c(2, 1, 3))
This should be a good deal quicker than looping through the matrix and individually copying over data by row. To see this, let's look at a larger example with 10,000 rows and 100 columns (which will be mapped to a 10x10x10k matrix):
josilber <- function(w) {
neww <- t(w)
dim(neww) <- c(sqrt(dim(w)[2]), sqrt(dim(w)[2]), dim(w)[1])
aperm(neww, c(2, 1, 3))
}
OP <- function(w) {
nw <- array(0, dim=c(sqrt(dim(w)[2]), sqrt(dim(w)[2]), dim(w)[1]))
for (i in 1:(dim(w)[1])) {
nw[,,i] <- matrix(w[i,], nrow=sqrt(dim(w)[2]), ncol=sqrt(dim(w)[2]), byrow=TRUE)
}
nw
}
bigw <- matrix(runif(1000000), nrow=10000, ncol=100)
all.equal(josilber(bigw), OP(bigw))
# [1] TRUE
microbenchmark(josilber(bigw), OP(bigw))
# Unit: milliseconds
# expr min lq mean median uq max neval
# josilber(bigw) 8.483245 9.08430 14.46876 9.431534 11.76744 135.7204 100
# OP(bigw) 83.379053 97.07395 133.86606 117.223236 129.28317 1553.4381 100
The approach using t, dim, and aperm is more than 10x faster in median runtime than the looping approach.

I did not test the speed, but you could try
nw1 <- aperm(`dim<-`(t(w), list(3, 3, 6)), c(2, 1, 3))
> nw1
, , 1
[,1] [,2] [,3]
[1,] 0.8257185 0.5475478 0.4157915
[2,] 0.8436991 0.3310513 0.1546463
[3,] 0.1794918 0.1836032 0.2675192
, , 2
[,1] [,2] [,3]
[1,] 0.6914582 0.1674163 0.2921129
[2,] 0.2558240 0.4269716 0.7335542
[3,] 0.6416367 0.8771934 0.6553210
, , 3
[,1] [,2] [,3]
[1,] 0.9761232 0.05223183 0.6651574
[2,] 0.5740032 0.80621864 0.2295017
[3,] 0.1138926 0.76009870 0.6932736
, , 4
[,1] [,2] [,3]
[1,] 0.437871558 0.5172516 0.1145181
[2,] 0.006923583 0.3235762 0.3751655
[3,] 0.823235642 0.4586850 0.6013853
, , 5
[,1] [,2] [,3]
[1,] 0.7425735 0.1665975 0.8659373
[2,] 0.1418979 0.1878132 0.2357267
[3,] 0.6963537 0.5391961 0.1112467
, , 6
[,1] [,2] [,3]
[1,] 0.7246276 0.02896792 0.04692648
[2,] 0.7563403 0.22027518 0.41138672
[3,] 0.8303413 0.31908307 0.25180560