I have a data.frame where each ID has exactly 3 attributes. For simplification I put only 100 rows, although in my real dataset it's around a 1.000.000. There are around 50 different possible attributes. The attributes are a mixture out of numbers and characters.
data <- data.frame(id = 1:100,
a1 = sample(letters,100,replace = T),
a2 = sample(letters,100,replace = T),
a3 = sample(letters,100,replace = T),
stringsAsFactors=FALSE) %>%
as_tibble()
I want to know what are the most frequent combinations (the order does not matter)
So the outcome is supposed to be something like this
pattern | frequency
a,a,a | 10
A,b,c | 5
a,e,c | 4
... | ....
First I started to create a vector which contains all possible combinations:
possible_combinations <- combn(c(letters,LETTERS),3) %>%
t() %>%
as_tibble() %>%
unite("combination",sep="") %>%
pull()
Then I wrote this nested loop to count the frequencies:
counter = 0
inner_counter = 0
combination_counter = vector(mode = "numeric",length = length (possible_combinations))
for (j in 1:length(possible_combinations)){
for (i in 1:nrow(data)){
# inner Counter Counts when Attribute of one ID is in one combination
inner_counter = inner_counter + str_count(possible_combinations[j] , data[[i,2]] )
inner_counter = inner_counter + str_count(possible_combinations[j] , data[[i,3]] )
inner_counter = inner_counter + str_count(possible_combinations[j] , data[[i,4]] )
# if all three attributes are in a combination, then the Counter increases by one
if(inner_counter == 3) {
counter = counter + 1 }
inner_counter = 0
}
# combination_counter is a vector which saves the frequency with
# which a combination ocurred in all different ids
combination_counter[[j]] = inner_counter
inner_counter = 0
}
I know this is really not very R like, but I don't know how to do it in a different way. The runtime is even bad for my little toy example and it's almost infeasible for my real data.
You could as well do this with base r:
table(apply(data[,2:4], 1, function(x) paste0(sort(x), collapse = ",")))
The problem that you are going to run into is dealing with the massive number of combinations. Even if you try to apply a simple solution of sorting each row, this will cost a lot of time for the number of rows you are dealing with.
Take the following example with the straightforward approach offered by #Lennyy:
set.seed(123)
n <- 1e7
data <- data.frame(id = 1:n,
a1 = sample(letters, n, replace = T),
a2 = sample(letters, n, replace = T),
a3 = sample(letters, n, replace = T),
stringsAsFactors = FALSE)
system.time(t2 <- table(apply(data[,2:4], 1, function(x) paste0(sort(x), collapse = ","))))
user system elapsed
373.281 1.695 375.445
That's a long time...
Here is the output for reference:
head(t2)
a,a,a a,a,b a,a,c a,a,d a,a,e a,a,f
603 1657 1620 1682 1759 1734
We need to somehow code each row quickly without worrying about which column a particular element came from. Additionally, we need to do this in a way that will guarantee uniqueness.
What about a hash table? We can easily do this with Rcpp.
#include <Rcpp.h>
#include <unordered_map>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector countCombos(IntegerMatrix myMat, int numAttr, CharacterVector myAttr) {
unsigned long int numRows = myMat.nrow();
unsigned long int numCols = myMat.ncol();
std::unordered_map<std::string, int> mapOfVecs;
for (std::size_t i = 0; i < numRows; ++i) {
std::vector<int> testVec(numAttr, 0);
for (std::size_t j = 0; j < numCols; ++j) {
++testVec[myMat(i, j) - 1];
}
std::string myKey(testVec.begin(), testVec.end());
auto it = mapOfVecs.find(myKey);
if (it == mapOfVecs.end()) {
mapOfVecs.insert({myKey, 1});
} else {
++(it->second);
}
}
std::size_t count = 0;
IntegerVector out(mapOfVecs.size());
CharacterVector myNames(mapOfVecs.size());
for (const auto& elem: mapOfVecs) {
std::size_t i = 0;
for (auto myChar: elem.first) {
while (myChar) {
myNames[count] += myAttr[i];
--myChar;
}
++i;
}
out[count++] = elem.second;
}
out.attr("names") = myNames;
return out;
}
This offers a great efficiency gain over any of the other solutions posted:
myRows <- 1:nrow(data)
attrCount <- 26
matOfInts <- vapply(2:ncol(data), function(x) {
match(data[, x], letters)
}, myRows, USE.NAMES = FALSE)
system.time(t <- countCombos(matOfInts, attrCount, letters))
user system elapsed
2.570 0.007 2.579
That's over 100 times faster!!!!
And here is the output:
head(t)
jkk ddd qvv ttu aaq ccd
1710 563 1672 1663 1731 1775
Testing equality (the output is in different order, so we must sort first):
identical(sort(unname(t)), as.integer(sort(unname(t2))))
[1] TRUE
Explanation
The countCombos function accepts a matrix of integers. This matrix represents the indices of elements of the unique attributes (in our example, this would be represented by letters).
As we are dealing with combinations with repetition, we can easily represent them as an indexing frequency vector.
The template vector is:
a b c d e y z
| | | | | | |
v v v v v v v
(0, 0, 0, 0, 0, ... 0, 0)
And here is how certain combinations get mapped:
aaa -->> (3, rep(0, 25))
zdd -->> dzd -->> ddz -->> (0, 0, 0, 2, rep(0, 21), 1)
Once we have created our vector, we convert it to a string, so ddz becomes:
ddz --> c((0,0,0,2, rep(0, 21),1) -->> `00020000000000000000000001`
And this is the key that is used in our hash.
If I've understood you correctly the ordering of the attributes doesn't matter, so aba is the same as aab and baa. You also have 50 different attributes and all other solutions seems to rely on typing these in manually.
The following code creates a column that is the concatenated of all attribute columns, sorts it to ignore the order of the attributes, and the calculates the count per group:
library(dplyr)
library(rlang)
cnames <- colnames(data)
cnames <- cnames[2:length(cnames)] #assuming the first column is the only non-attribute column,
#remove any other non-attribute columns as necessary
#!!!syms(cnames) outputs them as the columns rather than text, taken from here
# https://stackoverflow.com/questions/44613279/dplyr-concat-columns-stored-in-variable-mutate-and-non-standard-evaluation?rq=1
data %>%
mutate(comb = sort(paste0(!!!syms(cnames)))) %>%
group_by(comb) %>%
summarise(cnt = n())
You can use dplyr to do this efficiently. First use group_by to group variables a1, a2, and a3, then use summarize and n() to count frequencies:
set.seed(100)
N = 1e5
data <- data.frame(id = 1:N,
a1 = sample(letters[1:5],N,replace = T),
a2 = sample(letters[1:5],N,replace = T),
a3 = sample(letters[1:5],N,replace = T),
stringsAsFactors=FALSE)
data %>%
group_by(a1, a2, a3) %>%
summarize(count = n()) %>%
arrange(count)
## A tibble: 125 x 4
## Groups: a1, a2 [25]
# a1 a2 a3 count
# <chr> <chr> <chr> <int>
# 1 b a d 735
# 2 c b d 741
# 3 a d e 747
# 4 d a e 754
# 5 d e e 754
# 6 d e c 756
# 7 e a d 756
# 8 d c d 757
# 9 c c c 758
#10 d a b 759
## ... with 115 more rows
Related
Summary of real-world problem
Essentially this is a scenario evaluation, of a linear system of equations.
I have two data tables.
s_dt contains the scenarios, drivers (d) and values (v) for each observed scenario (o).
c_dt contains a series of terms (n) for a number of fitted model bases (b).
The individual powers of drivers, and associated coefficients are coded into (d and t) as name-value pairs.
Each basis (b) is essentially a polynomial with n terms.
The issue
Repro case below gives desired output format.
But is far too slow for required use case, even on a cut-down problem.
Numbers are junk, but I can't share actual data. Running on real-world data gives similar timing.
Circa 3sec for "lil" problem on my system (12 threads).
But "big" problem is 4000 times larger. So expect circa 3hours. Ouch!
Aim is to have the "big" problem run sub 5min (or ideally much faster!)
So, awesome clever people, how can this be made a lot faster?
(And what is the root cause of the slowdown?)
I'll happily accept base/tidyverse solutions too, if they meet the performance needs. I just assumed data.table was the best way to go for the size of the problem.
Current solution
Run fun on s_dt, grouping by o.
fun: Joins c_dt with each group data, to populate v, thus enabling calculation of r (the result of evaluating each of the polynomial equations).
In data.table parlance:
s_dt[, fun(.SD), keyby = .(o)]
Repro case
Creates two data.tables that have the combinations and field types matching real-world problem.
But with cut-down size for illustrative purposes.
Defines fun, then runs to populate r for all scenarios.
library(data.table)
# problem sizing ----
dims <- list(o = 50000, d = 50, b = 250, n = 200) # "big" problem - real-life size
dims <- list(o = 100, d = 50, b = 25, n = 200) # "lil" problem (make runtime shorter as example)
# build some test data tables ----
build_s <- function() {
o <- seq_len(dims$o)
d <- paste0("d",seq_len(dims$d))
v <- as.double(seq_len(dims$o * dims$d))/10000
CJ(o, d)[, `:=`(v = v)]
}
s_dt <- build_s()
build_c <- function() {
b <- paste0("c", seq_len(dims$b))
n <- seq_len(dims$n)
d <- c("c", paste0("d", seq_len(dims$d)))
t <- as.double(rep_len(0:6, dims$b * dims$n * (dims$d+1)))
dt <- CJ(d, b, n)[, `:=`(t = t)]
dt <- dt[t != 0]
}
c_dt <- build_c()
# define fun and evaluate ----
# (this is what needs optimising)
profvis::profvis({
fun <- function(dt) {
# don't use chaining here, for more useful profvis output
dt <- dt[c_dt, on = .(d)]
dt <- dt[, r := fcase(d == "c", t,
is.na(v), 0,
rep(TRUE, .N), v^t)]
dt <- dt[, .(r = prod(r)), keyby = .(b, n)]
dt <- dt[, .(r = sum(r)), keyby = .(b)]
}
res <- s_dt[, fun(.SD), keyby = .(o)]
})
Example inputs and outputs
> res
o b r
1: 1 c1 0.000000e+00
2: 1 c10 0.000000e+00
3: 1 c11 0.000000e+00
4: 1 c12 0.000000e+00
5: 1 c13 0.000000e+00
---
2496: 100 c5 6.836792e-43
2497: 100 c6 6.629646e-43
2498: 100 c7 6.840915e-43
2499: 100 c8 6.624668e-43
2500: 100 c9 6.842608e-43
> s_dt
o d v
1: 1 d1 0.0001
2: 1 d10 0.0002
3: 1 d11 0.0003
4: 1 d12 0.0004
5: 1 d13 0.0005
---
4996: 100 d50 0.4996
4997: 100 d6 0.4997
4998: 100 d7 0.4998
4999: 100 d8 0.4999
5000: 100 d9 0.5000
> c_dt
d b n t
1: c c1 2 1
2: c c1 3 2
3: c c1 4 3
4: c c1 5 4
5: c c1 6 5
---
218567: d9 c9 195 5
218568: d9 c9 196 6
218569: d9 c9 198 1
218570: d9 c9 199 2
218571: d9 c9 200 3
This would be difficult to fully vectorize. The "big" problem requires so many operations that going parallel is probably the most straightforward way to get to ~5 minutes.
But first, we can get a ~3x speed boost by using RcppArmadillo for the product and sum calculations instead of data.table's grouping operations.
library(data.table)
library(parallel)
Rcpp::cppFunction(
"std::vector<double> sumprod(arma::cube& a) {
for(unsigned int i = 1; i < a.n_slices; i++) a.slice(0) %= a.slice(i);
return(as<std::vector<double>>(wrap(sum(a.slice(0), 0))));
}",
depends = "RcppArmadillo",
plugins = "cpp11"
)
cl <- makeForkCluster(detectCores() - 1L)
The following approach requires extensive preprocessing. The upshot is that it makes it trivial to parallelize. However, it will work only if the values of s_dt$d are the same for each o as in the MRE:
identical(s_dt$d, rep(s_dt[o == 1]$d, length.out = nrow(s_dt)))
#> [1] TRUE
Now let's build the functions to accept s_dt and c_dt:
# slightly modified original function for comparison
fun1 <- function(dt, c_dt) {
# don't use chaining here, for more useful profvis output
dt <- dt[c_dt, on = .(d)]
dt <- dt[, r := fcase(d == "c", t,
is.na(v), 0,
rep(TRUE, .N), v^t)]
dt <- dt[, .(r = prod(r)), keyby = .(b, n)]
dt <- dt[, .(r = sum(r)), keyby = .(b)]
}
fun2 <- function(s_dt, c_dt, cl = NULL) {
s_dt <- copy(s_dt)
c_dt <- copy(c_dt)
# preprocess to get "a", "tt", "i", and "idxs"
i_dt <- s_dt[o == 1][, idxs := .I][c_dt, on = .(d)][, ic := .I][!is.na(v)]
ub <- unique(c_dt$b)
un <- unique(c_dt$n)
nb <- length(ub)
nn <- length(un)
c_dt[, `:=`(i = match(n, un) + nn*(match(b, ub) - 1L), r = 0)]
c_dt[, `:=`(i = i + (0:(.N - 1L))*nn*nb, ni = .N), i]
c_dt[d == "c", r := t]
a <- array(1, c(nn, nb, max(c_dt$ni)))
a[c_dt$i] <- c_dt$r # 3-d array to store v^t (updated for each unique "o")
i <- c_dt$i[i_dt$ic] # the indices of "a" to update (same for each unique "o")
tt <- c_dt$t[i_dt$ic] # c_dt$t ordered for "a" (same for each unique "o")
idxs <- i_dt$idxs # the indices to order s_dt$v (same for each unique "o")
uo <- unique(s_dt$o)
v <- collapse::gsplit(s_dt$v, s_dt$o)
if (is.null(cl)) {
# non-parallel solution
data.table(
o = rep(uo, each = length(ub)),
b = rep(ub, length(v)),
r = unlist(
lapply(
v,
function(x) {
a[i] <- x[idxs]^tt
sumprod(a)
}
)
),
key = "o"
)
} else {
# parallel solution
clusterExport(cl, c("a", "tt", "i", "idxs"), environment())
data.table(
o = rep(uo, each = length(ub)),
b = rep(ub, length(v)),
r = unlist(
parLapply(
cl,
v,
function(x) {
a[i] <- x[idxs]^tt
sumprod(a)
}
)
),
key = "o"
)
}
}
Now the data:
# problem sizing ----
bigdims <- list(o = 50000, d = 50, b = 250, n = 200) # "big" problem - real-life size
lildims <- list(o = 100, d = 50, b = 25, n = 200) # "lil" problem (make runtime shorter as example)
# build some test data tables ----
build_s <- function(dims) {
o <- seq_len(dims$o)
d <- paste0("d",seq_len(dims$d))
v <- as.double(seq_len(dims$o * dims$d))/10000
CJ(o, d)[, `:=`(v = v)]
}
build_c <- function(dims) {
b <- paste0("c", seq_len(dims$b))
n <- seq_len(dims$n)
d <- c("c", paste0("d", seq_len(dims$d)))
t <- as.double(rep_len(0:6, dims$b * dims$n * (dims$d+1)))
dt <- CJ(d, b, n)[, `:=`(t = t)]
dt <- dt[t != 0]
}
Timing the lil problem, which is so small that parallelization doesn't help:
s_dt <- build_s(lildims)
c_dt <- build_c(lildims)
microbenchmark::microbenchmark(fun1 = s_dt[, fun1(.SD, c_dt), o],
fun2 = fun2(s_dt, c_dt),
times = 10,
check = "equal")
#> Unit: seconds
#> expr min lq mean median uq max neval
#> fun1 3.204402 3.237741 3.383257 3.315450 3.404692 3.888289 10
#> fun2 1.134680 1.138761 1.179907 1.179872 1.210293 1.259249 10
Now the big problem:
s_dt <- build_s(bigdims)
c_dt <- build_c(bigdims)
system.time(dt2p <- fun2(s_dt, c_dt, cl))
#> user system elapsed
#> 24.937 9.386 330.600
stopCluster(cl)
A bit longer than 5 minutes with 31 cores.
I want to speed up a function for creating a pairwise matrix that describes the number of times an object is selected before and after all other objects, within a set of locations.
Here is an example df:
df <- data.frame(Shop = c("A","A","A","B","B","C","C","D","D","D","E","E","E"),
Fruit = c("apple", "orange", "pear",
"orange", "pear",
"pear", "apple",
"pear", "apple", "orange",
"pear", "apple", "orange"),
Order = c(1, 2, 3,
1, 2,
1, 2,
1, 2, 3,
1, 1, 1))
In each Shop, Fruit is picked by a customer in a given Order.
The following function creates an m x n pairwise matrix:
loop.function <- function(df){
fruits <- unique(df$Fruit)
nt <- length(fruits)
mat <- array(dim=c(nt,nt))
for(m in 1:nt){
for(n in 1:nt){
## filter df for each pair of fruit
xm <- df[df$Fruit == fruits[m],]
xn <- df[df$Fruit == fruits[n],]
## index instances when a pair of fruit are picked in same shop
mm <- match(xm$Shop, xn$Shop)
## filter xm and xn based on mm
xm <- xm[! is.na(mm),]
xn <- xn[mm[! is.na(mm)],]
## assign number of times fruit[m] is picked after fruit[n] to mat[m,n]
mat[m,n] <- sum(xn$Order < xm$Order)
}
}
row.names(mat) <- fruits
colnames(mat) <- fruits
return(mat)
}
Where mat[m,n] is the number of times fruits[m] is picked after fruits[n]. And mat[n,m] is the number of times fruits[m] is picked before fruits[n]. It is not recorded if pairs of fruit are picked at the same time (e.g. in Shop E).
See expected output:
>loop.function(df)
apple orange pear
apple 0 0 2
orange 2 0 1
pear 1 2 0
You can see here that pear is chosen twice before apple (in Shop C and D), and apple is chosen once before pear (in Shop A).
I am trying to improve my knowledge of vectorization, especially in place of loops, so I want to know how this loop can be vectorized.
(I have a feeling there may be a solution using outer(), but my knowledge of vectorizing functions is still very limited.)
Update
See benchmarking with real data times = 10000 for loop.function(), tidyverse.function(), loop.function2(), datatable.function() and loop.function.TMS():
Unit: milliseconds
expr min lq mean median uq max neval cld
loop.function(dat) 186.588600 202.78350 225.724249 215.56575 234.035750 999.8234 10000 e
tidyverse.function(dat) 21.523400 22.93695 26.795815 23.67290 26.862700 295.7456 10000 c
loop.function2(dat) 119.695400 126.48825 142.568758 135.23555 148.876100 929.0066 10000 d
datatable.function(dat) 8.517600 9.28085 10.644163 9.97835 10.766749 215.3245 10000 b
loop.function.TMS(dat) 4.482001 5.08030 5.916408 5.38215 5.833699 77.1935 10000 a
Probably the most interesting result for me is the performance of tidyverse.function() on the real data. I will have to try add Rccp solutions at a later date - I'm having trouble making them work on the real data.
I appreciate all the interest and answers given to this post - my intention was to learn and improve performance, and there is certainly a lot to learn from all the comments and solutions given. Thanks!
A data.table solution :
library(data.table)
setDT(df)
setkey(df,Shop)
dcast(df[df,on=.(Shop=Shop),allow.cartesian=T][
,.(cnt=sum(i.Order<Order&i.Fruit!=Fruit)),by=.(Fruit,i.Fruit)]
,Fruit~i.Fruit,value.var='cnt')
Fruit apple orange pear
1: apple 0 0 2
2: orange 2 0 1
3: pear 1 2 0
The Shop index isn't necessary for this example, but will probably improve performance on a larger dataset.
As the question raised many comments on performance, I decided to check what Rcpp could bring:
library(Rcpp)
cppFunction('NumericMatrix rcppPair(DataFrame df) {
std::vector<std::string> Shop = Rcpp::as<std::vector<std::string> >(df["Shop"]);
Rcpp::NumericVector Order = df["Order"];
Rcpp::StringVector Fruit = df["Fruit"];
StringVector FruitLevels = sort_unique(Fruit);
IntegerVector FruitInt = match(Fruit, FruitLevels);
int n = FruitLevels.length();
std::string currentShop = "";
int order, fruit, i, f;
NumericMatrix result(n,n);
NumericVector fruitOrder(n);
for (i=0;i<Fruit.length();i++){
if (currentShop != Shop[i]) {
//Init counter for each shop
currentShop = Shop[i];
std::fill(fruitOrder.begin(), fruitOrder.end(), 0);
}
order = Order[i];
fruit = FruitInt[i];
fruitOrder[fruit-1] = order;
for (f=0;f<n;f++) {
if (order > fruitOrder[f] & fruitOrder[f]>0 ) {
result(fruit-1,f) = result(fruit-1,f)+1;
}
}
}
rownames(result) = FruitLevels;
colnames(result) = FruitLevels;
return(result);
}
')
rcppPair(df)
apple orange pear
apple 0 0 2
orange 2 0 1
pear 1 2 0
On the example dataset, this runs >500 times faster than the data.table solution, probably because it doesn't have the cartesian product problem. This isn't supposed to be robust on wrong input, and expects that shops / order are in ascending order.
Considering the few minutes spent to find the 3 lines of code for the data.table solution, compared to the much longer Rcpp solution / debugging process, I wouldn't recommend to go for Rcpp here unless there's a real performance bottleneck.
Interesting however to remember that if performance is a must, Rcpp might be worth the effort.
Here is an approach that makes simple modifications to make it 5x faster.
loop.function2 <- function(df){
spl_df = split(df[, c(1L, 3L)], df[[2L]])
mat <- array(0L,
dim=c(length(spl_df), length(spl_df)),
dimnames = list(names(spl_df), names(spl_df)))
for (m in 1:(length(spl_df) - 1L)) {
xm = spl_df[[m]]
mShop = xm$Shop
for (n in ((1+m):length(spl_df))) {
xn = spl_df[[n]]
mm = match(mShop, xn$Shop)
inds = which(!is.na(mm))
mOrder = xm[inds, "Order"]
nOrder = xn[mm[inds], "Order"]
mat[m, n] <- sum(nOrder < mOrder)
mat[n, m] <- sum(mOrder < nOrder)
}
}
mat
}
There are 3 main concepts:
The original df[df$Fruits == fruits[m], ] lines were inefficient as you would be making the same comparison length(Fruits)^2 times. Instead, we can use split() which means we are only scanning the Fruits once.
There was a lot of use of df$var which will extract the vector during each loop. Here, we place the assignment of xm outside of the inner loop and we try to minimize what we need to subset / extract.
I changed it to be closer to combn as we can re-use our match() condition by doing both sum(xmOrder > xnOrder) and then switching it to sum(xmOrder < xnOrder).
Performance:
bench::mark(loop.function(df), loop.function2(df))
# A tibble: 2 x 13
## expression min median
## <bch:expr> <bch:tm> <bch:>
##1 loop.function(df) 3.57ms 4.34ms
##2 loop.function2(df) 677.2us 858.6us
My hunch is that for your larger dataset, #Waldi's data.table solution will be faster. But for smaller datasets, this should be pretty perfomant.
Finally, here's yet another rcpp approach that seems to be slower than #Waldi:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerMatrix loop_function_cpp(List x) {
int x_size = x.size();
IntegerMatrix ans(x_size, x_size);
for (int m = 0; m < x_size - 1; m++) {
DataFrame xm = x[m];
CharacterVector mShop = xm[0];
IntegerVector mOrder = xm[1];
int nrows = mShop.size();
for (int n = m + 1; n < x_size; n++) {
DataFrame xn = x[n];
CharacterVector nShop = xn[0];
IntegerVector nOrder = xn[1];
for (int i = 0; i < nrows; i++) {
for (int j = 0; j < nrows; j++) {
if (mShop[i] == nShop[j]) {
if (mOrder[i] > nOrder[j])
ans(m, n)++;
else
ans(n, m)++;
break;
}
}
}
}
}
return(ans);
}
loop_wrapper = function(df) {
loop_function_cpp(split(df[, c(1L, 3L)], df[[2L]]))
}
loop_wrapper(df)
``
It seems not possible to vectorize over the original data frame df. But if you transform it using reshape2::dcast(), to have one line per each shop:
require(reshape2)
df$Fruit <- as.character(df$Fruit)
by_shop <- dcast(df, Shop ~ Fruit, value.var = "Order")
# Shop apple orange pear
# 1 A 1 2 3
# 2 B NA 1 2
# 3 C 2 NA 1
# 4 D 2 3 1
# 5 E 1 1 1
..., then you can easily vectorize at least for each combination of [m, n]:
fruits <- unique(df$Fruit)
outer(fruits, fruits,
Vectorize(
function (m, n, by_shop) sum(by_shop[,m] > by_shop[,n], na.rm = TRUE),
c("m", "n")
),
by_shop)
# [,1] [,2] [,3]
# [1,] 0 0 2
# [2,] 2 0 1
# [3,] 1 2 0
This is probably the solution you desired to do with outer. Much faster solution would be a true vectorization over all combinations of fruits [m, n], but I've been thinking about it and I don't see any way to do it. So I had to use the Vectorize function which of course is much slower than true vectorization.
Benchmark comparison with your original function:
Unit: milliseconds
expr min lq mean median uq max neval
loop.function(df) 3.788794 3.926851 4.157606 4.002502 4.090898 9.529923 100
loop.function.TMS(df) 1.582858 1.625566 1.804140 1.670095 1.756671 8.569813 100
Function & benchmark code (also added the preservation of the dimnames):
require(reshape2)
loop.function.TMS <- function(df) {
df$Fruit <- as.character(df$Fruit)
by_shop <- dcast(df, Shop ~ Fruit, value.var = "Order")
fruits <- unique(df$Fruit)
o <- outer(fruits, fruits, Vectorize(function (m, n, by_shop) sum(by_shop[,m] > by_shop[,n], na.rm = TRUE), c("m", "n")), by_shop)
colnames(o) <- rownames(o) <- fruits
o
}
require(microbenchmark)
microbenchmark(loop.function(df), loop.function.TMS(df))
OK, here is a solution:
library(tidyverse)
# a dataframe with all fruit combinations
df_compare <- expand.grid(row_fruit = unique(df$Fruit)
, column_fruit = unique(df$Fruit)
, stringsAsFactors = FALSE)
df_compare %>%
left_join(df, by = c("row_fruit" = "Fruit")) %>%
left_join(df, by = c("column_fruit" = "Fruit")) %>%
filter(Shop.x == Shop.y &
Order.x < Order.y) %>%
group_by(row_fruit, column_fruit) %>%
summarise(obs = n()) %>%
pivot_wider(names_from = row_fruit, values_from = obs) %>%
arrange(column_fruit) %>%
mutate_if(is.numeric, function(x) replace_na(x, 0)) %>%
column_to_rownames("column_fruit") %>%
as.matrix()
apple orange pear
apple 0 0 2
orange 2 0 1
pear 1 2 0
If you don't know what is going on in the second code part (df_compare %>% ...), read the "pipe" (%>%) as 'then'. Run the code from df_compare to just before any of the pipes to see the intermediate results.
I want to find how many combinations of genome are found in a sequence. I mean for binary combinations: AA,AT,AG,AC,... 16 combinations like that;or for 3-elemented combinations ATG,ACG,... 64 combinations like that. I know how to do that with a package and I will write down it here. I want to create my own code to perform this
seqinr package is perfect on its job. That is the code that i used for;
install.packages('seqinr')
library(seqinr)
m = read.fasta(file='sequence.fasta')
mseq = m[[1]]
count(mseq,2) # gives how many binary combinations are found in the seq
count(mseq,3) # gives how many 3-elemented combinations are found in the seq
This is a slow way to do it. I am certain it is faster in the bioconductor package.
# some practice data
mseq = paste(sample(c("A", "C", "G", "T"), 1000, rep=T), collapse="")
# define a function called count
count = function(mseq, n){
# split the sequence into every possible sub sequence of length n
x = sapply(1:(nchar(mseq) - n + 1), function(i) substr(mseq, i, i+n-1))
# how many unique sub sequences of length R are there?
length(table(x))
}
Actually just checked and this is pretty much how they did it:
function (seq, wordsize, start = 0, by = 1, freq = FALSE, alphabet = s2c("acgt"),
frame = start)
{
if (!missing(frame))
start = frame
istarts <- seq(from = 1 + start, to = length(seq), by = by)
oligos <- seq[istarts]
oligos.levels <- levels(as.factor(words(wordsize, alphabet = alphabet)))
if (wordsize >= 2) {
for (i in 2:wordsize) {
oligos <- paste(oligos, seq[istarts + i - 1], sep = "")
}
}
counts <- table(factor(oligos, levels = oligos.levels))
if (freq == TRUE)
counts <- counts/sum(counts)
return(counts)
}
If you want to find the code for a function use getAnywhere()
getAnywhere(count)
The simple thing to do is just something like this:
# Generate a test sequence
set.seed(1234)
testSeq <- paste(sample(LETTERS[1:3], 100, replace = T), collapse = "")
# Split string into chunks of size 2 and then count occurrences
testBigram <- substring(testSeq, seq(1, nchar(testSeq), 2), seq(2, nchar(testSeq), 2))
table(testBigram)
AA AB AC BA BB BC CA CB CC
10 10 14 3 3 2 2 5 1
Here is a way using a "function factory" (https://adv-r.hadley.nz/function-factories.html).
The 2-element and 3-element combinations are n-grams of size 2 and 3. So we make this n-gram function factory.
# Generate a function to create a function
ngram <- function(size) {
function(myvector) {
substring(myvector, seq(1, nchar(myvector), size), seq(size, nchar(myvector), size))
}
}
# Assign the functions names (optional)
bigram <- ngram(2)
trigram <- ngram(3)
# 2 element combinations
table(bigram(testSeq))
AA AB AC BA BB BC CA CB CC
10 10 14 3 3 2 2 5 1
# count of 2 element combinations
length(unique(bigram(testSeq)))
[1] 9
# counting function
count <- function(mseq, n) length(unique(ngram(n)(mseq)))
count(testSeq, 2)
[1] 9
# and if we wanted to do with with 3 element combinations
table(trigram(testSeq))
Suppose I have the following matrix, for arbitrary J:
set.seed(1)
J=2
n = 100
BB = data.table(r=1:n)
BB[, (paste0("a",seq(J))) := rnorm(n,1,7) ]
So the output is...
> BB
r a1 a2
1: 1 -3.38517668 -3.38517668
2: 2 2.28550327 2.28550327
3: 3 -4.84940029 -4.84940029
...
How come the two columns are identical and now different rnorms?
You can use the super-fast for-set combination:
for(i in seq(J))
set(x = BB, j = paste0('a',i), value = rnorm(n, 1, 7))
I wanted to multiply to each list element in say l1 with b1's col1 and store it in a separate column. Basically this is what i wanted to do :
res = 0
for item in a
for col_item in b
res = res + item * col_item
E.g.
l1 = list(c('17-Nov-14', 10), c('17-Apr-15', 20))
b1 = data.frame(col1 = c(10, 20), res=c(0))
result = data.frame(col1= c(10, 20), res = c(2*10+4*10+3*10, 2*20+4*20+3*20))
I have a working code but can be improved.
test <- function(param, df) {
df$res <- as.integer(param[2]) * df$col1
df
}
t <- lapply(l1, test, b)
result <- cbind(t[[1]]$col1, t[[1]]$res + t[[2]]$res + t[[3]]$res)
We can simplify the computation with a little algebra. If we factor out the element of b1$col1, then we can precompute the sum of the list and perform a vectorized multiplication against it:
b1$res <- sum(unlist(l1))*b1$col1;
b1;
## col1 res
## 1 10 90
## 2 20 180
For your new problem definition, we need to extract the required element out of each list component vector:
b1$res <- sum(as.integer(sapply(l1,`[`,2L)))*b1$col1;
b1;
## col1 res
## 1 10 300
## 2 20 600
If you are looking for a method to reduce your list after lapply, you can use the Reduce function:
Reduce(function(df1, df2) data.frame(col = df1[1], res = df1[2] + df2[2]), myList)
# col1 res
# 1 10 90
# 2 20 180
Suppose myList <- lapply(...).