sapply on data.frame vs vector (colnames, rownames)

sapply on data.frame vs vector (colnames, rownames) - r

I have the following function:
func <- function(scores, labels, thresholds) {
labels <- if (is.data.frame(labels)) labels else data.frame(labels)
sapply(thresholds, function(t) { sapply(labels, function(lbl) { sum(lbl[which(scores >= t)]) }) })
}
I also have the following that I'll pass into func.
> scores
[1] 0.187 0.975 0.566 0.793 0.524 0.481 0.005 0.756 0.062 0.124
> thresholds
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> var1
[1] 1 1 0 0 0 1 0 1 1 1
> df
var1 var2
1 1 0
2 1 1
3 0 0
4 0 0
5 0 0
6 1 1
7 0 1
8 1 1
9 1 1
10 1 0
Here are two different calls two func, one with labels as a vector, and the other with labels as a data.frame:
> func(scores, var1, thresholds)
labels labels labels labels labels labels labels labels labels labels labels
6 5 3 3 3 2 2 2 1 1 0
> func(scores, df, thresholds)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
var1 6 5 3 3 3 2 2 2 1 1 0
var2 5 3 3 3 3 2 2 2 1 1 0
Why does "labels" get applied as a colname in the vector version, and "var1" and "var2" get applied as a rowname in the data.frame version?
What I'm looking for is the vector version to be more like:
> func(scores, var1, thresholds)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
labels 6 5 3 3 3 2 2 2 1 1 0
To create the variables above:
scores <- sample(seq(0, 1, 0.001), 10, replace = T)
thresholds <- seq(0, 1, 0.1)
var1 <- sample(c(0, 1), 10, replace = T)
var2 <- sample(c(0, 1), 10, replace = T)
df <- data.frame(var1, var2)

Try switching the order of the nested sapplys:
func <- function(scores, labels, thresholds) {
labels <- if (is.data.frame(labels)) labels else data.frame(labels)
t(sapply(labels, function(lbl) {
sapply(thresholds, function(t) sum(lbl[which(scores >= t)]))
}))
}
From ?sapply:
‘sapply’ is a user-friendly version and wrapper of ‘lapply’ by
default returning a vector, matrix or, if ‘simplify = "array"’, an
array if appropriate, by applying ‘simplify2array()’.
To understand what's going on in your original function, it's perhaps useful to think about each sapply in turn.
The inner sapply(labels, ...) creates a named vector of length k (where k is the number of columns in labels -- so k is 1 in the vector case, and 2 in the dataframe example), where the names of the vector elements are given by the column names (labels in the vector case, and var1/var2 in the dataframe example).
The outer sapply(thresholds, ...) runs the inner sapply 11 times, each time with a different value of t. So in the vector case, you'll end up with 11 vectors of length 1 where the name of the one and only element in each vector is labels, which sapply returns ("simplifies") as one vector of length 11.
By switching the order of the sapplys, the inner sapply now returns an unnamed vector of length 11. The outer sapply then does this k times. In the vector case, k is 1, and the name of the vector returned is labels. In the dataframe example, k is 2, and the names of the 2 vectors returned are var1 and var2.
(It might also be a useful exercise to name the elements in the thresholds vector; e.g. thresholds <- setNames(seq(0, 1, 0.1), LETTERS[1:11]) and re-run func to see what happens.)

Note: #weihuang-wong 's answer is great, and the solution is in some ways better than this one. But I already had most of this answer written before that answer was posted, so I decided to post this answer anyway.
You get the names you do because those are the names of the things you iterate over. But why do you get a named vector in the first case and a matrix with rownames in the second case? Here is a simpler case that makes it easier to see.
sapply(1, function(x) sapply(c(a = 1), function(y) y))
# a
# 1
sapply(1, function(x) sapply(c(a = 1, b = 2), function(y) y))
# [,1]
# a 1
# b 2
OK, so what is happening here? Let's break it down so we can see.
sapply(c(a = 1), function(y) y)
returns a named length-one vector.
sapply(c(a = 1, b = 2), function(y) y)
returns a named length-two vector.
Now it's the job of the outer sapply to combine those results. When it sees that the inner sapply returns a length-one vector it simplifies it to a named vector. That simplification doesn't work when the return value is of length > 1, so sapply simplifies to a matrix instead.
So if we want consistency we need sapply to return a matrix, even in the length-one case. How do we make sapply consistent? It's surprisingly difficult. In the end I would just convert it to a matrix after the fact.
matrix(sapply(1, function(x) sapply(c(a = 1), function(y) y)), dimnames = list("a"))
# [,1]
# a 1
Now that we understand what's happening we can apply what we've learned to the original problem.
func <- function(scores, labels, thresholds) {
labels <- if (is.data.frame(labels)) labels else data.frame(labels)
r <- sapply(thresholds, function(t) { sapply(labels, function(lbl) { sum(lbl[which(scores >= t)]) }) })
if(!is.matrix(r)) r <- matrix(r, nrow = 1, dimnames = list(names(labels)))
r
}
func(scores, df, thresholds)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
# var1 6 5 3 3 3 2 2 2 1 1 0
# var2 5 3 3 3 3 2 2 2 1 1 0
func(scores, var1, thresholds)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
# labels 6 5 3 3 3 2 2 2 1 1 0

Related

How to calculate PMF of sample median for discrete random variable in R?

A bag contains 5 billiard balls numbered 1, 2, 3, 4, 5. A random
sample of size n = 3 is drawn without replacement from the bag.
What is the probability mass function of the sample median?
Here is what I have:
library(listviewer)
sampleSpaceAndMedian = list()
# the random samples (1,2,3), (1,3,2), (2,1,3),
# (2,3,1), (3,1,2), and (3,2,1) have the same mean
# therefore, belong to the same equivalence class
for (a in 1:3){
for (b in 2:4){
for (c in 3:5){
# a unique random sample of size 3 (ignores the order)
if (b > a && c > b){
tString = paste(toString(a), toString(b), toString(c), toString(median(c(a,b,c))), sep = " ")
sampleSpaceAndMedian <- append(sampleSpaceAndMedian, tString)
}
}
}
}
# the random sample is in the first three columns
# median is the fourth column
jsonedit( sampleSpaceAndMedian )
```
Can you please help me to get the PMF? Thanks.

You can use combn to get all the combinations of a vector and apply a function to it:
combn(1:5, 3)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,] 1 1 1 1 1 1 2 2 2 3
#[2,] 2 2 2 3 3 4 3 3 4 4
#[3,] 3 4 5 4 5 5 4 5 5 5
To get the distribution of the median you can use the following:
prop.table(table(combn(1:5, 3, median)))
#> 2 3 4
#>0.3 0.4 0.3

Within the base packages, how can I generate the unique unordered pairs between two copies of a vector?

Given n=2, I want the set of values (1, 1), (1, 2), and (2, 2). For n=3, I want (1, 1), (1, 2), (1, 3), (2, 2), (2, 3), and (3, 3). And so on for n=4, 5, etc.
I'd like to do this entirely within the base libraries. Recently, I've taken to using
gen <- function(n)
{
x <- seq_len(n)
cbind(combn(x, 2), rbind(x, x))
}
which gives some workable but hacky output. We get the below for n=4.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x 1 1 1 2 2 3 1 2 3 4
x 2 3 4 3 4 4 1 2 3 4
Is there a better way? Between expand.grid, outer, combn, and R's many other ways of generating vectors, I was hoping to be able to do this with just one combination-producing function rather than having to bind together the output of combn with something else. I could write the obvious for loop, but that seems like a waste of R's powers.
Starting with expand.grid and then subsetting is an option that many answers so far have taken, but I find the idea of generating twice the set that I need to be a poor use of memory. This probably also rules out outer.

Here are some ways to do this.
1) upper.tri
n <- 4
d <- diag(n)
u <- upper.tri(d, diag = TRUE)
rbind(row(d)[u], col(d)[u])
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 1 2 1 2 3 1 2 3 4
## [2,] 1 2 2 3 3 3 4 4 4 4
The last line of code could alternately be written as:
t(sapply(c(row, col), function(f) f(d)[u]))
2) combn
n <- 4
combn(n+1, 2, function(x) if (x[2] == n+1) x[1] else x)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 1 1 1 2 2 2 3 3 4
## [2,] 2 3 4 1 3 4 2 4 3 4
A variation of this is:
co <- combn(n+1, 2)
co[2, ] <- ifelse(co[2, ] == n+1, co[1, ], co[2, ])
co
3) list comprehension
library(listcompr)
t(gen.matrix(c(i, j), i = 1:n, j = i:n))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 1 2 1 2 3 1 2 3 4
## [2,] 1 2 2 3 3 3 4 4 4 4
Performance
library(microbenchmark)
library(listcompr)
n <- 25
microbenchmark(
upper.tri = {
d <- diag(n)
u <- upper.tri(d, diag = TRUE)
rbind(row(d)[u], col(d)[u]) },
upper.tri2 = {
d <- diag(n)
u <- upper.tri(d, diag = TRUE)
t(sapply(c(row, col), function(f) f(d)[u])) },
combn = combn(n+1, 2, function(x) if (x[2] == n+1) x[1] else x),
combn2 = {
co <- combn(n+1, 2)
co[2, ] <- ifelse(co[2, ] == n+1, co[1, ], co[2, ])
co
},
listcompr = t(gen.matrix(c(i, j), i = 1:n, j = i:n)))
giving:
Unit: microseconds
expr min lq mean median uq max neval cld
upper.tri 41.8 52.00 65.761 61.30 77.15 132.6 100 a
upper.tri2 110.8 128.95 187.372 154.85 178.60 3024.6 100 a
combn 1342.8 1392.25 1514.038 1432.90 1473.65 7034.7 100 a
combn2 687.5 725.50 780.686 765.85 812.65 1129.4 100 a
listcompr 97889.0 100321.75 106442.425 101347.95 105826.55 307089.4 100 b

Update
Here is another version, inspired by #G. Grothendieck
gen <- function(n) t(which(upper.tri(diag(n), diag = TRUE), arr.ind = TRUE))
or
gen <- function(n) {
unname(do.call(
cbind,
sapply(
seq(n),
function(k) rbind(k, k:n)
)
))
}
You can try expand.grid + subset like below
gen <- function(n) {
unname(t(
subset(
expand.grid(rep(list(seq(n)), 2)),
Var1 <= Var2
)
))
}
and you will see
> gen(2)
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 1 2 2
> gen(3)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 1 2 1 2 3
[2,] 1 2 2 3 3 3
> gen(4)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 1 2 1 2 3 1 2 3 4
[2,] 1 2 2 3 3 3 4 4 4 4

Here's a slightly modified version of #G. Grothendieck's upper.tri, and a comparison of both to #rawr's method in the comments
upper.tri3 <- function(n){
mrow <- row(diag(n))
mcol <- t(mrow)
i <- mrow <= mcol
rbind(mrow[i], mcol[i])
}
library(bench)
n <- 1e4
mark(
upper.tri = {
d <- diag(n)
u <- upper.tri(d, diag = TRUE)
rbind(row(d)[u], col(d)[u]) },
upper.tri3 = upper.tri3(n),
rawr = {
s <- 1:n
rbind(sequence(s), rep(s, s))
}
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 upper.tri 3.96s 3.96s 0.252 4.47GB 0.757
#> 2 upper.tri3 2.46s 2.46s 0.406 3.73GB 1.62
#> 3 rawr 372.25ms 429.55ms 2.33 763.06MB 1.16
Created on 2021-10-18 by the reprex package (v2.0.1)

You can use expand.grid. I see it as the most intuitive and easy to read solution.
simple_solution <- function(x) {
df <- expand.grid(1:x, 1:x)
return(df[df$Var1 <= df$Var2, ])
}
> simple_solution(4)
Var1 Var2
1 1 1
5 1 2
6 2 2
9 1 3
10 2 3
11 3 3
13 1 4
14 2 4
15 3 4
16 4 4

R. Why does my function ignore 1 (and correctly work with any numbers)?

The function need count numbers and how often the use (input - vector, output - matrix). (I know, how I can do it more easy, but I want understand the error).
Problems is that function ignore 1 in vectors.
count_elements <- function(x) {
y <- sort(x)
m <- matrix(, nrow = 2, ncol = length(unique(x)))
a <- 1
for (i in 1:length(sort(x))) {
if(is.element(y[i], m)) {
} else {
(m[1, a] <- y[i]) & (m[2, a] <- sum(y == y[i])) & (a <- a+1) }
}
m
}
Examples inputs and outputs:
Without 1 in a vector
x <- c(2:10, 2, 3:7, -1)
count_elements(x)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,] -1 2 3 4 5 6 7 8 9 10
#[2,] 1 2 2 2 2 2 2 1 1 1`
With 1 in a vector
x <- c(0:10, 2, 3:7, -1)
count_elements(x)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#[1,] -1 0 2 3 4 5 6 7 8 9 10 NA
#[2,] 1 1 2 2 2 2 2 2 1 1 1 NA
Thank you and sorry for my English :)

The issue is with your logical if test. You ask "is y[i] an element of m?" This checks both the first and second row of m. So using your second example x, when you've gone through the loop twice, m looks like this:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] -1 0 NA NA NA NA NA NA NA NA NA NA
[2,] 1 1 NA NA NA NA NA NA NA NA NA NA
and on the next iteration of the loop, the if is TRUE because there is a 1 in the second row. Note that you just happened to find the error with a 1, but the same error could occur with any positive number. For example, if x = c(1, 1, 2, 3) you would get the error with the 2 because you've found two 1s by the time you get to the 2 in the loop.
I think the simplest fix is to change the if statement to this:
if(is.element(y[i], m[1,]))
A few other coding thoughts for what they're worth:
length(sort(x)) is the same as length(x). Not sure why you need the "sort"
Nothing happens if the if statement evaluates to TRUE, so it seems like there's no need for any if/else business
I've never seen this parentheses-ampersand coding style, e.g.: () & () & (). Newlines or semi-colons are the typical way to separate commands in R
And even though you mention that you're intentionally doing this the hard way, it's worth noting that table(x) will do exactly the same thing, in a lot less time and with fewer errors :)

You're doing your assignments inside a logical expression. The & operator means AND, it is not generally used to bind expressions together like you are doing.
You can simplify your code by indexing only over the unique elements of x. This eliminates a and the need to check whether or not you've included an element, which is where your problem is coming from.
count_elements_new <- function(x) {
y <- sort(x)
u <- unique(y)
m <- matrix(, nrow = 2, ncol = length(u))
for (i in 1:length(u)) {
m[1, i] <- u[i]
m[2, i] <- sum(y == u[i])
}
m
}
(This eliminates the if(), but you could also use a ! in your if() statement and delete else, like this: if(!is.element(y[i], m)))

R: sort using probabilities calculated from vector values

Given a vector, say c(1, 2, 3), I'd like to generate samples of this vector sorted according to probabilities calculated from its values. The process is illustrated below - is there an R function that does this?
A simple example, use probabilities calculated as the value divided by the vector sum: c(1/6, 2/6, 3/6) to determine the first value in the sorted vector. In this case value 3 has probability 3/6 or 50% of being the first element, value 2 has probability 2/6 or 33.3% of being the first element and 1 has probability 1/6 or 16.6%.
After the first element is selected, the process continues similarly for the remaining elements of the vector until a 'statistically' ordered vector is produced.
As the number of 'statistically' ordered samples grows, I'd expect 3 to be first 50% of the time, etc. A mocked up example of a sample size 6:
c(3, 2, 1)
c(2, 3, 1)
c(3, 1, 2)
c(3, 2, 1)
c(1, 3, 2)
c(2, 1, 3)

sample(1:3, prob = 1:3, replace = FALSE)
testing it:
set.seed(42)
res <- replicate(1e5, sample(1:3, prob = 1:3, replace = FALSE))
prop.table(table(res[1,]))
# 1 2 3
#0.16620 0.33324 0.50056
prop.table(table(res[2,]))
# 1 2 3
#0.25026 0.39827 0.35147
prop.table(table(res[3,]))
# 1 2 3
#0.58354 0.26849 0.14797

Try
N <- 100
X <- 3
replicate(N, sample(X, prob=prop.table(1:X)))
Output
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 3 3 3 3 3 1 3 3 3 3 2 3 2 2
[2,] 2 1 2 2 1 3 1 1 1 1 3 2 1 3
[3,] 1 2 1 1 2 2 2 2 2 2 1 1 3 1
# etc
You can transpose the output if you prefer
t(replicate(N, sample(X, prob=prop.table(1:X))))

Combining matrices by alternating columns

I'm looking for a general approach to combine two matrices so that the columns from the two initial matrices alternate in the new matrix
col1m1...col1m2...col2m1...col2m2...col3m1...col3m2......
for example:
matrix.odd <- matrix(c(rep(1,3),rep(3,3),rep(5,3)),nrow=3,ncol=3)
matrix.even <- matrix(c(rep(2,3),rep(4,3),rep(6,3)),nrow=3,ncol=3)
# would look like
matrix.combined <- matrix(c(rep(1,3),rep(2,3),rep(3,3),rep(4,3),rep(5,3),rep(6,3)),
nrow=3,ncol=6)
I'm looking for a general approach because I will have matrix combinations with more than just 3 columns. I've tried some for loops and some if statements but it isn't really coming together for me. Searches on combining matrices with shuffle and with alternation have not proven fruitful either. Any thoughts?

Smth like this should do:
m <- cbind(matrix.odd, matrix.even) # combine
m <- m[, c(matrix(1:ncol(m), nrow = 2, byrow = T))] # then reorder
Another option for fun:
matrix(rbind(matrix.odd, matrix.even), nrow = nrow(matrix.odd))
And to play the many matrices game:
weave = function(...) {
l = list(...)
matrix(do.call(rbind, l), nrow = nrow(l[[1]]))
}

rows.combined <- nrow(matrix.odd)
cols.combined <- ncol(matrix.odd) + ncol(matrix.even)
matrix.combined <- matrix(NA, nrow=rows.combined, ncol=cols.combined)
matrix.combined[, seq(1, cols.combined, 2)] <- matrix.odd
matrix.combined[, seq(2, cols.combined, 2)] <- matrix.even

alternate.cols <- function(m1, m2) {
cbind(m1, m2)[, order(c(seq(ncol(m1)), seq(ncol(m2))))]
}
identical(matrix.combined, alternate.cols(matrix.odd, matrix.even))
# [1] TRUE
which also does the right thing (subjective) if m1 and m2 have a different number of columns:
alternate.cols(matrix.odd, matrix.even[, -3])
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 2 3 4 5
# [2,] 1 2 3 4 5
# [3,] 1 2 3 4 5
It is easy to generalize to any number of matrices:
alternate.cols <- function(...) {
l <- list(...)
m <- do.call(cbind, l)
i <- order(sequence(sapply(l, ncol)))
m[, i]
}

You could turn into a 3D array and then transpose...
arr <- array( c(m1,m2) , dim = c(dim(m1),2) )
matrix( aperm( arr , c(1,3,2) ) , nrow(m1) )
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 5 6
[2,] 1 2 3 4 5 6
[3,] 1 2 3 4 5 6
And as a function, generalisable to many matrices...
bindR <- function(...){
args <- list(...)
dims <- c( dim(args[[1]]) , length(args) )
arr <- array( unlist( args ) , dim = dims )
matrix( aperm( arr , c(1,3,2) ) , dims[1] )
}
bindR(m1,m2,m1,m2)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#[1,] 1 2 1 2 3 4 3 4 5 6 5 6
#[2,] 1 2 1 2 3 4 3 4 5 6 5 6
#[3,] 1 2 1 2 3 4 3 4 5 6 5 6

There is likely a more succinct way to do this. If the matrices are large, you will likely need to look for a more efficient method.
# Test data
(X <- matrix(1:16, nrow=4, ncol=4))
(Y <- matrix(-16:-1, nrow=4, ncol=4))
# Set indices for the new matrix
X.idx <- seq(1, ncol(X)*2, by=2)
Y.idx <- seq(2, ncol(Y)*2+1, by=2)
# Column bind the matrices and name columns according to the indices
XY <- cbind(X, Y)
colnames(XY) <- c(X.idx, Y.idx)
# Now order the columns
XY[, order(as.numeric(colnames(XY)))]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

sapply on data.frame vs vector (colnames, rownames) - r

Related

How to calculate PMF of sample median for discrete random variable in R?

Within the base packages, how can I generate the unique unordered pairs between two copies of a vector?

R. Why does my function ignore 1 (and correctly work with any numbers)?

R: sort using probabilities calculated from vector values

Combining matrices by alternating columns

Categories

Resources