Given a vector, say c(1, 2, 3), I'd like to generate samples of this vector sorted according to probabilities calculated from its values. The process is illustrated below - is there an R function that does this?
A simple example, use probabilities calculated as the value divided by the vector sum: c(1/6, 2/6, 3/6) to determine the first value in the sorted vector. In this case value 3 has probability 3/6 or 50% of being the first element, value 2 has probability 2/6 or 33.3% of being the first element and 1 has probability 1/6 or 16.6%.
After the first element is selected, the process continues similarly for the remaining elements of the vector until a 'statistically' ordered vector is produced.
As the number of 'statistically' ordered samples grows, I'd expect 3 to be first 50% of the time, etc. A mocked up example of a sample size 6:
c(3, 2, 1)
c(2, 3, 1)
c(3, 1, 2)
c(3, 2, 1)
c(1, 3, 2)
c(2, 1, 3)
sample(1:3, prob = 1:3, replace = FALSE)
testing it:
set.seed(42)
res <- replicate(1e5, sample(1:3, prob = 1:3, replace = FALSE))
prop.table(table(res[1,]))
# 1 2 3
#0.16620 0.33324 0.50056
prop.table(table(res[2,]))
# 1 2 3
#0.25026 0.39827 0.35147
prop.table(table(res[3,]))
# 1 2 3
#0.58354 0.26849 0.14797
Try
N <- 100
X <- 3
replicate(N, sample(X, prob=prop.table(1:X)))
Output
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 3 3 3 3 3 1 3 3 3 3 2 3 2 2
[2,] 2 1 2 2 1 3 1 1 1 1 3 2 1 3
[3,] 1 2 1 1 2 2 2 2 2 2 1 1 3 1
# etc
You can transpose the output if you prefer
t(replicate(N, sample(X, prob=prop.table(1:X))))
Related
A bag contains 5 billiard balls numbered 1, 2, 3, 4, 5. A random
sample of size n = 3 is drawn without replacement from the bag.
What is the probability mass function of the sample median?
Here is what I have:
library(listviewer)
sampleSpaceAndMedian = list()
# the random samples (1,2,3), (1,3,2), (2,1,3),
# (2,3,1), (3,1,2), and (3,2,1) have the same mean
# therefore, belong to the same equivalence class
for (a in 1:3){
for (b in 2:4){
for (c in 3:5){
# a unique random sample of size 3 (ignores the order)
if (b > a && c > b){
tString = paste(toString(a), toString(b), toString(c), toString(median(c(a,b,c))), sep = " ")
sampleSpaceAndMedian <- append(sampleSpaceAndMedian, tString)
}
}
}
}
# the random sample is in the first three columns
# median is the fourth column
jsonedit( sampleSpaceAndMedian )
```
Can you please help me to get the PMF? Thanks.
You can use combn to get all the combinations of a vector and apply a function to it:
combn(1:5, 3)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,] 1 1 1 1 1 1 2 2 2 3
#[2,] 2 2 2 3 3 4 3 3 4 4
#[3,] 3 4 5 4 5 5 4 5 5 5
To get the distribution of the median you can use the following:
prop.table(table(combn(1:5, 3, median)))
#> 2 3 4
#>0.3 0.4 0.3
I am attempting to generate a matrix where each column represents a sequence of factors in R. The factors can assume the values 1, 2, 3 or 4. Each sequence has 13 elements suggesting a total of 4^13 potential sequences. However, only a specific subset of these potential sequences are considered valid. The logic is as follows:
A sequence can start at any factor
If a sequence starts at 4, its second element can be less than or equal to 4
Once an element drops below 4, subsequent entries must be weakly increasing
If a sequence starts with 1, 2, or 3 it must be weakly increasing
So for example, the sequence (1,2,3,3,3.....3) is valid. The sequence (4,4,1,1,2,4,4....4) is also valid. The sequence (4,1,2,3,1,1....1) is not, since it is not weakly increasing after the first drop from 4 to 1.
At the moment, I have code to combine the 2&3 factors and generate this matrix. The process involves generating a matrix of all possible sequences and then filtering down based on the above logic. This is highly inefficient, but I can post it if necessary. This process also cannot be generalized to a four factor model, as the 4^13 potential sequences overwhelm my machine.
If any of you can offer insight into how I might generate these valid sequences, it would be greatly appreciated. Thank you.
I am assuming that once a gradually increasing vector reaches 4, it cannot jump back down again to a lower value the way that it can if 4 is the first number (if it can, the code is actually easier).
The following function generates compatible sequences, essentially using switch to implement a Markov chain
generate_seq <- function(n)
{
x <- numeric(n)
x[1] <- sample(4, 1)
had_a_four <- FALSE
for(i in seq(n - 1)) {
if(!had_a_four)
{
x[i + 1] <- switch(x[i], sample(1:2, 1, prob = c(3, 1)),
sample(2:3, 1, prob = c(3, 1)),
sample(3:4, 1, prob = c(3, 1)),
sample(4, 1))
}
else
{
x[i + 1] <- switch(x[i], sample(1:2, 1, prob = c(3, 1)),
sample(2:3, 1, prob = c(3, 1)),
sample(3:4, 1, prob = c(3, 1)),
4)
}
if(x[i + 1] == 4 & !all(x[1:(i+1)] == 4)) had_a_four <- TRUE
}
x
}
And we can generate a 4-column matrix like this:
set.seed(4)
matrix(replicate(4, generate_seq(13)), ncol = 4)
#> [,1] [,2] [,3] [,4]
#> [1,] 4 4 1 1
#> [2,] 3 3 1 2
#> [3,] 3 4 2 3
#> [4,] 3 4 2 3
#> [5,] 4 4 2 4
#> [6,] 4 4 3 4
#> [7,] 4 4 3 4
#> [8,] 4 4 3 4
#> [9,] 4 4 4 4
#> [10,] 4 4 4 4
#> [11,] 4 4 4 4
#> [12,] 4 4 4 4
#> [13,] 4 4 4 4
I think you can use RcppAlgos to do this efficiently by generating the combinations for vectors of length 1:n (where it's assumed that the shorter vectors are left padded to length 13 with 4) :
library(RcppAlgos)
get_combos <- function(n) {
unique(do.call(rbind, sapply(rev(seq(n)), function(x)
do.call(
cbind, c(rep(4, n - x), list(comboGeneral(1:4, x, TRUE)))
))))
}
res <- get_combos(13)
head(res)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] 1 1 1 1 1 1 1 1 1 1 1 1 1
[2,] 1 1 1 1 1 1 1 1 1 1 1 1 2
[3,] 1 1 1 1 1 1 1 1 1 1 1 1 3
[4,] 1 1 1 1 1 1 1 1 1 1 1 1 4
[5,] 1 1 1 1 1 1 1 1 1 1 1 2 2
[6,] 1 1 1 1 1 1 1 1 1 1 1 2 3
nrow(res)
[1] 2367
I have a data table where I want to swap negative values by assigning them the positive value in the previous row for the same column. for ex:
1 2 3 4
2 -3 -2 3
should be
1 2 3 4
2 2 3 3
Thanks!
Since there are no answers from more experienced guys, here is what I've come up with.
# I'm reconstructing your example:
n <- matrix(c(1, 2, 2, -3, 3, -2, 4, 3), nrow = 2)
n
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 -3 -2 3
changeMat <- function(mat) {
new_mat <- mat
for(i in 1:length(mat))
ifelse(mat[i] < 0, new_mat[i] <- mat[i-1], new_mat[i] <- mat[i])
return(new_mat)
}
changeMat(n)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 2 3 3
I checked that for data.table object dt changeMat(as.matrix(dt)) would work properly.
Anyway, I am pretty sure that there must be smarter way...
I have the following function:
func <- function(scores, labels, thresholds) {
labels <- if (is.data.frame(labels)) labels else data.frame(labels)
sapply(thresholds, function(t) { sapply(labels, function(lbl) { sum(lbl[which(scores >= t)]) }) })
}
I also have the following that I'll pass into func.
> scores
[1] 0.187 0.975 0.566 0.793 0.524 0.481 0.005 0.756 0.062 0.124
> thresholds
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> var1
[1] 1 1 0 0 0 1 0 1 1 1
> df
var1 var2
1 1 0
2 1 1
3 0 0
4 0 0
5 0 0
6 1 1
7 0 1
8 1 1
9 1 1
10 1 0
Here are two different calls two func, one with labels as a vector, and the other with labels as a data.frame:
> func(scores, var1, thresholds)
labels labels labels labels labels labels labels labels labels labels labels
6 5 3 3 3 2 2 2 1 1 0
> func(scores, df, thresholds)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
var1 6 5 3 3 3 2 2 2 1 1 0
var2 5 3 3 3 3 2 2 2 1 1 0
Why does "labels" get applied as a colname in the vector version, and "var1" and "var2" get applied as a rowname in the data.frame version?
What I'm looking for is the vector version to be more like:
> func(scores, var1, thresholds)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
labels 6 5 3 3 3 2 2 2 1 1 0
To create the variables above:
scores <- sample(seq(0, 1, 0.001), 10, replace = T)
thresholds <- seq(0, 1, 0.1)
var1 <- sample(c(0, 1), 10, replace = T)
var2 <- sample(c(0, 1), 10, replace = T)
df <- data.frame(var1, var2)
Try switching the order of the nested sapplys:
func <- function(scores, labels, thresholds) {
labels <- if (is.data.frame(labels)) labels else data.frame(labels)
t(sapply(labels, function(lbl) {
sapply(thresholds, function(t) sum(lbl[which(scores >= t)]))
}))
}
From ?sapply:
‘sapply’ is a user-friendly version and wrapper of ‘lapply’ by
default returning a vector, matrix or, if ‘simplify = "array"’, an
array if appropriate, by applying ‘simplify2array()’.
To understand what's going on in your original function, it's perhaps useful to think about each sapply in turn.
The inner sapply(labels, ...) creates a named vector of length k (where k is the number of columns in labels -- so k is 1 in the vector case, and 2 in the dataframe example), where the names of the vector elements are given by the column names (labels in the vector case, and var1/var2 in the dataframe example).
The outer sapply(thresholds, ...) runs the inner sapply 11 times, each time with a different value of t. So in the vector case, you'll end up with 11 vectors of length 1 where the name of the one and only element in each vector is labels, which sapply returns ("simplifies") as one vector of length 11.
By switching the order of the sapplys, the inner sapply now returns an unnamed vector of length 11. The outer sapply then does this k times. In the vector case, k is 1, and the name of the vector returned is labels. In the dataframe example, k is 2, and the names of the 2 vectors returned are var1 and var2.
(It might also be a useful exercise to name the elements in the thresholds vector; e.g. thresholds <- setNames(seq(0, 1, 0.1), LETTERS[1:11]) and re-run func to see what happens.)
Note: #weihuang-wong 's answer is great, and the solution is in some ways better than this one. But I already had most of this answer written before that answer was posted, so I decided to post this answer anyway.
You get the names you do because those are the names of the things you iterate over. But why do you get a named vector in the first case and a matrix with rownames in the second case? Here is a simpler case that makes it easier to see.
sapply(1, function(x) sapply(c(a = 1), function(y) y))
# a
# 1
sapply(1, function(x) sapply(c(a = 1, b = 2), function(y) y))
# [,1]
# a 1
# b 2
OK, so what is happening here? Let's break it down so we can see.
sapply(c(a = 1), function(y) y)
returns a named length-one vector.
sapply(c(a = 1, b = 2), function(y) y)
returns a named length-two vector.
Now it's the job of the outer sapply to combine those results. When it sees that the inner sapply returns a length-one vector it simplifies it to a named vector. That simplification doesn't work when the return value is of length > 1, so sapply simplifies to a matrix instead.
So if we want consistency we need sapply to return a matrix, even in the length-one case. How do we make sapply consistent? It's surprisingly difficult. In the end I would just convert it to a matrix after the fact.
matrix(sapply(1, function(x) sapply(c(a = 1), function(y) y)), dimnames = list("a"))
# [,1]
# a 1
Now that we understand what's happening we can apply what we've learned to the original problem.
func <- function(scores, labels, thresholds) {
labels <- if (is.data.frame(labels)) labels else data.frame(labels)
r <- sapply(thresholds, function(t) { sapply(labels, function(lbl) { sum(lbl[which(scores >= t)]) }) })
if(!is.matrix(r)) r <- matrix(r, nrow = 1, dimnames = list(names(labels)))
r
}
func(scores, df, thresholds)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
# var1 6 5 3 3 3 2 2 2 1 1 0
# var2 5 3 3 3 3 2 2 2 1 1 0
func(scores, var1, thresholds)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
# labels 6 5 3 3 3 2 2 2 1 1 0
A = matrix(c(1,2,3, 0, 2, 2, 0,2 ,3), nrow=3, ncol=3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 2 2 2
[3,] 3 2 3
B = matrix(c(1,2,3, 1, 4, 2, 2,2 ,1), nrow=3, ncol=3)
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 2 4 2
[3,] 3 2 1
C = A + B /(Sum numbers diff of zero)
C = matrix(c(1,2,3, 1, 3, 2, 2,2 ,2), nrow=3, ncol=3)
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 2 3 2
[3,] 3 2 2
I need do it for a list of N matrices (mat_vect[[i]]):
list_mat_vect[[i]] <- assign(paste("a", i, sep = ""), mat_vect[[i]])
Sum matrix and get mean value
mat_sum_mean = Reduce("+", list_mat_vect) / length(file_list)
Here is dividing for all numbers, including the zeros. I dont want that.
You can do
(A+B)/((A!=0) + (B!=0))
to get
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 2 3 2
[3,] 3 2 2
Here != tests for equality with zero returning TRUE or FALSE. When we add those up, the TRUEs are treated like 1 and the FALSEs become 0.
You can do this with a list of matrices as well
list_mat_vect<-list(A,B)
Reduce("+", list_mat_vect) / Reduce("+", lapply(list_mat_vect, function(x) x!=0))