improve efficiency of filling dataset with sampling

improve efficiency of filling dataset with sampling - r

Given myletters:
library(tidyverse)
myletters <- letters
myletters
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
I want to sample 4 letters at a time (without replacement) from myletters, repeat this X multiple times and find the probability of having sampled all letters at least once in X = 1:100 draws.
For example if X = 10 we could get:
set.seed(10)
X <- unlist(rerun(10, sample(myletters, 4, replace = F)))
X
# [1] "k" "i" "j" "p" "l" "w" "h" "v" "g" "s" "x" "o" "o" "j" "g" "y" "b" "x" "m" "h" "n" "g" "f" "y" "v" "r" "u" "y" "m" "e" "a" "g" "z" "r" "d" "y" "x" "s" "v"
# [40] "r"
#test if X contains all 26 letters
n_distinct(X) == 26 #26 = no of letters
#FALSE
The following approach does what I want in a simulation but doesn't scale very well as it fills a dataframe column with up to 400 letters in a cell so is awkward and inefficient:
output <- crossing(drawsX = 1:100,
trial = 1:100) %>%
mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(myletters, 4, replace = F)))),
all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))
output
#plot
output %>%
group_by(drawsX) %>%
summarise(prob_of_all_letters = mean(all_letters)) %>%
ggplot(., aes(drawsX, prob_of_all_letters)) +
geom_line() +
scale_y_continuous(labels = scales::percent_format()) +
labs(y = "Probability")
Ideally I would like to simulate more times e.g. trial = 1:100000 but the approach above is inefficient if I wanted to do this.
1) Is there a more efficient way to fill my dataset (or using a matrix) with samples?
2) Also, is there an analytic way to solve this problem in R instead of simulation. e.g. what is probability of get 26 letters from 10 draws of 4 samples each?
thanks

Here's a somewhat improved version. The code is a bit more efficient and certainly cleaner:
sample_sets = function(replicates, k, set = letters) {
draws = vapply(1:replicates, function(z, ...) sample.int(...), FUN.VALUE = integer(k), n = length(set), size = k, replace = FALSE)
all(seq_along(set) %in% draws)
}
## example use
output <- crossing(
drawsX = 1:100,
trial = 1:100
) %>%
mutate(
outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
)
## timing
system.time({output <- crossing(
drawsX = 1:100,
trial = 1:100
) %>%
mutate(
outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
)
})
# user system elapsed
# 2.79 0.04 2.95
## original way
system.time({output <- crossing(drawsX = 1:100,
trial = 1:100) %>%
mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(letters, 4, replace = F)))),
all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))})
# user system elapsed
# 4.96 0.06 5.18
So it's about 40% faster on this data - hopefully that performance gain will continue as draws increases.

Related

Substitution Encryption/Decryption in R

Using the variables alpha and key, encrypt ptext into a variable named ctext. Using substitution cipher
So I have a text file separated in a vector
ptext <- strsplit(ptext,split = "", fixed = TRUE)
ptext <- unlist(ptext)
I also created a key for this cipher
key <- "ZGYHXIWJVKULTMSARBQCPDOENF"
key <- unlist(strsplit(key,""))
and an Alphabet vector for the key
alpha <- "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
alpha <= toupper(alpha)
alpha <- unlist(strsplit(alpha,""))
Now my goal is to try to replace all the character in the ptext vector corresponding to the letters in the key in relation to alpha (Example: A in alpha in relation to Z in the key. So all A's in the text would be replaced by a Z)
I know I am supposed to match the alpha in key
cipher <- match(key,alpha)
Now my issue is, the ptext file is over 1000 characters in it. How would I be able to replace all the letters in that vector?

You could use chartr which will avoid splitting the string and pasting back.
ptext <- 'REQWDSFFFSLK'
alpha <- 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
key <- 'ZGYHXIWJVKULTMSARBQCPDOENF'
chartr(alpha, key, ptext)
#[1] "BXROHQIIIQLU"
Here, all R is replaced with B, E with X and so on for every character value.

Basically, you need to do what you are doing with the cipher but apply that to each letter of ptext. You can either leave as a vector of single strings or put back together as desired, for example using paste0 below.
set.seed(123)
ptext <- strsplit(LETTERS[sample(26, 100, T)],split = "", fixed = TRUE)
ptext <- unlist(ptext)
key <- "ZGYHXIWJVKULTMSARBQCPDOENF"
key <- unlist(strsplit(key,""))
alpha <- unlist(strsplit(LETTERS,""))
encoded <- sapply(ptext, function(x) key[match(x, alpha)])
encoded
#> O S N C J R V K E T N V Y Z E S Y Y I C
#> "S" "Q" "M" "Y" "K" "B" "D" "U" "X" "C" "M" "D" "N" "F" "X" "Q" "N" "N" "V" "Y"
#> H Z G J I S D N Q K G U L O J M G I I J
#> "J" "F" "W" "K" "V" "Q" "H" "M" "R" "U" "W" "P" "L" "S" "K" "T" "W" "V" "V" "K"
#> W U G U F Y B E H L M R A Y Y F U O I O
#> "O" "P" "W" "P" "I" "N" "G" "X" "J" "L" "T" "B" "Z" "N" "N" "I" "P" "S" "V" "S"
#> Z P T F K H V V G P Q V R Q B D M E V S
#> "F" "A" "C" "I" "U" "J" "D" "D" "W" "A" "R" "D" "B" "R" "G" "H" "T" "X" "D" "Q"
#> Y T V Y N Y W C H P L Y N C N G C W V Z
#> "N" "C" "D" "N" "M" "N" "O" "Y" "J" "A" "L" "N" "M" "Y" "M" "W" "Y" "O" "D" "F"
paste0(encoded, collapse = "")
#> [1] "SQMYKBDUXCMDNFXQNNVYJFWKVQHMRUWPLSKTWVVKOPWPINGXJLTBZNNIPSVSFACIUJDDWARDBRGHTXDQNCDNMNOYJALNMYMWYODF"

Get a random column from R matrix

In R language, I defined a matrix this way:
data <- matrix(c("A","B","C","D","E","F"), nrow = 2)
This gives me something like this:
"A" | "C" | "E"
"B" | "D" | "F"
now, How do I get a random column of the matrix?
If I do:
sample(x = data, n = 2)
I get random elements from all around the matrix, like "A" and "F". What I want is to get a column like "A" and "B", or "C" and "D" or "E" and "F"
I am new to R so any help is really apreciated

I'd use something like this:
f <- function(mat) {
j <- sample(seq_len(ncol(mat)), size=1)
## (Use `drop=FALSE` to say "don't convert 1-column matrices to vectors")
data[, j, drop=FALSE]
}
## Try it out
f(data)
# [,1]
# [1,] "E"
# [2,] "F"

Horizontal barplot labels overlapping on each other

I am using a dataset to create a horizontal in the horizontal orientation. Something similar to what has been proposed as a solution in R: How can I make a barplot with labels parallel (horizontal) to bars.
However, the number of labels in the Y axis of my horizontal barplot chart are a little too many (due to the problem in hand) and hence, they are overlapping over each other.
Is there a way to preserve the barplot bin size and show a subset of the Y labels in the horizontal orientation of the barplot?
thanks,
rajat

Here's one way to do it, we can use a nice solution to interleave the names from your data with blanks:
generate some data
set.seed(123)
df1 <- data.frame(x = replicate(50, paste(sample(letters, 2, replace = T), collapse = '')),
y = sample(1:10, 50, replace = T), stringsAsFactors = FALSE)
make a barplot, using a subset of the names
barplot(df1$y, names.arg = c(rbind(df1$x, rep('', 50)))[1:50], horiz = T, las = 1)
The main trick is the names.arg = c(rbind(df1$x, rep('',50)))[1:50] line. It interleaves blanks between the names from the data. Effectively, we are replacing half of the names with blank space.
If that's not sufficient, we can define a function which takes in a vector of names, x, and a multiple, m that defines which values to replace with blanks:
replace_multiple <- function(x, m){
len_x <- length(x)
index_to_replace <- seq(1, len_x, by = m)
x[index_to_replace] <- ''
return(x)
}
replace_multiple(letters[1:12], m = 2)
# "" "b" "" "d" "" "f" "" "h" "" "j" "" "l"
replace_multiple(letters[1:12], m = 3)
# "" "b" "c" "" "e" "f" "" "h" "i" "" "k" "l"
replace_multiple(letters[1:12], m = 4)
# "" "b" "c" "d" "" "f" "g" "h" "" "j" "k" "l"

Margining two vectors on precisely defined intervals

Given are two vectors:
vec_nums <- 1:20
vec_ltrs <- letters[1:10]
I would like to write a function that would merge them some each element from the second vectors appears on the precisely defined position within the first vector. For example, running:
vec_mrg <- funMergeVectsByPlace(x = vec_num, y = vec_ltrs, position = 3)
Should return vec_mrg of the following content:
[1] "a" "b" "1" "c" "d" "2" "f" "g" "3" "i" "j" "4" "l" "m" "5" ...
Desired characteristics:
The function places element from the vector passed via the y = on the position given in the position = counting from the left hand side. So position = 3 should be understood as *every third place" accounting for 3, 6, ...
The function should work on numeric string and factor vectors and return an ordered factor.
The function should work on factor, string and numeric vectors
In case of vector y being shorter than than the number of inserts in the x the function should return remaining part of x without any additions
Suggested structure
I would envisage for the function to be of this structure:
funMergeVectsByPlace <- function(x,y position = 3) {
# Convert
vec_a <- as.character(x)
vec_b <- as.character(y)
# Missing part
# Combine two vectors
# Create ordered factor
vec_fac <- factor(vec_mrg,
# levels =
# I want the levels to reflect the order of elements in the vec_merg
)
# Return
return(vec_fac)
}
Samples
Simplest
Concerning attempts, simplest approach:
vec_mrg <- c(vec_nums, vec_ltrs)
vec_mrg <- order(vec_mrg)
But this would not create the order
Loop
for (i in 1:length(vec_nums)) {
pos <- position
vec_nums[pos] <- vec_ltrs[i]
pos <- pos + pos
# i will be out of bounds and the way to move the other vector is missing
}

vec_mrg <- function(x,y,pos) {
res <- y
counter <- seq(floor(length(y)/(pos-1)))
for(i in counter) {
res <- append(res, x[i], seq(pos-1,by=pos, length.out=length(counter))[i])
}
res
}
vec_mrg(vec_nums, vec_ltrs, 3)
#[1] "a" "b" "1" "c" "d" "2" "e" "f" "3" "g" "h" "4" "i" "j"
#[15] "5"

A loop-free solution:
funMergeVectsByPlace <- function( x, y, position )
{
n <- min( length(y)%/%(position-1), length(x) )
A <- rbind( matrix(head(y,n*(position-1)),position-1), head(x,n) )
rest <- c( x[-(1:n)], y[-(1:(n*(position-1)))] )
c(c(A),rest)
}
Speed comparison with Lafortunes solution:
> library(microbenchmark)
> vec_nums <- 1:20
> vec_ltrs <- letters[1:10]
> microbenchmark(Lafortune = vec_mrg(vec_nums,vec_ltrs,3),
+ mra68 = funMergeVectsByPlace(vec_nums,vec_ltrs,3),
+ times .... [TRUNCATED]
Unit: microseconds
expr min lq mean median uq max neval
Lafortune 137.677 143.112 161.12006 146.734 153.980 2931.512 10000
mra68 77.443 81.067 92.13208 83.331 86.954 2718.204 10000
Larger vectors:
> vec_nums <- 1:2000
> vec_ltrs <- letters[rep(1:10,100)]
> microbenchmark(Lafortune = vec_mrg(vec_nums,vec_ltrs,3),
+ mra68 = funMergeVectsByPlace(vec_nums,vec_ltrs,3),
+ times .... [TRUNCATED]
Unit: milliseconds
expr min lq mean median uq max neval
Lafortune 32.993883 40.991796 63.758011 51.171020 90.122351 456.9748 1000
mra68 1.101865 1.489533 2.468496 1.751299 3.338881 230.0460 1000
> v1 <- vec_mrg(vec_nums,vec_ltrs,3)
> v2 <- funMergeVectsByPlace(vec_nums,vec_ltrs,3)
>
Notice that the vec_mrg function does not append the rest of the x vector to the result, but funMergeVectsByPlace does. Otherwise the results are the same:
> v1 <- vec_mrg(1:20,letters[1:10],3)
> v2 <- funMergeVectsByPlace(1:20,letters[1:10],3)
> v1
[1] "a" "b" "1" "c" "d" "2" "e" "f" "3" "g" "h" "4" "i" "j" "5"
> v2
[1] "a" "b" "1" "c" "d" "2" "e" "f" "3" "g" "h" "4" "i" "j" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"
> identical(v1,v2[1:length(v1)])
[1] TRUE
>
Neither vec_mrg nor funMergeVectsByPlace return factors. If one includes factor(...), both functions are getting slower, but funMergeVectsByPlace is still faster than vec_mrg.

Define names of objects in multiple lists within a list (using lapply?)

My apologies for the somewhat confusing title (any suggestion for improvement are welcome)..
Suppose I have a list which contains several (e.g. four) lists in which I would like to store 20 objects later on:
mylist <- vector(mode="list",length=4)
names(mylist) <- c("One","Two","Three","Four")
mylist$One <- mylist$Two <- mylist$Three <- mylist$Four <- vector(mode="list",
length=20)
I would like to define the names of those objects beforehand. Of course, I can do that as following:
names(mylist$One) <- c("A","B","C","D","E","F","G","H","I","J",
"K","L","M","N","O","P","Q","R","S","T")
names(mylist$Two) <- names(mylist$Three) <- names(mylist$Four) <- names(mylist$One)
But if the number of the lists would increase (as is the case in my actual data), this becomes rather cumbersome, so I was trying to do this with a function such as lapply :
mylist <- lapply(mylist,FUN=function(x) {names(x) <-
c("A","B","C","D","E","F","G","H","I","J",
"K","L","M","N","O","P","Q","R","S","T")})
This, however, does not give me the same result, but I can not seem to figure out what I am overlooking here. Any suggestions?
Thanks!

You need to return a value in your lapply call:
mylist <- lapply(mylist,FUN=function(x) {names(x) <-
c("A","B","C","D","E","F","G","H","I","J",
"K","L","M","N","O","P","Q","R","S","T")
x ## <- note the x here; you could also use return(x)
})
mylist
# $One
# A B C D E F G H I J K L M N O P Q R S T
# "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T"
#
# $Two
# A B C D E F G H I J K L M N O P Q R S T
# "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T"
#
# $Three
# A B C D E F G H I J K L M N O P Q R S T
# "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T"
#
# $Four
# A B C D E F G H I J K L M N O P Q R S T
# "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T"

This is my implementation, which I think it produces the results you are expecting
mylist <- vector(mode="list",length=4)
names(mylist) <- c("One","Two","Three","Four")
mylist$One <- mylist$Two <- mylist$Three <- mylist$Four <- vector(mode="list",length=20)
renameList <- function(mylist,k){
names(mylist) <- LETTERS[1:k]
return(mylist)
}
mylist2 <- lapply(mylist, function(x) renameList(x,20))
# > str(mylist2)
# List of 4
# $ One :List of 20
# ..$ A: NULL
# ..$ B: NULL
# ..$ C: NULL

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

improve efficiency of filling dataset with sampling - r

Related

Substitution Encryption/Decryption in R

Get a random column from R matrix

Horizontal barplot labels overlapping on each other

Margining two vectors on precisely defined intervals

Define names of objects in multiple lists within a list (using lapply?)

Categories

Resources