I was hoping someone could help point me in the right direction to create a vector in R, containing a defined amount of randomly generated numbers. I am a complete newbie to R, and I have learned that the concatenate function is used for creating vectors. However, I wish to populate the vector with 50 random numbers. I do not wish to specify a range or any other conditions for the numbers.
MyVectorObject <- c(...)
Any suggestions would be greatly appreciated!
It depends on which numbers you want to generate. These are some options.
x1 <- rpois(n = 50, lambda = 10)
x2 <- runif(n = 50, min = 1, max = 10)
x3 <- sample(x = c(1, 3, 5), size = 50, replace = TRUE)
If we are talking about integer numbers, you want to generate number in interval <-base::.Machine$integer.max, base::.Machine$integer.max> which is for example on my computer interval <-2147483647,2147483647>
Implementation
you can use base::sample to generate positive numbers from 1 to base::.Machine$integer.max
random.pos <- function(N) {
int.max <- .Machine$integer.max
return(sample(int.max, N, replace=TRUE))
}
if you want also negative numbers, you can use
random.posneg <- function(N) {
int.max <- .Machine$integer.max
random.numbers <- sample(int.max, N, replace = TRUE)
random.signs <- sample(c(1,-1), N, replace=TRUE)
return(random.numbers * random.signs)
}
NOTE: No one from functions specified above does generate 0 (zero)
The best approach (by my opinion) is to use base::runif function.
random.runif <- function(N) {
int.max <- .Machine$integer.max
int.min <- -int.max
return(as.integer(runif(N, int.min, int.max)))
}
This will be little bit slower then using base::sample but you get random numbers uniformly distributed with possible zero.
Benchmark
library(microbenchmark)
require(compiler)
random.runif <- cmpfun(random.runif)
random.pos <- cmpfun(random.pos)
random.posneg <- cmpfun(random.posneg)
N <- 500
op <- microbenchmark(
RUNIF = random.runif(N),
POS = random.pos(N),
POSNEG = random.posneg(N),
times = 10000
)
print(op)
## library(ggplot2)
## boxplot(op)
## qplot(y=time, data=op, colour=expr) + scale_y_log10()
and results from the benchmark above
Unit: microseconds
expr min lq mean median uq max neval
RUNIF 13.423 14.251 15.197122 14.482 14.694 2425.290 10000
POS 4.174 5.043 5.613292 5.317 5.645 2436.909 10000
POSNEG 11.673 12.845 13.383194 13.285 13.800 60.304 10000
Related
I want to generate 599 numbers (with repeats) between 0 and 100 with specified mean and sd, the generated numbers should all be multiples of 5. A pseudo code can be
df$Murphy_task_strategy<-rnorm(mean=57, sd= 25.30, n=599,by=5, min=0, max=100)
Thank you
It won't strictly be a normal distribution but I'm assuming you're mainly looking for something that looks approximately bell shaped when taking a histogram.
rdiscnorm <- function(n, mean, sd, min = 0, max = 100, by = 5){
# generate the possible values we can take on
vals <- seq(from = min, to = max, by = by)
# use dnorm to get the density at each of those points
unnormed_probabilities <- dnorm(vals, mean = mean, sd = sd)
# normalize so that the probabilities sum to 1
# - this isn't strictly necessary because we use sample
# but it makes sense when thinking about the process
ps <- unnormed_probabilities/sum(unnormed_probabilities)
# Take a sample with replacement of the vals
# using the generated probabilities
output <- sample(vals, size = n, replace = TRUE, prob = ps)
return(output)
}
out <- rdiscnorm(599, 57, 25.30)
df$Murphy_task_strategy <- out
I need to write a function according to the info in the title. I'm trying to perform that with the following code:
my.function <- function(x=1:6,Nsample=20,prob1=NULL) {
rolling.die <- sample(x, size=Nsample, replace=TRUE, prob=prob1)
for (die in 1:10000) {
die.sum <- sum(rolling.die)
average <- die.sum/Nsample
}
return(var(average))
}
my.function()
But I always get N/A as a result. Could you, please, help me to understand what I am doing wrong?
You need replicate() -
set.seed(2)
test <- replicate(1000, mean(sample(1:6, 20, replace = T)))
# for expectation
mean(test)
[1] 3.50025
# for variance
var(test)
[1] 0.147535
average is a number. It does not make sense to calculate variance of a number. What is the variance of 5 ? Variance is applied to a collection of numbers. So your average must be a vector.
A more efficient approach is to generate all your data ahead of time. As long as you have the memory, this would be a very fast approach:
# sim parameters
n_rolls <- 20L #L means integer variables
n_sim <- 10000L
n_sides <- 6L
#generate data
set.seed(2)
sims <- sample(n_sides, n_rolls * n_sim, replace = T)
#make into matrix of n_sims x n_rolls
mat <- matrix(sims, ncol = n_rolls)
#mean of each simulation
rowMeans(mat)
#var of everything
var(rowMeans(mat))
This is around 14x faster than using replicate as this approach calls sample() once.
Unit: milliseconds
expr min lq mean median uq max neval
shree_replic 137.7283 138.9809 145.78485 142.34755 147.2499 172.4633 10
cole_samp_mat 11.3998 11.4477 11.57025 11.52105 11.7628 11.8218 10
As far as your current function, it doesn't make sense - the loop doesn't do anything. It just does the same calculation 10,000 times and as #user31264 points out, tries to calculate the var of a scalar after the loop. I think you mean to do something like:
my.function2 <- function(x=1:6,Nsample=20,prob1=NULL) {
rolling.die <- sample(x, size=Nsample, replace=TRUE, prob=prob1)
return(mean(rolling.die))
}
means <- vector(mode = 'double', length = n_sim)
for (i in 1:n_sim){
means[i] <- my.function2()
}
#which is equivalent to
means <- sapply(1:n_sim, my.function2)
#which is also equivalent to
means <- replicate(n_sim, my.function2())
var(means)
And #shree has a much more succinct version of your function.
I'm trying to divide each row of a dataframe by a number stored in a second mapping dataframe.
for(g in rownames(data_table)){
print(g)
data_table[g,] <- data_table[g,]/mapping[g,2]
}
However, this is incredibly slow, each row takes almost 1-2 seconds to run. I know iteration is usually not the best way to do things in R, but I don't know how else to do it. Is there any way I can speed up the runtime?
Try this :
sweep(data_table, 1, mapping[[2]], "/")
In terms of speed here is a benchmark for the possibilities using iris dataset and including your version :
microbenchmark::microbenchmark(
A = {
for(g in rownames(test)){
# print(g)
test[g,] <- test[g,]/test[g,2]
}
},
B = sweep(test, 1, test[[2]], "/"),
C = test / test[[2]],
times = 100
)
#Unit: microseconds
#expr min lq mean median uq max neval
#A 82374.693 83722.023 101688.1254 84582.052 147280.057 157507.892 100
#B 453.652 484.393 514.4094 513.850 539.480 623.688 100
#C 404.506 423.794 456.0063 446.101 470.675 729.205 100
you can vectorize this operation if the two variables have the same number of rows:
dt <- data.frame(a = rnorm(100), b = rnorm(100))
mapping <- data.frame(x = rnorm(100), y = rnorm(100))
dt / mapping[,2]
I have a large symmetric matrix and want to reduce it to the much smaller matrix matrix_small with rows(matrix_small) = n. The mean of matrix_small should be maximized. Is there are way to achieve this goal in R with a better algorithm than I already have? Better is either faster with the same mean or a higher mean with the same speed.
I feel like there should be a smarter way than to search the min so often. But I'm not aware of a way to set an SQL-like index for a Matrix in R to increase performance.
library(microbenchmark)
set.seed(2016)
sym_matrix <- matrix(runif(1e+06), ncol = 1000)
sym_matrix[lower.tri(sym_matrix)] <- t(sym_matrix)[lower.tri(sym_matrix)]
diag(sym_matrix) <- NA
rownames(sym_matrix) <- 1:1000
colnames(sym_matrix) <- 1:1000
findNrows <- function(sym_matrix, nrows){
# Return a matrix with rows(matrix) = nrows.
# mean(matrix) should be maximized
set.seed(2017)
k <- nrow(sym_matrix)
for (i in nrows:(k-1)) { #eliminate rows with minimum values
min_rows <- arrayInd(which.min(sym_matrix), dim(sym_matrix))
choose_row <- sample(min_rows, 1)
sym_matrix <- sym_matrix[-choose_row, -choose_row]
}
sym_matrix
}
microbenchmark(findNrows(sym_matrix = sym_matrix, nrows = 10), times = 25L)
mean(findNrows(sym_matrix = sym_matrix, nrows = 10), na.rm = TRUE)
The problem is one of finding the optimal nrows rows (and likewise columns) from a symmetric matrix that maximizes the sum of the elements in this selected sub-matrix. Unlike the so-called maximum subarray problem in 2D, which has a solution using Kadane's algorithm, the key issue here is that the chosen rows need not be contiguous. As a result, this problem seems to be a much harder combinatorial optimization. A brute force approach that searches over all combinations of nrows rows (here 10) out of N rows (here 1000) is clearly impractical. However, a very simple approach, which is different than the OP's algorithm, is to simply do a random search in the space of all combinations where we randomly select nrows rows (and likewise columns) from the symmetric matrix at each trial and keep the best set of nrows rows across sequential trials:
findNrows.random <- function(sym_matrix, nrows, ntrials){
set.seed(2017)
s.rows <- sample.int(nrow(sym_matrix),nrows)
s <- sym_matrix[s.rows,s.rows]
for (i in 1:ntrials) {
t.rows <- sample.int(nrow(sym_matrix),nrows)
t <- sym_matrix[t.rows,t.rows]
if (sum(s,na.rm=TRUE) < sum(t,na.rm=TRUE)) {
s.rows <- t.rows
s <- t
}
}
return(s)
}
This algorithm, implemented in R, is fast for even large number of trials, and for just 1000 trials it produces a result (for this particular data set and seed) that is surprisingly on-par to the OP's result but roughly 500 times faster. This speaks more towards the sub-optimality of the OP's algorithm than the optimality of random search because 1000 samples is a very small portion of the overall search space. In adition, by construction, the performance in terms of the mean value of the selected sub-matrix is guaranteed to increase as the number of trials increase. Therefore, for the same compute time, simple random search will outperform the OP's algorithm.
## OP results
microbenchmark(findNrows(sym_matrix = sym_matrix, nrows = 10), times = 2L)
##Unit: seconds
## expr min lq mean median uq max neval
## findNrows(sym_matrix = sym_matrix, nrows = 10) 11.67548 11.69193 11.70937 11.6997 11.71076 11.87105 25
mean(findNrows(sym_matrix = sym_matrix, nrows = 10), na.rm = TRUE)
##[1] 0.6256406
## Random search
microbenchmark(findNrows.random(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), times = 25L)
##Unit: milliseconds
## expr min lq mean median uq max neval
## findNrows.random(sym_matrix = sym_matrix, nrows = 10, ntrials = 1000) 21.81462 23.20069 27.06079 23.89368 26.25163 46.77016 25
mean(findNrows.random(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), na.rm = TRUE)
##[1] 0.6374652
Now, instead of throwing away the previous set of selected nrows rows if the next set has a bigger sum, we can seek to improve the performance of this random search by trying to improve the previous set of selected rows using the new set of selected rows. The heurestic we employ is the rowSum for the resulting sub-matrices. That is, at each trial we seek to replace rows in the current sub-matrix with rows in the newly selected sub-matrix that have larger rowSums (or equivalently larger rowMeans). This seems reasonable given that values are uniformly distributed in the full matrix because a row in a selected sub-matrix with a higher rowMean will on average have higher value elements across the full row. Of course, after replace rows (if any) in the current sub-matrix with rows from the newly selected sub-matrix to form the new sub-matrix, we still check to see if this new sub-matrix is better (i.e., has a bigger sum) than the current sub-matrix before replacing the current best sub-matrix for the next trial. The code is as follows:
findNrows.faster <- function(sym_matrix, nrows, ntrials){
set.seed(2017)
s.rows <- sample.int(nrow(sym_matrix),nrows)
s.means <- rowSums(sym_matrix[s.rows,s.rows],na.rm=TRUE)
for (i in 1:ntrials) {
t.rows <- sample.int(nrow(sym_matrix),nrows)
t.means <- rowSums(sym_matrix[t.rows,t.rows],na.rm=TRUE)
st.rows <- c(s.rows,t.rows)
st.means <- c(s.means,t.means)
## need to make sure we do not have duplicates before we choose the best nrows
dups <- duplicated(st.rows)
st.rows <- st.rows[!dups]
st.means <- st.means[!dups]
new.rows <- st.rows[order(st.means,decreasing=TRUE)[1:nrows]]
new.means <- rowSums(sym_matrix[new.rows,new.rows],na.rm=TRUE)
if (sum(s.means) < sum(new.means)) {
s.rows <- new.rows
s.means <- new.means
}
}
sym_matrix[s.rows,s.rows]
}
This algorithm is slower but the result is significantly better than plain random search. Note that the comparison in performance with findNrows.random is apples to apples since the same number of trials are used and the same seed is used to randomly select the same rows for each trial. However, note that we would expect the optimal algorithm to select a sub-matrix with a mean that is well over 0.9, so this algorithm is far from optimal.
## Improved random search
microbenchmark(findNrows.faster(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), times = 25L)
##Unit: milliseconds
## expr min lq mean median uq max neval
## findNrows.faster(sym_matrix = sym_matrix, nrows = 10, ntrials = 1000) 135.0531 136.3961 137.1123 136.7667 137.0439 143.0155 25
mean(findNrows.faster(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), na.rm = TRUE)
##[1] 0.7797313
I do have a similar problem that is explained in this question. Similar to that question I have a data frame that has 3 columns (id, group, value). I want to take n samples with replacement from each group and produce a smaller data frame with n samples from each group.
However, I am doing hundreds of subsamples in a simulation code and the solution based on ddply is very slow to be used in my code. I tried to rewrite a simple code to see if I can get a better performance but it is still slow (not better than the ddply solution if not worse). Below is my code. I am wondering if it can be improved for performance
#Producing example DataFrame
dfsize <- 10
groupsize <- 7
test.frame.1 <- data.frame(id = 1:dfsize, group = rep(1:groupsize,each = ceiling(dfsize/groupsize))[1:dfsize], junkdata = sample(1:10000, size =dfsize))
#Main function for subsampling
sample.from.group<- function(df, dfgroup, size, replace){
outputsize <- 1
newdf <-df # assuming a sample cannot be larger than the original
uniquegroups <- unique(dfgroup)
for (uniquegroup in uniquegroups){
dataforgroup <- which(dfgroup==uniquegroup)
mysubsample <- df[sample(dataforgroup, size, replace),]
sizeofsample <- nrow(mysubsample)
newdf[outputsize:(outputsize+sizeofsample-1), ] <- mysubsample
outputsize <- outputsize + sizeofsample
}
return(newdf[1:(outputsize-1),])
}
#Using the function
sample.from.group(test.frame.1, test.frame.1$group, 100, replace = TRUE)
Here's two plyr based solutions:
library(plyr)
dfsize <- 1e4
groupsize <- 7
testdf <- data.frame(
id = seq_len(dfsize),
group = rep(1:groupsize, length = dfsize),
junkdata = sample(1:10000, size = dfsize))
sample_by_group_1 <- function(df, dfgroup, size, replace) {
ddply(df, dfgroup, function(x) {
x[sample(nrow(df), size = size, replace = replace), , drop = FALSE]
})
}
sample_by_group_2 <- function(df, dfgroup, size, replace) {
idx <- split_indices(df[[dfgroup]])
subs <- lapply(idx, sample, size = size, replace = replace)
df[unlist(subs, use.names = FALSE), , drop = FALSE]
}
library(microbenchmark)
microbenchmark(
ddply = sample_by_group_1(testdf, "group", 100, replace = TRUE),
plyr = sample_by_group_2(testdf, "group", 100, replace = TRUE)
)
# Unit: microseconds
# expr min lq median uq max neval
# ddply 4488 4723 5059 5360 36606 100
# plyr 443 487 507 536 31343 100
The second approach is much faster because it does the subsetting in a single step - if you can figure out how to do it in one step, it's usually any easy way to get better performance.
I think this is cleaner and possibly faster:
z <- sapply(unique(test.frame.1$group), FUN= function(x){
sample(which(test.frame.1$group==x), 100, TRUE)
})
out <- test.frame.1[z,]
out