Why set.seed() affects sample() in R - r

I always thought set.seed() only makes random variable generators (e.g., rnorm) to generate a unique sequence for any specific set of input values.
However, I'm wondering, why when we set the set.seed(), then the function sample() doesn't do its job correctly?
Question
Specifically, given the below example, is there a way I can use set.seed before the rnorm but sample would still produce new random samples from this rnorm if sample is run multiple times?
Here is an R code:
set.seed(123458)
x.y = rnorm(1e2)
sampled = sample(x = x.y, size = 20, replace = TRUE)
plot(sampled)

As per the help file at ?set.seed
"If called with seed = NULL it re-initializes (see ‘Note’) as if no
seed had yet been set."
So, since rnorm and sample are both affected by set.seed(), you can do:
set.seed(639245)
rn <- rnorm(1e2)
set.seed(NULL)
sample(rn,5)

Instead of resetting the seed with NULL, I think it makes more sense to save the current state and restore it.
x <- .Random.seed
set.seed(639245)
rn <- rnorm(1e2)
.Random.seed <- x
sample(rn,5)

Related

Simulation with N trials in R

I am trying to create a simulation where a number 0:100 is chosen by a person, then a random number 0:100 is generated using sample(). The difference between their chosen number and the random number is calculated and stored. I would like to use a for loop to run this 10000 times and store the results in a vector so I can later plot the results. Can anyone point me to where I can read about this or see some examples? Below is what I have so far but I keep getting errors saying the lengths aren't the same multiple.
N = 10000
chosen.number = 0:100
generated.number = sample(0:100, N, replace = T)
differences = numeric(0)
for(i in 1:length(chosen.number)){
differences = (generated.number - chosen.number)
}
Then I'll make a scatterplot of the vector differences.
Here's an example of how you could go about it (if I understand your questions correctly).
You can set how many loops you want using Repeat.
Since you want a different randomly generated number each time, you'll have to put sample() within your loop. I didn't know where your user-selected number would come from, but in this example, it gets randomly generated with the same set of criteria as the random selection.
Then differences are collected in collect_differences for you to use downstream.
Repeat = 10 # Number of times to repeat/loop
collect_differences <- NULL
for(i in 1:Repeat){
randomly.generated.number = sample(0:100, size = 1, replace = T)
selected.number = sample(0:100, size = 1, replace = T)
differences = randomly.generated.number - selected.number
collect_differences = c(collect_differences, differences)
}
collect_differences
As for resources, you can look up anything related to the fundamentals of looping. You could also look through The Carpentries lessons in R as they have some resources for this as well.

R microbenchmark: How to pass same argument to evaluated functions?

I'd like to evaluate the time to extract data from a raster time series using different file types (geotiff, binary) or objects (RasterBrick, RasterStack). I created a function that will extract the time series from a random point of the raster object and I then use microbenchmark to test it.
Ex.:
# read a random point from a raster stack
sample_raster <- function(stack) {
poi <- sample(ncell(stack), 1)
raster::extract(stack, poi)
}
# opening the data using different methods
data_stack <- stack(list.files(pattern = '3B.*tif'))
data_brick <- brick('gpm_multiband.tif')
bench <- microbenchmark(
sample_stack = sample_raster(data_stack),
sample_brick = sample_raster(data_brick),
times = 10
)
boxplot(bench)
# this fails because sampled point is different
bench <- microbenchmark(
sample_stack = sample_raster(data_stack),
sample_brick = sample_raster(data_brick),
times = 10,
check = 'equal'
)
I included a sample of my dataset here
With this I can see that sampling on RasterBrick is faster than stacks (R Raster manual also says so -- good). The problem is that I'm sampling at different points at each evaluated expression. So I can't check if the results are the same. What I'd like to do is sample at the same location (poi) on both objects. But have the location be different for each iteration. I tried to use the setup option in microbenchmark but from what I figured out, the setup is evaluated before each function is timed, not once per iteration. So generating a random poi using the setup will not work.
Is it possible to pass the same argument to the functions being evaluated in microbenchmark?
Result
Solution using microbenchmark
As suggested (and explained bellow), I tried the bench package with the press call. But for some reason it was slower than setting the same seed at each microbenchmark iteration, as suggested by mnist. So I ended up going back to microbenchmark. This is the code I'm using:
library(microbenchmark)
library(raster)
annual_brick <- raster::brick('data/gpm_tif_annual/gpm_2016.tif')
annual_stack <- raster::stack('data/gpm_tif_annual/gpm_2016.tif')
x <- 0
y <- 0
bm <- microbenchmark(
ext = {
x <- x + 1
set.seed(x)
poi = sample(raster_size, 1)
raster::extract(annual_brick, poi)
},
slc = {
y <- y + 1
set.seed(y)
poi = sample(raster_size, 1)
raster::extract(annual_stack, poi)
},
check = 'equal'
)
Solution using bench::press
For completeness sake, this was how I did, using the bench::press. In the process, I also separated the code for selecting the random cell from the point sampling function. So I can time only the point sampling part of the code. Here is how I'm doing it:
library(bench)
library(raster)
annual_brick <- raster::brick('data/gpm_tif_annual/gpm_2016.tif')
annual_stack <- raster::stack('data/gpm_tif_annual/gpm_2016.tif')
bm <- bench::press(
pois = sample(ncell(annual_brick), 10),
mark(
iterations = 1,
sample_brick = raster::extract(annual_brick, pois),
sample_stack = raster::extract(annual_stack, pois)
)
)
My approach would be to set the same seats for each option in microbenachmark but change them prior to each function call. See the output and how the same seats are used for both calls eventually
x <- 0
y <- 0
microbenchmark::microbenchmark(
"checasdk" = {
# increase seat value by 1
x <- x + 1
print(paste("1", x))
set.seed(x)},
"check2" = {
y <- y + 1
print(paste("2", y))
set.seed(y)
}
)
If I understand correctly, the OP has two requirements:
The same data points should be sampled when timing the two expressions in order to check the results are identical.
In addition, timing of the two expressions is to be repeated for different data points sampled.
Using the same random numbers
As suggested by Roman, set.seed() can be used to set the seed values for R's random number generator. If the same parameter is used, the sequence of generated random numbers will be the same.
sample_raster() can be modified to ensure that the random number generator will be initiliased for each call.
sample_raster <- function(stack) {
set.seed(1L)
poi <- sample(ncell(stack), 1)
raster::extract(stack, poi)
}
This will met requirement 1 but not requirement 2 as the same data samples will be used for all repetitions.
Different random numbers in repetitions
The OP has asked:
Is it possible to pass the same argument to the functions being
evaluated in microbenchmark?
One possibility is to use for or lapply() to loop over a sequence of seed values as suggested in answers to a similar question.
In this case, I suggest to use the bench package for benchmarking. It has a press() function which runs bench::mark() across a grid of parameters.
For this, sample_raster() gets a second parameter:
sample_raster <- function(stack, seed) {
set.seed(seed)
poi <- sample(ncell(stack), 1L)
# cat(substitute(f), s, poi, "\n") # just to check, NOT to use for timings
raster::extract(stack, poi)
}
The timings are executed for different seeds as given in vector seed_vec.
library(bench)
bm <- press(
seed_vec = 1:10,
mark(
iterations = 1L,
sample_stack = sample_raster(data_stack, seed_vec),
sample_brick = sample_raster(data_brick, seed_vec)
)
)
Note that the length of seed_vec determines the number of repetitions with different poi, now. The iterations parameter to mark() specifies how often the timings are to be repeated for the same seed / poi.
The results can be plotted using
library(ggplot2)
autoplot(bm)
or summarized using
library(dplyr)
bm %>%
group_by(expression = expression %>% as.character()) %>%
summarise(median = median(median), n_itr = n())

Scaling only some columns of a training set and a test set

I often have to deal with the following issue:
I have a test set and a training set
I want to scale all columns of a training set, except for a few ones which are identified by a character vector
then, based on the sample means and sample standard deviations of the selected columns of the training set, I want to rescale the test set too
Currently, my workflow is kludgy: I use an index vector and then partial assignment to scale only some columns of the train set. I store the means and standard deviations from the scaling operation on the training set, and I use them to scale the test set. I was wondering if there could be a simpler way, without having to install caret (for a series of reasons, I'm not a big fan of caret and I definitely won't start using it just for this problem).
Here is my current workflow:
# define dummy train and test sets
train <- data.frame(letters = LETTERS[1:10], months = month.abb[1:10], numbers = 1:10,
x = rnorm(10, 1), y = runif(10))
test <- train
test$x <- rnorm(10, 1)
test$y <- runif(10)
# names of variables I don't want to scale
varnames <- c("letters", "months", "numbers")
# index vector of columns which must not be scaled
index <- names(train) %in% varnames
# scale only the columns not in index
temp <- scale(train[, !index])
train[, !index] <- temp
# get the means and standard deviations from temp, to scale test too
means <- attr(temp, "scaled:center")
standard_deviations <- attr(temp, "scaled:center")
# scale test
test[, !index] <- scale(test[, !index], center = means, scale = standard_deviations)
Is there a simpler/more idiomatic way to do this?
It is a nice question and I have tried a lot to come up with an answer.
I think this is a bit more elegant code:
train0=train%>%select(-c(letters, months, numbers))%>%as.matrix%>%scale
means <- attr(train0, "scaled:center")
standard_deviations <- attr(train0, "scaled:center")
train0=cbind(select(train,c(letters, months, numbers)),train0)
test0=test%>%select(-c(letters, months, numbers))%>%as.matrix%>%scale(center = means, scale = standard_deviations)
test0=cbind(select(test,c(letters, months, numbers)),test0)
I have tried hard to work with mutate_at in order to avoid cbind extra code but with no lack

how to create a random loss sample in r using if function

I am working currently on generating some random data for a school project.
I have created a variable in R using a binomial distribution to determine if an observation had a loss yes=1 or not=0.
Afterwards I am trying to generate the loss amount using a random distribution for all observations which already had a loss (=1).
As my loss amount is a percentage it can be anywhere between 0
What Is The Intuition Behind Beta Distribution # stats.stackexchange
In a third step I am looking for an if statement, which combines my two variables.
Please find below my code (which is only working for the Loss_Y_N variable):
Loss_Y_N = rbinom(1000000,1,0.01)
Loss_Amount = dbeta(x, 10, 990, ncp = 0, log = FALSE)
ideally I can combine the two into something like
if(Loss_Y_N=1 then Loss_Amount=dbeta(...) #... is meant to be a random variable with mean=0.15 and should be 0<x=<1
else Loss_Amount=0)
Any input highly appreciated!
Create a vector for your loss proportion. Fill up the elements corresponding to losses with draws from the beta. Tweak the parameters for the beta until you get the desired result.
N <- 100000
loss_indicator <- rbinom(N, 1, 0.1)
loss_prop <- numeric(N)
loss_prop[loss_indicator > 0] <- rbeta(sum(loss_indicator), 10, 990)

how to solve errors in frbs package of R using GFC.GCCL method?

I'm using frbs package in R on my data set using 5-fold stratified cross validation. I've implemented stratified CV. I use GFS.GCCL method for frbs.learn function in each fold and predict the result using test data. I get this error as well as 30 equal warning messages:
Error: object 'temp.rule.degree' not found
Warning: In max(MF.temp[m, ], na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
My code is written in below:
library(frbs)
data<-read.csv(file.address)
data[,30] <- unclass(data[,30]) #column 30 has the class of samples
data <- data[,c(1,14,20,26,27, 30)] # I choose to have 5 attr. since
#my data is high dimensional
k <- 5 # 5-fold
seed <- 1
folds <- strf.cv(data, k, seed) #stratification function for CV
range.data.inp <- matrix(apply(data[,-ncol(data)], 2, range), nrow=2)
data<-norm.data(as.matrix(data[,-ncol(data)]),range.data.
inp,min.scale = 0.1, max.scale = 1)
ctrl <- list(popu.size = 30, num.class = 2, num.labels= 3,
persen_cross = 0.9, max.gen = 200, persen_mutant = 0.3,
name="sim-1")
for(i in 1:k){
str <- paste("fold",i)
print(str)
test.ind <- folds[[str]]
test.data <- data[test.ind,]
train.data <- data[-test.ind,]
obj <- frbs.learn(train.data , method.type="GFS.GCCL",
range.data.inp , ctrl)
pred <- predict(obj, test.data)
print("Predicted classes:")
print(pred)
}
I don't have any idea about error and warnings. Please let me know what I should do.
I've had similar problem (and others) trying to reproduce the SLAVE learning starting with the iris example data. I had 2 format items to solve before being able to run this with my artifical data:
my dataframe import was giving me integer, where the learn needs at least numeric.
my distribution of criteria was not flat. When I flattened the distribution (3 values so n/3 samples per value) everything went fine.
That's all I know.
Hope it helps.
I encountered the same issue when I was running SLAVE and GFS.GCCL. When I was looking at the source code of the library. I found that in frbs.learn(), each method has an implementation to calculate the range of input data. So, I think it might be a problem with the range of input data. For example, in GFS.GCCL, in the source code, for setting the parameters, it looks like this:
range.data.input <- range.data
data.train.ori <- data.train
popu.size <- control$popu.size
persen_cross <- control$persen_cross
persen_mutant <- control$persen_mutant
max.gen <- control$max.gen
name <- control$name
n.labels <- control$num.labels
n.class <- control$num.class
num.labels <- matrix(rep(n.labels, ncol(range.data)), nrow = 1)
num.labels <- cbind(num.labels, n.class)
## normalize range of data and data training
range.data.norm <- range.data.input
range.data.norm[1, ] <- 0
range.data.norm[2, ] <- 1
range.data.input.ori <- range.data.input
data.tra.norm <- norm.data(data.train[, 1 : ncol(data.train) - 1], range.data.input, min.scale = 0, max.scale = 1)
data.train <- cbind(data.tra.norm, matrix(data.train[, ncol(data.train)], ncol = 1))
in the first line, range.data is either coming from your specification nor the default setting of frbs.learn(). For the default setting, it gets the max and min for each row. In the source code:
range.data <- rbind(dt.min, dt.max)
After that, the range of data taken by the GFS.GCCL is
range.data.norm <- range.data.input
range.data.norm[1, ] <- 0
range.data.norm[2, ] <- 1
which is between 0 and 1. The GFS.GCCL is also taken the range.data.input as parameter. So, it takes both range.data.norm and range.data.input.
Therefore, I think if internally, there are some calculation corresponding to range.data.input (it needs to be set as min, max for each row), but the setting for this is actually not min and max for each row. The error is generated.
But, in summary, after I remove "range.data"from frbs.learn(), both GFS.GCCL and SLAVE work for me.
You can download the source code from here:
https://cran.r-project.org/web/packages/frbs/index.html
You can find the code for GFS.GCCL and SLAVE in:
FRBS.MainFunction.R
GFS.Methods.R
In addition to #Pilip38's good advice, I have three other ideas that have fixed similar errors for me while working with the frbs package.
Most important: Make sure your output variable is never equal to 0. It looks like you have a binary output variable so I am hoping just adding 1 to it so it is 1/2 instead of 0/1 will work.
Try setting your range.data.inp matrix to be all 0's in the first row and all 1's in the second. Naturally it's better to have a tighter range but it may be causing your bug.
Try decreasing the number of labels to 2.
It's can be a brittle procedure.

Resources