I am using R to construct an agent based model with a monte carlo process. This means I got many functions that use a random engine of some kind. In order to get reproducible results, I must fix the seed. But, as far as I understand, I must set the seed before every random draw or sample. This is a real pain in the neck. Is there a way to fix the seed?
set.seed(123)
print(sample(1:10,3))
# [1] 3 8 4
print(sample(1:10,3))
# [1] 9 10 1
set.seed(123)
print(sample(1:10,3))
# [1] 3 8 4
There are several options, depending on your exact needs. I suspect the first option, the simplest is not sufficient, but my second and third options may be more appropriate, with the third option the most automatable.
Option 1
If you know in advance that the function using/creating random numbers will always draw the same number, and you don't reorder the function calls or insert a new call in between existing ones, then all you need do is set the seed once. Indeed, you probably don't want to keep resetting the seed as you'll just keep on getting the same set of random numbers for each function call.
For example:
> set.seed(1)
> sample(10)
[1] 3 4 5 7 2 8 9 6 10 1
> sample(10)
[1] 3 2 6 10 5 7 8 4 1 9
>
> ## second time round
> set.seed(1)
> sample(10)
[1] 3 4 5 7 2 8 9 6 10 1
> sample(10)
[1] 3 2 6 10 5 7 8 4 1 9
Option 2
If you really want to make sure that a function uses the same seed and you only want to set it once, pass the seed as an argument:
foo <- function(...., seed) {
## set the seed
if (!missing(seed))
set.seed(seed)
## do other stuff
....
}
my.seed <- 42
bar <- foo(...., seed = my.seed)
fbar <- foo(...., seed = my.seed)
(where .... means other args to your function; this is pseudo code).
Option 3
If you want to automate this even more, then you could abuse the options mechanism, which is fine if you are just doing this in a script (for a package you should use your own options object). Then your function can look for this option. E.g.
foo <- function() {
if (!is.null(seed <- getOption("myseed")))
set.seed(seed)
sample(10)
}
Then in use we have:
> getOption("myseed")
NULL
> foo()
[1] 1 2 9 4 8 7 10 6 3 5
> foo()
[1] 6 2 3 5 7 8 1 4 10 9
> options(myseed = 42)
> foo()
[1] 10 9 3 6 4 8 5 1 2 7
> foo()
[1] 10 9 3 6 4 8 5 1 2 7
> foo()
[1] 10 9 3 6 4 8 5 1 2 7
> foo()
[1] 10 9 3 6 4 8 5 1 2 7
I think this question suffers from a confusion. In the example, the seed has been set for the entire session. However, this does not mean it will produce the same set of numbers every time you use the print(sample)) command during a run; that would not resemble a random process, as it would be entirely determinate that the same three numbers would appear every time. Instead, what actually happens is that once you have set the seed, every time you run a script the same seed is used to produce a pseudo-random selection of numbers, that is, numbers that look as if they are random but are in fact produced by a reproducible process using the seed you have set.
If you rerun the entire script from the beginning, you reproduce those numbers that look random but are not. So, in the example, the second time that the seed is set to 123, the output is again 9, 10, and 1 which is exactly what you'd expect to see because the process is starting again from the beginning. If you were to continue to reproduce your first run by writing print(sample(1:10,3)), then the second set of output would again be 3, 8, and 4.
So the short answer to the question is: if you want to set a seed to create a reproducible process then do what you have done and set the seed once; however, you should not set the seed before every random draw because that will start the pseudo-random process again from the beginning.
This question is old, but still comes high in search results, and it seemed worth expanding on Spacedman's answer.
If you want to always return the same results from random processes, simply keep the seed set all the time with:
addTaskCallback(function(...) {set.seed(123);TRUE})
Now the output is the same every time:
print(sample(1:10,3))
# [1] 3 8 4
print(sample(1:10,3))
# [1] 3 8 4
You could do a wrapper function, like so:
> wrap.3.digit.sample <- function(x) {
+ set.seed(123)
+ return(sample(x, 3))
+ }
> wrap.3.digit.sample(c(1:10))
[1] 3 8 4
> wrap.3.digit.sample(c(1:10))
[1] 3 8 4
There is probably a more elegant way, and I'm sure someone will chime in with it. But, if they don't, this should make your life easier.
No need. Although the results are different from sample to sample (which you almost certainly want, otherwise the randomness is very questionable), results from run to run will be the same. See, here's the output from my machine.
> set.seed(123)
> sample(1:10,3)
[1] 3 8 4
> sample(1:10,3)
[1] 9 10 1
I suggest that you set.seed before calling each random number generator in R. I think what you need is reproducibility for Monte Carlo simulations. If in a for loop, you can set.seed(i) before calling sample, which guarantees to be fully reproducible. In your outer function, you may specify an argument seed=1 so that in the for loop, you use set.seed(i+seed).
Related
I am trying to make a model that will predict the group of a city according to the development level of it. I mean, the cities in the 1st group are the most developed cities and the ones in the 6th group are the least developed ones. I have 10 numerical variables in my data about each city.
First, I normalized them using max-min normalization. Then I generated the training and data sets. I have 81 cities.Dimensions of training and data sets are 20x10 and 61x10, respectively. I excluded the target variable from them. Then I made labels for them as training labels and test labels with dimensions 61x1 and 20x1.
Then I run the knn function like this
knn(train = Data.training, test = Data.test, cl = Data.trainLabels , k = 3)
its output is this
[1] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Levels: 1 2 3 4 5 6
But if I set the argument use.all to FALSE I get this output and that changes everytime I run the code
[1] 1 4 2 2 2 3 5 4 3 5 5 6 5 6 5 6 4 5 2 2
Levels: 1 2 3 4 5 6
I can't find the reason why my code gives the same prediction in the first place and what use.all has got to do with it.
As explained in the knn documentation :
use.all controls handling of ties. If true, all distances equal to the kth largest are included. If false, a random selection of distances equal to the kth is chosen to use exactly k neighbours.
In your case, all points have the same distances, so they all win as 'best neighbour' (use.all = True) or the algorithm picks k winners at random (use.all = False).
The problem seems to be in how you trained the algorithm or in the data itself. Since you did not post a sample of your data, I cannot help with that, but I suggest that you re-check it. You can also compute a few distances by hand, to see what is going on.
Also, check that you randomised your data before splitting it into training and testing sets. For example, say that the dataset is ordered by the label (the target variable). If you use the first 20 points to train the algorithm, it is likely that the algorithm will never see some of the labels during the training phase and therefore it will perform poorly on those during the testing phase.
In R, when I press return for a line of code (for example, a histogram,) what does the [1] that comes up in the results mean?
If there's another line, it comes up as [18], then [35].
The numbers that you see in the console in the situation that you are describing are the indices of the first elements of the line.
1:20
# [1] 1 2 3 4 5 6 7 8 9 10 11 12
# [13] 13 14 15 16 17 18 19 20
How many values are displayed by line depends by default on the width of the console (at least in Rstudio).
The value I printed is a numeric vector of length 20, a single number is technically also a numeric vector, but of length 1, in R there is no different concept for both, thus when you print only one value the [1] still shows.
42
# [1] 42
It's not obvious, for example there is no function of length 2, c(mean, median) is a list (containing functions), but it works like this for said atomic modes (see ?atomic) and usually the classes that are built on them.
You might not always see these numbers on all objects because they depend on what print methods are called, which itself depends on the class.
library(glue)
glue("a")
# a # <- we don't see [1]!
mode(glue("a"))
# character
class(glue("a"))
# [1] "glue" "character"
The print method that is called when typing print(1:20) is print.default, it can be overriden to avoid displaying the [numbers] :
print.default <- function(x) cat(x,"\n")
print(1:20)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rm(print.default) # necessary cleanup!
The autoprint (what you get when not calling print explicitly) won't change however, as auto-printing can only involve method dispatch for explicit classes (with a class attribute, a.k.a. objects)
Type methods(print) to see all the available methods.
I have a vector of numbers stored in R.
my_vector <- c(1,2,3,4,5)
I want to add two to each number.
my_vector + 2
[1] 3 4 5 6 7
However, I want there to only be a twenty percent chance of adding two to the numbers in my vector each time I run the code. Is there a way to code this in R?
What I mean is, if I run the code, the output could be:
[1] 3 4 5 6 9
Or perhaps
[1] 5 4 5 6 7
i.e. there is only a 20% chance that any one number in the vector will get two added to it.
myvector + 2*sample(c(TRUE,FALSE), length(myvector), prob=c(0.2,0.8), repl=TRUE)
That will give a variable number of 2's to be added (which is what you were asking) but sometimes people want to know that exactly 20% will have a 2 added in whoch case it would be:
myvector + 2*sample(c(TRUE,rep(FALSE,4)))
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
using a package for a glm which reads a dataframe in chunks. It is required that all levels of a factor occur in every chunk. I am looking for a nice strategy to rearrange observations so as to maximise the probability to have all values in every chunk.
The example would be
c(4,7,4,4,4,4,4,4,4,4,4,7,4,4,8,8,5,5)
for a size of the chunk of 8 the best rearrangement would be
c(4,7,5,8,4,4,4,4,4,4,4,7,4,4,8,8,4,5,8)
is there some elegant way to shuffel the data around?
just saw the comments..the library itself is called bigglm (where it reads data chunkwise). The vectors should be of eqal lenegth. The question is really just about re arranging the data that most level are present in most chunks
An example for the coumn of the dataframe can be found here
(https://www.dropbox.com/s/cth8kwcq9ph5j0p/d1.RData?dl=0)
the most important thing in this case is that as many levels as possible are present in as many chunks as possible. The smaller the chunk, the less memory will be needed when reading in. I think it would be a good point to assume 10 chunks.
I think I understand what you are asking for, though admittedly I am not familiar with the function that reads data in by chunks and uses stringsAsFactors = TRUE while making assumptions a priori on the makeup of the data (and does not offer a way to superimpose other characteristics of the factors). I offer in advance the suggestion that either you are misinterpreting the function or you are mis-applying it to your specific data problem.
I'm easily wrong in problems like this, so I'll try to address the inferred problem regardless.
You claim that the function will be reading in the first 8 elements, on which it do its processing. It must know that there are (in this case) four factors to be considered; the easiest way, as you are asking, is to have each of these factors present in each chunk. Once it has processed these first 8 rows, it will then read the second 8 elements. In the case of your sample data, this does not work since the second 8 elements does not include a 5.
I'll define slightly augmented data later to remedy this.
Assumptions / Rules
the number of unique values overall in the data must be no larger than the size of each chunk;
each factor must have at least as many occurrences as the number of chunks to be read; and
all chunks have precisely chunksize elements in them (i.e., full) except for the last chunk will have between 1 and chunksize elements in it; ergo,
the last chunk has at least as many elements as there are unique values.
Function Definition
Given those rules, here's some code. This is most certainly not the only solution, and it may not perform well with significantly large datasets (I have not done extensive testing).
myfunc <- function(x, chunksize = 8) {
numChunks <- ceiling(length(x) / chunksize)
uniqx <- unique(x)
lastChunkSize <- chunksize * (1 - numChunks) + length(x)
## check to see if it is mathematically possible
if (length(uniqx) > chunksize)
stop('more factors than can fit in one chunk')
if (any(table(x) < numChunks))
stop('not enough of at least one factor to cover all chunks')
if (lastChunkSize < length(uniqx))
stop('last chunk will not have all factors')
## actually arrange things in one feasible permutation
allIndices <- sapply(uniqx, function(z) which(z == x))
## fill one of each unique x into chunks
chunks <- lapply(1:numChunks, function(i) sapply(allIndices, `[`, i))
remainder <- unlist(sapply(allIndices, tail, n = -3))
remainderCut <- split(remainder, ceiling(seq_along(remainder)/4))
## combine them all together, wary of empty lists
finalIndices <- sapply(1:numChunks,
function(i) {
if (i <= length(remainderCut))
c(chunks[[i]], remainderCut[[i]])
else
chunks[[i]]
})
x[unlist(finalIndices)]
}
Supporting Execution
In your offered data, you have 18 elements requiring three chunks. Your data will fail on two accounts: three of the elements only occur twice, so the third chunk will most certainly not contain all elements; and your last chunk will only have two elements, which cannot contain each of the four.
I'll augment your data to satisfy both misses, with:
dat3 <- c(4,7,5,7,8,4,4,4,4,4,4,7,4,4,8,8,5,5,5,5)
which will not work unadjusted, if for no other reason than the last chunk will only have four 5's in it.
The solution:
myfunc(dat3, chunksize = 8)
## [1] 4 7 5 8 4 4 4 4 4 7 5 8 4 4 5 5 4 7 5 8
(spaces were added to the output for easy inspection). Each chunk has 4, 7, 5, 8 as its first four elements, therefore all factors are covered in each chunk.
Breakdown
A quick walkthrough (using debug(myfunc)), assuming x = dat3 and chunksize = 8. Jumping down the code:
## Browse[2]> uniqx
## [1] 4 7 5 8
## Browse[2]> allIndices
## [[1]]
## [1] 1 6 7 8 9 10 11 13 14
## [[2]]
## [1] 2 4 12
## [[3]]
## [1] 3 17 18 19 20
## [[4]]
## [1] 5 15 16
This shows the indices for each unique element. For example, there are 4's located at indices 1, 6, 7, etc.
## Browse[2]> chunks
## [[1]]
## [1] 1 2 3 5
## [[2]]
## [1] 6 4 17 15
## [[3]]
## [1] 7 12 18 16
There are three chunks to be filled, and this list starts forming those chunks. In this example, we have placed indices 1, 2, 3, and 5 in the first chunk. Looking back at allIndices, you'll see that these represent the first instance of each of uniqx, so the first chunk now contains c(4, 7, 5, 8), as do the other two chunks.
At this point, we have satisfied the basic requirement that each unique element be found in every chunk. The rest of the code fills with the remaining elements.
## Browse[2]> remainder
## [1] 8 9 10 11 13 14 19 20
These are all indices that have so far not been added to the chunks.
## Browse[2]> remainderCut
## $`1`
## [1] 8 9 10 11
## $`2`
## [1] 13 14 19 20
Though we have three chunks, we only have two lists here. This is fine, we have nothing (and need nothing) to add to the last chunk. We will then zip-merge these with chunks to form a list of index lists. (Note: you might be tempted to try mapply(function(a, b) c(a, b), chunks, remainderCut), but you may notice that if remainderCut is not the same size as chunks, as we see here, then its values are recycled. Not acceptable. Try it.)
## Browse[2]> finalIndices
## [[1]]
## [1] 1 2 3 5 8 9 10 11
## [[2]]
## [1] 6 4 17 15 13 14 19 20
## [[3]]
## [1] 7 12 18 16
Remember, each number represents the index from within x (originally dat3). We then unlist this split-vector and apply the indices to the data.
I have a data set containing the following information:
Workload name
Configuration used
Measured performance
Here you have a toy data set to illustrate my problem (performance data does not make sense at all, I just selected different integers to make the example easy to follow. In reality that data would be floating point values coming from performance measurements):
workload cfg perf
1 a 1 1
2 b 1 2
3 a 2 3
4 b 2 4
5 a 3 5
6 b 3 6
7 a 4 7
8 b 4 8
You can generate it using:
dframe <- data.frame(workload=rep(letters[1:2], 4),
cfg=unlist(lapply(seq_len(4),
function(x) { return(c(x, x)) })),
perf=round(seq_len(8))
)
I am trying to compute the harmonic speedup for the different configurations. For that a base configuration is needed (cfg = 1 in this example). Then the harmonic speedup is computed as:
num_workloads
HS(cfg_i) = num_workloads / sum (perf(cfg_base, wl_j) / perf(cfg_i, wl_j))
wl_j
For instance, for configuration 2 it would be:
HS(cfg_2) = 2 / [perf(cfg_1, wl_1) / perf(cfg_2, wl_1) +
perf(cfg_1, wl_2) / perf_cfg_2, wl_2)]
I would like to compute harmonic speedup for every workload pair and configuration. By using the example data set, the result would be:
workload.pair cfg harmonic.speedup
1 a-b 1 2 / (1/1 + 2/2) = 1
2 a-b 2 2 / (1/3 + 2/4) = 2.4
3 a-b 3 2 / (1/5 + 2/6) = 3.75
4 a-b 4 2 / (1/7 + 2/8) = 5.09
I am struggling with aggregate and ddply in order to find a solution that does not uses loops, but I have not been able to come up with a working solution. So, the basic problems that I am facing are:
how to handle the relationship between workloads and configuration. The results for a given workload pair (A-B), and a given configuration must be handled together (the first two performance measurements in the denominator of the harmonic speedup formula come from workload A, while the other two come from workload B)
for each workload pair and configuration, I need to "normalize" performance values with the values from configuration base (cfg 1 in the example)
I do not really know how to express that with some R function, such as aggregate or ddply (if it is possible, at all).
Does anyone know how this can be solved?
EDIT: I was somehow afraid that using 1..8 as perf could lead to some confusion. I did that for the sake of simplicity, but the values do not need to be those ones (for instance, imagine initializing them like this: dframe$perf <- runif(8)). Both James and Zach's answers understood that part of my question wrong, so I thought it was better to clarify this in the question. Anyway, I generalized both answers to deal with the case where performance for configuration 1 is not (1, 2)
Try this:
library(plyr)
baseline <- dframe[dframe$cfg == 1,]$perf
hspeed <- function(x) length(x) / sum(baseline / x)
ddply(dframe,.(cfg),summarise,workload.pair=paste(workload,collapse="-"),
harmonic.speedup=hspeed(perf))
cfg workload.pair harmonic.speedup
1 1 a-b 1.000000
2 2 a-b 2.400000
3 3 a-b 3.750000
4 4 a-b 5.090909
For problems like this, I like to "reshape" the dataframe, using the reshape2 package, giving a column for workload a, and a column for workload b. It is then easy to compare the 2 columns using vector operations:
library(reshape2)
dframe <- dcast(dframe, cfg~workload, value.var='perf')
baseline <- dframe[dframe$cfg == 1, ]
dframe$harmonic.speedup <- 2/((baseline$a/dframe$a)+(baseline$b/dframe$b))
> dframe
cfg a b harmonic.speedup
1 1 1 2 1.000000
2 2 3 4 2.400000
3 3 5 6 3.750000
4 4 7 8 5.090909