How do I stop set.seed() after a specific line of code? - r

I would like to end the scope of set.seed() after a specific line in order to have real randomization for the rest of the code. Here is an example in which I want set.seed() to work for "rnorm" (line 4), but not for "nrow" (line 9)
set.seed(2014)
f<-function(x){0.5*x+2}
datax<-1:100
datay<-f(datax)+rnorm(100,0,5)
daten<-data.frame(datax,datay)
model<-lm(datay~datax)
plot(datax,datay)
abline(model)
a<-daten[sample(nrow(daten),20),]
points(a,col="red",pch=16)
modela<-lm(a$datay~a$datax)
abline(modela, col="red")
Thanks for suggestions, indeed!

set.seed(NULL)
See help documents - ?set.seed:
"If called with seed = NULL it re-initializes (see ‘Note’) as if no seed had yet been set."

Simply use the current system time to "undo" the seed by introducing a new unique random seed:
set.seed(Sys.time())
If you need more precision, consider fetching the system timestamp by millisecond (use R's system(..., intern = TRUE) function).

set.seed() only works for the next execution. so what you want is already happening.
see this example
set.seed(12)
sample(1:15, 5)
[1] 2 12 13 4 15
sample(1:15, 5) # run the same code again you will see different results
[1] 1 3 9 15 12
set.seed(12)#set seed again to see first set of results
sample(1:15, 5)
[1] 2 12 13 4 15

set.seed() just works for the first line containing randomly sample, and will not influence the next following command. If you want it to work for the other lines, you must call the set.seed function with the same "seed"-the parameter.

Related

Report weight from row in R

I am a beginner in R who got this question:
One of the functions we will be using often is sample(). Read the help file for sample() using ?sample. Now take a random sample of size 1 from the numbers 13 to 24 and report back the weight of the mouse represented by that row. Make sure to type set.seed(1) to ensure that everybody gets the same answer.
I tried this:
set.seed(1)
i <- sample( 13:24, 1)
dat$Bodyweight[i]
And got the answer 25.34. But apparently, that's wrong. What am I doing wrong?!
you need to detect which col u want to choose from first
set.seed(1)
sample(data$Bodyweight[13:24] ,1 )

modelling an infinite series in R

I'm trying to write a code to approximate the following infinite Taylor series from the Theis hydrogeological equation in R.
I'm pretty new to functional programming, so this was a challenge! This is my attempt:
Wu <- function(u, repeats = 100) {
result <- numeric(repeats)
for (i in seq_along(result)){
result[i] <- -((-u)^i)/(i * factorial(i))
}
return(sum(result) - log(u)-0.5772)
}
I've compared the results with values from a data table available here: https://pubs.usgs.gov/wsp/wsp1536-E/pdf/wsp_1536-E_b.pdf - see below (excuse verbose code - should have made a csv, with hindsight):
Wu_QC <- data.frame(u = c(1.0*10^-15, 4.1*10^-14,9.9*10^-13, 7.0*10^-12, 3.7*10^-11,
2.3*10^-10, 6.8*10^-9, 5.7*10^-8, 8.4*10^-7, 6.3*10^-6,
3.1*10^-5, 7.4*10^-4, 5.1*10^-3, 2.9*10^-2,8.7*10^-1,
4.6,9.90),
Wu_table = c(33.9616, 30.2480, 27.0639, 25.1079, 23.4429,
21.6157, 18.2291, 16.1030, 13.4126, 11.3978,
9.8043,6.6324, 4.7064,2.9920,0.2742,
0.001841,0.000004637))
Wu_QC$rep_100 <- Wu(Wu_QC$u,100)
The good news is the formula gives identical results for repeats = 50, 100, 150 and 170 (so I've just given you the 100 version above). The bad news is that, while the function performs well for u < ~10^-3, it goes off the rails and gives negative outputs for numbers within an order of magnitude or so of 1. This doesn't happen when I just call the function on an individual number. i.e:
> Wu(4.6)
[1] 0.001856671
Which is the correct answer to 2sf.
Can anyone spot what I've done wrong and/or suggest a better way to code this equation? I think the problem is something to do with my for loop and/or an issue with the factorials generating infinite numbers as u gets larger, but I'm not at all certain.
Thanks!
As it says on page 93 of your reference, W is also known as the exponential integral. See also here.
Then, e.g., the package expint provides a function to compute W(u):
library(expint)
expint(10^(-8))
# [1] 17.84347
expint(4.6)
# [1] 0.001841006
where the results are exactly as in your referred table.
You can write a function that takes in a value together with the repetition times and outputs the required value:
w=function(u,l){
a=2:l
-0.5772-log(u)+u+sum(u^(a)*rep(c(-1,1),length=l-1)/(a)/factorial(a))
}
transform(Wu_QC,new=Vectorize(w)(u,170))
u Wu_table new
1 1.0e-15 3.39616e+01 3.396158e+01
2 4.1e-14 3.02480e+01 3.024800e+01
3 9.9e-13 2.70639e+01 2.706387e+01
4 7.0e-12 2.51079e+01 2.510791e+01
5 3.7e-11 2.34429e+01 2.344290e+01
6 2.3e-10 2.16157e+01 2.161574e+01
7 6.8e-09 1.82291e+01 1.822914e+01
8 5.7e-08 1.61030e+01 1.610301e+01
9 8.4e-07 1.34126e+01 1.341266e+01
10 6.3e-06 1.13978e+01 1.139777e+01
11 3.1e-05 9.80430e+00 9.804354e+00
12 7.4e-04 6.63240e+00 6.632400e+00
13 5.1e-03 4.70640e+00 4.706408e+00
14 2.9e-02 2.99200e+00 2.992051e+00
15 8.7e-01 2.74200e-01 2.741930e-01
16 4.6e+00 1.84100e-03 1.856671e-03
17 9.9e+00 4.63700e-06 2.030179e-05
As the numbers become large the estimation is not quite good, so we should have to go further than 170! but R cannot do that. Maybe you can try other platforms. ie Python
I think I may have solved this myself (though borrowing heavily from Onyambo's answer!) Here's my code:
well_func2 <- function (u, l = 100) {
result <- numeric(length(u))
a <- 2:l
for(i in seq_along(u)){
result[i] <- -0.5772-log(u[i])+u[i]+sum(u[i]^(a)*rep(c(-1,1),length=l-1)/(a)/factorial(a))
}
return(result)
}
As far as I can tell so far, this matches the tabulated results well for u <5 (as did Onyambo's code), and it also gives the same result for vector vs single-value inputs.
Still needs a bit more testing, and there's probably a tidier way to code it using map() or similar instead of the for loop, but I'm happy enough for now. Thought I'd share in case anyone else has the same problem.

R sample function issue over 10 million values

I found this quirk in R and can't find much evidence for why it occurs. I was trying to recreate a sample as a check and discovered that the sample function behaves differently in certain cases. See this example:
# Look at the first ten rows of a randomly ordered vector of the first 10 million integers
set.seed(4)
head(sample(1:10000000), 10)
[1] 5858004 89458 2937396 2773749 8135739 2604277 7244055 9060916 9490395 731445
# Select a specified sample of size 10 from this same list
set.seed(4)
sample(1:10000000), size = 10)
[1] 5858004 89458 2937396 2773749 8135739 2604277 7244055 9060916 9490395 731445
# Try the same for sample size 10,000,001
set.seed(4)
head(sample(1:10000001), 10)
[1] 5858004 89458 2937396 2773750 8135740 2604277 7244056 9060917 9490396 731445
set.seed(4)
sample(1:10000001), size = 10)
[1] 5858004 89458 2937397 2773750 8135743 2604278 7244060 9060923 9490404 731445
I tested many values up to this 10 million threshold and found that the values matched (though I admit to not testing more than 10 output rows).
Anyone know what's going on here? Is there something significant about this 10 million number?
Yes, there's something special about 1e7. If you look at the sample code, it ends up calling sample.int. And as you can see at ?sample, the default value for the useHash argument of sample.int is
useHash = (!replace && is.null(prob) && size <= n/2 && n > 1e7)
That && n > 1e7 means when you get above 1e7, the default preference switches to useHash = TRUE. If you want consistency, call sample.int directly and specify the the useHash value. (TRUE is a good choice for memory efficiency, see the argument description at ?sample for details.)

Using parSapply to generate random numbers

I am trying to run a function which there is a random number generator within the function. The results at not as what I expected so I have done the following test:
# Case 1
set.seed(100)
A1 = matrix(NA,20,10)
for (i in 1:10) {
A1[,i] = sample(1:100,20)
}
# Case 2
set.seed(100)
A2 = sapply(seq_len(10),function(x) sample(1:100,20))
# Case 3
require(parallel)
set.seed(100)
cl <- makeCluster(detectCores() - 1)
A3 = parSapply(cl,seq_len(10), function(x) sample(1:100,20))
stopCluster(cl)
# Check: Case 1 result equals Case 2 result
identical(A1,A2)
# [1] TRUE
# Check: Case 1 result does NOT equal to Case 3 result
identical(A1,A3)
# [1] FALSE
# Check2: Would like to check if it's a matter of ordering
range(rowSums(A1))
# [1] 319 704
range(rowSums(A3))
# [1] 288 612
In the above code, the parSapply generates a different set of random numbers than A1 and A2. My purpose of having Check2 is that, I was suspecting that parSapply might alter the order however it doesn't seem to be case as the max and min sums of these random numbers are different.
Appreciate if someone could shed some colour on why parSapply would give a different result from sapply. What am I missing here?
Thanks in advance!
Have a look at ?vignette(parallel) and in particular at "Section 6 Random-number generation". Among other things it states the following
Some care is needed with parallel computation using (pseudo-)random numbers: the processes/threads which run separate parts of the computation need to run independent (and preferably reproducible) random-number streams.
When an R process is started up it takes the random-number seed from the object .Random.seed in a saved workspace or constructs one from the clock time and process ID when random-number generation is first used (see the help on RNG). Thus worker processes might get the same seed
because a workspace containing .Random.seed was restored or the random number generator has been used before forking: otherwise these get a non-reproducible seed (but with very high probability a different seed for each worker).
You should also have a look at ?clusterSetRNGStream.

R: Sample into bins of predefined sizes (partition sample vector)

I'm working on a dataset that consists of ~10^6 values which clustered into a variable number of bins. In the course of my analysis, I am trying to randomize my clustering, but keeping bin size constant. As a toy example (in pseudocode), this would look something like this:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
for (rand in 1:no.of.randomizations) {
rand.data <- partition.sample(seq(1,15), partitions=sizes, replace=F)
}
So, I am looking for a function like "partition.sample" that will take a vector (like seq(1,15)) and randomly sample from it, returning a list with the data partitioned into the right bin sizes given already by "sizes".
I've been trying to write one such function myself, since the task seems to be not so hard. However, the partitioning of a vector into given bin sizes looks like it would be a lot faster and more efficient if done "under the hood", meaning probably not in native R. So I wonder whether I have just missed the name of the appropriate function, or whether someone could please point me to a smart solution that is around :-)
Your help & time are very much appreciated! :-)
Best,
Lymond
UPDATE:
By "no.of.randomizations" I mean the actual number of times I run through the whole "randomization loop". This will, later on, obviously include more steps than just the actual sampling.
Moreover, I would in addition be interested in a trick to do the above feat for sampling without replacement.
Thanks in advance, your help is very much appreciated!
Revised: This should be fairly efficient. It's complexity should be primarily in the permutation step:
# A single step:
x <- sample( unlist(data))
list( one=x[1:4], two=x[5:8], three=x[9], four=x[10:12], five=x[13:16])
As mentioned above the "no.of.randomizations" may have been the number of repeated applications of this proces, in which case you may want to wrap replicate around that:
replic <- replicate(n=4, { x <- sample(unlist(data))
list( x[1:4], x[5:8], x[9], x[10:12], x[13:15]) } )
After some more thinking and googling, I have come up with a feasible solution. However, I am still not convinced that this is the fastest and most efficient way to go.
In principle, I can generate one long vector of a uniqe permutation of "data" and then split it into a list of vectors of lengths "sizes" by going via a factor argument supplied to split. For this, I need an additional ID scheme for my different groups of "data", which I happen to have in my case.
It becomes clearer when viewed as code:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
So far, everything as above
names <- c("set1", "set2", "set3", "set4", "set5");
In my case, I am lucky enough to have "names" already provided from the data. Otherwise, I would have to obtain them as (e.g.)
names <- seq(1, length(data));
This "names" vector can then be expanded by "sizes" using rep:
cut.by <- rep(names, times = sizes);
[1] 1 1 1 1 2 2 2 2 3 4 4 4 5
[14] 5 5
This new vector "cut.by" can then by provided as argument to split()
rand.data <- split(sample(1:15, 15), cut.by)
$`1`
[1] 8 9 14 4
$`2`
[1] 10 2 15 13
$`3`
[1] 12
$`4`
[1] 11 3 5
$`5`
[1] 7 6 1
This does the job I was looking for alright. It samples from the background "1:15" and splits the result into vectors of lengths "sizes" through the vector "cut.by".
However, I am still not happy to have to go via an additional (possibly) long vector to indicate the split positions, such as "cut.by" in the code above. This definitely works, but for very long data vectors, it could become quite slow, I guess.
Thank you anyway for the answers and pointers provided! Your help is very much appreciated :-)

Resources