Interaction of sample and rnorm function in R? - r

I am setting my R code for doing a Monte Carlo, however I need a sample of 1 number with a random distribution, so in order to test the function of the sample in R, I set the code below, however I do not understand the reason of the different results.
x <- rnorm(1,8,0)
x
#8
y <-sample(x=rnorm(1,8,0), size=1)
y
#4

Quoting ?sample,
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1,
sampling via sample takes place from 1:x.
you're actually drawing from c(1, 2, 3, 4, 5, 6, 7, 8) and not from c(8).
However, it works if we draw from "character" class.
as.numeric(sample(as.character(rnorm(1,8,0)), size=1))
# [1] 8

Related

How to fix the random seed of sample function in R

I am wondering how to fix the random seed of sample function in R.
An easy example is here:
set.seed(1)
tmp <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sample(tmp, size = 3) # (*)
#[1] 9 4 7
sample(tmp, size = 3) # (**)
#[1] 1 2 5
sample(tmp, size = 3) # (***)
#[1] 7 2 3
sample(tmp, size = 3) # (****)
#[1] 3 1 5
# retry set the random seed
set.seed(1)
sample(tmp, size = 3) # (*)
#[1] 9 4 7
sample(tmp, size = 3) # (**)
#[1] 1 2 5
sample(tmp, size = 3) # (***)
#[1] 7 2 3
sample(tmp, size = 3) # (****)
#[1] 3 1 5
Then, when I repeat the sample function, the returned values change, although pattern of the change is uniform.
I cannot understand why sample function is differently affected from the set.seed.
How can I fix the return from sample function?
Thank you for your time to spend to read this question.
something like this may be
{set.seed(1); sample( tmp, 3 )}
[1] 9 4 7
This will return same result whenever called
I think your post sort of answers the question. If you notice the first sample after setting the seed are the same, which is what it's expected to do:
set.seed(1)
tmp <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sample(tmp, size = 3)
[1] 9 4 7
sample(tmp, size = 3)
[1] 1 2 5
set.seed(1)
tmp <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sample(tmp, size = 3)
[1] 9 4 7
sample(tmp, size = 3)
[1] 1 2 5
The first set of three draws after setting the seed are the same. The second draw of three is the same.
Perhaps it could be easier to see how set.seed works by looking at the first draw of 6 random selected values:
set.seed(1)
tmp <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sample(tmp, size = 6)
[1] 9 4 7 1 2 5
Your question seems to reflect a basic misunderstanding of Pseudo-Random Number Generators (PRNGs) and seeds. PRNGs maintain an internal state containing a fixed number of bits, and advance from the current state to the next state by applying a deterministic function. That function determines how "random" the sequence of state-based outputs appears to be, even though it's not actually random (hence the pseudo prefix). Given a fixed number of bits, the algorithm will eventually repeat its internal state, at which point the entire sequence repeats since it's deterministic. How long it takes to do this is called the cycle length of the PRNG.
Once you realize that all PRNGs will eventually repeat their sequence, it's easy to see that a seed does nothing more (nor less) than provide an entry point to that cycle. A lot of people mistakenly think the seed influences the quality of the PRNG, but that's incorrect. The quality is based on the deterministic transition function, not on the seed.
So why do PRNGs allow you to specify a seed? Basically it's for reproducibility. Firstly, it's really hard to debug a program if you can't reproduce the data which exposed your bug. Re-running with the same seed will allow you to trace through to whatever caused a problem. Secondly, this allows for fairer comparisons of alternative systems. Let's say you want to compare how a grocery store, or a bank, works when you add an additional clerk or teller. If you run both configurations with the same set of customers and customer transactions by controlling seeding, the differences you see between those configurations are directly attributable to the change in the system configuration. Without controlling the seeding, differences may be damped or amplified by the randomness of the sequence. Yes, in the long run that would come out by taking a larger sample, but the point is that you can reduce the variance of the answer by clever use of seeding rather than boosting the sample size.
Bottom line - your example is working exactly as expected. Resetting the seed produces the same sequence of randomness when sampled in the same way.

Calculating the value to know the trend in a set of numeric values

I have a requirement where I have set of numeric values for example: 2, 4, 2, 5, 0
As we can see in above set of numbers the trend is mixed but since the latest number is 0, I would consider the value is getting DOWN. Is there any way to measure the trend (either it is getting up or down).
Is there any R package available for that?
Thanks
Suppose your vector is c(2, 4, 2, 5, 0) and you want to know last value (increasing, constant or decreasing), then you could use diff function with a lag of 1. Below is an example.
MyVec <- c(2, 4, 2, 5, 0)
Lagged_vec <- diff(MyVec, lag=1)
if(MyVec[length(MyVec)]<0){
print("Decreasing")}
else if(MyVec[length(MyVec)]==0){
print("Constant")}
else {print("Increasing")}
Please let me know if this is what you wanted.

How to calculate the distance between an array and a matrix

Consider a matrix A and an array b. I would like to calculate the distance between b and each row of A. For instance consider the following data:
A <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15), 3, 5, byrow=TRUE)
b <- c(1, 2, 3, 4, 5)
I would expect as output some array of the form:
distance_array = c(0, 11.18, 22.36)
where the value 11.18 comes from the euclidean distance between a[2,] and b:
sqrt(sum((a[2,]-b)^2))
This seems pretty basic but so far all R functions I have found allow to compute distance matrices between all the pairs of rows of a matrix, but not this array-matrix calculation.
I would recommend putting the rows a A in list instead of a matrix as it might allow for faster processing time. But here's how I would do it with respect to your example
A <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15), 3, 5, byrow=TRUE)
b <- c(1, 2, 3, 4, 5)
apply(A,1,function(x)sqrt(sum((x-b)^2)))

Multivariate Granger's causality

I'm having issues doing a multivariate Granger's causal test. I'll like to check if conditioning a third variable affects the results of a causal test.
Here's one sample for a single dependent and independent variable based on an earlier question I asked and was answered by #Alex
Granger's causality test by column
library(lmtest)
M1<- matrix( c(2,3, 1, 4, 3, 3, 1,1, 5, 7), nrow=5, ncol=2)
M2<- matrix( c(7,3, 6, 9, 1, 2, 1,2, 8, 1), nrow=5, ncol=2)
M3<- matrix( c(1, 3, 1,5, 7,3, 1, 3, 3, 4), nrow=5, ncol=2)
For example, the equation for a conditioned linear regression will be
formula = y ~ w + x * z
How do I carry out this test as a function of a third or fourth variable please?
1. The solution for stationary variables are well-established: See FIAR (v 0.3) package.
This is the paper related with the package that includes concrete example of multivariate Granger causality (in the case of all of the variables are stationary).
Page 12: Theory, Page 15: Practice.
2. In case of mixed (stationary, nonstationary) variables, make all the variables stationary first (via differencing etc.). Do not handle stationary ones (they are already stationary). Now again, you finish by the above procedure (in case I).
3. In case of "non-cointegrated nonstationary" variables, then there is no need for VECM. Run VAR with the stationary variables (by making them stationary first, of course). Apply FIAR::condGranger etc.
4. In case of "cointegrated nonstationary" variables, the answer is really really very long:
Johansen Procedure (detect rank via urca::cajo)
Apply vec2var to convert VECM to VAR (since FIAR is based on VAR).
John Hunter's latest book nicely summarizes what can happen and what can be done in this last case.
You may wanna read this as well.
To my knowledge: Conditional/partial Granger causality supersides the GC via "Block exogeneity Wald test over VAR".

Decile function in R - nested ifelse() statements lead to poor runtime

I wrote a function that calculates the deciles of each row in a vector. I am doing this with the intention of creating graphics to evaluate the efficacy of a predictive model. There has to be a easier way to do this, but I haven't been able to figure it out for a while. Does anyone have any idea how I could score a vector in this way without having so many nested ifelse() statements? I included the function as well as some code to copy my results.
# function
decile <- function(x){
deciles <- vector(length=10)
for (i in seq(0.1,1,.1)){
deciles[i*10] <- quantile(x, i)
}
return (ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10))))))))))
}
# check functionality
test.df <- data.frame(a = 1:10, b = rnorm(10, 0, 1))
test.df$deciles <- decile(test.df$b)
test.df
# order data frame
test.df[with(test.df, order(b)),]
You can use quantile and findInterval
# find the decile locations
decLocations <- quantile(test.df$b, probs = seq(0.1,0.9,by=0.1))
# use findInterval with -Inf and Inf as upper and lower bounds
findInterval(test.df$b,c(-Inf,decLocations, Inf))
Another solution is to use ecdf(), described in the help files as the inverse of quantile().
round(ecdf(test.df$b)(test.df$b) * 10)
Note that #mnel's solution is around 100 times faster.

Resources