Translating SAS code to R: Simulation and Dataframes - r

I am working with this SAS code.
data a;
do i=1 to 10000000;
x = 12 + 2.5*rannor(0);
y = 15000 + 2500*x + 5000*rannor(0);
output;
end;
I am having a hard time in my attempt to write a suitable R code that can replicate (or rather be similar) to what I've done above.
All that I've been able to do was this:
set.seed(0)
x = 12 +2.5*rnorm(1)
y = 1500+ 250*x+ 500*rnorm(1)
...but I think the SAS program actually generates 10000000 x's and y's that have values based on their equations above, so I assume a dataframe is involved.
Anyone used R or/and SAS before? Any ideas as to how I can do something similar to the SAS code?

set.seed(0)
n = 10000000
library(dplyr)
data_frame(x = 12 + 2.5*rnorm(n),
y = 1500+ 250*x+ 500*rnorm(n) )

I don't think an external package is needed here.
set.seed(0)
n <- 10000000
x <- 12+rnorm(n = n,mean = 0,sd = 1)*2.5
y <- 1500 + 250*x + 500*rnorm(n = n, mean = 0, sd = 1)
data <- cbind(x,y)
You just need to call rnorm() to include the n that you are seeking. I believe the above code will do that.

Related

curious behavior of set.seed inside function

Something i came across today that i don't quite understand. The setup is that i want to generate some uniformly distributed points in the plane, afterwards i want to assign each point an arrival rate. I want to be able to reproduce the same points but assign different arrival rates. I figured i could use the set.seed function for this.
library(dplyr)
library(ggplot2)
seed = NULL
no_of_points = 50
interval = c("min" = -10, "max" = 10)
arv = c("min" = 1/80, "max" = 1)
plot_data <- function() {
id <- 1:no_of_points
# setting the seed here to be able to reproduce if desired
set.seed(seed)
x <- runif(no_of_points, min = interval["min"], max = interval["max"])
y <- runif(no_of_points, min = interval["min"], max = interval["max"])
# resetting the seed to give "random" arrival rates regardless of the seed
set.seed(NULL)
arrival_rate <- runif(no_of_points, min = arv["min"], max = arv["max"])
data <- tibble(
"Demand point id" = as.character(id),
"x" = x,
"y" = y,
"Arrival rate" = arrival_rate
)
}
ggplot(plot_data()) +
geom_point(aes(x, y, size = `Arrival rate`))
This works fine when i set a seed and i get a plot like this, which is what i would expect
However when i have seed = NULL as in the example code i get a plot like this, where it seems that arrival rates are correlated with the x-axis.
How can this be explained? Additionally i tried to run the same code but not inside a function, but then i get expected behavior. So i suspect it has something to do with the seed being set inside a function.
I don't think set.seed(NULL) is doing what you expect. In this case I think NULL is initializing the exact same random seed both times you call it. Therefore, the first random number generation after calling set.seed(NULL) (x) is correlated with the first random number generation after you call set.seed(NULL) again (Arrival rate) (but not the second generation of the first instance - y). In this simple example, you can see that the nth random generation after setting a particular seed is correlated with the nth random generation after setting that same seed again, and that using NULL and NULL is basically the same as using 1 and 1.
f <- function(s1 = NULL, s2 = NULL) {
set.seed(s1)
a <- runif(50)
b <- runif(50)
c <- runif(50)
set.seed(s2)
d <- runif(50)
e <- runif(50)
f <- runif(50)
x <- data.frame(a, b, c, d, e, f)
plot(x)
}
f(NULL, NULL)
f(1, 1)
f(1, 2)
Created on 2022-01-04 by the reprex package (v2.0.1)

R/Python - For Loop Statement for a Trigonometric Formula

I am working on an estimation module, where we are computing seasonality variations and forecasting. Previously, we were using fixed 5-order sinusoidal functions for estimation. The formula was as follows
doy_seasonality = exp(z[1]*sin(2*pi*doy/365)+z[2]*cos(2*pi*doy/365)
+z[3]*sin(4*pi*doy/365)+ z[4]*cos(4*pi*doy/365)
+z[5]*sin(6*pi*doy/365)+ z[6]*cos(6*pi*doy/365)
+z[7]*sin(8*pi*doy/365)+ z[8]*cos(8*pi*doy/365)
+ z[9]*sin(10*pi*doy/365)+ z[10]*cos(10*pi*doy/365))
Now, we have tried some modifications in our model. Using Fast Fourier Transform, we are able to generate the orders for trigonometric functions automatically.
For example, on my current dataset, I have the following array of orders.
order_FFT = [2, 6, 10, 24], such that
order_FFT[0] = 2
order_FFT[1] = 6
order_FFT[2] = 10
order_FFT[3] = 24
There will be 4 orders here. With some other dataset, there could be more or less no. of orders. Therefore, I need to define a for loop so that the formula gets modified.
With my current dataset and corresponding orders_FFT array, the for loop should execute the following formula:
doy_seasonality = exp(z[1]*sin(2*pi*doy/365)+z[2]*cos(2*pi*doy/365)
+z[3]*sin(6*pi*doy/365)+ z[4]*cos(6*pi*doy/365)
+z[5]*sin(10*pi*doy/365)+ z[6]*cos(10*pi*doy/365)
+z[7]*sin(24*pi*doy/365)+ z[8]*cos(24*pi*doy/365)
which basically means
doy_seasonality = exp(z[1]*sin(order_FFT[0]*pi*doy/365)+z[2]*cos(order_FFT[0]*pi*doy/365)
+z[3]*sin(order_FFT[1]*pi*doy/365)+ z[4]*cos(order_FFT[1]*pi*doy/365)
+z[5]*sin(order_FFT[2]*pi*doy/365)+ z[6]*cos(order_FFT[2]*pi*doy/365)
+z[7]*sin(order_FFT[3]*pi*doy/365)+ z[8]*cos(order_FFT[3]*pi*doy/365)
I am at a loss trying to figure out a for loop code for this. Sorry that I am not able to show my own efforts here.
I would not use a loop. Here is an R approach:
#Some test data
set.seed(42)
z <- rnorm(8)
doy <- 1:365
order_FFT <- c(2, 6, 10, 24)
#separate coefficients for sin and cos in two rows:
z <- matrix(z, nrow = 2)
#calculate the sins and cosins:
sins <- outer(doy, order_FFT, function(x, y) sin(x * pi * y / 365))
cosins <- outer(doy, order_FFT, function(x, y) cos(x * pi * y / 365))
#use matrix products to multiply and sum
doy_seasonality2 <- c(exp(sins %*% z[1,] + cosins %*% z[2,]))
Does it produce the same result?
doy_seasonality = exp(z[1]*sin(2*pi*doy/365)+z[2]*cos(2*pi*doy/365)
+z[3]*sin(6*pi*doy/365)+ z[4]*cos(6*pi*doy/365)
+z[5]*sin(10*pi*doy/365)+ z[6]*cos(10*pi*doy/365)
+z[7]*sin(24*pi*doy/365)+ z[8]*cos(24*pi*doy/365))
all.equal(doy_seasonality, doy_seasonality2)
#[1] TRUE

Goal seek in R with 3 parameters

#========
#DATABASE
#========
database <- matrix(c(51,43,-22,-92,28,-21,68,22,9,-20,-30,-1,-10,10,-10,-5,10,-2,30,-3,-5),ncol=3,byrow=TRUE)
colnames(database ) <- c("A","B","C")
database <- as.data.frame(database )
x<-1
y<-1
z<-1
database$RES<-c(1,0,0,0,1,0,1)
database$SCORE<- database$A*x+database$B*y+database$C*z
database$PREV<- ifelse(database$SCORE>1,1,0)
#========
#TARGET
#========
t<-table(database$RES, database$PREV)
P<-(t[1]+t[4])/nrow(database)
This is an example of my database (60k rows), I want to find values for x y z (in the code I put "1" just for convenience to run the script but I want to find them!) to have maximum value of P. The target P must be 1 or closed to 1.
I didnt find what I'm looking for in thread with similiar title.
In excel is pretty simple but can't find more than 1 parameter.
Thanx in advance.
I'm not satisfied with this answer, but maybe this is something that can at least get you started.
The optim() function finds the optimum set of answers for the problem you're trying to solve, but it looks to me, at least with the toy data, that it finds itself into a local maxima. You'd have to run it several times to find the best parameters, for me it occurs when P = 0.8571429, and even then the x, y, z values can vary quite significantly, which would indicate that there are several equally optimal solutions for this particular data.
database <- matrix(c(51,43,-22,-92,28,-21,68,22,9,-20,-30,-1,-10,10,-10,-5,10,-2,30,-3,-5),ncol=3,byrow=TRUE)
colnames(database ) <- c("A","B","C")
database <- as.data.frame(database )
database$RES <- c(1,0,0,0,1,0,1)
find_best <- function(data, x) {
SCORE <- data$A*x[1]+data$B*x[2]+data$C*x[3]
PREV <- ifelse(SCORE>1,1,0)
t <- table(data$RES, PREV)
P <- (t[1]+t[4])/nrow(data)
P
}
result <- optim(c(1, 1, 1), find_best, data = database, method = "SANN", control = list(fnscale = -1))
result$value
[1] 0.8571429 # The P value
result$par
[1] 2.396844 -4.460343 -7.137460 # These are your sought after x, y, z parameters.

How to write down the number of the repeated data in R function

Suppose I have repeated simulated data (100 times). Then, suppose that I would like to apply one function on each of these data. Since my data is repeated (sometimes 1000 times) I would like to know at which data my code working at this moment. That is, I would like my code to show the number of each data it is working on it. For example, when my code start with first data, then I would like it to let me know this is the first data. Then, the same for the second data and so on. I know that I will get the number of my data in console as a list. However, my function is much more complicated. This is just a simple example to explain my problem. I would like my code to let me know at which data it is working now.
This is my code:
N.a = 186; N.b = 38; N.ab=13; N.o = 284
## 1) numerical optimization
llk = function(xpar){
tmp = exp(c(xpar,0))
pr = tmp/sum(tmp) ## A/B/O
res1 = N.a*log(pr[1]^2+2*pr[1]*pr[3]) + N.b*log(pr[2]^2+2*pr[2]*pr[3])
res2 = N.ab*log(2*pr[1]*pr[2]) + N.o*log(pr[3]^2)
-res1-res2
}
pr = rep(1/3,3) ## A/B/O
it = 0; pdiff = 1
while( (it<100)&(pdiff>1e-5) ){
tmp = c(pr[1]^2, 2*pr[1]*pr[3])
tmp = tmp/sum(tmp)
N.aa = N.a*tmp[1]
N.ao = N.a*tmp[2]
tmp = c(pr[2]^2, 2*pr[2]*pr[3])
tmp = tmp/sum(tmp)
N.bb = N.b*tmp[1]
N.bb = N.b*tmp[1]
N.bo = N.b*tmp[2]
pr1 = c(2*N.aa+N.ao+N.ab, 2*N.bb+N.bo+N.ab, N.ao+N.bo+2*N.o)
pr1 = pr1/sum(pr1)
pdiff = mean(abs(pr1-pr))
it = it+1
pr = pr1
cat(it, pr, "\n")
}
How I can use cat function. For example, how to use this in my code:
cat(paste0("data: ", i, "\n"))

R - Dealing with zeros in radomized subsamples

I've run into a little problem, simulating the throw of dice. Basically im doing this to get familiar with loops and their output.
Intention is to simulate the throw of two dice as follows:
R = 100
d6 = c(1:6)
d = 60
DICE = NULL
for (i in 1:R)
{
i <- as.factor((sample(d6, size=d, replace = T)) + (sample(d6, size=d, replace = T)))
j <- summary(i)
DICE = rbind(DICE, j)
}
head(DICE)
HIS = colMeans(DICE)
boxplot(DICE)
title(main= "Result 2d6", ylab= "Throws", xlab="")
relHIS = (HIS / sum(HIS))*100
relHIS
Problems occur if the result in one cathegorie is 0 (result did not occur in the sample). If this happens randomly in the first subsample one or more the categories (numbers 2-12) are missing. This causes problems ("number of columns of result is not a multiple of vector length (arg 2)") in the following subsamples.
Im sure there is a really simple solution for this, by defining everything beforehand...
Thanks for your help!
Here are some fixes:
R = 100
d6 = c(1:6)
d = 60
DICE = matrix(nrow = R, ncol = 11) #pre-allocate
colnames(DICE) <- 2:12
for (i in 1:R)
{
sim <- ordered((sample(d6, size=d, replace = T)) + (sample(d6, size=d, replace = T)),
levels = 2:12) #define the factor levels
sumsim <- table(sim)
DICE[i,] <- sumsim #sub-assign
}
head(DICE)
HIS = colMeans(DICE)
boxplot(DICE)
title(main= "Result 2d6", ylab= "Throws", xlab="")
prop.table(HIS) * 100
Always pre-allocate your result data structure. Growing it in a loop is terribly slow and you know how big it needs to be. Also, don't use the same symbol for the iteration variable and something else.
Omit as.factor()in your seventh row

Resources