R using set.seed() within a purrr::map iteration structure - r

When generating new data in R, I can use set.seed() to ensure that I get the same data sets every time the code is run:
set.seed(12345)
a <- rnorm(500, mean = 50, sd = 10)
set.seed(12345)
b <- rnorm(500, mean = 50, sd = 10)
identical(a, b)
# TRUE
If I comment out the set.seed() lines, identical(a,b) returns FALSE.
Now I want to use a purrr::map() structure to generate multiple data sets with slightly different parameters:
library(tidyverse)
means <- c(40, 50, 60)
sds <- c(9, 10, 11)
set.seed(12345)
data <- map2(
means,
sds,
~
rnorm(500, mean = .x, sd = .y)
)
The map2() call generates a list of three data frames. With this relatively simple operation, I get identical data frames every time I run the code. But I'm finding that with more complex, longer functional pipelines involving certain packages (e.g., bestNormalize), I'm not getting identical output when the set.seed() command is outside the iterative looping structure of map().
I'm at a loss for how to bring set.seed() within the map() iteration structure so that it is called anew at the beginning of each iteration. To be clear, the larger goal is to be able to iterate over functions that use random number generation, and to get identical results every time. Perhaps there's a better way to accomplish this in the tidyverse that doesn't depend on set.seed(). Thanks in advance for any help!

I hope this solves your question as how to position the seed inside the map call:
means <- c(40, 50, 60)
sds <- c(9, 10, 11)
myfun <- function(means, sds){
set.seed(12345) # set it before each call
ret <- rnorm(500, mean = means, sd = sds)
return(ret)
}
data <- purrr::map2(means,
sds,
~ myfun(.x, .y))

As a followup, here is the most concise way to solve my original problem:
library(tidyverse)
means <- c(40, 50, 60)
sds <- c(9, 10, 11)
data <- map2(
means,
sds,
~ {
set.seed(12345)
rnorm(500, mean = .x, sd = .y)
}
)
This code returns identical results each time it is run.

Related

Simulate rnorm in R for many observations using mean and sd from each row

I am attempting to apply the rnorm function to many rows (214) of a data frame in R.
I want to use the predefined row mean and sd values of each row of the data frame to complete the simulations and n=10,000 for all observations.
I would like to use the apply function to do this, however, I am unclear how to write the rnorm call within the apply function to accomplish this for all rows at once.
Reproducible example:
set.seed(1)
Data <- data.frame(
Hazard = LETTERS[1:10],
mean = sample(1:10),
sd = c(0.14,0.23,0.21,0.27,0.12,0.19,0.21,0.18,
0.29,0.22)
)
Code I tried:
dist <- rnorm(10000, mean=Data$mean, sd=Data$sd)
apply(X= Data,
FUN = dist,
MARGIN = 1)
Thanks in advance for your assistance.
It may be better to use Map here where we loop over the corresponding elements of 'mean', 'sd' column, apply the rnorm and returns a list
n <- 10000
lst1 <- Map(function(x, y) rnorm(n, mean = x, sd = y), Data$mean, Data$sd)
Or if we prefer apply, then subset the numeric columns of interest and loop over it
apply(Data[-1], 1, FUN = function(x) rnorm(n, mean = x[1], sd = x[2]))
My solution would be use expand.grid to generate a new dataframe for the simulations for each possible mean/std combination.
library(dplyr)
sim_data <-
expand.grid(Hazard=Data$Hazard, iteration=1:10000) %>%
left_join(Data) %>%
mutate(x = rnorm(mean, sd))

K-means iterated for same data for 10 times

I am a fresher to R. Trying to evaluate if I can get an optimization of K-means (using R) by iteratively calling the k-means routine for same dataset and same value for K (i.e. k=3 in my case) of 10/15 times and see if if can give me good results. I see the clustering changes at every call, even the total sum of squares and withinss starts changing but not sure how to halt at the best situation.
Can anyone guide me?
code:
run_kmeans <- function(xtimes)
{
for (x in 1:xtimes)
{
kmeans_results <- kmeans(filtered_data, 3)
print(kmeans_results["totss"])
print(kmeans_results["tot.withinss"])
}
return(kmeans_results)
}
kmeans_results = run_kmeans(10)
Not sure I understood your question because this is not the usual way of selecting the best partition (elbow method, silhouette method, etc.)
Let's say you want to find the kmeans partition that minimizes your within-cluster sum of squares.
Let's take the example from ?kmeans
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
You could write that to run repetitively kmeans:
xtimes <- 10
kmeans <- lapply(seq_len(xtimes), function(i){
kmeans_results <- kmeans(x, 3)
})
lapply is always preferrable to for. You output a list. To extract withinss and see which one is minimal:
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)
However, unless I misunderstood your objective, this is a strange way to select the most performing partition. Usually, this is the number of clusters that is evaluated ; not different partititons produced with the same sample data and the same number of clusters.
Edit from your comment
Ok, so you want to find the combination of columns that give you the best performance. I give you an example below where every two by two combinations of three variables is tested. You could generalize a little bit (but the number of combinations possible with 8 variables is very big, you should have a routine to reduce the number of tested combinations)
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 3),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 3)
)
colnames(x) <- c("x", "y","z")
combinations <- combn(colnames(x), 2, simplify = FALSE)
kmeans <- lapply(combinations, function(i){
kmeans_results <- kmeans(x[,i], 3)
})
perf <- sapply(kmeans, function(d) as.numeric(d["tot.withinss"]))
which.min(perf)

How to pass a parameter to purrr's map function without using formula notation?

I am following along Hadley Wickham's online book "R for Data Science" and got a little confused once the purrr:map function was introduced. In particular, question 21.5.3 (4) asks to create 10 random normals for each of the means (-10, 0, 10, 100) but my attempts to apply a function using map failed.
I did note that (unlike previous examples) the mean serves as a parameter here and not as the object to which the function is applied. The solution makes use of the (abbreviated) formula notation. What exactly lets the code work with a formula rather than a function even though the explanation suggests that both are equivalent inside map()?
The given solution is:
library("tidyverse")
mu <- c(-10, 0, 10, 100)
map(mu, ~ rnorm(n = 10, mean = .))
To me, the equivalent function would look something like:
library("tidyverse")
mu <- c(-10, 0, 10, 100)
map(mu, rnorm(n = 10, mean = mu))
Note also that it is indeed possible to use the vector that is passed to map as a parameter to a function, as in:
library("tidyverse")
map(1:5, rnorm)
mu <- c(-10, 0, 10, 100)
#map(mu, rnorm(n = 10, mean = mu))
map(mu, partial(rnorm, n = 10))
map(mu, rnorm, n = 10)
map(mu, function(x, n = 10) rnorm(n = n, mean = x))
Three different ways of doing it. Personally, I like number 2.

For Loop t.test, Comparing Means by Factor Class in R

I want to loop a lot of one sided t.tests, comparing mean crop harvest value by pattern for a set of different crops.
My data is structured like this:
df <- data.frame("crop" = rep(c('Beans', 'Corn', 'Potatoes'), 10),
"value" = rnorm(n = 30),
"pattern" = rep(c("mono", "inter"), 15),
stringsAsFactors = TRUE)
I would like the output to provide results from a t.test, comparing mean harvest of each crop by pattern (i.e. compare harvest of mono-cropped potatoes to intercropped potatoes), where the alternative is greater value for the intercropped pattern.
Help!
Here's an example using base R.
# Generate example data
df <- data.frame("crop" = rep(c('Beans', 'Corn', 'Potatoes'), 10),
"value" = rnorm(n = 30),
"pattern" = rep(c("inter", "mono"), 15),
stringsAsFactors = TRUE)
# Create a list which will hold the output of the test for each crop
crops <- unique(df$crop)
test_output <- vector('list', length = length(crops))
names(test_output) <- crops
# For each crop, save the output of a one-sided t-test
for (crop in crops) {
# Filter the data to include only observations for the particular crop
crop_data <- df[df$crop == crop,]
# Save the results of a t-test with a one-sided alternative
test_output[[crop]] <- t.test(formula = value ~ pattern,
data = crop_data,
alternative = 'greater')
}
It's important to note that when calling t-test with the formula interface (e.g. y ~ x) and where your independent variable is a factor, then using the setting alternative = 'greater' will test whether the mean in the lower factor level (in the case of your data, "inter") is greater than the mean in the higher factor level (here, that's "mono").
Here's the elegant "tidyverse" approach, which makes use of the tidy function from broom which allows you to store the output of a t-test as a data frame.
Instead of a formal for loop, the group_by and do functions from the dplyr package are used to accomplish the same thing as a for loop.
library(dplyr)
library(broom)
# Generate example data
df <- data.frame("crop" = rep(c('Beans', 'Corn', 'Potatoes'), 10),
"value" = rnorm(n = 30),
"pattern" = rep(c("inter", "mono"), 15),
stringsAsFactors = TRUE)
# Group the data by crop, and run a t-test for each subset of data.
# Use the tidy function from the broom package
# to capture the t.test output as a data frame
df %>%
group_by(crop) %>%
do(tidy(t.test(formula = value ~ pattern,
data = .,
alternative = 'greater')))
Consider by, object-oriented wrapper to tapply designed to subset a data frame by factor(s) and run operations on subsets:
t_test_list <- by(df, df$crop, function(sub)
t.test(formula = value ~ pattern,
data = sub, alternative = 'greater')
)

How to store the output of the function as a matrix with specific name of columns and rows in r?

I simulate a mixture data for 10 runs. Then aplied my function to all runs using apply function. Now I will get 10 different results of my function. I would like to save my output as a matrix with chosen name of my column and row. For example I would like to get the output as follows:
dist = rnorm(n=100, m=2, sd=2.2)
rep. = function (dist) {
replicate(n=2, dist)
list(mean(dist), mode(dist),sd(dist)
}
I would like to find the mode, mean and sd as a matrix for the 2 runs. That is:
Iteration. Mean Mode sd
1 0.5. 3 0.4
2 0.3 1 0.6
Any help please?
First of all, there isn't a function mode() in base R. So I provide one for you:
mode <- function(x){
dd <- density(x)
dd$x[which.max(dd$y)]
}
Second, you need to wrap your code inside a call to replicate. Your code doesn't do anything at all with the replicate. It replicates print(dist) twice and then calculates the mean, mode and sd only once.
Third: If you want to do this on random generated vectors, also the rnorm() call has to be inside the replicate.
Don't export a list object and let replicate do the job for you.
Forth: replicate() returns by default a matrix with names if necessary. That's what the arguments simplify and USE.NAMES exist in the first place. Keep in mind that replicate binds the results together as columns. So to have a matrix where you have a column for mean, one for mode and one for sd, you need to transpose it.
So what you want to do, is better solved by:
out <- replicate(n=2,{
x <- rnorm(n = 100, m = 2, sd = 2.2)
c(mean = mean(x),
mode = mode(x),
sd = sd(x)
)
})
t(out)
The following function is what you want:
library(dplyr)
replicate(n = 2, expr = x <- rnorm(n = 100, mean = 2, sd = 2.2)) %>%
data.frame() %>%
summarize_all(c("mean", "sd"))
There is no function in base R for calculating the mode, so I suggest you do that in a separate function as described here. However, calculating the mode does not make much sense in this case since the mean = median = mode for the normal distribution.

Resources