I'm currently having an issue where I'm trying to nest simulated data for an efficient frontier inside a tibble containing all 250 simulations. The tibble will have 1 column named "sim" which indicates the number of the simulation, i.e. the rows in this column runs from 1:250. The other column should contain the nested simulation data which is a 3x123 tibble for each simulation. I've successfully, with help from a nice soul here, managed to create this tibble containing the efficient frontiers. Now I need to make a loop running through this tibble and plotting all of the 250 efficient frontiers in one plot.
I've tried to replicate the problem such that you don't need all of the previous code and data to see the issue. In this simple and reproducible example I have a table which is a 5x2 Tibble where the column 'sim' lists simulations (1:5) and 'obs' holds an individual 5x3 tibble with some coordinates:
library(tidyverse)
library(ggplot2)
counter = 0
table <- tibble(sim = 1:5, obs = NA)
for(i in (1:5)){
counter = counter + 1
tibble <- tibble(a = NA, b = 1:5, x = c(counter + 1), y = c(counter*2-1))
tibble$a <- counter
nested_tibble <- tibble %>% nest(data = -a) %>% select(-a)
table$obs[i] <- nested_tibble[[1]]
}
for (i in (1:5)){
print(ggplot()+
geom_point( data = (table %>% filter(sim == i) %>% .$obs)[[1]],
aes(x = x, y = y),
color = "red",
size = 4))
}
As mentioned I wish for it to plot all of the 5 coordinates in one plot such that I can replicate this to plot 250 efficient frontiers. However, when I run the code it only returns the last coordinate.
I hope my formulation makes sense. If you need any additional documentation please let me know.
I am not sure, but this should do the job. I think using lists is way better to store nested structure. So, the code below returns a list called table_out.
Please, have a look if this is what you want.
library(tibble)
library(data.table)
library(ggplot2)
N_sim <- 5
table_out <- vector("list", 5)
for ( i in seq_len(N_sim) ) {
current_table <- tibble(a = i, b = 1L:N_sim, x = i + 1, y = i*2 - 1)
table_out[[ i ]] <- current_table
}
# this creates a data.table (like a data.frame) from a list
final <- rbindlist( table_out )
ggplot(final, aes(x, y)) +
geom_point(color = "red", size = 4)
Created on 2021-03-03 by the reprex package (v1.0.0)
Related
I have a function that outputs a 1 by 2 data frame like the below reproducible example:
g_1 <- 2
g_2 <- 3
tbl <- cbind(g_1, g_2)
tbl <- as.data.frame(tbl)
tbl
And I'm trying to run a simulation of this function 5000 times and map the output of the function to a matrix or in other words fill the matrix with the output of each iteration. I have this code which I know doesn't work because I get the error under it but also because I think it's trying to fill the output into one column maybe?
nreps <- 5000
#Creating workspace
df_sim <- matrix(-999, nreps, 2, dimnames = list( c(), c("g_1", "g_2")))
for (i in 1:nreps){
df_sim[i] <- sim_tab(x = 5, y = 6)
}
number of items to replace is not a multiple of replacement lengthnumber
Is there a way to fill the matrix with each 1,2 output from each iteration of the loop?
You can use replicate to repeat the function nreps times and combine the result using do.call.
result <- do.call(rbind, replicate(nreps,sim_tab(x = 5, y = 6),simplify = FALSE))
We could use rerun from purrr
library(purrr)
library(dplyr)
sim_tab(x = 5, y = 6) %>%
rerun(nreps) %>%
bind_rows
My understanding regarding the difference between the merge() function (in base R) and the join() functions of plyr and dplyr are that join() is faster and more efficient when working with "large" data sets.
Is there some way to determine a threshold to regarding when to use join() over merge(), without using a heuristic approach?
I am sure you will be hard pressed to find a "hard and fast" rule around when to switch from one function to another. As others have mentioned, there are a set of tools in R to help you measure performance. object.size and system.time are two such function that look at memory usage and performance time, respectively. One general approach is to measure the two directly over an arbitrarily expanding data set. Below is one attempt at this. We will create a data frame with an 'id' column and a random set of numeric values, allowing the data frame to grow and measuring how it changes. I'll use inner_join here as you mentioned dplyr. We will measure time as "elapsed" time.
library(tidyverse)
setseed(424)
#number of rows in a cycle
growth <- c(100,1000,10000,100000,1000000,5000000)
#empty lists
n <- 1
l1 <- c()
l2 <- c()
#test for inner join in dplyr
for(i in growth){
x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
test <- inner_join(x,y, by = c('id' = 'id'))
l1[[n]] <- object.size(test)
print(system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3])
l2[[n]] <- system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3]
n <- n+1
}
#empty lists
n <- 1
l3 <- c()
l4 <- c()
#test for merge
for(i in growth){
x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
test <- merge(x,y, by = c('id'))
l3[[n]] <- object.size(test)
# print(object.size(test))
print(system.time(test <- merge(x,y, by = c('id')))[3])
l4[[n]] <- system.time(test <- merge(x,y, by = c('id')))[3]
n <- n+1
}
#ploting output (some coercing may happen, so be it)
plot <- bind_rows(data.frame("size_bytes" = l3, "time_sec" = l4, "id" = "merge"),
data.frame("size_bytes" = l1, "time_sec" = l2, "id" = "inner_join"))
plot$size_MB <- plot$size_bytes/1000000
ggplot(plot, aes(x = size_MB, y =time_sec, color = id)) + geom_line()
merge seems to perform worse out the gate, but really kicks off around ~20MB. Is this the final word on the matter? No. But such testing can give you a idea of how to choose a function.
I have a dataset with 99 observations and I need to create boxplots for ones with a specific string in them. However, when I run this code I get 57 of the exact same plots from the original function instead of the loop. I was wondering how to prevent the plots from being overwritten but still create all 57. Here is the code and a picture of the plot.
Thanks!
Boxplot Format
#starting boxplot function
myboxplot <- function(mydata=ivf_dataset, myexposure =
"ART_CURRENT", myoutcome = "MEG3_DMR_mean")
{bp <- ggplot(ivf_dataset, aes(ART_CURRENT, MEG3_DMR_mean))
bp <- bp + geom_boxplot(aes(group =ART_CURRENT))
}
#pulling out variables needed for plots
outcomes = names(ivf_dataset)[grep("_DMR_", names(ivf_dataset),
ignore.case = T)]
#creating loop for 57 boxplots
allplots <- list()
for (i in seq_along(outcomes))
{
allplots[[i]]<- myboxplot (myexposure = "ART_CURRENT", myoutcome =
outcomes[i])
}
allplots
I recommend reading about standard and non-standard evaluation and how this works with the tidyverse. Here are some links
http://adv-r.had.co.nz/Functions.html#function-arguments
http://adv-r.had.co.nz/Computing-on-the-language.html
I also found this useful
https://rstudio-pubs-static.s3.amazonaws.com/97970_465837f898094848b293e3988a1328c6.html
Also, you need to produce an example so that it is possible to replicate your problem. Here is the data that I created.
df <- data.frame(label = rep(c("a","b","c"), 5),
x = rnorm(15),
y = rnorm(15),
x2 = rnorm(15, 10),
y2 = rnorm(15, 5))
I kept most of your code the same and only changed what needed to be changed.
myboxplot2 <- function(mydata = df, myexposure, myoutcome){
bp <- ggplot(mydata, aes_(as.name(myexposure), as.name(myoutcome))) +
geom_boxplot()
print(bp)
}
myboxplot2(myexposure = "label", myoutcome = "y")
Because aes() uses non-standard evaluation, you need to use aes_(). Again, read the links above.
Here I am getting all the columns that start with x. I am assuming that your code gets the columns that you want.
outcomes <- names(df)[grep("^x", names(df), ignore.case = TRUE)]
Here I am looping through in the same way that you did. I am only storing the plot object though.
allplots <- list()
for (i in seq_along(outcomes)){
allplots[[i]]<- myboxplot2(myexposure = "label", myoutcome = outcomes[i])$plot
}
allplots
Using dplyr: is there a way to loop over variables in a data frame and pass both the data and the variable name to a custom function?
I have a solution for this using mapply in base R. In the interest of learning I am wondering if there is a neat dplyr-way to achieve the same result.
Here is a small example, where each column in a data frame is transformed by adding a constant. The constant I wish to add is different for each variable, as listed in myconstants.
library(tidyverse)
mydata <- tibble(
a = 1:5,
b = 1:5,
c = 1:5
)
myconstants <- tibble(
a = 10,
b = 20
)
custom_function <- function (x, y, k) {
constant <- if (is.null(k[[y]])) 0 else k[[y]]
x + constant
}
# solution in base R
foo <- mapply(
custom_function,
mydata,
names(mydata),
MoreArgs = list(k = myconstants)
) %>%
as_tibble()
Neither this post nor this post apply to my case.
Assume:
set.seed(42)
x<-rep(c("A","B","C"), c(3,4,1))
y<-rep(c("V","W"),c(5,3))
z<-rnorm(8,-2,1)
df<-data.frame(x,y,z)
boxplot(z~x+y,df)
I want my plot to include groups with more than, say, one element. This means that I want my plot show only A.V, B.V and B.W.
Furthermore, since my graph has about 70 groups, I don't want to do it by writing a list by hand.
Thanks
You can create a new column ('xy') using paste, create a logical index using ave for 'xy' groups having more than one elements, and then do the boxplot.
df1$xy <- factor(paste(df1$x, df1$y, sep='.'))
index <- with(df1, ave(1:nrow(df1), xy, FUN=length))>1
boxplot(z~xy, droplevels(df1[index,]))
Or using ggplot
library(dplyr)
library(tidyr)
library(ggplot2)
df %>%
group_by(x,y) %>%
filter(n()>1) %>%
unite(xy, x,y) %>%
ggplot(., aes(xy, z))+
geom_boxplot()
You can see if any bp$n are 0 and subset by that
set.seed(42)
df <- data.frame(x = rep(c("A","B","C"), c(3,4,1)),
y = rep(c("V","W"),c(5,3)),
z = rnorm(8,-2,1))
bp <- boxplot(z ~ x + y, df, plot = FALSE)
wh <- which(bp$n == 0)
bp[] <- lapply(bp, function(x) if (length(x)) {
## `bp` contains a list of boxplot statistics, some vectors and
## some matrices which need to be indexed accordingly
if (!is.null(nrow(x))) x[, -wh] else x[-wh]
## some of `bp` will not be present depending on how you called
## `boxplot`, so if that is the case, you need to leave them alone
## to keep the original structure so `bxp` can deal with it
} else x)
## call `bxp` on the subset of `bp`
bxp(bp)
Or you can use any value you like:
wh <- which(bp$n <= 1)
bp[] <- lapply(bp, function(x) if (length(x)) {
if (!is.null(nrow(x))) x[, -wh] else x[-wh]
} else x)
bxp(bp)