Plotting ggplot with for loop. How should variable be referenced? - r

I am trying to create scatterplots where a single plot displays the relationship between each predictor and a single outcome. However, the outcome is not displaying normally. I assume this is because the ggplot function is not recognizing the outcome as a column name. Any advice on how to properly refer to the outcome?
# data
data <- data.frame(o1=rnorm(100, 3, sd=1.2),
o2=rnorm(100, 3.5, sd=1.4),
p1=rnorm(100, 2, sd=1.9),
p2=rnorm(100, 1, sd=1.2),
p3=rnorm(100, 7, sd=1.6)
)
func <- function(data, outcomes, predictors) {
for(i in seq_along(outcomes)){
print(data %>% select(outcomes[[i]], predictors[[i]]) %>%
gather(var, value, -outcomes[[i]]) %>%
ggplot(aes(x=value, y=outcomes[[i]])) + geom_point() + facet_wrap(~var))
}
}
func(data, outcomes=c("o1", "o2"), predictors=list(c("p1", "p2"), c("p2","p3")))

You could try this without the loop:
data <- data.frame(o1=rnorm(100, 3, sd=1.2),
o2=rnorm(100, 3.5, sd=1.4),
p1=rnorm(100, 2, sd=1.9),
p2=rnorm(100, 1, sd=1.2),
p3=rnorm(100, 7, sd=1.6)
)
tidy_data <- data %>%
pivot_longer(c(p1:p3), names_to = "predictor", values_to = "x") %>%
pivot_longer(c(o1:o2), names_to = "outcome", values_to = "y")
ggplot(tidy_data) +
geom_point(aes(x = x,
y = y)) +
facet_grid(outcome~predictor)

In the function you can :
create a list to hold all the plots.
replace gather with pivot_longer since gather is retired.
Use .data pronoun to specify y-axis value as variable.
library(tidyverse)
func <- function(data, outcomes, predictors) {
plot_list <- vector('list', length(outcomes))
for(i in seq_along(outcomes)){
plot_list[[i]] <- data %>%
select(outcomes[i], predictors[[i]]) %>%
pivot_longer(cols = -outcomes[i]) %>%
ggplot(aes(x=value, y=.data[[outcomes[i]]])) +
geom_point() + facet_wrap(~name)
}
return(plot_list)
}
and call it as :
result <- func(data, outcomes=c("o1", "o2"),
predictors=list(c("p1", "p2"), c("p2","p3")))
where result is a list of plots and each individual plots can be accessed as result[[1]], result[[2]] and so on.

Related

Extrapolate dataset with limited data points and add all values to new dataset

I have a dataset with very limited data points.
x<- c(4, 8, 13, 24)
y<- c(40, 37, 28, 20)
df<- data.frame(x,y)
Now I want to extrapolate this data, creating a dataset where the value of y will be given for every value (no decimals) of x between 1-100. x and y have a linear relationship.
Secondly, could this be done for multiple dataframes by using something like a loop?
Thank you!
This is a short snippet that does this:
linear_xy <- lm(y ~ x, data = df)
# df <- broom:::augment.lm(linear_xy, newdata = complete(df, x = 1:100)) # one way
df <- df %>% # another way
complete(x = 1:100) %>%
mutate(.fitted = predict(linear_xy, newdata = .))
ggplot(df, aes(x, y)) +
geom_line(aes(y = .fitted)) +
geom_point() +
ggpubr::theme_pubr()
This requires that you have the packages {tidyverse}, {broom}, and {ggpubr} installed.
Second part
Assumming we want to do this with multiple data-frames, we have to
restructure things a bit.
x <- c(4, 8, 13, 24)
y <- c(40, 37, 28, 20)
df <- tibble(x, y)
I don't have multiple data-frames (or tibbles), so I'll make this the
primary one, and make up a function (a factory) that yields data-frames, that are a bit different from the above df.
df_factory <- . %>%
mutate(x_new = x + sample.int(100, size = n()),
x = if_else(x_new >= 100, x, x_new),
y_new = y + rnorm(n(), mean = median(y), sd = sd(y)),
y = y_new,
y_new = NULL,
x_new = NULL)
Thus df_factory is a function of one-variable, and that must be a
data-frame that has an x and y;
df1 <- df_factory(df)
df2 <- df_factory(df)
df3 <- df_factory(df)
all_dfs <- list(df1, df2, df3)
all_dfs <- bind_rows(all_dfs, .id = "df_id")
Here we ensure that the relation to the original data-frame is preserved in the all_dfs data-frame via the new variable df_id.
Next we want to:
Collapse the variables into their individual data-frame, and we put
that in a list-column named data.
For each (see rowwise) we have to perform:
An "interpolating" linear model (not a piece-wise one so...)
Predict on each of these linear_xy (which are also stored in a list-column`).
Unnest it all back, so it can be fed into ggplot as one contiguous data-frame.
all_dfs %>%
nest(data = c(x,y)) %>%
rowwise() %>%
mutate(linear_xy = list(lm(y ~ x, data = data)),
augment = list(broom:::augment.lm(linear_xy,
newdata = complete(data, x = 1:100)))) %>%
ungroup() %>%
select(-data, -linear_xy) %>%
unnest(augment) ->
all_dfs_predictions
Note: -> at the end shows what the pipe result is now assigned to.
The group informs ggplot to treat the rows as separate via their
df_id. And for fun we add the color and fill to also depend on df_id. In fact I could have choosen something else to be the coloraesthetics dependent, like "original df" vs. "others" or if a certain threshold should distinguish them, etc.. But then the group aesthetic would still tell ggplot to separate the rows amongst this relation.
ggplot(all_dfs_predictions, aes(x, y, group = df_id, color = df_id, fill = df_id)) +
geom_line(aes(y = .fitted)) +
geom_point() +
lims(x = c(1,100)) +
ggpubr::theme_pubr()

Repeat a ggplot for each value of a variable in the dataframe

I want to make a graph for each value of a variable in my dataframe, and then pass that value through to the graph as the title. I think the best way to do this is by using the apply() family of functions, but i'm a bit of a novice and can't figure out how to do that.
For example, say I have this dataframe:
df <- data.frame(month=c("Chair", "Table", "Rug", "Wardrobe", "Chair", "Desk", "Lamp", "Shelves", "Curtains", "Bed"),
cutlery=c("Fork", "Knife", "Spatula", "Spoon", "Corkscrew", "Fork", "Tongs", "Chopsticks", "Spoon", "Knife"),
type=c("bbb", "ccc", "aaa", "bbb", "ccc", "aaa", "bbb", "ccc", "aaa", "bbb"),
count=c(341, 527, 2674, 811, 1045, 4417, 1178, 1192, 4793, 916))
I could manually go through and select on the value of type doing this:
df %>%
filter(type=='aaa') %>%
ggplot() +
geom_col(aes(month, count)) +
labs(title = 'value of {{type}} being plotted')
df %>%
filter(type=='bbb') %>%
ggplot() +
geom_col(aes(month, count)) +
labs(title = 'value of {{type}} being plotted')
df %>%
filter(type=='ccc') %>%
ggplot() +
geom_col(aes(month, count)) +
labs(title = 'value of {{type}} being plotted')
But this quickly becomes a lot of code, with enough levels of type and assuming a fair amount of additional code for each plot. Let's also assume that I don't want to use facet_wrap(~type). As you can see, the values of the x variable vary quite a lot across values of type so facet_wrap() results in lots of missing spaces along the x-axis. Ideally i'd just create a function that takes as an input the x and y variables, and type and then filters on type, makes the plot, and extracts the value of type to use in the title.
Can anyone help me out with this?
Or you can create your custom function and then lapply over the levels of type
myplot <- function(var){
df %>%
filter(type==var) %>%
ggplot() +
geom_col(aes(month, count)) +
labs(title = paste0("value of ",var))
}
plot.list <- lapply(unique(df$type), myplot)
plot.list[[1]]
plot.list[[2]]
plot.list[[3]]
EDIT
to include the x variable as argument::
myplot <-
function(var, xvar) {
df %>%
filter(type == var) %>%
ggplot() +
geom_col(aes(x={{xvar}}, count)) +
labs(title = paste0("value of ", var))
}
plot.list <- lapply(unique(df$type), myplot,xvar=cutlery)
you miss the {{}} (aka 'curly curly') operator that replaces the approach of quoting with enquo and unquoting with !! (aka 'bang-bang'), and the argument xvar has to be passed as an extra argument of lapply and not of the function inside lapply
You can split the data for each value of type and generate a list of plots.
library(tidyverse)
df %>%
group_split(type) %>%
map(~ggplot(.x) +
geom_col(aes(month, count)) +
labs(title = sprintf('value of %s being plotted',
first(.x$type)))) -> plot_list
plot_list[[1]] returns :
and plot_list[[2]] returns -
From your statement "Ideally i'd just create a function that takes as an input the x and y variables, and type ..." I assume you specifically want to have a function that takes x and y as different variables (e.g. vectors) as inputs rather than taking the whole dataframe as an input. In this case, this function may suit your needs:
library(ggplot2)
library(dplyr)
myPlot_function = function(x, y, type) {
gg_list = list() # empty list
df = data.frame(x, y, type)
type_n = length(unique(df$type)) # number of each type
for (i in 1:type_n){ # loop over each type
g = df %>%
filter(type == type[i]) %>%
ggplot() +
geom_col(aes(x, y)) +
labs(title = paste0("Value of ", type[i], " being plotted"))
gg_list[[paste0("plot_", i)]] = g
}
return(gg_list)
}
And then, you can specify your x and y and type as vectors. For example:
some_plots = myPlot_function(x = df$month, y = df$count, type = df$type)
The function myPlot_function returns a list containing your plots. You can now either use some_plots$plot_1 or some_plots[[1]] to see your plot. For example:
some_plots$plot_1

Use scale_x_continuous with labeller function that also takes a data frame as an argument as well as default breaks

Here's a code block:
# scale the log of price per group (cut)
my_diamonds <- diamonds %>%
mutate(log_price = log(price)) %>%
group_by(cut) %>%
mutate(scaled_log_price = scale(log_price) %>% as.numeric) %>% # scale within each group as opposed to overall
nest() %>%
mutate(mean_log_price = map_dbl(data, ~ .x$log_price %>% mean)) %>%
mutate(sd_log_price = map_dbl(data, ~ .x$log_price %>% sd)) %>%
unnest %>%
select(cut, price, price_scaled:sd_log_price) %>%
ungroup
# for each cut, find the back transformed actual values (exp) of each unit of zscore between -3:3
for (i in -3:3) {
my_diamonds <- my_diamonds %>%
mutate(!! paste0('mean_', ifelse(i < 0 , 'less_', 'plus_'), abs(i), 'z') := map2(.x = mean_log_price, .y = sd_log_price, ~ (.x + (i * .y)) %>% exp) %>% unlist)
}
my_diamonds_split <- my_diamonds %>% group_split(cut)
split_names <- my_diamonds %>% mutate(cut = as.character(cut)) %>% group_keys(cut) %>% pull(cut)
names(my_diamonds_split) <- split_names
I now have a variable my_diamonds_split that is a list of data frames. I would like to loop over these data frames and each time create a new ggplot.
I can use a custom labeller function with a single df, but I don't know how to do this within a loop:
labeller <- function(x) {
paste0(x,"\n", scales::dollar(sd(ex_df$price) * x + mean(ex_df$price)))
}
ex_df <- my_diamonds_split$Ideal
ex_df %>%
ggplot(aes(x = scaled_log_price)) +
geom_density() +
scale_x_continuous(label = labeller, limits = c(-3, 3))
This creates a plot for the 'Ideal' cut of diamonds. I also get two data points on the x axis, the zscore values at -2, 0 and 2 as well as the raw dollar values of 3.8K, 3.9K and 11.8K.
When I define the labeller function, I must specify the df to scale with. Tried instead with placing the dot instead of my_df, hoping that on each iteration ggplot would get the value of the df on any iteration:
labeller <- function(x) {
paste0(x,"\n", scales::dollar(sd(.$price) * x + mean(.$price)))
}
ex_df <- my_diamonds_split$Ideal
ex_df %>%
ggplot(aes(x = scaled_log_price)) +
geom_density() +
scale_x_continuous(label = labeller, limits = c(-3, 3))
Returns:
Error in is.data.frame(x) : object '.' not found
I then tried writing the function to accept an argument for the df to scale with:
labeller <- function(x, df) {
paste0(x,"\n", scales::dollar(sd(df$price) * x + mean(df$price)))
}
ex_df <- my_diamonds_split$Ideal
ex_df %>%
ggplot(aes(x = scaled_log_price)) +
geom_density() +
scale_x_continuous(label = labeller(df = ex_df), limits = c(-3, 3)) # because when it comes to running in real life, I will try something like labeller(df = my_diamonds_split[[i]])
Error in paste0(x, "\n", scales::dollar(sd(df$price) * x + mean(df$price))) :
argument "x" is missing, with no default
Bearing in mind that the scaling must be done per iteration, how could I loop over my_diamonds_split, and on each iteration generate a ggplot per above?
labeller <- function(x) {
# how can I make df variable
paste0(x,"\n", scales::dollar(sd(df$price) * x + mean(df$price)))
}
for (i in split_names) {
my_diamonds_split[[i]] %>%
ggplot(aes(x = scaled_log_price)) +
geom_density() +
scale_x_continuous(label = labeller, # <--- here, labeller must be defined with df$price except that will difer on each iteration
limits = c(-3, 3))
}
There's a hacky way to get this result in facets. Basically, after converting to z scores, you add different amounts (say, multiples of 1000) to each group's z scores. Then you set all the breaks to this collection of points and label them with pre-calculated labels.
library(ggplot2)
library(dplyr)
f <- function(x) {
y <- diamonds$price[diamonds$cut == x]
paste(seq(-3, 3), scales::dollar(round(mean(y) + seq(-3, 3) * sd(y))), sep = "\n")
}
breaks <- as.vector(sapply(levels(diamonds$cut), f))
diamonds %>%
group_by(cut) %>%
mutate(z = scale(price) + 3 + 1000 * as.numeric(cut)) %>%
ggplot(aes(z)) +
geom_point(aes(x = z - 2, y = 1), alpha = 0) +
geom_density() +
scale_x_continuous(breaks = as.vector(sapply(1:5 * 1000, "+", 0:6)),
labels = breaks) +
facet_wrap(vars(cut), scales = "free_x") +
theme(text = element_text(size = 16),
axis.text.x = element_text(size = 6))
You would have to increase the plot size to make the dollar values more visible of course.
Created on 2020-08-04 by the reprex package (v0.3.0)

How to plot a(n unknown) number of data series as geom_line in same chart

My first Q here, so please go lightly if I'm out of step anywhere.
I'm trying to code R to produce a single chart to contain a number of data series lines. The number of data series may vary but will be provided in the data frame. I have tried to rearrange another thread's content to print the geom_line , but not successfully.
The logic is:
#desire to replace loop of 1:5 with ncol(df)
print(ggplot(df,aes(x=time))
for (i in 1:5) {
print (+ geom_line(aes(y=df[,i]))
}
#functioning geom point loops ggplot production:
for (i in 1:5) {
print(ggplot(df,aes(x=time,y=df[,i]))+geom_point())
}
#functioning multi-line ggplot where n is explicit:
ggplot(data=df, aes(x=time), group=1) +
geom_line(aes(y=df$`3`))+
geom_line(aes(y=df$`4`))
The functioning example code produces n number of point charts, 5 in this case. I would like just one chart to contain n line series.
This may be similar to How to plot n dimensional matrix? for which there are currently no relevant answers
Any contributions much appreciated, thanks
You can use gather from tidyverse "world" to do that.
As you didn't supply a sample data I used mtcars.
I created two data.frames one with 3 columns one with 9. In each one of them I plotted all of the variables against the variable mpg.
library(tidyverse)
df3Columns <- mtcars[, 1:4]
df9Columns <- mtcars[, 1:10]
df3Columns %>%
gather(var, value, -mpg) %>%
ggplot(aes(mpg, value, group = var, color = var)) +
geom_line()
df9Columns %>%
gather(var, value, -mpg) %>%
ggplot(aes(mpg, value, group = var, color = var)) +
geom_line()
Edit - using the sample data in comments.
library(tidyverse)
df %>%
rownames_to_column("time") %>%
gather(var, value, -time) %>%
ggplot(aes(time, value, group = var, color = var)) +
geom_line()
Sample data:
df <- structure(list("39083" = c(96, 100, 100), "39090" = c(99, 100, 100), "39097" = c(99, 100, 100)), row.names = 3:5, class = "data.frame")
To strictly answer your question, you can simply store your ggplot in a variable and add the geom_line one by one:
df <- structure(list("39083" = c(96, 100, 100), "39090" = c(99, 100, 100), "39097" = c(99, 100, 100)), row.names = 3:5, class = "data.frame")
g <- ggplot(df, aes(x = 1:nrow(df)))
for (i in colnames(df))
{
g <- g + geom_line(y = df[,i])
}
g <- g + scale_y_continuous(limits = c(min(df), max(df)))
print(g)
However, this is not a very convenient solution. I would highly recommend to refactor your data frame to be more ggplot style.
df.ultimate <- data.frame(time = numeric(), value = numeric(), group = character())
for (i in colnames(df))
{
df.ultimate <- rbind(df.ultimate, data.frame(time = 1:nrow(df), value = df[, i], group = i))
}
g <- ggplot(df.ultimate, aes(x = time, y = value, color = group))
g <- g + geom_line()
print(g)
A one-line solution:
ggplot(data.frame(time = rep(1:nrow(df), ncol(df)),
value = as.vector(as.matrix(df)),
group = rep(colnames(df), each = nrow(df))),
aes(x = time, y = value, color = group)) + geom_line()

Variable column names in the pipe

I have the following code:
install.packages('tidyverse')
library(tidyverse)
x <- 1:10
y <- x^2
df <- data.frame(first_column = x, second_column = y)
tibble <- as_tibble(df)
tibble %>%
filter(second_column != 16) %>%
ggplot(aes(x = first_column, y = second_column)) +
geom_line()
Now I would like to create the following function
test <- function(colname) {
tibble %>%
filter(colname != 16) %>%
ggplot(aes(x = first_column, y = colname)) +
geom_line()
}
test('second_column')
But running it creates a vertical line instead of the function. How can I make this function work?
Edit: My focus is on getting the pipe to work, not ggplot.
In order to pass character strings for variable names, you have to use the standard evaluation version of each function. It is aes_string for aes, and filter_ for filter. See the NSE vignette for more details.
Your function could look like:
test <- function(colname) {
tibble %>%
filter_(.dots= paste0(colname, "!= 16")) %>%
ggplot(aes_string(x = "first_column", y = colname)) +
geom_line()
}

Resources