How to index a dataframe for using ggplot in a loop - r

I am struggling with creating multiple ggplots using a loop.
I use data in the following format:
a <- c(1,2,3,4)
b <- c(5,6,7,8)
c <- c(9,10,11,12)
d <- c(13,14,15,16)
time <- c(1,2,3,4)
data <- cbind(a,b,c,d,time)
What I want to create is a list of plots that plot one of the letters against the variable time.
Which I tried in the following way:
library(ggplot2)
library(gridExtra)
plots <- list()
for (i in 1:4){
plots[[i]] <- ggplot() + geom_line(data = data, aes(x = time, y = data[,i]))
}
grid.arrange(plots[[1]], plots[[2]], plots[[3]], plots[[4]])
This results in four times the fourth plot. How do I index this correctly in a way that creates the four intended plots?

(Up front: the reason that your plots are all identical is due to ggplot's "lazy" evaluation of code. See my #2 below, where I identify that the data[,i] is evaluated when you try to plot the data, at which point i is 4, the last pass in the for loop.)
It's generally preferred/recommended to use data.frames instead of matrices or vectors (as you're doing here). It gives a bit more power and control.
data <- data.frame(a,b,c,d,time)
Also, I tend to prefer lapply to for-loops and lists, for various (some subjective) reasons. Ultimately, the issue you're having is that ggplot2 is evaluating the data lazily, so plots is a list with four plots that make reference to i ... and that is realized when you try to plot them all, at which point i is 4 (from the last pass through the loop). One benefit of using lapply is that the i referenced is a local-only (inside of the anon-func) version of i that is preserved as you would expect.
plots <- lapply(names(data)[1:4],
function(nm) ggplot(data, aes(x = time, y = .data[[nm]])) + geom_line())
gridExtra::grid.arrange(plots[[1]], plots[[2]])
I also prefer patchwork to gridExtra, mostly because it makes more-customized layouts a bit more intuitive, plus adds functionality such as axis-alignment, shared legends, shared titles, etc. (None of those other features are demonstrated here.)
library(patchwork)
plots[[1]] / plots[[2]] # same plot
plots[[1]] + plots[[2]] # side-by-side instead of top/bottom
(plots[[1]] + plots[[2]]) / (plots[[3]] + plots[[4]]) # grid
Ultimately, though, I suggest that facets can be useful and very powerful. For this, we need to melt/pivot the data into a "long format" so that the column names a-b are actually in one column.
reshape2::melt(data, id.vars = "time") |>
ggplot(aes(time, value)) +
geom_line() +
facet_grid(variable ~ ., scales = "free_y")
I assumed the preference for independent (free) y-scales, ergo the scales="free_y". Try it without if you want to see the options. (There are also scales="free_x" and scales="free" (both).)
To see what I mean by "long" format:
reshape2::melt(data, id.vars = "time")
# time variable value
# 1 1 a 1
# 2 2 a 2
# 3 3 a 3
# 4 4 a 4
# 5 1 b 5
# 6 2 b 6
# 7 3 b 7
# 8 4 b 8
# 9 1 c 9
# 10 2 c 10
# 11 3 c 11
# 12 4 c 12
# 13 1 d 13
# 14 2 d 14
# 15 3 d 15
# 16 4 d 16
This can also be done with tidyr::pivot_longer(data, -time), albeit the variable name is now name. For this use, there is no advantage to reshape2::melt or tidyr::pivot_longer; there are opportunities for significantly more complex pivoting in the latter, not relevant with this data.
Data
data <- structure(list(a = c(1, 2, 3, 4), b = c(5, 6, 7, 8), c = c(9, 10, 11, 12), d = c(13, 14, 15, 16), time = c(1, 2, 3, 4)), class = "data.frame", row.names = c(NA, -4L))

Related

Identify and plot datapoints surrounded by NAs

I am using ggplot2 and geom_line() to make a lineplot of a large number of time series. The dataset has a high number of missing values, and I am generally happy that lines are not drawn across missing segments, as this would look awkard.
My problem is that single non-NA datapoints surrounded by NAs (or points at the beginning/end of the series with an NA on the other side) are not plotted. A potential solution would be adding geom_point() for all observations, but this increases my filesize tenfold, and makes the plot harder to read.
Thus, I want to identify only those datapoints that do not get shown with geom_line() and add points only for those. Is there a straightforward way to identify these points?
My data is currently in long format, and the following MWE can serve as an illustration. I want to identify rows 1 and 7 so that I can plot them:
library(ggplot2)
set.seed(1)
dat <- data.frame(time=rep(1:5,2),country=rep(1:2,each=5),value=rnorm(10))
dat[c(2,6,8),3] <- NA
ggplot(dat) + geom_line(aes(time,value,group=country))
> dat
time country value
1 1 1 -0.6264538
2 2 1 NA
3 3 1 -0.8356286
4 4 1 1.5952808
5 5 1 0.3295078
6 1 2 NA
7 2 2 0.4874291
8 3 2 NA
9 4 2 0.5757814
10 5 2 -0.3053884
You can use zoo::rollapply function to create a new column with values surrended with NA only. Then you can simply plot those points. For example:
library(zoo)
library(ggplot2)
foo <- data.frame(time =c(1:11), value = c(1 ,NA, 3, 4, 5, NA, 2, NA, 4, 5, NA))
# Perform sliding window processing
val <- c(NA, NA, foo$value, NA, NA) # Add NA at the ends of vector
val <- rollapply(val, width = 3, FUN = function(x){
if (all(is.na(x) == c(TRUE, FALSE, TRUE))){
return(x[2])
} else {
return(NA)
}
})
foo$val_clean <- val[c(-1, -length(val))] # Remove first and last values
foo$val_clean
ggplot(foo) + geom_line(aes(time, value)) + geom_point(aes(time, val_clean))
Do you mean something like this?
library(tidyverse)
dat %>%
na.omit() %>%
ggplot() +
geom_line(aes(time, value, group = country))

Plot every 10 datapoint in a vector by different color in R

I have one dimensional vector in R which I would like to plot like :
Every 10 data points have different color. How do I do this in R with normal plot function, with ggplot and with plotly?
in base R you can try this.
I changed the data a little bit compared to the other answer
# The data
set.seed(2017);
df <- data.frame(x = 1:100, y = 0.001 * 1:100 + runif(100));
nCol <- 10;
df$col <- rep(1:10, each = 10);
# base R plot
plot(df[1:2]) #add `type="n"` to remove the points
sapply(1:nrow(df), function(x) lines(df[x+0:1,1:2], col=df$col[x], lwd=2))
As for lines the col parameter will be recycled you have to use a loop (here sapply) over the rows and plot segments.
Here is a ggplot solution; unfortunately you don't provide sample data, so I'm generating some random data.
# Sample data
set.seed(2017);
df <- data.frame(x = 1:100, y = 0.001 * 1:100 + runif(1000));
# The number of different colours
nCol <- 5;
df$col <- rep(1:nCol, each = 10);
# ggplot
library(tidyverse);
ggplot(df, aes(x = x, y = y, col = as.factor(col), group = 1)) +
geom_line();
For plotly just wrap the ggplot call within ggplotly.
This answer doesn't show you how to do it in a specific plotting package, but instead shows how to assign random colors to your data according to your specifications. The benefit of this approach is that it gives you control over which colors you use if you choose.
library(dplyr) # assumed okay given ggplot2 mention
df = data_frame(v1=rnorm(100))
n = nrow(df)
df$group = (1:n - (1:n %% -10)) / 10
colors = sample(colors(), max(df$group), replace=FALSE)
df$color = colors[df$group]
df %>% group_by(group) %>% filter(row_number() <= 2) %>% ungroup()
# A tibble: 20 x 3
v1 group color
<dbl> <dbl> <chr>
1 -0.6941434087 1 lightsteelblue2
2 -0.4559695973 1 lightsteelblue2
3 0.7567737300 2 darkgoldenrod2
4 0.9478937275 2 darkgoldenrod2
5 -1.2358486079 3 slategray3
6 -0.7068140340 3 slategray3
7 1.3625895045 4 cornsilk
8 -2.0416315923 4 cornsilk
9 -0.6273386846 5 darkgoldenrod4
10 -0.5884521130 5 darkgoldenrod4
11 0.0645078975 6 antiquewhite1
12 1.3176727205 6 antiquewhite1
13 -1.9082708004 7 khaki
14 0.2898018693 7 khaki
15 0.7276799336 8 greenyellow
16 0.2601492048 8 greenyellow
17 -0.0514811315 9 seagreen1
18 0.8122600269 9 seagreen1
19 0.0004641533 10 darkseagreen4
20 -0.9032770589 10 darkseagreen4
The above code first creates a fake dataset with 100 rows of data, and sets n equal to 100. df$group is set by taking the row numbers (1:n) performing a rather convoluted evaluation to get a vector of numbers like c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ..., 10). It then samples the colors available in base R returning as many colors as their are groups (max(df$group)) and then using the group variable to index the color vector to get the color. The final output is just the first two rows of each group to show that the colors are the same within group, but different between groups. This should now be able to be passed in as a variable in your various plotting environments.

How do you store evaluated ggplots in R from a list of lists by string?

I am have made a series of lists that contain ggplots. I would like to evaluate the objects in order to bite the plotting time early. I have gathered the variable names that I would like to evaluate in a string vector. Additionally, I want to keep the variable names before.
The solution I tried was to lapply the eval(as.symbol("myvarstring")). To my knowledge, it evaluates the variable without storing the evaluated expression.
Adding as.symbol("myvarstring") <- eval(as.symbol("myvarstring")) does not work for me.
Below is a minimal reproducible example of my failed solution.
library(tidyverse)
tbl <- tibble(
x = 1:10,
y = 1:10
)
g <- ggplot(tbl, aes(x, y)) + geom_point()
my_plot_list1 <- list(g,g,g,g,g,g)
my_plot_list2 <- list(g,g,g,g,g,g)
my_plot_list3 <- list(g,g,g,g,g,g)
my_vars <- c(
"my_plot_list1",
"my_plot_list2",
"my_plot_list3"
)
lapply(my_vars, FUN = function(x) {as.symbol(x) <- eval(as.symbol(x))})
How would you accomplish this task?
Thank you
EDIT:
These graphs will ultimately be displayed through an rmarkdown script. The graphs will be loaded in the rscript. My graphs take an enormous amount of time to plot. If I could save an environment with "rendered" graphs, it would shorten the rmarkdown runtime. Shortening runtime of the rmarkdown runtime is the ultimate goal.
Why don't you just store the lists in a list, rather than relying on tricks to get them from the global environment?
library(tidyverse)
tbl <- tibble(
x = 1:10,
y = 1:10
)
g <- ggplot(tbl, aes(x, y)) + geom_point()
my_plot_list1 <- list(g,g,g,g,g,g)
my_plot_list2 <- list(g,g,g,g,g,g)
my_plot_list3 <- list(g,g,g,g,g,g)
my_vars <- list(
my_plot_list1,
my_plot_list2,
my_plot_list3
)
lapply(my_vars, function(x) lapply(x, function(y) y))
If you want to ensure that the plots print (eg, if you were to call this code in a function or script) then replace the inner function(y) y with function(y) print(y)
EDIT: I believe I misunderstood.
If you want to assign variables to a programmatically generated name, you would do:
x <- "mygeneratedname"
assign(x, g, envir = .GlobalEnv)
The get function in base R will retrieve the object from the character string. For example:
get("tbl")
# # A tibble: 10 x 2
# x y
# <int> <int>
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
# 6 6 6
# 7 7 7
# 8 8 8
# 9 9 9
# 10 10 10
So in your example:
lapply(my_vars, FUN = function(x) { get(x)})
should work.
I believe there are better approaches depending on the next steps of what you want to do with the plots. Consider if this the best way to handle the data. Can a list of lists work? Store the lists in a vector?

Histogram of specific rows

I have a csv that looks similar to the following.
Library Parameter1 Parameter2 Parameter3
A 3 6 4
A 4 6 3
A 7 8 9
B 2 10 7
B 4 4 5
B 3 5 4
C 4 6 4
C 6 3 12
C 5 6 8
I would like to be able to create a function to create a histogram for a specific library and parameter e.g., histogram of the frequency of Parameter 2 in Library B.
I kind of know how to use the histogram function here's what I have right now.
### x = "Parameter"
histogram <- function(x) {hist(filename[[x]], main = "Normalized",
xlab = "x", ylab = "Frequency", breaks = ceiling(sqrt(nrow(filename))))}
Edit: This is the actual data frame I am working with. It is quite large so I couldn't put the dput in here???
https://www.dropbox.com/s/2ivbhc7wyqms0fy/All-Norm.csv?dl=0
(Sorry if I've done anything incorrectly, still very new.)
One solution would be to subset your data first:
sub <- subset(yourdata, Library == "B")$Parameter2
histogram(sub)
This is a really simple ggplot I just put together
Code:
dat <- data.frame(Library = c("A","A","A","B","B","B","C","C","C"),
Parameter1=c(3,4,7,2,4,3,4,6,5),
Parameter2 = c(6,6,8,10,4,5,6,3,6),
Parameter3=c(4,3,9,7,5,4,4,12,8))
dat <- data.table::melt(dat,id.vars="Library")
library(ggplot2)
ggplot(dat,aes(x = value)) + geom_histogram() + facet_grid(Library~variable)
Output:
Obviously this could be cleaned up a lot, but this is a place to start.

Separate charts for element in list using ggplot2 and lapply

Okay guys I need a hand with using ggplot2 in a loop over a list (with lapply) to obtain a separate chart for each element of the list.
I'm new to R, so forgive the noob-ness.
Say I have a dataframe as such:
df <- cbind.data.frame(Time = c(1,2,3,4,1,2,3,4),
Person = c("A","A","A","A","B","B","B","B"),
Quantity = c(1,4,6,8,1,6,2,10))
df <- data.table(df)
> df
Time Person Quantity
1: 1 A 1
2: 2 A 4
3: 3 A 6
4: 4 A 8
5: 1 B 1
6: 2 B 6
7: 3 B 2
8: 4 B 10
I want to produce a chart for person A and person B separately.
At the moment I have my function set up like this:
Persons = c("A","B")
PersonList = as.list(Persons)
MyFunction <- function(x){
SubsetPersons = Persons[!(Persons %in% x)]
df <- df[!(df$Person %in% SubsetPersons)]
g <- ggplot(data=df, aes(x=Time, y=Quantity))
g <- g + geom_line()
print(g)
}
Results <- lapply(Persons, MyFunction)
But I'm not sure how to save the charts with different names corresponding to the list elements?
NOTE: I know this function may seem an odd way to solve this problem, but for the larger more complex problem I have at hand it is required.
I am simply trying to figure out how to save different names for the charts in the list!
Thanks in advance!
Persons = c("A","B")
MyFunction <- function(x){
dfs <- df[df$Person == x,]
g <- ggplot(data=dfs, aes(x=Time, y=Quantity))
g <- g + geom_line()
#have added extra bracket after ".PNG"
ggsave(paste0("plot_for_person", x, ".PNG"), g)
print(g)
return(g)
}
Results <- lapply(Persons, MyFunction)

Resources