Setting order of scale_x_discrete when there are repeated levels

Setting order of scale_x_discrete when there are repeated levels - r

I want to make usual geom_point plot using ggplot. But some of x values are repeated and I want to repeat them again in the x axis. So I tried scale_x_discrete and followed the example at here change-the-order-of-a-discrete-x-scale but I was not able to do what I want.
Here is my example
x = c(seq(1,4),seq(2,4))
y= (seq(1,7))
ex=rep(c("ex1","ex2"),c(4,3))
df <- data.frame(x,y,ex)
x y ex
1 1 1 ex1
2 2 2 ex1
3 3 3 ex1
4 4 4 ex1
5 2 5 ex2
6 3 6 ex2
7 4 7 ex2
ggplot(df, aes(x=factor(x),y=y)) +
geom_point(size=4) +
scale_x_discrete(limits=c(seq(1,4),seq(2,4)))
with discrete x repeat values, the repeated x axis values is not shown. How can repeat 2,3,4 values again after 1,2,3,4 in the x axis?
Thanks

Because you want not x but a combination of repeat and x as x-axis, it is a natural idea to give aes(x) the combination.
ggplot(df, aes(x = interaction(x, ex), y = y)) +
geom_point(size=4) +
scale_x_discrete(labels = df$x)

Related

How to index a dataframe for using ggplot in a loop

I am struggling with creating multiple ggplots using a loop.
I use data in the following format:
a <- c(1,2,3,4)
b <- c(5,6,7,8)
c <- c(9,10,11,12)
d <- c(13,14,15,16)
time <- c(1,2,3,4)
data <- cbind(a,b,c,d,time)
What I want to create is a list of plots that plot one of the letters against the variable time.
Which I tried in the following way:
library(ggplot2)
library(gridExtra)
plots <- list()
for (i in 1:4){
plots[[i]] <- ggplot() + geom_line(data = data, aes(x = time, y = data[,i]))
}
grid.arrange(plots[[1]], plots[[2]], plots[[3]], plots[[4]])
This results in four times the fourth plot. How do I index this correctly in a way that creates the four intended plots?

(Up front: the reason that your plots are all identical is due to ggplot's "lazy" evaluation of code. See my #2 below, where I identify that the data[,i] is evaluated when you try to plot the data, at which point i is 4, the last pass in the for loop.)
It's generally preferred/recommended to use data.frames instead of matrices or vectors (as you're doing here). It gives a bit more power and control.
data <- data.frame(a,b,c,d,time)
Also, I tend to prefer lapply to for-loops and lists, for various (some subjective) reasons. Ultimately, the issue you're having is that ggplot2 is evaluating the data lazily, so plots is a list with four plots that make reference to i ... and that is realized when you try to plot them all, at which point i is 4 (from the last pass through the loop). One benefit of using lapply is that the i referenced is a local-only (inside of the anon-func) version of i that is preserved as you would expect.
plots <- lapply(names(data)[1:4],
function(nm) ggplot(data, aes(x = time, y = .data[[nm]])) + geom_line())
gridExtra::grid.arrange(plots[[1]], plots[[2]])
I also prefer patchwork to gridExtra, mostly because it makes more-customized layouts a bit more intuitive, plus adds functionality such as axis-alignment, shared legends, shared titles, etc. (None of those other features are demonstrated here.)
library(patchwork)
plots[[1]] / plots[[2]] # same plot
plots[[1]] + plots[[2]] # side-by-side instead of top/bottom
(plots[[1]] + plots[[2]]) / (plots[[3]] + plots[[4]]) # grid
Ultimately, though, I suggest that facets can be useful and very powerful. For this, we need to melt/pivot the data into a "long format" so that the column names a-b are actually in one column.
reshape2::melt(data, id.vars = "time") |>
ggplot(aes(time, value)) +
geom_line() +
facet_grid(variable ~ ., scales = "free_y")
I assumed the preference for independent (free) y-scales, ergo the scales="free_y". Try it without if you want to see the options. (There are also scales="free_x" and scales="free" (both).)
To see what I mean by "long" format:
reshape2::melt(data, id.vars = "time")
# time variable value
# 1 1 a 1
# 2 2 a 2
# 3 3 a 3
# 4 4 a 4
# 5 1 b 5
# 6 2 b 6
# 7 3 b 7
# 8 4 b 8
# 9 1 c 9
# 10 2 c 10
# 11 3 c 11
# 12 4 c 12
# 13 1 d 13
# 14 2 d 14
# 15 3 d 15
# 16 4 d 16
This can also be done with tidyr::pivot_longer(data, -time), albeit the variable name is now name. For this use, there is no advantage to reshape2::melt or tidyr::pivot_longer; there are opportunities for significantly more complex pivoting in the latter, not relevant with this data.
Data
data <- structure(list(a = c(1, 2, 3, 4), b = c(5, 6, 7, 8), c = c(9, 10, 11, 12), d = c(13, 14, 15, 16), time = c(1, 2, 3, 4)), class = "data.frame", row.names = c(NA, -4L))

Looping through dataframe names and plotting in R

I have a dataframe of 12 variables and I'd like to plot exactly one variable against others using ggplot's geom_point(). Wouldn't want to do it manually so i need to loop through the variables making plots.
For example, I have a df like this (simplified to 4 variables for readability):
> head(df)
letters value1 value2 value3
A 1 0 10
B 3 1 9
C 6 0 8
D 76 0 7
E 13 1 6
F 58 1 5
And I'd like to produce two plots where value1 is plotted over value2 and value3.
I've tried this:
plts <- vector()
for (i in names(df)) {
p <- ggplot(df, aes(x=value1, y=i, fill=letters)) + geom_point())
plts <- append(plts, p)
}
but it treats the values 2 & 3 different than the value 1 and produces something like this (e.g., value1 over value3):
Plot of value1 over value3
What should be done to improve this and achieve the goal of having the plots like this:
ggplot(df, aes(x=value1, y=value3, fill=letters)) + geom_point()
Produced without a loop

I think using aes_string() instead of aes will give you what you want. Your problem is caused by the tidyverse's use of non-standard evaluation (NSE).
lapply(
names(df),
function(y) {
df %>% ggplot() + geom_point(aes_string(x="value1", y=y, colour="letters"))
}
)
giving, for example
You can customise the first argument to lapply to select the variables you need.
That said, I think it would be easier and more robust to reformat your data frame to a more helpful layout and then create your plots...
For example,
df %>%
pivot_longer(
cols=c("value2", "value3"),
names_to="Variable",
values_to="y"
) %>%
ggplot() +
geom_point(aes(x=value1, y=y, colour=letters)) +
facet_grid(rows=vars(Variable))
Giving
By the way, using colour=letters is probably more informative than fill=letters when using geom_point.

stack bars in plot without preserving label order

ggplot preserves the order of stacked bars according to labels:
d <- read.table(text='Day Location Length Amount
1 2 3 1
1 1 4 2
3 3 3 2
3 2 5 1',header=T)
d$Amount<-as.factor(d$Amount) # in real world is not numeric
ggplot(d, aes(x = Day, y = Length)) +
geom_bar(aes(fill = Amount), stat = "identity")
What I desired is something similar as the result of the plot without the as.factor line. That is: that the greater bars are always on top. However, I cannot do that with my data because I have categories, not numbers.
Similar post: https://www.researchgate.net/post/R_ggplot2_Reorder_stacked_plot
Solution can come in other R package
Note: data.frame is only demonstrative.

I came up with this solution:
(1) First, sort data.frame by column of values in decreasing order
(2) Then duplicate column of values, as factor.
(3) In ggplot group by new factor (values)
d <- read.table(text='Day Length Amount
1 3 1
1 4 2
3 3 2
3 5 1',header=T)
d$Amount<-as.factor(d$Amount)
d <- d[order(d$Length, decreasing = TRUE),] # (1)
d$LengthFactor<-factor(d$Length, levels= unique(d$Length) ) # (2)
ggplot(d)+
geom_bar(aes(x=Day, y=Length, group=LengthFactor, fill=Amount), # (3)
stat="identity", color="white")
{
library(data.table)
sam<-data.frame(population=c(rep("PRO",8),rep("SOM",4)),
allele=c("alele1","alele2","alele3","alele4",rep("alele5",2),
rep("alele3",2),"alele2","alele3","alele3","alele2"),
frequency=rep(c(10,5,4,6,7,16),2) #,rep(1,6)))
)
sam <- setDT(sam)[, .(frequencySum=sum(frequency)), by=.(population,allele)]
sam <- sam[order(sam$frequency, decreasing = TRUE),] # (1)
# (2)
sam$frequency<-factor(sam$frequency, levels = unique(sam$frequency) )
library(ggplot2)
ggplot(sam)+
geom_bar(aes(x=population, y=frequencySum, group=frequency, fill=allele), # (3)
stat="identity", color="white")
}

Conditional coloring of geom_path in ggplot in R

I have the following simplified data frame, which contains four paths:
df <- read.table(text="id x y
a 1 1
a 2 2.0
a 2 3.1
a 3.2 4
b 1.0 1
b 2 0
b 2 -1
b 3 -3
c 1 1
c 0 0
c 0 -1
c -1 -2
d 1 1
d 0 1
d -1 0
d -2 0", header=TRUE)
I am able to plot the paths, using ggplots' geom_path() function:
ggplot(data = df) +
geom_path(aes(x = x, y = y, color = id))
Question
How do I color the paths that have a Y-value of over 2 in red (or even better, in red scales in case of multiple paths), while coloring the others in different greyscales? I am able to manually alter the colors of the lines, but I have plots with 3 up to 50+ paths, so I am looking for a more automated solution.

One approach is to add a colour column for each path before passing the data frame into ggplot: for example, you could assign each path a colour name in luv_colours, which can be passed into an identity colour scale. The example below does this with dplyr.
n_ids <- length(unique(df$id))
group_by(df, id) %>%
mutate(col = if (any(y>2)) "red" else paste0("gray", round(match(id, letters) * 60/n_ids))) %>%
ungroup() %>%
ggplot() +
geom_path(aes(x, y, colour = col, lty = id)) +
scale_colour_identity()
n_ids is used to spread the grayscale values over most of the scale, leaving out the values close to white. This should work if n_ids <= 50, since 60/n_ids > 1, and therefore two paths can't have their gray number be the same after rounding.
Line types are used here for the legend, because using colour runs into problems if there's more than one red path. This not ideal, because there are not many line types. I'd therefore recommend that, instead of colouring the key paths red, you use a different line type, reserving colour for the path id since there are many more colours than line types.
group_by(df, id) %>%
mutate(lty = if (any(y>2)) "solid" else "dashed") %>%
ungroup() %>%
ggplot() +
geom_path(aes(x, y, colour = id, lty = lty)) +
scale_linetype_identity()
This also has the advantage of the number of paths not being limited by the number of grayscale colours.
If the path IDs have a meaningful order, you could look at using a colour palette different to the default, such as scale_color_brewer() or scale_color_viridis_d().
EDIT: You could also introduce several shades of red, and use them in a similar way to how I handled the grayscale for different paths. I'd still recommend against this in favour of my alternative for two reasons:
Handling colours in this way is a pain.
It still has the same underlying problem, which is that you're trying to map several features of the paths (y > 2, id) onto a single graphical feature (colour).
This is all predicated on you wanting to identify each path uniquely. If not, you can just do this:
group_by(df, id) %>%
mutate(col = if (any(y>2)) "red" else "black") %>%
ungroup() %>%
ggplot() +
geom_path(aes(x, y, colour = col, group = id)) +
scale_colour_identity()

Your question is not very clear to me but you can create a condition and then color the item based on it.
df <- read.table(text="id x y
a 1 1
a 2 2.0
a 2 3.1
a 3.2 4
b 1.0 1
b 2 0
b 2 -1
b 3 -3
c 1 1
c 0 0
c 0 -1
c -1 -2
d 1 1
d 0 1
d -1 0
d -2 0", header=TRUE)
df$condition <- ifelse(df$y>2,"Yes","No")
ggplot(data = df) + geom_path(aes(x = x, y = y, color = condition))

Identify and plot datapoints surrounded by NAs

I am using ggplot2 and geom_line() to make a lineplot of a large number of time series. The dataset has a high number of missing values, and I am generally happy that lines are not drawn across missing segments, as this would look awkard.
My problem is that single non-NA datapoints surrounded by NAs (or points at the beginning/end of the series with an NA on the other side) are not plotted. A potential solution would be adding geom_point() for all observations, but this increases my filesize tenfold, and makes the plot harder to read.
Thus, I want to identify only those datapoints that do not get shown with geom_line() and add points only for those. Is there a straightforward way to identify these points?
My data is currently in long format, and the following MWE can serve as an illustration. I want to identify rows 1 and 7 so that I can plot them:
library(ggplot2)
set.seed(1)
dat <- data.frame(time=rep(1:5,2),country=rep(1:2,each=5),value=rnorm(10))
dat[c(2,6,8),3] <- NA
ggplot(dat) + geom_line(aes(time,value,group=country))
> dat
time country value
1 1 1 -0.6264538
2 2 1 NA
3 3 1 -0.8356286
4 4 1 1.5952808
5 5 1 0.3295078
6 1 2 NA
7 2 2 0.4874291
8 3 2 NA
9 4 2 0.5757814
10 5 2 -0.3053884

You can use zoo::rollapply function to create a new column with values surrended with NA only. Then you can simply plot those points. For example:
library(zoo)
library(ggplot2)
foo <- data.frame(time =c(1:11), value = c(1 ,NA, 3, 4, 5, NA, 2, NA, 4, 5, NA))
# Perform sliding window processing
val <- c(NA, NA, foo$value, NA, NA) # Add NA at the ends of vector
val <- rollapply(val, width = 3, FUN = function(x){
if (all(is.na(x) == c(TRUE, FALSE, TRUE))){
return(x[2])
} else {
return(NA)
}
})
foo$val_clean <- val[c(-1, -length(val))] # Remove first and last values
foo$val_clean
ggplot(foo) + geom_line(aes(time, value)) + geom_point(aes(time, val_clean))

Do you mean something like this?
library(tidyverse)
dat %>%
na.omit() %>%
ggplot() +
geom_line(aes(time, value, group = country))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Setting order of scale_x_discrete when there are repeated levels - r

Because you want not x but a combination of repeat and x as x-axis, it is a natural idea to give aes(x) the combination. ggplot(df, aes(x = interaction(x, ex), y = y)) + geom_point(size=4) + scale_x_discrete(labels = df$x)

Related

How to index a dataframe for using ggplot in a loop

Looping through dataframe names and plotting in R

stack bars in plot without preserving label order

Conditional coloring of geom_path in ggplot in R

Identify and plot datapoints surrounded by NAs

Categories

Resources