This question already has answers here:
Plotting two variables as lines using ggplot2 on the same graph
(5 answers)
Closed 3 years ago.
I have a dataframe, df,similar to the following:
Time Sample_A Sample_B Sample_C
0 0.12 0.14 0.15
1 0.13 0.20 0.21
2 0.31 0.34 0.36
I am reading in this data from a text file, in which the number of columns will always be changing. I would like to use ggplot in order to quickly and easily graph the x value (always Time) by all of the y values (Sample A, B, C, ....) onto a single graph. The names of the Y-variables are always changing as well.
In essence, I'd like to avoid doing the following on repeat:
ggplot(df, aes(x = Time, y = Sample_A) + geom_line()
ggplot(df, aes(x = Time, y = Sample_B) + geom_line()
I have tried to create a vector that contains all names of the columns and apply that as the Y-values to the aes function, however it returns the number of variables, rather than the values within the variables.
What is the most efficient way to go about this?
This is pretty simple:
library(tidyverse)
df <- tibble(
time = c(0, 1, 2),
Sample_A = c(0.12, 0.13, 0.31),
Sample_B = c(0.14, 0.20, 0.34),
Sample_C = c(0.15, 0.21, 0.36)
)
df %>%
gather(key = sample, value = value, -time) %>%
ggplot(aes(x = time, y = value, color = sample)) +
geom_line()
Basically, you can gather all of the columns except the first into a "long" data frame instead of a "wide" one. Then a couple lines of ggplot code will plot the result, colored by sample.
Use lapply to render a geom_line that loops over the columns like this:
ggplot(data) +
lapply(names(data)[2:length(data)], FUN = function(i) geom_line(aes_string(x = time, y = i)))
Related
please consider the following simple data frame, where each observation/individual has multiple variables whose values add up to 1. For example, A through E could be different body parts, and each person's body parts all add up 100% of the person's total weight, but the proportions might be different between individuals:
library(ggplot2)
library(dplyr)
test.df <- data.frame("Variable" = LETTERS[1:5], "Obs1" = c(0.1, 0, 0.5, 0.2, 0.2),
"Obs2" = c(0.3, 0.7, 0, 0, 0))
This should produce the following data frame
> test.df
Variable Obs1 Obs2
1 A 0.1 0.3
2 B 0.0 0.7
3 C 0.5 0.0
4 D 0.2 0.0
5 E 0.2 0.0
I'd like a pie chart for each observation. So, a total of two pie charts, where all the variables are shown in a legend for each chart, and the color codes match between charts, and values of '0' are represented in the legend but not on the chart itself.
I'm positive that what I'd like to do is simple, but there's some stumbling block that I'm not seeing. I've done this before successfully, but it seems I did not truly understand because now I'm having trouble. I have tried:
ggplot(test.df, aes(x='', y='Obs1', fill = Variable)) +
geom_bar(width = 1, stat = 'identity') + coord_polar("y", start = 0)
What I end up with is a pie chart where each of the five varibles take up equal amounts of space, even though I have specified that the values of 'y' should come from Obs1 in the data frame:
Can anyone please help me? This is driving me crazy!
Best,
A
You could convert your data to a longer format using pivot_longer from tidyr to make sure you can plot two graphs for Obs1 and Obs2 using facet_wrap like this:
library(ggplot2)
library(dplyr)
library(tidyr)
test.df %>%
pivot_longer(cols = Obs1:Obs2) %>%
ggplot(aes(x='', y= value, fill = Variable)) +
geom_bar(width = 1, stat = 'identity') +
coord_polar("y", start = 0) +
facet_wrap(~name)
Created on 2023-02-18 with reprex v2.0.2
I am trying to create a violin plot using a data frame in the long format. The data frame has 2 columns headed group (containing 2 factors- efficient and inefficient) and Glucose m+6 with corresponding numerical values.
I have tried plotting a violin plot using the following code:
Dta_lng %>%
ggplot(aes(x= Group, y= `Glucose m+6`, fill= Group)) +
geom_violin(show.legend = FALSE) +
geom_jitter(aes(fill=Group),width=0.1, alpha=0.6, pch=21, color="black")
This is the resulting plot:
https://i.stack.imgur.com/VrtbU.jpg
The console also gives 50 warning messages saying groups with fewer than two data points have been dropped.
This is the data I'm working with:
Dta_lng
A tibble: 66 x 2
Group
Glucose m+6
Efficient
0.47699999999999998
Efficient
0.376
Efficient
0.496
Efficient
0.32500000000000001
Efficient
8.8999999999999996E-2
Efficient
4.5999999999999999E-2
Efficient
0.21299999999999999
Efficient
8.2000000000000003E-2
Efficient
0.35899999999999999
Efficient
0.30599999999999999
... with 56 more rows
The first 30 rows are efficient the last 35 are inefficient in the group column.
Perhaps like this:
Data: (Note the different labels!)
df <- data.frame(
group = c(sample(c("efficient", "inefficient"), 1000, replace = TRUE)),
Glucose_m_6 = rnorm(1000)
)
The violin plot with scatter plot:
ggplot(data = df,
aes(x = group, y = Glucose_m_6, fill = group)) +
geom_violin(scale = "count", trim = F, adjust = 0.7, kernel = "cosine") +
geom_point(aes(y = Glucose_m_6),
position = position_jitter(width = .25), size = 0.9, alpha = 0.8)
I have the following dataframe x:
x1 <- data.frame(Date = seq(as.Date("2010-01-01"),
as.Date("2012-12-01"),
by = "month"),
TS1 = rnorm(36,0,1),
TS2 = rnorm(36,0,1),
stringsAsFactors = F)
x2 <- data.frame(Date = seq(as.Date("2010-01-01"),
as.Date("2012-12-01"),
by = "quarter"),
TS3 = rnorm(12,0,1),
stringsAsFactors = F)
x <- left_join(x1, x2, by = "Date")
x contains two monthly series, while one is quarterly.
I would like to plot all three series at the same time with ggplot. I am aware of dualplot as a way to do it. The issue with it however is that it allows you to plot only 2 mixed frequency series.
Is there anyone who can help me with this?
Thanks!
Note that ggplot requires long format, so we first use tidyr::pivot_longer.
Next, we can plot TS1 and TS2 easily, but TS3 will not plot at all as it contains missing values.
One option is to plot the line with missings with a separate geom_line call:
x2 <- x %>%
tidyr::pivot_longer(cols = c(TS1, TS2, TS3), names_to = "TS") %>%
mutate(TS = as.factor(TS))
ggplot(x2, aes(x = Date, y = value, group = TS, color = TS)) +
geom_line() +
geom_line(data = subset(x2, TS == "TS3" & !is.na(value)))
In this instance, ggplot does not have to have the data transformed into long format (although it is a nice solution, if you are familiar with transforming data, and recommended especially if there were lots of columns or separate lines to be plotted).
For simplicity, especially when learning ggplot can I propose an alternative solution.
TS1 and TS2 can easily be plotted against date, as neither have NA values. Here, we call geom_line() twice, once for each line:
x %>%
ggplot()+
geom_line(aes(Date, TS1), colour = 'red')+
geom_line(aes(Date, TS2), colour = 'blue')
If you try and include a third geom_line() with TS3, only the original two lines are plotted due to TS3's missing values (NA). A solution is to fill in the NA values in the data before plotting, using zoo::na.approx(). As the name suggests, zoo::na.approx() is able to approximate values when you have NAs, by linear interpolation. In this instance, I assume linear interpolation between known values is appropriate for plotting (as geom_line is doing anyway). Check out ?zoo::na.approx for more details, including non-linear interpolation.
zoo::na.approx(TS3, Date, na.rm = FALSE) may be read aloud like: "We want to approximate the values of TS3 when they are missing (NA), based on the values of Date, and if there are still NAs in the interpolated data keep the non-NA values we can approximate."
x %>%
mutate(
TS3 = zoo::na.approx(TS3, Date, na.rm = FALSE)
) %>%
ggplot()+
geom_line(aes(Date, TS1), colour = 'red')+
geom_line(aes(Date, TS2), colour = 'blue')+
geom_line(aes(Date, TS3), colour = 'green')
Note that the green line finishes just short (2 data points) of the other two lines. This is because by default, zoo::na.approx() doesn't interpolate when NA is not between two known data points. This is why we specified na.rm = FALSE when doing the interpolation. Look at the help page ?zoo::na.approx for alternatives (such as repeating the last known observation).
Newer to using R and ggplot2 for my data analysis. Trying to figure out how to turn my data from R into the ggplot2 format. The data is a set of values for 5 different categories and I want to make a stacked bar graph that allows me to section the stacked bar graph into 3 sections based on the value. Ex. small, medium, and large values based on arbitrary cutoffs. Similar to the 100% stacked bar graph in excel where the proportion of all the values adds up to 1 (on the y axis). There is a fair amount of data (~1500 observations) if that is also a valuable thing to note.
here is a sample of what the data looks like (but it has approx 1000 observations for each column) (I put an excel screenshot because I don't know if that worked below)
dput(sample-data)
This sort of problem is usually a data reformating problem. See reshaping data.frame from wide to long format.
The following code uses built-in data set iris, with 4 numeric columns, to plot a bar graph with the data values cut into levels after reshaping the data.
I have chosen cutoff points 0.2 and 0.7 but any other numbers in (0, 1) will do. The cutoff vector is brks and levels names labls.
library(tidyverse)
data(iris)
brks <- c(0, 0.2, 0.7, 1)
labls <- c('Small', 'Medium', 'Large')
iris[-5] %>%
pivot_longer(
cols = everything(),
names_to = 'Category',
values_to = 'Value'
) %>%
group_by(Category) %>%
mutate(Value = (Value - min(Value))/diff(range(Value)),
Level = cut(Value, breaks = brks, labels = labls,
include.lowest = TRUE, ordered_result = TRUE)) %>%
ggplot(aes(Category, fill = Level)) +
geom_bar(stat = 'count', position = position_fill()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Here's a solution requiring no data reformating.
The diamonds dataset comes with ggplot2. Column "color" is categorical, column "price" is numeric:
library(ggplot)
ggplot(diamonds) +
geom_bar(aes(x = color, fill = cut(price, 3, labels = c("low", "mid", "high"))),
position = "fill") +
labs(fill = "price")
So, the issue is as follows: I have a dataset which contains
A Condition factor variable with (for this example) 3 levels that need to be plotted on a y axis,
A Group factor variable with three levels to be plotted on the x, and
A value for each group at every condition (example data below).
The three levels on the x axis indicate conditions and I would like to display observations at each level on the y in a violin plot format. I am aware of the fact that I need a numeric on the y axis for ggplot to plot these data, but cannot find a solution to solve this issue of nesting specific values (which will change from experiment to experiment) for the y value at each x condition. My progress (after receiving prior help here) has been properly formatting the data into a data frame, and melting the data into a long format for ggplot.
Example data below:
Condition Observation Value
1-----------------A-----------11
1-----------------B-----------7
1-----------------C-----------2
2-----------------A-----------21
2-----------------B-----------2
2-----------------C-----------5
3-----------------A-----------16
3-----------------B-----------45
3-----------------C-----------34
EDIT:
> SampleA <- c(3,7,9)
> SampleB <- c(15,23,33)
> SampleC <- c(21,19,12)
> Observations <- c("Observation 1", "Observation 2", "Observation 3")
> df0 <- data.frame(Observations = as.factor(Observations), SampleA, SampleB, SampleC)
>library(ggplot2)
>df0 <- reshape2::melt((df0, id.vars = "Observations"))
I'd suggest something like this:
library(dplyr)
df0 = df0 %>%
group_by(Observations) %>%
mutate(norm_value = value / sum(value))
ggplot(df0, aes(x = Observations, y = variable, fill = norm_value)) +
geom_tile() +
geom_label(aes(label = scales::percent(norm_value)), fill = "gray80") +
guides(fill = F) +
coord_equal() +
labs(x = "", y = "") +
theme_minimal()
If you have a lot of data, I'd remove the individual labels and rely on the color scale, but with this few points direct labels seem clearest.