I have several datasets and my end goal is to do a graph out of them, with each line representing the yearly variation for the given information. I finally joined and combined my data (as it was in a per month structure) into a table that just contains the yearly means for each item I want to graph (column depicting year and subsequent rows depicting yearly variation for 4 different elements)
I have one factor that is the year and 4 different variables that read yearly variations, thus I would like to graph them on the same space. I had the idea to joint the 4 columns into one by factor (collapse into one observation per row and the year or factor in the subsequent row) but seem unable to do that. My thought is that this would give a structure to my y axis. Would like some advise, and to know if my approach to the problem is effective. I am trying ggplot2 but does not seem to work without a defined (or a pre defined range) y axis. Thanks
I would suggest next approach. You have to reshape your data from wide to long as next example. In that way is possible to see all variables. As no data is provided, this solution is sketched using dummy data. Also, you can change lines to other geom you want like points:
library(tidyverse)
set.seed(123)
#Data
df <- data.frame(year=1990:2000,
v1=rnorm(11,2,1),
v2=rnorm(11,3,2),
v3=rnorm(11,4,1),
v4=rnorm(11,5,2))
#Plot
df %>% pivot_longer(-year) %>%
ggplot(aes(x=factor(year),y=value,group=name,color=name))+
geom_line()+
theme_bw()
Output:
We could use melt from reshape2 without loading multiple other packages
library(reshape2)
library(ggplot2)
ggplot(melt(df, id.var = 'year'), aes(x = factor(year), y = value,
group = variable, color = variable)) +
geom_line()
-output plot
Or with matplot from base R
matplot(as.matrix(df[-1]), type = 'l', xaxt = 'n')
data
set.seed(123)
df <- data.frame(year=1990:2000,
v1=rnorm(11,2,1),
v2=rnorm(11,3,2),
v3=rnorm(11,4,1),
v4=rnorm(11,5,2))
I am trying to create a histogram/bar plot in R to show the counts of each x value I have in the dataset and higher. I am having trouble doing this, and I don't know if I use geom_histogram or geom_bar (I want to use ggplot2). To describe my problem further:
On the X axis I have "Percent_Origins," which is a column in my data frame. On my Y axis - for each of the Percent_Origin values I have occurring, I want the height of the bar to represent the count of rows with that percent value and higher. Right now, if I am to use a histogram, I have:
plot <- ggplot(dataframe, aes(x=dataframe$Percent_Origins)) +
geom_histogram(aes(fill=Percent_Origins), binwidth= .05, colour="white")
What should I change the fill or general code to be to do what I want? That is, plot an accumulation of counts of each value and higher? Thanks!
I think that your best bet is going to be creating the cumulative distribution function first then passing it to ggplot. There are several ways to do this, but a simple one (using dplyr) is to sort the data (in descending order), then just assign a count for each. Trim the data so that only the largest count is still included, then plot it.
To demonstrate, I am using the builtin iris data.
iris %>%
arrange(desc(Sepal.Length)) %>%
mutate(counts = 1:n()) %>%
group_by(Sepal.Length) %>%
slice(n()) %>%
ggplot(aes(x = Sepal.Length, y = counts)) +
geom_step(direction = "vh")
gives:
If you really want bars instead of a line, use geom_col instead. However, note that you either need to fill in gaps (to ensure the bars are evenly spaced across the range) or deal with breaks in the plot.
I want to create a ggpairs plot colored by a factor, in order to do that I have to keep the factor column in the data frame that goes into ggpairs().
The problem with that is that it adds plots that are done with the factor (the last column and last line in the plot), which I do not want to have in the ggpairs plot (they only add a limited amount of information and make the plot messy).
Is there a way to not show them in the plot, or alternatively color by a factor which is in a separate dataframe?
I was able to remove the whole top part of the plot by using: upper = 'blank' but it doesn't really help as I cannot remove by columns or rows of the ggmatrix.
Is there a way to do this?
I searched for solutions but I didn't find anything relevant
here is an example using the gapminder dataset:
library(dplyr)
library(ggplot2)
library(GGally)
library(gapminder)
gapminder %>%
filter(year == 2002 & continent != 'Oceania') %>%
transmute(lifeExp = lifeExp, log_pop = log(pop), log_gdpPercap = log(gdpPercap), continent = continent) %>%
ggpairs(aes(color = continent, alpha = 0.5))
I get this:
ggpairs with the factor
and I would like to get something like this:
ggpairs colored by factor but without its related plots
You can use the columns argument for this.
From the documentation:
which columns are used to make plots. Defaults to all columns.
In your example output you want only columns 1:3.
... %>%
ggpairs(aes(color = continent, alpha = 0.5), columns = 1:3)
My data are visualized in the package ggplot2 via bar plots with several (~10) facets. I want first to split these facets in several rows. I can use function facet_grid() or facet_wrap() for this. In the minimal example data here I build 8 facets in two rows (4x2). However I need to adjust scales for different facets, namely: first row contains data on small scale, and in the second row values are bigger. So I need to have same scale for all data in the first row to compare them along the row, and another scale for the second row.
Here is the minimal example and possible solutions.
#loading necessary libraries and example data
library(dplyr)
library(tidyr)
library(ggplot2)
trial.facets<-read.csv(text="period,xx,yy
A,2,3
B,1.5,2.5
C,3.2,0.5
D,2.5,1.5
E,11,13
F,16,14
G,8,5
H,5,4")
#arranging data to long format with omission of the "period" variable
trial.facets.tidied<-trial.facets %>% gather(key=newvar,value=newvalue,-period)
And now plotting itself:
#First variant
ggplot(trial.facets.tidied,aes(x=newvar,y=newvalue,position="dodge"))+geom_bar(stat ="identity") +facet_grid(.~period)
#Second variant:
ggplot(trial.facets.tidied,aes(x=newvar,y=newvalue,position="dodge"))+geom_bar(stat ="identity") +facet_wrap(~period,nrow=2,scales="free")
The results for the first and second variants are as follows:
In both examples we have either free scales for all graphs, or fixed for all graphs. Meanwhile the first row (first 4 facets) needs to be scaled somewhat to 5, and the second row - to 15.
As a solution to use facet_grid() function I can add a fake variable "row" which specifies, to what row should the corresponding letter belong. The new dataset, trial.facets.row (three lines shown only) would look like as follows:
period,xx,yy,row
C,3.2,0.5,1
D,2.5,1.5,1
E,11,13,2
Then I can perform the same rearrangement into long format, omitting variables "period" and "row":
trial.facets.tidied.2<-trial.facets.row %>% gather(key=newvar,value=newvalue,-period,-row)
Then I arrange facets along variables "row" and "period" in the hope to use the option scales="free_y" to adjust scales only across rows:
ggplot(trial.facets.tidied.2,aes(x=newvar,y=newvalue,position="dodge"))+geom_bar(stat ="identity") +facet_grid(row~period,scales="free_y")
and - surprise: the problem with scales is solved, however, I get two groups of empty bars, and whole data is again stretched across a long strip:
All discovered manual pages and handbooks (usually using the mpg and mtcars dataset) do not consider such situation of such unwanted or dummy data
I used a combination of your first method (facet_wrap) & second method (leverage on dummy variable for different rows):
# create fake variable "row"
trial.facets.row <- trial.facets %>% mutate(row = ifelse(period %in% c("A", "B", "C", "D"), 1, 2))
# rearrange to long format
trial.facets.tidied.2<-trial.facets.row %>% gather(key=newvar,value=newvalue,-period,-row)
# specify the maximum height for each row
trial.facets.tidied.3<-trial.facets.tidied.2 %>%
group_by(row) %>%
mutate(max.height = max(newvalue)) %>%
ungroup()
ggplot(trial.facets.tidied.3,
aes(x=newvar, y=newvalue,position="dodge"))+
geom_bar(stat = "identity") +
geom_blank(aes(y=max.height)) + # add blank geom to force facets on the same row to the same height
facet_wrap(~period,nrow=2,scales="free")
Note: based on this reproducible example, I'm assuming that all your plots already share a common ymin at 0. If that's not the case, simply create another dummy variable for min.height & add another geom_blank to your ggplot.
Looking over SO I encountered a solution which might be a bit tricky - from here
The idea is to create a second fake dataset which would plot a single point at each facet. This point will be drawn in the position, corresponding to the highest desired value for y scale in every case. So heights of scales can be manually adjusted for each facet. Here is the solution for the dataset in question. We want y scale (maximum y value) 5 for the first row, and 17 for the second row. So create
df3=data.frame(newvar = rep("xx",8),
period = c("A","B","C","D","E","F","G","H"),
newvalue = c(5,5,5,5,17,17,17,17))
And now superimpose the new data on our graph using geom_point() .
ggplot(trial.facets.tidied,aes(x=newvar,y=newvalue,position="dodge"))+
geom_bar(stat ="identity") +
facet_wrap(~period,nrow=2,scales="free_y")+
geom_point(data=df3,aes(x=newvar,y=newvalue),alpha=1)
Here what we get:
Here I intentionally draw this extra point to make things clear. Next we need to make it invisible, which can be achieved by setting alpha=0 instead of 1 in the last command.
This approach draws an invisible line at the maximum for each row
#loading necessary libraries and example data
library(dplyr)
library(tidyr)
library(ggplot2)
trial.facets<-read.csv(text="period,xx,yy
A,2,3
B,1.5,2.5
C,3.2,0.5
D,2.5,1.5
E,11,13
F,16,14
G,8,5
H,5,4")
# define desired number of columns
n_col <- 4
#assign a row number - mmnsodulo number of colu
trial.facets$row <- seq(0, nrow(trial.facets)-1) %/% n_col
# determine the max by row, and round up to nearest multiple of 5
# join back to original
trial.facets.max <- trial.facets %>%
group_by(row) %>%
summarize(maxvalue = (1 + max(xx, yy) %/% 5) * 5 )
trial.facets <- trial.facets %>% inner_join(trial.facets.max)
# make long format carrying period, row and maxvalue
trial.facets.tidied<-trial.facets %>% gather(key=newvar,value=newvalue,-period,-row,-maxvalue)
# plot an invisible line at the max
ggplot(trial.facets.tidied,aes(x=newvar,y=newvalue,position="dodge"))+
geom_bar(stat ="identity") +
geom_hline(aes(yintercept=maxvalue), alpha = 0) +
facet_wrap(~period,ncol=n_col,scales="free")
I am facing a difficulty while plotting a parallel coordinates plot using the ggparcoord from the GGally package. As there are two categorical variables, what I want to show in the visualisation is like the image below. I've found that in ggparcoord, groupColumn is only allowed to a single variable to group (colour) by, and surely I can use showPoints to mark the values on the axes, but i also need to vary the shape of these markers according to the categorical variables. Is there other package that can help me to realise my idea?
Any response will be appreciated! Thanks!
It's not that difficult to roll your own parallel coordinates plot in ggplot2, which will give you the flexibility to customize the aesthetics. Below is an illustration using the built-in diamonds data frame.
To get parallel coordinates, you need to add an ID column so you can identify each row of the data frame, which we'll use as a group aesthetic in ggplot. You also need to scale the numeric values so that they'll all be on the same vertical scale when we plot them. Then you need to take all the columns that you want on the x-axis and reshape them to "long" format. We do all that on the fly below with the tidyverse/dplyr pipe operator.
Even after limiting the number of category combinations, the lines are probably too intertwined for this plot to be easily interpretable, so consider this merely a "proof of concept". Hopefully, you can create something more useful with your data. I've used colour (for the lines) and fill (for the points) aesthetics below. You can use shape or linetype instead, depending on your needs.
library(tidyverse)
theme_set(theme_classic())
# Get 20 random rows from the diamonds data frame after limiting
# to two levels each of cut and color
set.seed(2)
ds = diamonds %>%
filter(color %in% c("D","J"), cut %in% c("Good", "Premium")) %>%
sample_n(20)
ggplot(ds %>%
mutate(ID = 1:n()) %>% # Add ID for each row
mutate_if(is.numeric, scale) %>% # Scale numeric columns
gather(key, value, c(1,5:10)), # Reshape to "long" format
aes(key, value, group=ID, colour=color, fill=cut)) +
geom_line() +
geom_point(size=2, shape=21, colour="grey50") +
scale_fill_manual(values=c("black","white"))
I haven't used ggparcoords before, but the only option that seemed straightforward (at least on my first try with the function) was to paste together two columns of data. Below is an example. Even with just four category combinations, the plot is confusing, but maybe it will be interpretable if there are strong patterns in your data:
library(GGally)
ds$group = with(ds, paste(cut, color, sep="-"))
ggparcoord(ds, columns=c(1, 5:10), groupColumn=11) +
theme(panel.grid.major.x=element_line(colour="grey70"))