Tornado-llike plot with two variables - r

Related question that uses three varibales is easier to do.
This should be seemigly simple but I couldn't get it to work. Here's a simple example:
test_me<-data.frame(A=c(-1.5,-5.6,-4.6,-7.8,0.98,0.07,-0.32,-0.4,-0.4),
B=c("A","A","A","B","B","B","C","C","C"))
The kind of plot(not shown to keep the post as concise as possible) I would like to make with ggplot2 done with base:
barplot(test_me$A,col=test_me$B,legend=test_me$B)
This gives me the kind of plot I need. However, barplot returns duplicated names in the legend and efforts to remove these were futile. I could use lattice or barchart but would prefer a solution that either replicates this in ggplot2 or removes the duplicated legend entries in base's output.
Here is one of several things I've tried:
library(ggplot2)
ggplot(test_me,aes(B,A,fill=B))+geom_col()
The above won't work with changes to position. How can I best make this plot? I tried to set manual legends with legend.text in barplot but that removes the "grouping".
EDIT:
The solution below might solve the issue but it leads to overlap in bars unlike the base equivalent. I would therefore prefer a solution that uses base with elimination of the multiple entries in the legend. In short, how can I have a grouped barplot with just two variables and unique legend entries?
test_me %>%
mutate(x = row_number()) %>%
ggplot(aes(x = x, y = A, fill = B)) +
geom_col()
The issue however is that the above solution results in overlap yet the base plot results in three grouped bars(that is the groups appear to be non-overlapping).
Thanks.

You need to give each element a discrete value on the x axis. Try this:
test_me %>%
mutate(x = row_number()) %>%
ggplot(aes(x = x, y = A, fill = B)) +
geom_col()

Related

Plotting series of factor variables side by ggplot

I am trying to plot a series of variables, which are collected in two-time frames. The structure of data is something like this, the number of observations is 9700, the class is factor.
Please see the structure of the data
I want to plot a barplot like thisI will have a list of the sbs base on each wave.
I have used aggregate function and dplyr, but I could not make a proper structure for the data.
I am very happy that can you help me with it.
Thank you,
As #Tung suggested, you can put your data into long format, and use position_dodge with the plot so bars are next to each other in the plot. Here is an example.
Using tidyr pivot_longer you can put columns that start with "sb" into long form. Then you can filter out rows where the value is zero. unite will combine names - such as sb_1 and x to become sb_1_x.
In this format, it is easier to plot. Use geom_bar to create the bar plot, and use position_dodge2 to put bars next to each other with different wave values. The use of preserve = "single" keeps the bars the same width (in cases where one wave has zero count).
library(tidyverse)
library(ggplot2)
df %>%
pivot_longer(cols = starts_with("sb")) %>%
filter(value != 0) %>%
unite(sb, name, value) %>%
ggplot(aes(x = sb)) +
geom_bar(aes(fill = wave), position = position_dodge2(preserve = "single"))
Plot

Want to compare all variables from two dataframes with same columns?

I have been using this code to plot all the variables in my synthetic dataset, but I would like to modify it in order to compare the dataset to the main, original dataset.
synthetic %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
I can't figure out how to make the plots go on the same, well, plots, or where to put the color values so I can tell which is which...
I tried combining both into one dataframe and using gather to inform the plots which was which but the coloring didn't work and it didn't work in general.
I tried matplot, but it told me they both have to have the same number of rows, which inclines me to believe it's not the right function for this.
The third thing I tried was:
par(mfrow=c(5,4))
i<-1
for (i in 1:26) {
plot(synthetic[i], col = "red")
points(full2[i], col = "blue")
}
But that failed as well. I would like all the plots to appear at once and not have to click through them.

Reordering data based on a column in [r] to order x-value items from lowest to highest y-values in ggplot

I have a dataframe that I want to reorder to make a ggplot so I can easily see which items have the highest and lowest values in them. In my case, I've grouped the data into two groups, and it'd be nice to have a visual representation of which group tends to score higher. Based on this question I came up with:
library(ggplot2)
cor.data<- read.csv("https://dl.dropbox.com/s/p4uy6uf1vhe8yzs/cor.data.csv?dl=0",stringsAsFactors = F)
cor.data.sorted = cor.data[with(cor.data,order(r.val,pic)),] #<-- line that doesn't seem to be working
ggplot(cor.data.sorted,aes(x=pic,y=r.val,size=df.val,color=exp)) + geom_point()
which produces this:
I've tried quite a few variants to reorder the data, and I feel like this should be pretty simple to achieve. To clarify, if I had succesfully reorganised the data then the y-values would go up as the plot moves along the x-value. So maybe i'm focussing on the wrong part of the code to achieve this in a ggplot figure?
You could do something like this?
library(tidyverse);
cor.data %>%
mutate(pic = factor(pic, levels = as.character(pic)[order(r.val)])) %>%
ggplot(aes(x = pic, y = r.val, size = df.val, color = exp)) + geom_point()
This obviously still needs some polishing to deal with the x axis label clutter etc.
Rather than try to order the data before creating the plot, I can reorder the data at the time of writing the plot:
cor.data<- read.csv("https://dl.dropbox.com/s/p4uy6uf1vhe8yzs/cor.data.csv?dl=0",stringsAsFactors = F)
cor.data.sorted = cor.data[with(cor.data,order(r.val,pic)),] #<-- This line controls order points drawn created to make (slightly) more readible plot
gplot(cor.data.sorted,aes(x=reorder(pic,r.val),y=r.val,size=df.val,color=exp)) + geom_point()
to create

geom_histogram to plot counts/accumulation of each x value and higher

I am trying to create a histogram/bar plot in R to show the counts of each x value I have in the dataset and higher. I am having trouble doing this, and I don't know if I use geom_histogram or geom_bar (I want to use ggplot2). To describe my problem further:
On the X axis I have "Percent_Origins," which is a column in my data frame. On my Y axis - for each of the Percent_Origin values I have occurring, I want the height of the bar to represent the count of rows with that percent value and higher. Right now, if I am to use a histogram, I have:
plot <- ggplot(dataframe, aes(x=dataframe$Percent_Origins)) +
geom_histogram(aes(fill=Percent_Origins), binwidth= .05, colour="white")
What should I change the fill or general code to be to do what I want? That is, plot an accumulation of counts of each value and higher? Thanks!
I think that your best bet is going to be creating the cumulative distribution function first then passing it to ggplot. There are several ways to do this, but a simple one (using dplyr) is to sort the data (in descending order), then just assign a count for each. Trim the data so that only the largest count is still included, then plot it.
To demonstrate, I am using the builtin iris data.
iris %>%
arrange(desc(Sepal.Length)) %>%
mutate(counts = 1:n()) %>%
group_by(Sepal.Length) %>%
slice(n()) %>%
ggplot(aes(x = Sepal.Length, y = counts)) +
geom_step(direction = "vh")
gives:
If you really want bars instead of a line, use geom_col instead. However, note that you either need to fill in gaps (to ensure the bars are evenly spaced across the range) or deal with breaks in the plot.

How to plot parallel coordinates with multiple categorical variables in R

I am facing a difficulty while plotting a parallel coordinates plot using the ggparcoord from the GGally package. As there are two categorical variables, what I want to show in the visualisation is like the image below. I've found that in ggparcoord, groupColumn is only allowed to a single variable to group (colour) by, and surely I can use showPoints to mark the values on the axes, but i also need to vary the shape of these markers according to the categorical variables. Is there other package that can help me to realise my idea?
Any response will be appreciated! Thanks!
It's not that difficult to roll your own parallel coordinates plot in ggplot2, which will give you the flexibility to customize the aesthetics. Below is an illustration using the built-in diamonds data frame.
To get parallel coordinates, you need to add an ID column so you can identify each row of the data frame, which we'll use as a group aesthetic in ggplot. You also need to scale the numeric values so that they'll all be on the same vertical scale when we plot them. Then you need to take all the columns that you want on the x-axis and reshape them to "long" format. We do all that on the fly below with the tidyverse/dplyr pipe operator.
Even after limiting the number of category combinations, the lines are probably too intertwined for this plot to be easily interpretable, so consider this merely a "proof of concept". Hopefully, you can create something more useful with your data. I've used colour (for the lines) and fill (for the points) aesthetics below. You can use shape or linetype instead, depending on your needs.
library(tidyverse)
theme_set(theme_classic())
# Get 20 random rows from the diamonds data frame after limiting
# to two levels each of cut and color
set.seed(2)
ds = diamonds %>%
filter(color %in% c("D","J"), cut %in% c("Good", "Premium")) %>%
sample_n(20)
ggplot(ds %>%
mutate(ID = 1:n()) %>% # Add ID for each row
mutate_if(is.numeric, scale) %>% # Scale numeric columns
gather(key, value, c(1,5:10)), # Reshape to "long" format
aes(key, value, group=ID, colour=color, fill=cut)) +
geom_line() +
geom_point(size=2, shape=21, colour="grey50") +
scale_fill_manual(values=c("black","white"))
I haven't used ggparcoords before, but the only option that seemed straightforward (at least on my first try with the function) was to paste together two columns of data. Below is an example. Even with just four category combinations, the plot is confusing, but maybe it will be interpretable if there are strong patterns in your data:
library(GGally)
ds$group = with(ds, paste(cut, color, sep="-"))
ggparcoord(ds, columns=c(1, 5:10), groupColumn=11) +
theme(panel.grid.major.x=element_line(colour="grey70"))

Resources