Coxcomb charts in R - r

I have the following two graphs, the first one is provided, and we need to modify it to produce the second one. The code is provided below
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), width = 1) +
labs(x=NULL) +
theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank()) +
coord_polar()
This is the code that produces the first image, to get the second graph, the geom_bar() call needs to be changed, specifically, stat() needs to be called to manually set the heights. How do I modify this line of code to produce the second graph?

For those who come across a similar issue, I solved this by adding an argument for the y-axis and setting it to sqrt(stat(count)). This yields the second coxcomb chart shown above

Related

Why is the variable considered continous in legend?

I have used the following code to generate a plot with ggplot:
I want the legend to show the runs 1-8 and only the volumes 12.5 and 25 why doesn't it show it?
And is it possible to show all the points in the plot even though there is an overlap? Because right now the plot only shows 4 of 8 points due to overlap.
OP. You've already been given a part of your answer. Here's a solution given your additional comment and some explanation.
For reference, you were looking to:
Change a continuous variable to a discrete/discontinuous one and have that reflected in the legend.
Show runs 1-8 labeled in the legend
Disconnect lines based on some criteria in your dataset.
First, I'm representing your data here again in a way that is reproducible (and takes away the extra characters so you can follow along directly with all the code):
library(ggplot2)
mydata <- data.frame(
`Run`=c(1:8),
"Time"=c(834, 834, 584, 584, 1184, 1184, 938, 938),
`Area`=c(55.308, 55.308, 79.847, 79.847, 81.236, 81.236, 96.842, 96.842),
`Volume`=c(12.5, 12.5, 12.5, 12.5, 25.0, 25.0, 25.0, 25.0)
)
Changing to a Discrete Variable
If you check the variable type for each column (type str(mydata)), you'll see that mydata$Run is an int and the rest of the columns are num. Each column is understood to be a number, which is treated as if it were a continuous variable. When it comes time to plot the data, ggplot2 understands this to mean that since it is reasonable that values can exist between these (they are continuous), any representation in the form of a legend should be able to show that. For this reason, you get a continuous color scale instead of a discrete one.
To force ggplot2 to give you a discrete scale, you must make your data discrete and indicate it is a factor. You can either set your variable as a factor before plotting (ex: mydata$Run <- as.factor(mydata$Run), or use code inline, referring to aes(size = factor(Run),... instead of just aes(size = Run,....
Using reference to factor(Run) inline in your ggplot calls has the effect of changing the name of the variable to be "factor(Run)" in your legend, so you will have to also add that to the labs() object call. In the end, the plot code looks like this:
ggplot(data = mydata, aes(x=Area, y=Time)) +
geom_point(aes(color =as.factor(Volume), size = Run)) +
geom_line() +
labs(
x = "Area", y = "Time",
# This has to be changed now
color='Volume'
) +
theme_bw()
Note in the above code I am also not referring to mydata$Run, but just Run. It is greatly preferable that you refer to just the name of the column when using ggplot2. It works either way, but much better in practice.
Disconnect Lines
The reason your lines are connected throughout the data is because there's no information given to the geom_line() object other than the aesthetics of x= and y=. If you want to have separate lines, much like having separate colors or shapes of points, you need to supply an aesthetic to use as a basis for that. Since the two lines are different based on the variable Volume in your dataset, you want to use that... but keep the same color for both. For this, we use the group= aesthetic. It tells ggplot2 we want to draw a line for each piece of data that is grouped by that aesthetic.
ggplot(data = mydata, aes(x=Area, y=Time)) +
geom_point(aes(color =as.factor(Volume), size = Run)) +
geom_line(aes(group=as.factor(Volume))) +
labs(
x = "Area", y = "Time", color='Volume'
) +
theme_bw()
Show Runs 1-8 Labeled in Legend
Here I'm reading a bit into what you exactly wanted to do in terms of "showing runs 1-8" in the legend. This could mean one of two things, and I'll assume you want both and show you how to do both.
Listing and showing sizes 1-8 in the legend.
To set the values you see in the scale (legend) for size, you can refer to the various scale_ functions for all types of aesthetics. In this case, recall that since mydata$Run is an int, it is treated as a continuous scale. ggplot2 doesn't know how to draw a continuous scale for size, so the legend itself shows discrete sizes of points. This means we don't need to change Run to a factor, but what we do need is to indicate specifically we want to show in the legend all breaks in the sequence from 1 to 8. You can do this using scale_size_continuous(breaks=...).
ggplot(data = mydata, aes(x=Area, y=Time)) +
geom_point(aes(color =as.factor(Volume), size = Run)) +
geom_line(aes(group=as.factor(Volume))) +
labs(
x = "Area", y = "Time", color='Volume'
) +
scale_size_continuous(breaks=c(1:8)) +
theme_bw()
Showing all of your runs as points.
The note about showing all runs might also mean you want to literally see each run represented as a discrete point in your plot. For this... well, they already are! ggplot2 is plotting each of your points from your data into the chart. Since some points share the same values of x= and y=, you are getting overplotting - the points are drawn over top of one another.
If you want to visually see each point represented here, one option could be to use geom_jitter() instead of geom_point(). It's not really great here, because it will look like your data has different x and y values, but it is an option if this is what you want to do. Note in the code below I'm also changing the shape of the point to be a hollow circle for better clarity, where the color= is the line around each point (here it's black), and the fill= aesthetic is instead used for Volume. You should get the idea though.
set.seed(1234) # using the same randomization seed ensures you have the same jitter
ggplot(data = mydata, aes(x=Area, y=Time)) +
geom_jitter(aes(fill =as.factor(Volume), size = Run), shape=21, color='black') +
geom_line(aes(group=as.factor(Volume))) +
labs(
x = "Area", y = "Time", fill='Volume'
) +
scale_size_continuous(breaks=c(1:8)) +
theme_bw()

ggplot - gaps in stacked bar plot [duplicate]

I'm using ggplot and I get those weird horizontal lines out of geom_bar. I cannot provide a minimal working example: the same code works with few observations and it relies on data I am importing and transforming. However, I can show the relevant line of codes and cross my fingers someone ran into this issue:
ggplot(data) + geom_bar(aes(x=Horizon, y=Importance, fill=Groups),
position='fill', stat='identity') +
theme_timeseries2() +
scale_fill_manual(values=c('#1B9E77', 'orange2', 'black',
'red2', 'blue4')) +
xlab('') + ylab('')
My personal function, theme_timeseries2() isn't the source of the problem: it happens even if I stop after geom_bar. I checked for missing values in Importance and every other column of my data frame and there are none.
It's also very odd: the white lines aren't the same on the zoomed page as in the plot window of RStudio. They do print in .png format when I save the file, so there really is something going on with those horizontal bars. Any theory about why geom_bar() does this would be highly appreciated.
You can fix it by adding the fill as color. Like this:
geom_bar(aes(x=Horizon, y=Importance, fill=Groups, color=Groups),
position='fill', stat='identity')
This was suggested here.
I'm guessing the lines are due to a plotting bug between observations that go into each bar. (That could be related to the OS, the graphics device, and/or how ggplot2 interacts with them...)
I expect it'd go away if you summarized before ggplot2, e.g.:
library(dplyr);
data %>%
count(Horizon, Groups, wt = Importance, name = "Importance") %>%
ggplot() +
geom_col(aes(x = Horizon, y= Importance, fill = Groups), position = "fill") + ....
Mine went away when changing the size of the graphs in rmarkdown.

Why does this ggplot only plot the grid without the values?

I am trying to plot a bar chart in ggplot but I am continuously getting only the grid. This is apparently a demonstration about the draw nothing here but I would like to understand how to get the values visible in the simplest way.
library(ggplot2)
testData<-data.frame(x=c("a","b","c","d","e","f"), y=c(10,6,9,28,10,17))
bar <- ggplot(data=testData, aes(x=c("a","b","c","d","e","f"), y=c(10,6,9,28,10,17), fill = "#FFCC00"))
One way I can get the plots is the geom_bar
bar <- ggplot(data=testData, aes(x=c("a","b","c","d","e","f"), y=c(10,6,9,28,10,17), fill = "#FFCC00")) + geom_bar(stat="identity")
Why are the values not plotted on the first bar chart and how to fix it the simplest way? What is the idea behind of this way of plotting with + and what is it called?
With the ggplot2 package, calling ggplot() is only meant to call the basic grid; it's like taking out a piece of graph paper before drawing a graph. In either case, having the grid ready has nothing to do with plotting the graph. That's why running the following command will result in the empty grid in your first example:
ggplot(data=testData, aes(x=x, y=y, fill = "#FFCC00"))
It's not the same as using a function like plot() or hist(), which prep the grid and plot the data at the same time:
plot(x=x,y=y,data=testData)
hist(x=x,data=testData)
The "+" in ggplot is just a way to say that there are more arguments related to the ggplot that we want included on top of the first blank grid. That's why each line separated by a "+" is typically called a layer.
So, if we want to make a simple scatterplot, we add points on top of a grid:
testData<-data.frame(x=c(1:6), y=c(10,6,9,28,10,17))
ggplot(data=testData,aes(x=x,y=y)) +
geom_point()
Output:
If we want to add lines to that scatterplot, we can just add one line of code:
ggplot(data=testData,aes(x=x,y=y)) +
geom_point() +
geom_line()
Output:
We can keep adding layers like this if we want. Just note that they will print in the order that you type them (i.e. the first few lines will be below the lines printed after them):
ggplot(data=testData,aes(x=x,y=y)) +
geom_bar(stat="identity",fill="#00BFC4") +
geom_point() +
geom_line()
Output:
Also, note that it's recommended not to call your data multiple times within a ggplot call; that can lead to errors.
Don't use:
ggplot(data=testData, aes(x=c("a","b","c","d","e","f"),
y=c(10,6,9,28,10,17), fill = "#FFCC00")) +
geom_bar(stat="identity")
#or
ggplot(data=testData, aes(x=testData$x, y=testData$x, fill = "#FFCC00")) +
geom_bar(stat="identity")
Instead use:
ggplot(data=testData, aes(x=x, y=y, fill="#FFCC00")) +
geom_bar(stat="identity")
If you want to plot data from a data frame(s) not called within the first ggplot() line, then simply add a data argument to the "layers" that use that different data frame, like this:
ggplot(data=testData,aes(x=x,y=y)) +
geom_bar(stat="identity",fill="#00BFC4") +
geom_point(data=differentDf, aes(x=x,y=y)) +
geom_line(data=differentDf, aes(x=x,y=y))

compare boxplots with a single value

I want to compare the distribution of several variables (here X1 and X2) with a single value (here bm). The issue is that these variables are too many (about a dozen) to use a single boxplot.
Additionaly the levels are too different to use one plot. I need to use facets to make things more organised:
However with this plot my benchmark category (bm), which is a single value in X1 and X2, does not appear in X1 and seems to have several values in X2. I want it to be only this green line, which it is in the first plot. Any ideas why it changes? Is there any good workaround? I tried the options of facet_wrap/facet_grid, but nothing there delivered the right result.
I also tried combining a bar plot with bm and three empty categories with the boxplot. But firstly it looked terrible and secondly it got similarly screwed up in the facetting. Basically any work around would help.
Below the code to create the minimal example displayed here:
# Creating some sample data & loading libraries
library(ggplot2)
library(RColorBrewer)
set.seed(10111)
x=matrix(rnorm(40),20,2)
y=rep(c(-1,1),c(10,10))
x[y==1,]=x[y==1,]+1
x[,2]=x[,2]+20
df=data.frame(x,y)
# creating a benchmark point
benchmark=data.frame(y=rep("bm",2),key=c("X1","X2"),value=c(-0.216936,20.526312))
# melting the data frame, rbinding it with the benchmark
test_dat=rbind(tidyr::gather(df,key,value,-y),benchmark)
# Creating a plot
p_box <- ggplot(data = test_dat, aes(x=key, y=value,color=as.factor(test_dat$y))) +
geom_boxplot() + scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1"))
# The first line delivers the first plot, the second line the second plot
p_box
p_box + facet_wrap(~key,scales = "free",drop = FALSE) + theme(legend.position = "bottom")
The problem only lies int the use of test_dat$y inside the color aes. Never use $ in aes, ggplot will mess up.
Anyway, I think you plot would improve if you use a geom_hline for the benchmark, instead of hacking in a single value boxplot:
library(ggplot2)
library(RColorBrewer)
ggplot(tidyr::gather(df,key,value,-y)) +
geom_boxplot(aes(x=key, y=value, color=as.factor(y))) +
geom_hline(data = benchmark, aes(yintercept = value), color = '#4DAF4A', size = 1) +
scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1")) +
facet_wrap(~key,scales = "free",drop = FALSE) +
theme(legend.position = "bottom")

R incorrect y-axis in ggplots geom_bar()

I have a dataframe with Wikipedia edits, with information about the number of edit for the user (1st edit, 2nd edit and so on), the timestamp when the edit was made, and how many words were added.
In the actual dataset, I have up to 20.000 edits per user and in some edits, they add up to 30.000 words.
However, here is a downloadable small example dataset to exemplify my problem. The header looks like this:
I am trying to plot the distribution of added words across the Edit Progression and across time. If I use the regular R barplot, i works just like expected:
barplot(UserFrame3$NoOfAdds,UserFrame3$EditNo)
But I want to do it in ggplot for nicer graphics and more customizing options.
If I plot this as a scatterplot, I get the same result:
ggplot(data = UserFrame3, aes(x = UserFrame3$EditNo, y = UserFrame3$NoOfAdds)) + geom_point(size = 0.1)
Same for a linegraph:
ggplot(data = UserFrame3, aes(x = UserFrame3$EditNo, y = UserFrame3$NoOfAdds)) +geom_line(size = 0.1)
But when I try to plot it as a bargraph in ggplot, I get this result:
ggplot(data = UserFrame3, aes(x = UserFrame3$EditNo, y = UserFrame3$NoOfAdds)) + geom_bar(stat = "identity", position = "dodge")
There appear to be a lot more holes on the X-axis and the maximum is nowhere close to where it should be (y = 317).
I suspect that ggplot somehow groups the bars and uses means instead of the actual values despite the "dodge" parameter? How can I avoid this? and how would I go about plotting the time progression as a bargraph aswell without ggplot averaging over multiple edits?
You should expect more x-axis "holes" using bars as compared with lines. Lines connect the zero values together, bars do not.
I used geom_col with your data download, it looks as expected:
UserFrame3 %>%
ggplot(aes(EditNo, NoOfAdds)) + geom_col()

Resources