How can I create a (100%) stacked histogram in R? - r

My dataset:
I have data in the following format (here, imported from a CSV file). You can find an example dataset as CSV here.
PAIR PREFERENCE
1 5
1 3
1 2
2 4
2 1
2 3
… and so on. In total, there are 19 pairs, and the PREFERENCE ranges from 1 to 5, as discrete values.
What I'm trying to achieve:
What I need is a stacked histogram, e.g. a 100% high column, for each pair, indicating the distribution of the PREFERENCE values.
Something similar to the "100% stacked columns" in Excel, or (although not quite the same, a so-called "mosaic plot"):
What I tried:
I figured it'd be easiest using ggplot2, but I don't even know where to start. I know I can create a simple bar chart with something like:
ggplot(d, aes(x=factor(PAIR), y=factor(PREFERENCE))) + geom_bar(position="fill")
… that however doesn't get me very far. So I tried this, and it gets me somewhat closer to what I'm trying to achieve, but it still uses the count of PREFERENCE, I suppose? Note the ylab being "count" here, and the values ranging to 19.
qplot(factor(PAIR), data=d, geom="bar", fill=factor(PREFERENCE_FIXED))
Results in:
So, what do I have to do to get the stacked bars to represent a histogram?
Or do they actually do this already?
If so, what do I have to change to get the labels right (e.g. have percentages instead of the "count")?
By the way, this is not really related to this question, and only marginally related to this (i.e. probably same idea, but not continuous values, instead grouped into bars).

Maybe you want something like this:
ggplot() +
geom_bar(data = dat,
aes(x = factor(PAIR),fill = factor(PREFERENCE)),
position = "fill")
where I've read your data into dat. This outputs something like this:
The y label is still "count", but you can change that manually by adding:
+ scale_x_discrete("Pairs") + scale_y_continuous("Votes")

Related

Making a line graph with certain X + Y values expressed differently with lines of 33 user IDs in R

I'm trying to put ActivityDate on the X Axis, and Calories on the Y Axis, relating to how 33 different users ranged in their calorie burnings daily. I'm new to ggplot and visualizations as you can tell, so I'd appreciate the most basic solution that I can understand. Thank you so much.
I really tried several iterations of this code, and each one of them weren't quite right in how the visualization turned out. Here are a couple of my thoughts:
##first and foremost:
install.packages("tidyverse") install.packages("here") library(tidyverse) library(here)
Attempt 1 Bar Graph
ggplot(data=trimmed_dactivity) + geom_bar(mapping=aes(x=Id, color=ActivityDate))
Attempt 1 Bar Graph
##Not probably the best for stakeholders, but if I could maybe have the bars a little closer together that might help, so I tried to identify the unique IDs. Perhaps the reason why they are so small is that they appear in long number format, and are not sequential, so it could be adding the extra space and making the bars so small because of the spaces of empty sequential numbers.
Attempt 2 Bar Graph
UId <- unique("Id") ggplot(data=trimmed_dactivity) + geom_bar(mapping=aes(x=UId, color=ActivityDate))
Attempt 2 Bar Graph
##Facepalm, definitely not what I was looking for at all, but that was my effort to solve the above problem.
Attempt 3 Bar Graph
ggplot(data=trimmed_dactivity) + geom_bar(mapping=aes(x=ActivityDate, fill=Id)) + theme(axis.text.x = element_text(angle=45))
Attempt 3 Bar Graph
##The fill function does not work, and on the y-axis if you will, I don't know what "count" is referring to in this case, so could be useful except for those two issues.
##Finally, I switch to a line graph
Attempt 4 Line Graph
ggplot(data=trimmed_dactivity) + geom_line(mapping=aes(x=ActivityDate, y=Calories)) + theme(axis.text.x = element_text(angle=45))
Attempt 4 Line Graph
##Now what I get is separate lines going up and down, and what I want is 33 separate lines representing unique Id numbers to travel along the x axis for time, and rise in the y axis for calories. Of course I'm not sure how to do that...
Any help with what I'm missing on this journey here?
what I want is 33 separate lines representing unique Id numbers…
It sounds like you want a spaghetti plot. To make one, map Id to color (or to group if you don’t want each id to be colored differently).
library(ggplot2)
ggplot(fakedata, aes(ActivityDate, Calories)) +
geom_line(aes(color = factor(Id)), show.legend = FALSE)
Example data:
set.seed(13)
fakedata <- expand.grid(
Id = 1:33,
ActivityDate = seq(as.Date("2016-04-13"), length.out = 10, by = "day")
)
fakedata$Calories <- round(rnorm(330, 2500, 500))

How to plot a gg barplot for a single factor column?

My data frame has 621 rows and each column describes something about it. I'm trying to do a exploratory data analysis where I plot out all the data into a bar plot.
I have a factor column called phenotype, which has 86 levels which describe the main condition in my cohort. I want to plot this out as 86 separate bar plots, each with the total number of people who have that condition on ggplot.
I've attached a screenshot of my data below, I basically want the x axis to have the condition name like the 'Bardet-Biedl Syndrome', 'Classic Ehlers Danlos Syndrome' etc and on the y axis the number of people who have that condition, such as 3,4,5 as displayed below etc. I got the below data by basically doing
table(data.frame$Phenotype)
I'm using the below code to generate my ggplot
ggplot (tiering, aes(x = Phenotype, y = count(tiering$Phenotype))) +
theme bw() +
geom bar(stat = "identity")
I'm sure the answer is out there, but I've looked on the R help websites and I can't seem to figure this out, so would be very grateful for the help.
EDIT: I got to a marplot with the help of the below code, just trying to reorder the bar/columns in decreasing order and tried this method but it hasn't worked. Would anyone have any suggestions?

Bar chart - bars jumped to y-axis

I was plotting a bar chart with the code which worked perfectly well until some of the data had a value of 0.
barwidth = 0.35
df1:
norms_number R2.c
1 0.011
2 0
3 0.015
4 0.011
5 0
6 0.012
df2:
norms_number R2.c
1 0.001
2 0
3 0.012
4 0.006
5 0
6 0.004
test <- ggplot()+
geom_bar(data=df1, aes(x=norms_number, y=R2.c),stat="identity", position="dodge", width = barwidth)+
geom_bar(data=df2, aes(x=norms_number+barwidth+0.03, y=R2.c),
stat="identity", position="dodge",width = barwidth)
my result was:
and I got a warning that position stack requires non-overlapping x intervals (but they are not overlapping?)
I looked into it and changed the DV to factor (from numeric), which half helped, because now the graph looks like this:
why are the bars on the y axis? how else can I get around this weird error with values of 0?
First of all, you are intending to plot a bar chart where the heights are represented by a value rather than by number of cases. See here for more details, but you should be using geom_col instead of geom_bar.
With that being said, the error you are getting and the result is because it seems with x=norms_number+barwidth+0.03 you are trying to specify the precise positioning of the second set of data (df2) relative to the first set of data (df1).
In order for ggplot to dodge, it has to understand what to use as a basis for the dodge, and then it will separate (or "dodge") each observation containing the same x= aesthetic based upon that particular group used as the basis. Under normal circumstances, you would specify in aes( something like fill=, and ggplot is smart enough to know that whatever you set as fill= will also be the basis for position='dodge' to function. in the abscence of that (or if you wanted to override that), you would need to specify a group= aesthetic that would be used for dodging.
Ultimately, this means that you need to combine your datasets and provide ggplot a way of deciding how to dodge. This makes sense, since both of your dataframes are intended to be placed in the same plot, and both have identical x and y aesthetics. If you leave them as separate dataframes, you can overplot them in the same plot, but there is no good way to have ggplot use position='dodge', because it needs to see all the data in the geom_col call in order to know what to use as the basis for the dodge.
With all that being said, here's what I would recommend:
# combine datasets, but first make a marker called "origin"
# this will be used as a basis for the dodge and fill aesthetics
df1$origin <- 'df1'
df2$origin <- 'df2'
df <- rbind(df1, df2)
# need to change norms_number to a factor to allow for discrete axis
df$norms_number <- as.factor(df$norms_number)
You then use only one call to geom_col to get your plot. In the first case, I will use only the group= aesthetic to show you how ggplot uses this for the dodge mechanism:
ggplot(df, aes(x=norms_number, y=R2.c)) +
geom_col(position='dodge', width=0.35, aes(group=origin), color='black')
As mentioned, you can also just supply a fill= aesthetic, and ggplot will know to use that as the mechanism for dodging:
ggplot(df, aes(x=norms_number, y=R2.c)) +
geom_col(position='dodge', width=0.35, aes(fill=origin), color='black')
Not very sure if you are trying to draw something more complicated like a bar over a bar etc.. anyhow, one way is to use geom_rect() if you want to have one over the other:
ggplot()+
geom_rect(data=df1,
aes(xmin=norms_number-barwidth,xmax=norms_number,
ymin=0,ymax=R2.c))+
geom_rect(data=df2,
aes(xmin=norms_number,xmax=norms_number+barwidth,
ymin=0,ymax=R2.c))+
scale_x_continuous(breaks=1:6)

Scatterplot in ggplot stacked like barplot

I want to create a scatterplot in ggplot where there are multiple y values for each x value. I want to add these y values and plot the sum against the x value.
>df
a b
1 2
1 2
2 1
2 4
3 1
3 5
I want a plot that plots the sums of the b values for each a
a b
1 4
2 5
3 6
I can do this for a barplot by making a stacked barplot:
ggplot(data=df, aes(x=df$a, y=df$b)) + geom_bar(stat="identity")
but if I do this with geom_point ggplot just plots each value of y without stacking.
I could use ddply for this, but that would require a number of more steps. If there is a more expedient way I'd appreciate it.
I searched the site for other answers. While there were plenty about "stacked scatterplots" they were all about overlaid plots.
I don't see anything stacked about your bar chart example. If you just want to summarize the values to a single pont, you can use stat_summary
ggplot(data=df, aes(x=a, y=b)) + stat_summary(fun.y=sum, geom="point")
There are many ways to achieve this effect - of a 'histogram' but without bars, whose height is the sum of all values at the same X.
This type of graph is called a Cleveland Dot Plot, and is used because the conspicuous bars of a histogram can a distraction or at worse be misleading. (see works by Cleveland, Tufte etc).
One way to achieve this is to pre-process the data to do the sum, using functions such as table or hist or tapply or xtabs...
Note that base R has the function dotchart for the production of this type of graph.
dotchart(xtabs(rev(df)))
... but since we are discussing ggplot, which has powerful ways to summarise the data while plotting it, let's stick to MrFlick's theme of how to do it directly ggplot operators (i.e. not preprocessed).
Using a weighted bin summary statistic:
ggplot(data=df, aes(x=factor(a),weight=b)) + geom_point(stat="bin")
you may want to adjust the lower y limit to 0 here.
By stacking the height of the points:
ggplot(data=df, aes(x=factor(a),y=b)) + geom_point(position="stack")
the additional dots visible on this plot are probably superfluous and definitely ambiguous, but highlight the fact of multiplicity in the source data.
Building a dotplot
This one is popular in newspapers, but usually has dollar bills instead of giant black holes:
ggplot(data=df, aes(x=factor(a),weight=b)) + geom_dotplot(method="histodot")
It's probably not what you are looking for, but it's worth being aware of.
You should also be aware that scales are difficult to get correct in this mode, so it's best used in a hand-tuned mode, with the y scale numbering turned off.

Plot a 'top 10' style list/ranking in R based on numerical column of dataframe

I have an R dataframe that contains a string variable and a numerical variable, and I would like to plot the top 10 strings, based on the value of the numerical variable.
I can of course get the top 10 entries pretty simply:
top10_rank <- rank[order(rank$numerical_var_name),]
My first approach to trying to visualize this was to simple attempt to plot this like:
ggplot(data=top10_rank, aes(x = top10_rank$numerical_var_name, y = top10_rank$string_name)) + geom_point(size=3)
And to a first approximation this "works" - the problem is that the strings on the y axis are sorted alphabetically rather than by the numerical value.
My preference would be to find a way to plot the top 10 strings without having to bother showing the numerical variable at all - just basically as a list (even better would be if I could enumerate the list). I am attempting to plot this so it looks more pleasing than simply dumping the text to the screen.
Any ideas greatly appreciated!
The y-axis tick marks may be sorted alphabetically, but the points are drawn in order(from left to right) of the top10_rank dataframe. What you need to do is change the order of the y-axis. Add this to your call of ggplot + scale_y_discrete(limits=top10_rank$String) and it should work.
ggplot(data=top10_rank, aes(x = top10_rank$Number,
y = top10_rank$String)) + geom_point(size=3) + scale_y_discrete(limits=top10_rank$String)
Here is a link to a great resource on R graphics: R Graphics Cookbook

Resources