selective display of the groups text on a stacked ggplot2 - r

I'm creating several stacked barplots using ggplot. I'm grouping my results by year and I want to sort my data by a factor variable that has many levels (around 30). I want to display my cumulative summs but there are so many of them that they overlap.
My barplot looks OK for categories with big values, but I haven't managed to find a solution for categories that have small values.I tried setting different geom_text arguments. Now I would like to simply exclude the text for those categories from the barplot but dont know how.
ggplot(data=pivot, aes(x=YEAR, y=SUM, fill=GROUP))+
geom_bar(stat="identity")+
geom_text(aes(label=round(SUM)), vjust=1.6,
position = position_stack(), size=2.5)+
labs(x = "YEAR", y="Amount sold in EUR")
I think that my graphs look better with text over categories with bigger values so I want to include them in the final results but don't know how to select only a few for display.
My dataframe looks as follows:
> pivot
A tibble: 86 x 3
Groups: value [31]
value Year SUM
1 1 2011 771.
2 1 2012 999.
3 1 2013 1479.
4 1 2014 512.
5 1 2015 677.
6 3 2012 4.07
7 4 2012 7.92
8 4 2013 3.97
9 4 2014 41.2
10 5 2011 12.0
... with 76 more rows
I would like to display text on the barplot for values of SUM for category 1 as they are bigger but not for categories 3, 4 and 5. In the final result I would be content with displaying text only for categories 1, 24 and 26 but dont know how to select only them.

Related

Order Stacked Bars Plot R

I've tried to organize by factor levels; I've tried to organize my data, but nothing is working.
I want the stacked bars to be from either 1-5 or 5-1.
Data:
Scale variable value
5 5 - Extremely valuable Q10A 17.8%
10 5 - Extremely valuable Q10B 18.9%
4 4 Q10A 27.1%
9 4 Q10B 31.4%
3 3 Q10A 31.5%
8 3 Q10B 32.4%
2 2 Q10A 12.7%
7 2 Q10B 8.8%
1 1 - No value at all Q10A 11%
6 1 - No value at all Q10B 8.6%
Code:
ggplot(breakstablemelt,aes(x=variable, y=value,fill=Scale))+
geom_bar(stat="identity")+
coord_flip()+
labs(title="title",
x="Q10",
y=NULL)
Organizing Data by Scale:
breakstablemelt=breakstablemelt[order(breakstablemelt$Scale,decreasing=T),]
Edit:
Factor Organization:
breakstablemelt$Scale<-factor(breakstablemelt$Scale, levels=breakstable$Scale)
breakstablemelt2=breakstablemelt %>% arrange(desc(Scale))
Graph output:
unordered stacked bar graph
Removed the percent symbols at the end of the Value column, and it fixed everything.

R: Plot several lines in the same plot: ggplot + data tables or frames vs matrices

My general problem: I tend to struggle using ggplot, because it's very data-frame-centric but the objects I work with seem to fit matrices better than data frames. Here is an example (adapted a little).
I have a quantity x that can assume values 0:5, and a "context" that can have values 0 or 1. For each context I have 7 different frequency distributions over the quantity x. (More generally I could have more than two "contexts", more values of x, and more frequency distributions.)
I can represent these 7×2 frequency distributions as a list freqs of two matrices, say:
> freqs
$`context0`
x0 x1 x2 x3 x4 x5
sample1 20 10 10 21 37 2
sample2 34 40 6 10 1 8
sample3 52 4 1 2 17 25
sample4 16 32 25 11 5 10
sample5 28 2 10 4 21 35
sample6 22 13 35 12 13 5
sample7 9 5 43 29 4 10
$`context1`
x0 x1 x2 x3 x4 x5
sample1 15 21 14 15 14 21
sample2 27 8 6 5 29 25
sample3 13 7 5 26 48 0
sample4 33 3 18 11 13 22
sample5 12 23 40 11 2 11
sample6 5 51 2 28 5 9
sample7 3 1 21 10 63 2
or a 3D array.
Or I could use a data.table tablefreqs like this one:
> tablefreqs
context x0 x1 x2 x3 x4 x5
1: 0 20 10 10 21 37 2
2: 0 34 40 6 10 1 8
3: 0 52 4 1 2 17 25
4: 0 16 32 25 11 5 10
5: 0 28 2 10 4 21 35
6: 0 22 13 35 12 13 5
7: 0 9 5 43 29 4 10
8: 1 15 21 14 15 14 21
9: 1 27 8 6 5 29 25
10: 1 13 7 5 26 48 0
11: 1 33 3 18 11 13 22
12: 1 12 23 40 11 2 11
13: 1 5 51 2 28 5 9
14: 1 3 1 21 10 63 2
Now I'd like to draw the following line plot (there's a reason why I need line plots and not, say, histograms or bar plots):
The 7 frequency distributions for context 0, with x as x-axis and the frequency as y-axis, all in the same line plot (with some alpha).
The 7 frequency distributions for context 1, again with x as x-axis and the frequency as y-axis, all in the same line plot (with alpha), but displayed upside-down below the plot for context 0.
Ggplot would surely do this very nicely, but it seems to require some acrobatics with data tables:
– If I use the data table tablefreqs it's not clear to me how to plot all its rows having context==0 in the same plot: ggplot seems to only think column-wise, not row-wise. I could use the six values of x as table rows, but then the "context" values would also end up in a row, and I'm not sure I can subset a data table by values in a row, rather than in a column.
– If I use the matrix freqs, I could create a mini-data-table having x as one column and one frequency distribution as another column, input that into ggplot+geom_line, then go over all 7 frequency distributions in a for-loop maybe. Not clear to me how to tell ggplot to keep the previous plots in this case. Then another for-loop over the two "contexts".
I'd be grateful for suggestions on how to approach this problem in particular, and more generally on what objects to choose for storing this kind of data: matrices? data tables, maybe with a different structure than shown here? some other formats?
I would suggest to familiarize yourself with the concept of what is known as Tidy Data, which are principles for data handling and storage that are adopted by ggplot2 and a number of other packages.
You are free to use a matrix or list of matrices to store your data; however, you can certainly store the data as you describe it (and as I understand it) in a data frame or single table following the following convention of columns:
context | sample | x | freq
I'll show you how I would convert the tablefreqs dataset you shared with us into that format, then how I would go about creating a plot as you are describing it in your question. I'm assuming in this case you only have the two values for context, although you allude to there being more. I'm going to try to interpret correctly what you stated in your question.
Create the Tidy Data frame
Your data frame as shown contains columns x1 through x5 that have values for x spread across more than one column, when you really need these to be converted in the format shown above. This is called "gathering" your data, and you can do that with tidyr::gather().
First, I also need to replicate the naming of your samples according to the matrix dataset, so I'll do that and gather your data:
library(dplyr)
library(tidyr)
library(ggplot2)
# create the sample names
tablefreqs$sample <- rep(paste0('sample',1:7), 2)
# gather the columns together
df <- tablefreqs %>%
gather(key='x', value='freq', -c(context, sample))
Note that in the gather() function, we have to specify to leave alone the two columns df$context and df$sample, as they are not part of the gathering effort. But now we are left with df$x containing character vectors. We can't plot that, because we want the to be in the form of a number (at least... I'm assuming you do). For that, we'll convert using:
df$x <- as.numeric(gsub("[^[:digit:].]", "", df$x))
That extracts the number from each value in df$x and represents it as a number, not a character. We have the opposite issue with df$context, which is actually a discrete factor, and we should represent it as such in order to make plotting a bit easier:
df$context <- factor(df$context)
Create the Plot
Now we're ready to create the plot. From your description, I may not have this perfectly right, but it seems that you want a plot containing both context = 1 and context = 0, and when context = 1 the data should be "upside down". By that, I'm assuming you are talking about plotting df$freq when df$context == 0 and -df$freq when df$context == 1. We could do that using some fancy logic in the ggplot() call, but I find it's easier just to create a new column in your dataset to represent what we will be plotting on the y axis. We'll call this column df$freq_adj and use that for plotting:
df$freq_adj <- ifelse(df$context==1, -df$freq, df$freq)
Then we create the plot. I'll explain a bit below the result:
ggplot(df, aes(x=x, y=freq_adj)) +
geom_line(
aes(color=context, linetype=sample)
) +
geom_hline(yintercept=0, color='gray50') +
scale_x_continuous(expand=expansion(mult=0)) +
theme_bw()
Without some clearer description or picture of what you were looking to do, I took some liberties here. I used color to discriminate between the two values for context, and I'm using linetype to discriminate the different samples. I also added a line at 0, since it seemed appropriate to do so here, and the scale_x_continuous() command is removing the extra white space that is put in place at the extreme ends of the data.
An alternative that is maybe closer to your description would be to physically have a separation between the two plots, and represent context = 1 as a physically separate plot compared to context = 0, with one over top of the other.
Here's the code and plot:
ggplot(df, aes(x=x, y=freq_adj)) +
geom_line(aes(group=sample), alpha=0.3) +
facet_grid(context ~ ., scales='free_y') +
scale_x_continuous(expand=expansion(mult=0)) +
theme_bw()
There the use of aes(group=sample) is quite important, since I want all the lines for each sample to be the same (alpha setting and color), yet ggplot2 needs to know that the connections between the points should be based on "sample". This is done using the group= aesthetic. The scales='free_y' argument on facet_grid() allows the y axis scale to shrink and fit the data according to each facet.

In R, how do I create a bar graph using a categorical x and the average of a numeric y with ggplot2?

I am using R version 3.5.1.
I started with this problem, and used the code suggested by the top comment on my own dataset, where time_len is a categorical variable telling how long it takes to play game i, with values of short, medium, and long; num_fans is a numeric variable telling how many fans game i has:
ggplot(cdata) +
aes(x = time_len, y = num_fans) +
geom_bar(stat = "identity")
Here is the plot:
I created the bar chart using the code above, but the problem is that the num_fans variable looks like it is a total. I want it to show the average for each category.
EDIT: here is a sample of my data.
game_id min_players single_player max_players family_game avg_time time_len year life avg_rating
1 161936 2 multi 4 family 60 short 2015 4 8.65105
2 187645 2 multi 4 family 240 long 2016 3 8.45420
3 12333 2 multi 2 small 180 long 2005 14 8.32843
4 193738 2 multi 4 family 150 medium 2016 3 8.28646
5 162886 1 single 4 family 120 medium 2017 2 8.39401
6 84876 2 multi 4 family 90 medium 2011 8 8.12803
geek_rating rating age owned num_fans
1 8.49375 8.572400 13 47498 2168
2 8.16391 8.309055 14 23989 1594
3 8.18051 8.254470 13 45955 3639
4 8.07144 8.178950 12 21513 835
5 7.96162 8.177815 13 13743 1002
6 8.01120 8.069615 12 49353 1866
Again, I am only asking about time_len. Is there a way to show the average of num_fans?
Try replacing the
geom_bar(stat = "identity")
line with
stat_summary(geom = "bar", fun.y = "mean")?
(Note: this is the answer contributed by Z.Lin, who told me that I could post it as an answer.)

R: Plot Density Graph for data in tables with respect to Labels in tables

I got a data in table form which look like this in R:
V1 V2
1 19 -1539
2 7 -1507
3 3 -1446
4 7 -1427
5 8 -1401
6 2 -422
7 22 4178
8 5 4277
9 10 4303
10 18 4431
....200 million more lines to go
I would like to plot a density plot for the value in the second column with respect to the label in the first column (i.e. each label has on density curve on a same graph). But I don't know how. Any suggestion?
If I understood the question correctly, this would end up somewhat like a density heatmap in the end. (Considering there are 200 million observations total and V1 has fairly considerable range of variation)
For that I would try ggplot and stat_binhex:
df <- read.table(text="V1 V2
1 19 -1539
2 7 -1507
3 3 -1446
4 7 -1427
5 8 -1401
6 2 -422
7 22 4178
8 5 4277
9 10 4303
10 18 4431")
library(ggplot2)
ggplot(data=df,aes(V1,V2)) +
stat_binhex() +
scale_fill_gradient(low="red", high="steelblue") +
scale_y_continuous() +
theme_bw()
stat_binhex should work well with large data and has several parameters that will help with presentation (like bins, binwidth. See ?stat_binhex)
OK I figure it out by myself
ggplot(data, aes(x=V2, color=V1)) + geom_density(aes(group=V1))
Should be able to do that.
However there is two thing I need to make sure first in order to let it run:
V1 is a factor
V2 is a numerical value
The data I got wasn't set directly by read.tables in the way I want, so I have to do the following before using ggplot:
data$V1 = as.factor(data$V1)
data$V2 = as.numeric(as.character(data$V2))

How to control colors and breaks in heatmap using ggplot?

I am trying to make a heatmap using ggplot2 package.
I have trouble controlling the colors and breaks on the heatmap.
I have 18 questions, 22 firms and the meanvalue of the firms responses on a 1 to 5 scale.
Say i would want values (0-1)(1-2)(2-3)(3-4)(4-5) to be color coded. Either with different colors (Blue, Green, Red, Yellow, Purple) or on a gradient scale. And also NA values = Black.
Short: How do i choose colors and breaks?
I would also like to fix the order on the axis to "Question1, Question2...Question18".
Likewise for the firms. At this moment I believe it is of class "factor" that causes this problem.
> head(mydf, 20)
Firm Question Value
1 1 Question1 3.6675482217047
2 1 Question2 3.74327628361858
3 1 Question3 <NA>
4 1 Question4 <NA>
5 1 Question5 <NA>
6 1 Question6 <NA>
7 1 Question7 0.352078239608802
8 1 Question8 3.04180471049169
9 1 Question9 3.9559090659924
10 1 Question10 <NA>
11 1 Question11 1
12 1 Question12 4.26591296778731
13 1 Question13 3.95256943635996
14 1 Question14 0.465686274509804
15 1 Question15 2.61764705882353
16 1 Question16 1.83333333333333
17 1 Question17 <NA>
18 1 Question18 0.225490196078431
19 2 Question1 3.85714285714286
20 2 Question2 4
> ggplot(mydf, aes(Question, Firm, fill=Value)) + geom_tile() + theme(axis.text.x = element_text(angle=330, hjust=0))
http://imgur.com/iM1aLXG Link to picture of my current plot.
The root of your problem appears to be that Value is a factor, rather than a numeric vector. I infer this based on the fact that in the head() output NA values are written as <NA>, which I assume is how they were written in your original spreadsheet, but is not default behavior for R. The image you link to is ggplot's default behavior for coloring based on a factor; the default coloration for numeric is much closer to what you want.
You can check if this in indeed the case by using class$mydf$Value. If it is indeed a factor, convert it to numeric with the following:
mydf$Value <-as.numeric(as.character(mydf$Value))
Your plotting code as written will now return a graph which looks like this:
You can play around with the exact visualization using the gradient scale, or add a manual scale.
As for your other question, reordering that factor is quite simple. Adapted From R bloggers:
mydf$Question <- factor(mydf$Question, levels(mydf$Question)[c(1,10:18,2:9)])

Resources