How to melt R data.frame and plot group by bar plot - r

I have following R data.frame:
group match unmatch unmatch_active match_active
1 A 10 4 0 0
2 B 116 20 0 3
3 c 160 27 1 4
4 D 79 17 0 3
5 E 309 84 4 14
6 F 643 244 10 23
...
My goal is to plot a group by bar plot (http://www.cookbook-r.com/Graphs/Bar_and_line_graphs_(ggplot2)/ section-Graphs with more variables) as shown in the link.
I realize that before getting to that I need to get the data in to following format
group variable value
1 A match 10
2 B match 116
3 C match 160
4 D match 79
5 E match 309
6 F match 643
7 A unmatch 4
8 B unmatch 20
...
I used the melt function:
groups.df.melt <- melt(groups.df[,c('group','match','unmatch', 'unmatch_active', 'match_active')],id.vars = 1)
I don't think I am doing the melt correctly because after I execute above groups.df.melt has 1000+ lines which doesn't make sense to me.
I looked at how Draw histograms per row over multiple columns in R and tried to follow the same yet I don't get the graph I want.
In addition I get following error: When I try to do the plotting:
ggplot(groups.df.melt, aes(x='group', y=value)) + geom_bar(aes(fill = variable), position="dodge") + scale_y_log10()
Mapping a variable to y and also using stat="bin".
With stat="bin", it will attempt to set the y value to the count of cases in each group.
This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
If you want y to represent values in the data, use stat="identity".
See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)
Error in pmin(y, 0) : object 'y' not found

Try:
mm <- melt(ddf, id='group')
ggplot(data = mm, aes(x = group, y = value, fill = variable)) +
geom_bar(stat = 'identity', position = 'dodge')
or
ggplot(data = mm, aes(x = group, y = value, fill = variable)) +
# `geom_col()` uses `stat_identity()`: it leaves the data as is.
geom_col(position = 'dodge')

Related

Trying to make a bar chart with each categorical column as a different color

I found a cool Wes Anderson palette package but I am failing here in actually using it. The variable I am looking at (Q1) has options 1 and 2. There is an NA in the set which is getting plotted however I would like to remove it as well.
library(readxl)
library(tidyverse)
library(wesanderson)
RA_Survey <- read_excel("file extension")
ggplot(data = RA_Survey, mapping = aes(x = Q1)) +
geom_bar() + scale_fill_manual(values=wes_palette(n=2, name="GrandBudapest"))
The plot I'm getting is working but without the color. Any ideas?
There are several issues which need to be addressed.
Using the Wes Anderson palette
As already mentioned by Mako, the fill aesthetic was missing from the call to aes().
Furthermore, the OP reports an error message saying Palette not found. The wesanderson package contains a list of available palettes:
names(wesanderson::wes_palettes)
[1] "BottleRocket1" "BottleRocket2" "Rushmore1" "Rushmore" "Royal1" "Royal2" "Zissou1"
[8] "Darjeeling1" "Darjeeling2" "Chevalier1" "FantasticFox1" "Moonrise1" "Moonrise2" "Moonrise3"
[15] "Cavalcanti1" "GrandBudapest1" "GrandBudapest2" "IsleofDogs1" "IsleofDogs2"
There is no palette called "GrandBudapest" as requested in OP's code. Instead, we have to choose between "GrandBudapest1" and "GrandBudapest2".
Also, the help file help("wes_palette") lists the available palettes.
Here is a working example which uses the dummy data created in the Data section below:
library(ggplot2)
library(wesanderson)
ggplot(RA_Survey, aes(x = Q1, fill = Q1)) +
geom_bar() +
scale_fill_manual(values=wes_palette(n=2, name="GrandBudapest1"))
Removing NA
The OP has asked to remove the NAs from the set. There are two options:
Tell ggplot() to remove the NAs.
Remove the NAs from te data by filtering.
We can tell ggplot() to remove NAs when plotting the x axis:
library(ggplot2)
library(wesanderson)
ggplot(RA_Survey, aes(x = Q1, fill = Q1)) +
geom_bar() +
scale_fill_manual(values=wes_palette(n=2, name="GrandBudapest1")) +
scale_x_discrete(na.translate = FALSE)
Note, this produces a warning message Removed 3 rows containing non-finite values (stat_count). To get rid of the message, we can use geom_bar(na.rm = TRUE).
The other option removes the NAs from the data by filtering
library(dplyr)
library(ggplot2)
library(wesanderson)
ggplot(RA_Survey %>% filter(!is.na(Q1)), aes(x = Q1, fill = Q1)) +
geom_bar() +
scale_fill_manual(values=wes_palette(n=2, name="GrandBudapest1"))
which creates exactly the same chart.
Data
As the OP has not provided a sample dataset, we need to create our own:
library(dplyr)
set.seed(123L)
RA_Survey <- data_frame(Q1 = sample(c("1", "2", NA), 20, TRUE, c(3, 6, 1)))
RA_Survey
# A tibble: 20 x 1
Q1
<chr>
1 2
2 1
3 2
4 1
5 NA
6 2
7 2
8 1
9 2
10 2
11 NA
12 2
13 1
14 2
15 2
16 1
17 2
18 2
19 2
20 NA

ggplot2 geom_bar position failure

I am using the ..count.. transformation in geom_bar and get the warning
position_stack requires non-overlapping x intervals when some of my categories have few counts.
This is best explained using some mock data (my data involves direction and windspeed and I retain names relating to that)
#make data
set.seed(12345)
FF=rweibull(100,1.7,1)*20 #mock speeds
FF[FF>60]=59
dir=sample.int(10,size=100,replace=TRUE) # mock directions
#group into speed classes
FFcut=cut(FF,breaks=seq(0,60,by=20),ordered_result=TRUE,right=FALSE,drop=FALSE)
# stuff into data frame & plot
df=data.frame(dir=dir,grp=FFcut)
ggplot(data=df,aes(x=dir,y=(..count..)/sum(..count..),fill=grp)) + geom_bar()
This works fine, and the resulting plot shows the frequency of directions grouped according to speed. It is of relevance that the velocity class with the fewest counts (here "[40,60)") will have 5 counts.
However more velocity classes leads to a warning. For instance, with
FFcut=cut(FF,breaks=seq(0,60,by=15),ordered_result=TRUE,right=FALSE,drop=FALSE)
the velocity class with the fewest counts (now "[45,60)") will have only 3 counts and ggplot2 will warn that
position_stack requires non-overlapping x intervals
and the plot will show data in this category spread out along the x axis.
It seems that 5 is the minimum size for a group to have for this to work correctly.
I would appreciate knowing if this is a feature or a bug in stat_bin (which geom_bar is using) or if I am simply abusing geom_bar.
Also, any suggestions how to get around this would be appreciated.
Sincerely
This occurs because df$dir is numeric, so the ggplot object assumes a continuous x-axis, and aesthetic parameter group is based on the only known discrete variable (fill = grp).
As a result, when there simply aren't that many dir values in grp = [45,60), ggplot gets confused over how wide each bar should be. This becomes more visually obvious if we split the plot into different facets:
ggplot(data=df,
aes(x=dir,y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar() +
facet_wrap(~ grp)
> for(l in levels(df$grp)) print(sort(unique(df$dir[df$grp == l])))
[1] 1 2 3 4 6 7 8 9 10
[1] 1 2 3 4 5 6 7 8 9 10
[1] 2 3 4 5 7 9 10
[1] 2 4 7
We can also check manually that the minimum difference between sorted df$dir values is 1 for the first three grp values, but 2 for the last one. The default bar width is thus wider.
The following solutions should all achieve the same result:
1. Explicitly specify the same bar width for all groups in geom_bar():
ggplot(data=df,
aes(x=dir,y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar(width = 0.9)
2. Convert dir to a categorical variable before passing it to aes(x = ...):
ggplot(data=df,
aes(x=factor(dir), y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar()
3. Specify that the group parameter should be based on both df$dir & df$grp:
ggplot(data=df,
aes(x=dir,
y=(..count..)/sum(..count..),
group = interaction(dir, grp),
fill = grp)) +
geom_bar()
This doesn't directly solve the issue, because I also don't get what's going on with the overlapping values, but it's a dplyr-powered workaround, and might turn out to be more flexible anyway.
Instead of relying on geom_bar to take the cut factor and give you shares via ..count../sum(..count..), you can easily enough just calculate those shares yourself up front, and then plot your bars. I personally like having this type of control over my data and exactly what I'm plotting.
First, I put dir and FF into a data frame/tbl_df, and cut FF. Then count lets me group the data by dir and grp and count up the number of observations for each combination of those two variables, then calculate the share of each n over the sum of n. I'm using geom_col, which is like geom_bar but when you have a y value in your aes.
library(tidyverse)
set.seed(12345)
FF <- rweibull(100,1.7,1) * 20 #mock speeds
FF[FF > 60] <- 59
dir <- sample.int(10, size = 100, replace = TRUE) # mock directions
shares <- tibble(dir = dir, FF = FF) %>%
mutate(grp = cut(FF, breaks = seq(0, 60, by = 15), ordered_result = T, right = F, drop = F)) %>%
count(dir, grp) %>%
mutate(share = n / sum(n))
shares
#> # A tibble: 29 x 4
#> dir grp n share
#> <int> <ord> <int> <dbl>
#> 1 1 [0,15) 3 0.03
#> 2 1 [15,30) 2 0.02
#> 3 2 [0,15) 4 0.04
#> 4 2 [15,30) 3 0.03
#> 5 2 [30,45) 1 0.01
#> 6 2 [45,60) 1 0.01
#> 7 3 [0,15) 6 0.06
#> 8 3 [15,30) 1 0.01
#> 9 3 [30,45) 2 0.02
#> 10 4 [0,15) 6 0.06
#> # ... with 19 more rows
ggplot(shares, aes(x = dir, y = share, fill = grp)) +
geom_col()

R Plot Bar graph transposed dataframe

I'm trying to plot the following dataframe as bar plot, where the values for the filteredprovince column are listed on a separate column (n)
Usually, the ggplot and all the other plots works on horizontal dataframe, and after several searches I am not able to find a way to plot this "transposed" version of dataframe.
The cluster should group each bar graph, and within each cluster I would plot each filteredprovince based on the value of the n column
Thanks you for the support
d <- read.table(text=
" cluster PROVINCIA n filteredprovince
1 1 08 765 08
2 1 28 665 28
3 1 41 440 41
4 1 11 437 11
5 1 46 276 46
6 1 18 229 18
7 1 35 181 other
8 1 29 170 other
9 1 33 165 other
10 1 38 153 other ", header=TRUE,stringsAsFactors = FALSE)
UPDATE
Thanks to the suggestion in comments I almost achived the format desired :
ggplot(tab_s, aes(x = cluster, y = n, fill = factor(filteredprovince))) + geom_col()
There is any way to put on Y labels not frequencies but the % ?
If I understand correctly, you're trying to use the geom_bar() geom which gives you problems because it wants to make sort of an histogram but you already have done this kind of summary.
(If you had provided code which you have tried so far I would not have to guess)
In that case you can use geom_col() instead.
ggplot(d, aes(x = filteredprovince, y = n, fill = factor(PROVINCIA))) + geom_col()
Alternatively, you can change the default stat of geom_bar() from "count" to "identity"
ggplot(d, aes(x = filteredprovince, y = n, fill = factor(PROVINCIA))) +
geom_bar(stat = "identity")
See this SO question for what a stat is
EDIT: Update in response to OP's update:
To display percentages, you will have to modify the data itself.
Just divide n by the sum of all n and multiply by 100.
d$percentage <- d$n / sum(d$n) * 100
ggplot(d, aes(x = cluster, y = percentage, fill = factor(filteredprovince))) + geom_col()
I'm not sure I perfectly understand, but if the problem is the orientation of your dataframe, you can transpose it with t(data) where data is your dataframe.

How to incorporate "significance" of a Tukey'sHSD directly into graphs of ggplot2 in R?

I have the following data(dat)
V W X Y Z
1 8 89 3 900
1 8 100 2 800
0 9 333 4 980
0 9 560 1 999
I wish to perform TukeysHSD pairwise test to the above data set. From the results of the test, I want to incorporate the significant comparisons in the graph (showing the "*" or "**" sign between the groups that are significant).
This is the code attempted:
library(ggplot2)
library(reshape2)
dat1 <- gather(dat)
ggplot(data = dat1, aes(x = key, y = value)) + stat_summary(fun.data = "mean_cl_normal", colour = "red", size = 1)
pairwise.t.test(dat1$value, dat1$key, p.adj = "holm")
I do not know if (I can) and how to incorporate the results ("significance") of the test directly into the graph without saving each of the results as an external array and then calling it into ggplot2.

Modifying y-axis with ggplot2

I'm trying to plot the number of observations for each instance of a word, both of which are stored in a data frame.
I can generate the plot with ggplot2, but the y-axis displays "1+e05", "2+e05",...,etc...instead of numerical values.
How can I modify this code so that the y-axis displays numbers instead?
Here is my code:
> w
p.word p.freq
1 the 294571
2 and 158624
3 you 84152
4 for 77117
5 that 71672
6 with 47987
7 this 42768
8 was 41088
9 have 39835
10 are 36458
11 but 33899
12 not 30370
13 all 27079
14 your 26923
15 just 25507
16 from 24497
17 out 22578
18 like 22501
19 what 22150
20 will 21530
21 they 21435
22 about 21184
23 one 20877
24 its 20109
ggplot(w, aes(x = p.word, y = p.freq))+ geom_bar(stat = "identity")
Here is the plot that is generated:
"1e+05" etc are numerical values (scientific notation).
If you want the long notation (e.g. "100,000") use library(scales) and the comma formatter:
library(scales)
ggplot(w, aes(x = p.word, y = p.freq))+ geom_bar(stat = "identity") +
scale_y_continuous(labels=comma)

Resources