barplot discrete variables for 2 groups

barplot discrete variables for 2 groups - r

I have a dataframe with a column of 'Y' or 'N' for 2 groups eg:
drug<-c("Y","Y","N","Y","Y","Y","N","N","N","N","N","Y","Y","Y","N","N")
group<-c(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1)
df<-data.frame(drug,group)
I want to make barplots of the 'Y'/'N' for both groups with the two groups beside each other.
I've tried various things with ggbarplot and get weird plots out
ggbarplot(my_matches, x = "group", y = "drug",
color = "group", palette = c("#00AFBB", "#FC4E07"))
and have tried making tables and plotting these as barplots like
counts0<-df[which(df$group==0),]
counts1<-df[which(df$group==1),]
grp0<-table(counts0$drug)
grp1<-table(counts1$drug)
s<- as.data.frame(t(rbind(grp0,grp1)))
barplot(s$grp0, s$grp1,beside=T)
As you can tell, I'm a beginner and have been driving myself mad trying to solve this. Please help!

First, there's no need to create vectors as data frame columns, and df is not a great variable name (there's a function of the same name). Create your data frame in one step like this:
mydata <- data.frame(drug = c("Y","Y","N","Y","Y","Y","N","N","N","N","N","Y","Y","Y","N","N"),
group = c(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1))
Second: if you're working with data frames, it's worth learning dplyr. So install it, along with ggplot2, then load:
library(dplyr)
library(ggplot2)
Now we can count Y/N by group:
mydata %>%
count(group, drug)
# A tibble: 4 x 3
group drug n
<dbl> <fct> <int>
1 0 N 3
2 0 Y 5
3 1 N 5
4 1 Y 3
And plot counts versus group. We need to convert the groups to factors, since group is a categorical variable:
mydata %>%
count(group, drug) %>%
mutate(group = factor(group)) %>%
ggplot(aes(group, n)) +
geom_col(aes(fill = drug))

Related

How to reorder the x-axis when 3 different dataframes are used?

I want to get 3 different lines, so there are 3 different dataframes, agg1, agg2, agg3. The data in Daysleft for agg1 is ">60",">70",">80",">100". The data in Daysleft for agg2 is ">80",">100". The data in Daysleft for agg3 is ">100". The x-axis for that graph that I am getting currently is ">100",">60",">70",">80". I have already factorised the Daysleft column correctly for the 3 dataframe. However, after plotting the graph, the x-axis is jumbled up again. Thanks!
ggplot()+
geom_point(data=agg1,aes(x=Daysleft,y=AvgPrice,group=1),colour="red")+
geom_point(data=agg2,aes(x=Daysleft,y=AvgPrice,group=1),colour="blue")+
geom_point(data=agg3,aes(x=Daysleft,y=AvgPrice,group=1),colour="green")

If you provide some data, I can try it on my side. The following is my approach, please let me know if it does not work with your data.
agg1$group = 1
agg2$group = 2
agg3$group = 3
df <- rbind(agg1, agg2, agg3)
df %>%
mutate(Daysleft = as.factor(Daysleft, levels = c(">60",">70",">80",">100"))) %>%
ggplot(aes(x = Daysleft, y = AvgPrice, color = group))

How to plot top 5 most frequent variables by region in R

I am looking to do a plot to look into the most common occuring FINAL_CALL_TYPE in my dataset by BOROUGH in NYC. I have a dataset with over 3 million obs. I broke this down into a sample of 2000, but have refined it even more to just the incident type and the borough it occured in.
Essentially, I want to create a plot that will visualize to the 5 most common call types in each borough, with the count of how many of each call types there was in each borough.
Below is a brief look of how my data looks with just Call Type and Borough
> head(df)
FINAL_CALL_TYPE BOROUGH
1804978 INJURY BRONX
1613888 INJMAJ BROOKLYN
294874 INJURY BROOKLYN
1028374 DRUG BROOKLYN
1974030 INJURY MANHATTAN
795815 CVAC BRONX
This shows how many unique values there are
> str(df)
'data.frame': 2000 obs. of 2 variables:
$ FINAL_CALL_TYPE: Factor w/ 139 levels "ABDPFC","ABDPFT",..: 50 48 50 34 50 25 17 138 28 28 ...
$ BOROUGH : Factor w/ 5 levels "BRONX","BROOKLYN",..: 1 2 2 2 3 1 4 2 4 4 ...
This is the code that I have tried
> ggplot(df, aes(x=BOROUGH, y=FINAL_CALL_TYPE)) +
+ geom_bar(stat = 'identity') +
+ facet_grid(~BOROUGH)
and below is the result
I have tried a few suggestions accross this community, but I have not found any that shows how to perform the action with 2 columns.
It would be much appreciated if there is someone who know a solution for this.
Thanks!

If I understand correctly, you can use tidyverse to doo something like:
df <- df %>%
group_by(BOROUGH, FINAL_CALL) %>%
summarise(count = n()) %>%
top_n(n = 5, wt = count)
then plot
ggplot(df, aes(x = FINAL_CALL, y = count) +
geom_col() +
facet(~BOROUGH, scales = "free")

creating the barplot
The first part of your problem is to create the barplot. With geom_bar you only need to supply the x variable, as the y-axis is the count of observations of that variable. You can then use the facet option to separate that count into different panels for another grouping variable.
library(ggplot2)
ggplot(data = diamonds, aes(x = color)) +
geom_bar() +
facet_grid(.~cut)
filtering to top 5 observations
The second part of your problem, limiting the data to only the top five in each group is slightly more complex. An easy way to do this is to first tally the data which will create a column n that has the count of observations. By adding the sort option we can filter the data to the first five rows in each group. tally, like summarize, automatically removes the last group.
In the ggplot call I now use geom_col instead of geom_bar and I explicitly specify that the y-variable is n (n is created by tally).
geom_bar plots the count of observations per x-variable, geom_col plots a y-variable value for each value of the x-variable.
scales = "free_x" removes values from the x-axis that are present in one cut panel but not another.
library(tidyverse)
df <- diamonds %>%
group_by(cut, color) %>%
tally(sort = TRUE) %>%
filter(row_number() <= 5)
ggplot(data = df, aes(x = color, y = n)) +
geom_col() +
facet_grid(.~cut, scales = "free_x")

Stacked bar chart with multiple categorical variables in ggplot2 with facet_grid

I am trying to create a stacked bar chart in ggplot2 to display the percentage of values corresponding to each categorical variable. Here's an example of the data that I am trying to work with.
sampledf <- data.frame("Death" = rep(0:1, each = 5),
"HabitA" = rep(0:1, c(3, 7)),
"HabitB" = rep(1:2, c(4, 6)),
"HabitC" = rep(0:1, c(6, 4)))
Each of the habits are the columns that I am using to create the stacked bar chart, and I want to use the Death column in facet_grid. I'm looking to show the percentage of values for each habit in the bar chart.
The output data I think I need to create the chart should will translate to, under Death = 0, HabitA has 60% 0 values, and 40% of the values are 1, while under Death = 1, 100% of HabitA values are 1.
I have produced charts like this using ggplot and group_by, summarise for only one attribute, but I am not sure how this works with multiple categorical attributes in the data.
sampledf %>%
group_by(Death, HabitA) %>%
summarise(count=n()) %>%
mutate(perc=count/sum(count))
This produces what I want for just one variable, but when I include another attribute in the group by argument, it returns counts a percentages for a combination of all 3 attributes which is not what I am looking for. I tried using the summarise_at/mutate_at but it doesn't seem to be working.
sampledf %>%
group_by(Death) %>%
mutate_at(c("HabitA", "HabitB"), Counts = n())
Is there a straightforward way to do this in R, and use the resulting data as input for ggplot2?
Edit:
I tried to reshape the data and using the long form to build my plot. Here's what I have.
long <- melt(sampledf, id.vars = c("Death"))
The resulting data is in this format.
Death variable value
1 0 HabitA 0
2 0 HabitA 0
3 0 HabitA 0
4 0 HabitA 1
5 0 HabitA 1
6 1 HabitA 1
7 1 HabitA 1
I'm not sure how to use the value attribute to build the plot, because the ggplot I am currently trying to build is counting the total number of times each level occurs in the variable column.
ggplot(long, aes(x = variable, fill = variable)) +
geom_bar(stat = "count", position = "dodge") + facet_grid(~ Death)

Try this, maybe not so straightforward, but it works. It includes reshaping as #aosmith suggested by gather. Then calculation of number of observations after grouping and then percentage for each group Death + habitat. Then summarized to get unique values.
sampledf_edited <- sampledf %>%
tidyr::gather("habitat", "count", 2:4) %>%
group_by(Death, habitat, count) %>%
mutate(observation = n()) %>%
ungroup() %>%
group_by(Death, habitat) %>%
mutate(percent = observation/n()) %>%
ungroup() %>%
group_by(Death, habitat, count, percent) %>%
summarize()
It is necessarry to make count factor.
sampledf_edited$count <- as.factor(sampledf_edited$count)
Plotting by ggplot.
ggplot(sampledf_edited, aes(habitat, percent, fill = count)) +
geom_bar(stat = "identity") +
facet_grid(~ Death)
If your question has been answered, please make sure to accept an answer for further references.
---EDIT---
plot added

Creating a histogram in R that shows the difference in the number of errors made by three groups

I have to create a histogram in RStudio with the number of errors of three (3) Data groups. There are GroupA, GroupB and GroupC. Each one of them has 4 variables and one of them is the "errors" variable. So its like GroupA$errors etc..
How am I going to combine these 3 Groups and make a plot on which on the x axis shows 3 bars (each one of them is each group) and on the y axis the number of errors?
dput: http://pastebin.com/vGEPDNFf

With your data:
myData <- data.frame(case = 1:48,
group = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3),
age = c(70,68,61,68,77,72,64,65,69,67,71,75,73,68,65,69,63,70,78,73,76,78,65,68,75,65,62,69,70,71,60,69,60,66,75,70,62,63,79,79,66,76,64,61,70,67,69,63),
errors = c(9,6,7,8,10,11,4,5,5,6,12,8,9,3,7,6,8,6,12,7,13,10,8,8,11,5,9,6,9,6,9,7,5,3,6,6,7,5,9,8,6,6,3,4,7,5,4,5))
Here is the code you have to run in R:
library(ggplot2)
library(dplyr)
myData %>%
group_by(group) %>%
summarize(total.errors=sum(errors)) %>%
ggplot(aes(x=factor(group), y=total.errors)) + geom_bar(stat = "identity")
It gives you the following figure:

Categorizing samples in R, and plotting them in different colors

I am new to learning R. I wanted to know how I can asssign a categorical value to observations I have read in as a dataframe. For eg I have data for m variables from n samples and I want to assign some samples as group 1 and some samples as group 2 and so on. Also, how can I visualise different groups in different colors when I plot them?

Let's say you have the following data:
spam = data.frame(value = runif(100))
you can assign random group membership like this:
spam[["group"]] = sample(c("group1", "group2"), nrow(spam), replace = TRUE)
> head(spam)
value group
1 0.1385715 group1
2 0.1785452 group1
3 0.7407510 group2
4 0.5867080 group1
5 0.1514461 group1
6 0.3009905 group1
Plotting the groups with different colors can easily be done using ggplot2:
require(ggplot2)
ggplot(aes(x = 1:nrow(spam), y = value, color = group), data = spam) +
geom_point()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

barplot discrete variables for 2 groups - r

Related

How to reorder the x-axis when 3 different dataframes are used?

How to plot top 5 most frequent variables by region in R

Stacked bar chart with multiple categorical variables in ggplot2 with facet_grid

Creating a histogram in R that shows the difference in the number of errors made by three groups

Categorizing samples in R, and plotting them in different colors

Categories

Resources