I can't fill an histogram in r using ggplot - r

I have a dataframe called "employee_attrition". There are two variables of my interest, the first one is called "MonthlyIncome" (with continuous data of salary) and the second one is "PerformanceRating" which takes discrete values (1,2,3 or 4). My intention is to create a histogram for the MonthlyIncome, and show the PerformanceRating in the same plot. I have this:
ggplot(data = employee_attrition, aes(x=MonthlyIncome, fill=PerformanceRating))+
geom_histogram(aes(y=..count..))+
xlab("Salario mensual (MonthlyIncome)")+
ylab("Frecuencia")+
ggtitle("Histograma: MonthlyIncome y Attrition")+
theme_minimal()
The problem is that the plot does not show the "PerformanceRating" associated with each bar of the histogram.
My data frame is something like this:
MonthlyIncome PerformanceRating
1 5993 1
2 5130 1
3 2090 4
4 2909 3
5 3468 4
6 3068 3
And i want a histogram that shows the frequency of MonthlyIncome and each bar with 4 colours of the PerformanceRating.
Something like this, but with 4 colours (PerformanceRating Values)

To make the fill commands works, you should first making factor the grouping variables.
library(tibble)
library(tidyverse)
##---------------------------------------------------
##Creating a sample dataset simulating your dataset
##---------------------------------------------------
employee_attrition <- tibble(
MonthlyIncome = sample(3000:5993, 1000, replace = FALSE),
PerformanceRating = sample(1:4, 1000, replace = TRUE)
)
##------------------------------------
## Plot - also changing the format of
## PerformanceRating to "factor"
##-----------------------------------
employee_attrition %>%
mutate(PerformanceRating = as.factor(PerformanceRating)) %>%
ggplot(aes(x=MonthlyIncome, fill=PerformanceRating))+
geom_histogram(aes(y=..count..), bins = 20) +
xlab("Salario mensual (MonthlyIncome)")+
ylab("Frecuencia")+
ggtitle("Histograma: MonthlyIncome y Attrition")+
theme_minimal()

Related

Creating a boxplot from two dataframes

I have two separate data frames - each representing a feature (activity, and sleep) and the amount of days that each of these features were recorded by each id number. The amount of days need to reflect on the y-axis and the feature itself needs to reflect on the x-axis.
I managed to draw the boxplots separately, showing the outliers clearly esp for the one set, however if I want to place the two boxplots next to each other, the outliers do not show up clearly. Also, how do I get the names of the two features (activity and sleep) on my x-axis?
The dataframe for the "sleep "feature:
head(idday)
A tibble: 6 x 2
id days
<dbl> <int>
1 1503960366 25
2 1644430081 4
3 1844505072 3
4 1927972279 5
5 2026352035 28
6 2320127002 1
The dataframe for the "activity "feature:
head(iddaya)
A tibble: 6 x 2
id days
<dbl> <int>
1 1503960366 31
2 1624580081 31
3 1644430081 30
4 1844505072 31
5 1927972279 31
6 2022484408 31
My attempt for sleep:
ggplot(idday, aes(y = days), boxwex = 0.05) +
stat_boxplot(geom = "errorbar",
width = 0.2) +
geom_boxplot(alpha=0.9, outlier.color="red")
and for activity:
ggplot(iddaya, aes(y = days), boxwex = 0.05) +
stat_boxplot(geom = "errorbar",
width = 0.2) +
geom_boxplot(alpha=0.9, outlier.color="red")
I then combined them:
boxplot(summary(idday$days), summary(iddaya$days))
In this final image the outliers do not show clearly, and I want to name my x-axis and y-axis.
There are several ways to achieve your task. One way could be:
If your dataframes are coalled df_sleep and df_activity then we could combine them in a named list and add a new column feature, then plot:
df_sleep
df_activity
library(tidyverse)
bind_rows(list(sleep = df_sleep, activity = df_activity), .id = 'feature') %>%
ggplot(aes(x = feature, y=days, fill=feature))+
geom_boxplot()
If you want to compare these two boxplots with each other I recommend to use the same range for your y-axis. To achieve this you first have to combine both data frames. You can do this with inner_join() from the dplyr package.
data_combined <- inner_join(idday, iddaya,
by = "id",
suffix = c("_sleep", "_activity"))
Then you need to transform your data frame into long-format with pivot_longer() from the tidyr package:
data_combined_long <- data_combined %>%
pivot_longer(days_sleep:days_activity,
names_to = "features",
names_prefix = "days_",
values_to = "days")
After that you can again use ggplot() to create your boxplot. But now you have to define that you want your x-axis to represent your features:
ggplot(data_combined_long, aes(y = days, x = features), boxwex = 0.05)+
stat_boxplot(geom = "errorbar",
width = 0.5) +
geom_boxplot(alpha=0.9, outlier.color="red")
Your plot should then look like this:

Grouped barchart in r with 4 variables

I'm a beginner in r and I've been trying to find how I can plot this graphic.
I have 4 variables (% of gravel, % of sand, % of silt in five places). I'm trying to plot the percentages of these 3 types of sediment (y) in each station (x). So it's five groups in x axis and 3 bars per group.
Station % gravel % sand % silt
1 PRA1 28.430000 70.06000 1.507000
2 PRA3 19.515000 78.07667 2.406000
3 PRA4 19.771000 78.63333 1.598333
4 PRB1 7.010667 91.38333 1.607333
5 PRB2 18.613333 79.62000 1.762000
I tried plotting a grouped barchart with
grao <- read_excel("~/Desktop/Masters/Data/grao.xlsx")
colors <- c('#999999','#E69F00','#56B4E9','#94A813','#718200')
barplot(table(grao$Station, grao$`% gravel`, grao$`% sand`, grao$`% silt`), beside = TRUE, col = colors)
But this error message keeps happening:
'height' must be a vector or matrix
I also tried
ggplot(grao, aes(Station, color=as.factor(`% gravel`), shape=as.factor(`% sand`))) +
geom_bar() + scale_color_manual(values=c('#999999','#E69F00','#56B4E9','#94A813','#718200')+ theme(legend.position="top")
But it's creating a crazy graphic.
Could someone help me, please? I've been stuck for weeks now in this one.
Cheers
I think this may be what you are looking for:
#install.packages("tidyverse")
library(tidyverse)
df <- data.frame(
station = c("PRA1", "PRA3", "PRA4", "PRB1", "PRB2"),
gravel = c(28.4, 19.5, 19.7, 7.01, 18.6),
sand = c(70.06, 78.07, 78.63, 91, 79),
silt = c(1.5, 2.4, 1.6, 1.7, 1.66)
)
df2 <- df %>%
pivot_longer(cols = c("gravel", "sand", "silt"), names_to = "Sediment_Type", values_to = "Percentage")
ggplot(df2) +
geom_bar(aes(x = station, y = Percentage, fill = Sediment_Type ), stat = "identity", position = "dodge") +
theme_minimal() #theme_minimal() is from the ggthemes package
provides:
You need to "pivot" your data set "longer". Part of the tidy way is ensuring all columns represent a single variable. You will notice in your initial dataframe that each column name is a variable ("Sediment_type") and each column fill is just the percentage for each ("Percentage"). The function pivot_longer() takes a dataset and allows one to gather up all the columns then turn them into just two - the identity and value.
Once you've done this, ggplot will allow you to specify your x axis, and then a grouping variable by "fill". You can switch these two up. If you end up with lots of data and grouping variables, faceting is also an option worth looking in to!
Hope this helps,
Brennan
barplot wants a "matrix", ideally with both dimension names. You could transform your data like this (remove first column while using it for row names):
dat <- `rownames<-`(as.matrix(grao[,-1]), grao[,1])
You will see, that barplot already does the tabulation for you. However, you also could use xtabs (table might not be the right function for your approach).
# dat <- xtabs(cbind(X..gravel, X..sand, X..silt) ~ Station, grao) ## alternatively
I would advise you to use proper variable names, since special characters are not the best idea.
colnames(dat) <- c("gravel", "sand", "silt")
dat
# gravel sand silt
# PRA1 28.430000 70.06000 1.507000
# PRA3 19.515000 78.07667 2.406000
# PRA4 19.771000 78.63333 1.598333
# PRB1 7.010667 91.38333 1.607333
# PRB2 18.613333 79.62000 1.762000
Then barplot knows what's going on.
.col <- c('#E69F00','#56B4E9','#94A813') ## pre-define colors
barplot(t(dat), beside=T, col=.col, ylim=c(0, 100), ## barplot
main="Here could be your title", xlab="sample", ylab="perc.")
legend("topleft", colnames(dat), pch=15, col=.col, cex=.9, horiz=T, bty="n") ## legend
box() ## put it in a box
Data:
grao <- read.table(text=" Station '% gravel' '% sand' '% silt'
1 PRA1 28.430000 70.06000 1.507000
2 PRA3 19.515000 78.07667 2.406000
3 PRA4 19.771000 78.63333 1.598333
4 PRB1 7.010667 91.38333 1.607333
5 PRB2 18.613333 79.62000 1.762000 ", header=TRUE)

Not able to plot multiple geom frequency polygons

My data is
PC_Name Electors_2009 Electors_2014 Electors_2019 Voters_2009 Voters_2014
1 Amritsar 1241099 1477262 1507875 814503 1007196
2 Anandpur Sahib 1338596 1564721 1698876 904606 1086563
3 Bhatinda 1336790 1525289 1621671 1048144 1176767
4 Faridkot 1288090 1455075 1541971 930521 1032107
5 Fatehgarh Sahib 1207556 1396957 1502861 838150 1030954
6 Ferozpur 1342488 1522111 1618419 956952 1105412
7 Gurdaspur 1318967 1500337 1595284 933323 1042699
8 Hoshiarpur 1299234 1485286 1597500 843123 961297
9 Jalandhar 1339842 1551497 1617018 899607 1040762
10 Khadoor Sahib 1340145 1563256 1638842 946690 1040518
11 Ludhiana 1309308 1561201 1683325 846277 1100457
12 Patiala 1344864 1580273 1739600 935959 1120933
13 Sangrur 1251401 1424743 1529432 931247 1099467
Voters_2019
1 859513
2 1081727
3 1200810
4 974947
5 985948
6 1172033
7 1103887
8 990791
9 1018998
10 1046032
11 1046955
12 1177903
13 1105888
I have written the code
data <- read.csv(file = "Punjab data 3.csv")
data
library(ggplot2)
library(reshape2)
long <- reshape2::melt(data, id.vars = "PC_Name")
ggplot(long, aes(PC_Name, value, fill = variable)) + geom_freqpoly(stat="identity",binwidth = 500)
I am trying to plot something like this
I tried line chart and geom line but I am not sure where problem resides. I am trying geom polygon now but its not plotting.I want to compare voters or electors not both of them according to year 2009 2014 2019.Sorry for bad english.
I want to plot PC_Name on x-axis and compare Electors_2009 with Voters_2009 and Electors_2014 with Voters_2014 and all these on same graph. So on y axis I will have 'values' after melting.
It sounds like you were interested in PC_Name on horizontal axis, and value (after melting) on vertical axis. Perhaps you might be interested in a barplot with and compare electors and voters side-by-side?
As suggested by #camille, you could split your data frame's variable column after melting into two columns (one with either Electors or Voters, and the other column with the year). This would provide flexibility in plot options.
Here are a couple of possibilities to start with:
You could order your variable factor how you would like (e.g., Electors_2009, Voters_2009, Electors_2014, etc. for comparison) and use geom_bar.
You could use facet_wrap to make comparisons between Electors and Voters by year.
library(ggplot2)
library(reshape2)
long <- reshape2::melt(data, id.vars = "PC_Name")
# Split electors/voters from year into 2 columns
long <- cbind(long, colsplit(long$variable, "_", c("type", "year")))
# Change order of variable factor for comparisons
long$variable <- factor(long$variable, levels =
c("Electors_2009", "Voters_2009",
"Electors_2014", "Voters_2014",
"Electors_2019", "Voters_2019"))
# Plot value vs. PC_Name using barplot (all years together)
ggplot(long, aes(PC_Name, value, fill = variable)) +
geom_bar(position = "dodge", stat = "identity")
# Show example plot faceted by year
ggplot(long, aes(PC_Name, value, fill = type)) +
geom_bar(position = "dodge", stat = "identity") +
facet_wrap(~year, ncol = 1)
Please let me know if this is what you had in mind. There would be alternative options available.

adding rows to a tibble based on mostly replicating existing rows

I have data that only shows a variable if it is not 0. However, I would like to have gaps representing these 0s in the graph.
(I will be working from a large dataframe, but have created an example data based on how I will be manipulating it for this purpose.)
library(tidyverse)
library(ggplot2)
A <- tibble(
name = c("CTX_M", "CblA_1"),
rpkm = c(350, 4),
sample = "A"
)
B <- tibble(
name = c("CTX_M", "OXA_1", "ampC"),
rpkm = c(324, 357, 99),
sample = "B"
)
plot <- bind_rows(A, B)
ggplot()+ geom_col(data = plot, aes(x = sample, y = rpkm, fill = name),
position = "dodge")
Sample A and B both have CTX_M, however the othre three "names" are only present in either sample A or sample B. When I run the code, the output graph shows two bars for sample A and three bars for sample B the resulting graph was:
Is there a way for me to add ClbA_1 to sample B with rpkm=0, and OXA_1 and ampC to sample A with rpkm=0, while maintaining sample separation? - so the tibble would look like this (order not important):
and the graph would therefore look like this:
You can use complete from tidyr.
plot <- plot %>% complete(name,sample,fill=list(rpkm=0))
# A tibble: 8 x 3
name sample rpkm
<chr> <chr> <dbl>
1 ampC A 0
2 ampC B 99
3 CblA_1 A 4
4 CblA_1 B 0
5 CTX_M A 350
6 CTX_M B 324
7 OXA_1 A 0
8 OXA_1 B 357
ggplot()+ geom_col(data = plot, aes(x = sample, y = rpkm, fill = name),
position = "dodge")

R - reshaped data from wide to long format, now want to use created timevar as factor

I am working with longitudinal data and assess the utilization of a policy over 13 months. In oder to get some barplots with the different months on my x-axis, I converted my data from wide Format to Long Format.
So now, my dataset looks like this
id month hours
1 1 13
1 2 16
1 3 20
2 1 0
2 2 0
2 3 10
I thought, after reshaping I could easily use my newly created "month" variable as a factor and plot some graphs. However, it does not work out and tells me it's a list or an atomic vector. Transforming it into a factor did not work out - I would desperately Need it as a factor.
Does anybody know how to turn it into a factor?
Thank you very much for your help!
EDIT.
The OP's graph code was posted in a comment. Here it is.
library(ggplot2)
ggplot(data, aes(x = hours, y = month)) + geom_density() + labs(title = 'Distribution of hours')
# Loading ggplot2
library(ggplot2)
# Placing example in dataframe
data <- read.table(text = "
id month hours
1 1 13
1 2 16
1 3 20
2 1 0
2 2 0
2 3 10
", header = TRUE)
# Converting month to factor
data$month <- factor(data$month, levels = 1:12, labels = 1:12)
# Plotting grouping by id
ggplot(data, aes(x = month, y = hours, group = id, color = factor(id))) + geom_line()
# Plotting hour density by month
ggplot(data, aes(hours, color = month)) + geom_density()
The problem seems to be in the aes. geom_density only needs a x value, if you think about it a little, y doesn't make sense. You want the density of the x values, so on the vertical axis the values will be the values of that density, not some other values present in the dataset.
First, read in the data.
Indirekte_long <- read.table(text = "
id month hours
1 1 13
1 2 16
1 3 20
2 1 0
2 2 0
2 3 10
", header = TRUE)
Now graph it.
library(ggplot2)
g <- ggplot(Indirekte_long, aes(hours))
g + geom_density() + labs(title = 'Distribution of hours')

Resources