Related
I have the following dataset (in file emp1.txt and I would like to draw a grouped bar chart based on age range and also I would like to make stacked options for Male and Female for each group.
Count Male Female Emp_group
38 10 28 Staff
38 20 18 Teacher
33 15 18 Teacher
34 17 17 Teacher
41 35 6 Staff
45 25 20 Teacher
35 17 18 Staff
39 30 9 Staff
39 9 30 Teacher
44 22 22 Staff
42 20 22 Teacher
This is what I have tried, but not able to figure out the stacked portion. I would appreciate any help. Both the red and green bars should be divided into two parts for Male and Female respectively. Also I would like to put color description for Male and Female in the legends.
data <- read.csv("emp1.txt", sep = "\t" , header = TRUE)
df1<-tibble(data)
df1<- mutate(df1, emp_class = cut(Count, breaks = c(0, 30, 40, 50, 60, 100),
labels = c('(0-30)', '(31-40)', '(41-50)', '(51-60)', '(61-100)')))
df1 <- df1 %>%
group_by(Emp_group) %>%
add_count()
df1 <- mutate(df1, x_axis = paste(Emp_group, n, sep = "\n"))
my_ggp <- ggplot(df1, aes(x=as.factor(x_axis), fill=as.factor(emp_class)))+
geom_bar(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..]*100), position="dodge") + ylab('% Employes') +xlab("") + labs(fill = "Count group")
df1
my_ggp + theme(text = element_text(size = 20))
You need position = "stack" instead of "dodge".
I reorganised your code slightly:
library(ggplot2)
library(dplyr)
data %>%
mutate(emp_class = cut(Count,
breaks = c(0, 30, 40, 50, 60, 100),
labels = c('(0-30)', '(31-40)', '(41-50)', '(51-60)', '(61-100)')
)
) %>%
pivot_longer(c(Male, Female),
names_to = "MF") %>%
group_by(Emp_group, MF) %>%
add_count() %>%
mutate(x_axis = as.factor(paste(Emp_group, n, sep = "\n"))) %>%
ggplot(aes(x = x_axis, fill = as.factor(emp_class))) +
geom_bar(aes(y = value),
position = "fill",
stat = "identity") +
labs(x = "", y = "% Employes", fill = "Age group") +
theme(text = element_text(size = 20)) +
facet_wrap(~MF) +
scale_y_continuous(labels = scales::percent_format())
This returns
Data
structure(list(Count = c(38, 38, 33, 34, 41, 45, 35, 39, 39,
44, 42), Male = c(10, 20, 15, 17, 35, 25, 17, 30, 9, 22, 20),
Female = c(28, 18, 18, 17, 6, 20, 18, 9, 30, 22, 22), Emp_group = c("Staff",
"Teacher", "Teacher", "Teacher", "Staff", "Teacher", "Staff",
"Staff", "Teacher", "Staff", "Teacher")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(
cols = list(Count = structure(list(), class = c("collector_double",
"collector")), Male = structure(list(), class = c("collector_double",
"collector")), Female = structure(list(), class = c("collector_double",
"collector")), Emp_group = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
I am looking to get a bar graph of medals in R. I have 3 distinct columns (gold, silver, bronze). The columns for gold medals has a total of 8, the silver has 10, and the bronze has 13.
For the code, I started writing: ggplot(data, aes(x=?)) + geom_bar()
I am not sure how to write all 3 gold medals on the function where it shows x=?
Thanks
For plotting purposes, it is "easier" to work with long data instead of wide. Below I converted the data you mentioned in your comment to long and plotted the data as a grouped bar.
library(tidyverse)
# load data
raw_data <- structure(list(Rank = c(1, 2, 3, 4, 5, 6),
`Team/Noc` = c("United States of America", "People's Republic of China", "Japan", "Great Britain", "ROC", "Australia"),
Gold = c(39, 38, 27, 22, 20, 17),
Silver = c(41,32, 14, 21, 28, 7),
Bronze = c(33, 18, 17, 22, 23, 22),
Total = c(113, 88, 58, 65, 71, 46),
`Rank by Total` = c(1, 2, 5, 4, 3, 6)),
row.names = c(NA,-6L),
class = c("tbl_df", "tbl", "data.frame"))
# convert wide data to long
long_data <- raw_data %>%
pivot_longer(cols = -`Team/Noc`, names_to = 'Medal') %>% # convert wide data to long format
filter(Medal %in% c("Gold", "Silver", "Bronze")) # only select medal columns
# plot
ggplot(long_data) +
geom_col(aes(x = `Team/Noc`,
y = value,
fill = Medal),
position = "dodge" # grouped bars
)
Hope this gets you started!
I have the following
densityPlots <- lapply(numericCols, function(var_x){
p <- ggplot(df, aes_string(var_x)) + geom_density()
})
numericCols are the names of the columns that are numeric. I want to add the mean line, I have tried multiple things such as
densityPlots <- lapply(numericCols, function(var_x){
p <- ggplot(df, aes_string(var_x)) + geom_density() + geom_vline(aes(xintercept=mean(var_x)),
color="red", linetype="dashed", size=1)
})
The data
str(df)
tibble [9 × 4] (S3: tbl_df/tbl/data.frame)
$ A: num [1:9] 12 NA 34 45 56 67 78 89 100
$ B: num [1:9] 1 2 3 NA 5 6 7 8 9
$ C: num [1:9] 83 55 27 27 7 3 5 8 9
$ D: num [1:9] 6 2 NA 1 NA 3 4 5 6
dput(df)
structure(list(A = c(12, NA, 34, 45, 56, 67, 78, 89, 100), B = c(1,
2, 3, NA, 5, 6, 7, 8, 9), C = c(83, 55, 27, 27, 7, 3, 5, 8, 9
), D = c(6, 2, NA, 1, NA, 3, 4, 5, 6)), row.names = c(NA, -9L
), class = c("tbl_df", "tbl", "data.frame"))
print(numericCols)
[1] "A" "B" "C"
But it does not work, it just ignores the geom_vline function. Does someone have a suggestion? Thanks :)!
You should use mean(df[, var_x], na.rm=T) in geom_vline:
library(ggplot2)
df <- structure(list(A = c(12, NA, 34, 45, 56, 67, 78, 89, 100), B = c(1,
2, 3, NA, 5, 6, 7, 8, 9), C = c(83, 55, 27, 27, 7, 3, 5, 8, 9
), D = c(6, 2, NA, 1, NA, 3, 4, 5, 6)), row.names = c(NA, -9L
), class = c("tbl_df", "tbl", "data.frame"))
numericCols <- c("A","B","C")
df <- as.data.frame(df)
densityPlots <- lapply(numericCols, function(var_x) {
ggplot(df, aes_string(var_x)) + geom_density() +
geom_vline(aes(xintercept=mean(df[, var_x], na.rm=T)),
color="red", linetype="dashed", size=1)
})
gridExtra::grid.arrange(grobs=densityPlots)
Here is an approach somewhat different than what you tried in your question, but uses dplyr and tidyr to pivot the data and use ggplot mapping. Unfortunately, geom_vline doesn't summarize by group, so you have to pre-compute the values:
set.seed(3)
data <- data.frame(Category = paste0("Catagory",LETTERS[1:20]),
lapply(LETTERS[1:10],function(x){setNames(data.frame(runif(20,10,100)),x)}))
numericCols <- LETTERS[1:10]
library(dplyr)
library(tidyr)
library(ggplot2)
data.means <- data %>%
select(numericCols) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "var_x") %>%
group_by(Variable) %>%
summarize(Mean = mean(var_x))
data %>%
select(numericCols) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "var_x") %>%
ggplot(aes(x = var_x, color = Variable)) +
geom_density() +
geom_vline(data = data.means, aes(xintercept=Mean, color = Variable),
linetype="dashed", size=1)
Or you could combine with facet_wrap for multiple plots.
data %>%
select(numericCols) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "var_x") %>%
ggplot(aes(x = var_x)) +
facet_wrap(.~Variable) +
geom_density() +
geom_vline(data = data.means, aes(xintercept=Mean, color = Variable),
linetype="dashed", size=1)
I'm trying to plot a series of demographics factors. Each plot show the frequency distributions of demographic variables by gender. It runs nicely, but some of the labels are ordered in alphabetic order and not in meaningful order eg. Education, Marital Status and SIC2007.
Data structure
structure(list(DMSex = c("Male", "Female", "Male", "Male"), Income = c(980,
-8, 3000, 120), IncCat = c("-1", "-8", "-1", "-1"), HrWkAc = c(-1,
-1, -1, -1), ShiftWk = c(-1, -1, -1, -1), ShiftPat = c(-1, -1,
-1, -1), SOC2010C = c("-1", "9.2.3.3", "-1", "-1"), XSOC2010 = c(-1,
9233, -1, -1), IndexNo = c(-1, 1398, -1, -1), ES2010 = c(-1,
7, -1, -1), nssec = c(-1, 13.4, -1, -1), SECFlag = c(-1, 0, -1,
-1), LSOC2000 = c("-1", "9.2.3.3", "-1", "-1"), XSOC2000 = c(-1,
9233, -1, -1), seg = c(-1, 11, -1, -1), sc = c(-1, 5, -1, -1),
SIC2007 = c(-1, 87, -1, -1), Educ = c(1, 1, -1, 2), EducCur = c(10,
1, -1, -1), FinFTEd = c(-1, -1, -1, 1), FinFTEdY = c(-1,
-1, -1, 21), HiQual = c(22, 10, -1, 1), sic20070 = c(-1,
87, -1, -1), dhhtype = c(6, 8, 7, 3), dagegrp = c(2, 3, 3,
3), dmarsta = c("Single, never married", "Single, never married",
"Interview not achieved", "Married/cohabitating"), dhiqual = c(" Secondary",
" A level or equivalent", "Item not applicable", "Degree or higher"
), dnssec8 = c(-1, 8, -1, -1), duresmc = c(14, 15, 11, 16
), dgorpaf = c(7, 8, 5, 10), dukcntr = c(1, 1, 1, 1), dnrkid04 = c(0,
0, 0, 0), dilodefr = c(3, 3, -1, 3), deconact = c(8, 8, -1,
11), dtenure = c(2, 3, 2, 3), dtotac = c(-1, -1, -1, -1),
dtotus = c(-1, -1, -1, -1), dsic = c("Item not applicable",
"Public admin, education and health", "Item not applicable",
"Item not applicable"), dsoc = c(-1, 9, -1, -1), DVAge_category = c("15 to 30",
"15 to 30", "15 to 30", "15 to 30"), Income_category = c("Less than 1000",
"Less than 1000", "1001 to 3000", "Less than 1000"), HoursWorked_category = c("Less than 20 hours",
"Less than 20 hours", "Less than 20 hours", "Less than 20 hours"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
#Age variable
demographics$dagegrp_category<-ifelse(demographics$dagegrp_01 > 2 & demographics$dagegrp < 6, age<-"15 to 30",
ifelse(demographics$dagegrp> 6 & demographics$dagegrp < 9, age<-"31 to 45",
ifelse(demographics$dagegrp > 9 & demographics$dagegrp < 12 , age<-"46 to 60",
ifelse(demographics$dagegrp > 12 & demographics$dagegrp < 15 , age<-"61 to 75",
ifelse(demographics$dagegrp > 15 & demographics$dagegrp < 18 , age<-"76+",
age<- "zombie")))))
demographics$DVAge_category<-c("15 to 30","31 to 45", "46 to 60","61 to 75", "76+", "zombie")[findInterval(demographics$dagegrp , c(-Inf, 6, 10, 12, 15,18, Inf))]
Age<-as.vector(demographics$DVAge_category)
#Gender variable
demographics$DMSex[demographics$DMSex==1]<-"Male"
demographics$DMSex[demographics$DMSex==2]<-"Female"
Gender<-as.vector(demographics$DMSex)
#Income variable
demographics$Income_category<-ifelse(demographics$Income < 1001, income<-"Less than 1000",
ifelse(demographics$Income > 999 & demographics$Income < 3001, income<-"1001 to 3000",
ifelse(demographics$Income > 3001 & demographics$Income < 6001, income <-"3001 to 6000",
ifelse(demographics$Income > 6001 & demographics$Income < 10001 , income<-"6001 to 10000",
income<- "zombie"))))
demographics$Income_category<-c("Less than 1000","1001 to 3000", "3001 to 6000", "6001 to 10000","zombie")[findInterval(demographics$Income , c(-Inf, 1001, 3001, 6001,10001, Inf) ) ]
Income<-as.vector(demographics$Income_category)
#Marital status variable
demographics$dmarsta[demographics$dmarsta==-1]<-"Interview not achieved"
demographics$dmarsta[demographics$dmarsta==1]<-"Single, never married"
demographics$dmarsta[demographics$dmarsta==2]<-"Married/cohabitating"
demographics$dmarsta[demographics$dmarsta==3]<-"Divorced/widowed"
MaritalStatus<-as.vector(demographics$dmarsta)
#Education
demographics$dhiqual[demographics$dhiqual==-8]<-"Don't know"
demographics$dhiqual[demographics$dhiqual==-1]<-"Item not applicable"
demographics$dhiqual[demographics$dhiqual==1]<-"Degree or higher"
demographics$dhiqual[demographics$dhiqual==2]<-"Higher education"
demographics$dhiqual[demographics$dhiqual==3]<-" A level or equivalent"
demographics$dhiqual[demographics$dhiqual==4]<-" Secondary"
demographics$dhiqual[demographics$dhiqual==5]<-" Other"
Education<-as.vector(demographics$dhiqual)
#Hours worked per week in main job variable
demographics$HoursWorked_category<-ifelse(demographics$dtotac < 21, workhours<-"Less than 20 hours",
ifelse(demographics$dtotac > 20 & demographics$dtotac< 41, workhours <-"Between 21 to 40 hours",
ifelse(demographics$dtotac > 40 & demographics$dtotac < 61, workhours <-"Between 41 to 60 hours",
ifelse(demographics$dtotac > 62, workhours<-"More than 61 hours",
workhours<- "Not Applicable"))))
demographics$HoursWorked_category<-c("Less than 20 hours", "Between 21 to 40 hours", "Between 41 to 60 hours","More than 61 hours","Not Applicable")[findInterval(demographics$dtotac, c(-Inf, 21, 41, 61, 62, Inf) ) ]
WorkHours<-as.vector(demographics$HoursWorked_category)
#DV: SIC 2007 industry divisions (grouped)
demographics$dsic[demographics$dsic==-8]<-"Don't know"
demographics$dsic[demographics$dsic==-1]<-"Item not applicable"
demographics$dsic[demographics$dsic==1]<-"Agriculture, forestry and fishing"
demographics$dsic[demographics$dsic==2]<-"Manufacturing"
demographics$dsic[demographics$dsic==3]<-"Energy and water supply"
demographics$dsic[demographics$dsic==4]<-"Construction"
demographics$dsic[demographics$dsic==5]<-"Distribution, hotels and restaurants"
demographics$dsic[demographics$dsic==6]<-"Transport and communication"
demographics$dsic[demographics$dsic==7]<-"Banking and finances"
demographics$dsic[demographics$dsic==8]<-"Public admin, education and health"
demographics$dsic[demographics$dsic==9]<-"Other services"
demographics$industry_category<-c("Don't know", "Item not applicable", "Agriculture, forestry and fishing","Manufacturing","Energy and water supply",
"Construction", "Distribution, hotels and restaurants", "Transport and communication", "Banking and finances",
"Public admin, education and health", "Other service")
SIC2007<-as.vector(demographics$dsic)
# creating df
df<-data.frame(Gender, Age, Education, MaritalStatus, Income, WorkHours, SIC2007)
df %>%
#tidy, not gender
gather(variable, value, -c(Gender))%>%
#group by value, variable, then gender
group_by(value, variable, Gender) %>%
#summarise to obtain table cell frequencies
summarise(freq=n()) %>%
#Plot
ggplot(aes(x=value, y=freq, group=Gender))+geom_bar(aes(fill=Gender), stat='identity', position='dodge')+ facet_wrap(~variable, scales='free_x') + theme(legend.position="right", axis.text.x = element_text(angle = 60, hjust = 1)) + labs(x="Characteristics", y="Frequencies")
In ggplot2, the data is ordered according to the factor levels of the data.frame column.
To (re)set the order in your plot, just set the order of the factor by:
df$variable <- factor(df$variable, levels = c(...))
You could do this by first storing the data.frame, before piping to the ggplot function, then manually setting the levels of the variables you want to change. It is maybe a bit inefficient, but this should do the trick:
## Make your plotting data.frame
df2 <- df %>%
gather(variable, value, -c(Gender))%>%
group_by(value, variable, Gender) %>%
summarise(freq=n())
## Apply custom order to MaritalStatus variable:
custom <- c(sort(unique(MaritalStatus))[c(4,3,1,2)],
....)
df2$variable <- factor(df2$variable, levels = c(levels(df2$variable)[!levels(df2$variable) %in% custom],
custom))
I am trying to plot several line graphs from a list using a loop in R. The list temp.list looks like this:
> temp.list[1]
$`1`
NEW_UPC Week1 Week2 Week3 Week4 Week5 Week6 Week7 Week8 Week9 Week10 Week11 Week12
5 11410008398 3 6 11 15 15 27 31 33 34 34 34 34
Life Status Num_markets Sales
5 197 1 50 186048.1
I use only some part of the data above to plot, specifically items 2 to 13 in the list will go on the y-axis, i.e. 3,6,11,15,...,34. For x-axis, I would like to have Week 1, Week 2, ..., Week 12 at the tick marks. As I don't know how to assign character values to x in gglplot command, I created a variable called weeks for x-axis as below:
weeks = c(1,2,3,4,5,6,7,8,9,10,11,12)
The code that I used to generate the plot is below:
for (i in 1:2) {
markets= temp.list[[i]][2:13]
ggplot(data = NULL,aes(x=weeks,y=markets))+geom_line()+
scale_x_continuous(breaks = seq(0,12,1))+
scale_y_continuous(breaks = seq(0,50,5))
}
This code does not generate any plot. When I run just the below lines:
ggplot(data = NULL,aes(x=weeks,y=markets))+geom_line()+
scale_x_continuous(breaks = seq(0,12,1))+
scale_y_continuous(breaks = seq(0,50,5))
I get this error:
Error: geom_line requires the following missing aesthetics: y
In addition: Warning message:
In (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
row names were found from a short variable and have been discarded
Any help to fix this will be appreciated. I looked at some related discussions here, but I am not clear how to proceed.
Also, any better way to generate multiple plots is also welcome. From temp.list, I am looking to generate over 300 separate line graphs (i.e, not all lines in one chart).
Here is a solution using tidyverse:
library(tidyverse)
# 1: create a function for the plot
## gather will give a long format data.frame
## then extract the number of the week and format it to numeric
## then use ggplot on this data
plot_function <- function(data) {
data %>%
gather(., key = week, value = markets, starts_with("Week")) %>%
mutate(week= as.numeric(str_sub(week, 5, -1))) %>%
ggplot(., aes(x = week, y = markets)) +
geom_line() +
scale_x_continuous(breaks = seq(0, 12, 1)) +
scale_y_continuous(breaks = seq(0, 50, 5))
}
# 2: map the function to your list
plots <- map(temp.list, plot_function)
# 3: access each plot easily
plots[[1]]
plots[[2]]
...
#Data used for this example
temp.list = list('1' = data.frame(NEW_UPC = 11410008398,
Week1 = 3,
Week2 = 6,
Week3 = 11,
Week4 = 15,
Week5 = 15,
Week6 = 27,
Week7 = 31,
Week8 = 33,
Week9 = 34,
Week10 = 34,
Week11 = 34,
Week12 = 34,
Life = 197,
Status = 1,
Num_markets = 50,
Sales = 186048.1),
'2' = data.frame(NEW_UPC = 11410008398,
Week1 = 4,
Week2 = 5,
Week3 = 8,
Week4 = 13,
Week5 = 14,
Week6 = 25,
Week7 = 29,
Week8 = 30,
Week9 = 31,
Week10 = 33,
Week11 = 34,
Week12 = 34,
Life = 201,
Status = 1,
Num_markets = 50,
Sales = 186048.1)
)