Horizontal stacked bar chart with a separate element in ggplot - r

I'm trying to come up with a way to visualize some likert scale data in a specific way. I'm not even sure how to fake what it would look like, so I hope my explanation would suffice.
I have the following (fake) data:
n questions, each with 5 answers (strongly agree, agree, don't agree, strongly don't agree, and don't know)
I want to visualize the data (ideally using ggplot) along a central axis, so that the two "agree" answers are on the left, and the two "disagree" answers are on the right, and then on a separate block off to the side, a block representing "don't know". It should look roughly like this:
Q1: *****++++++++|------!! ?????
Q2: ****++++++|----!!!!!! ???????
Q3: **++++++|---!!! ??????????
*: strongly agree, +: agree, -: don't agree, !:strongly disagree, ?: don't know
As you can see, this representation allows to compare the actual numbers of agree and disagree, without hiding how many "don't knows" there are. The place I get stuck on is how to create that second element for the don't knows. Any ideas?
Here's some fake data:
structure(list(Q = structure(1:3, .Label = c("Q1", "Q2", "Q3"
), class = "factor"), SA = c(25, 18, 12), A = c(30, 25, 15),
DA = c(25, 20, 25), SDA = c(10, 18, 25), DK = c(10, 19, 23
)), row.names = c(NA, -3L), class = "data.frame")

As suggested in the comments, you can just facet out the "DK" category.
library(ggplot2)
library(tidyr)
library(scales)
df <- structure(list(Q = structure(1:3, .Label = c("Q1", "Q2", "Q3"
), class = "factor"), SA = c(25, 18, 12), A = c(30, 25, 15),
DA = c(25, 20, 25), SDA = c(10, 18, 25), DK = c(10, 19, 23
)), row.names = c(NA, -3L), class = "data.frame")
lvls <- colnames(df)[c(2,3,5,4,6)]
ggplot(
pivot_longer(df ,-1),
aes(y = Q, fill = name, group = factor(name, lvls),
x = ifelse(name %in% c("A", "SA"), -1, 1) * value)
) +
geom_col() +
facet_grid(~ name == "DK", scales = "free_x", space = "free_x") +
scale_fill_manual(
values = c(viridis_pal()(4), "grey50"),
limits = colnames(df)[-1]
)
Created on 2021-11-04 by the reprex package (v2.0.1)

Related

How to make a bar-chart by using two variables on x-axis and a grouped variable on y-axis?

I hope I asked my question in the right way this time! If not let me know!
I want to code a grouped bar-chart similary to this one (I just created in paint):
enter image description here
I created as flipped both it actually doesn't matter if its flipped or not. So, a plot similarly to this will also be very usefull:
Grouped barchart in r with 4 variables
Both the variables, happy and lifesatisfied are scaled values from 0 to 10. Working hours is a grouped value and contains 43+, 37-42, 33-36, 27-32, and <27.
A very similar example of how my data set looks like (I just changed the values and order, I also have much more observations):
Working hours
happy
lifestatisfied
contry
37-42
7
9
DK
<27
8
8
SE
43+
7
8
DK
33-36
6
6
SE
37-42
7
5
NO
<27
4
7
NO
I tried to found similar examples and based on that tried to code the bar chart in the following way but it doesn't work:
df2 <- datafilteredwomen %>%
pivot_longer(cols = c("happy", "stflife"), names_to = "var", values_to = "Percentage")
ggplot(df2) +
geom_bar(aes(x = Percentage, y = workinghours, fill = var ), stat = "identity", position = "dodge") + theme_minimal()
It give this plot which is not correct/what I want:
enter image description here
seocnd try:
forplot = datafilteredwomen %>% group_by(workinghours, happy, stflife) %>% summarise(count = n()) %>% mutate(proportion = count/sum(count))
ggplot(forplot, aes(workinghours, proportion, fill = as.factor(happy))) +
geom_bar(position = "dodge", stat = "identity", color = "black")
gives this plot:
enter image description here
third try - used the ggplot2 builder add-in:
library(dplyr)
library(ggplot2)
datafilteredwomen %>%
filter(!is.na(workinghours)) %>%
ggplot() +
aes(x = workinghours, group = happy, weight = happy) +
geom_bar(position = "dodge",
fill = "#112446") +
theme_classic() + scale_y_continuous(labels = scales::percent)
gives this plot:
enter image description here
But none of my tries are what I want.. really hope that someone can help me if it's possible!
After speaking to the OP I found his data source and came up with this solution. Apologies if it's a bit messy, I have only been using R for 6 months. For ease of reproducibility I have preselected the variables used from the original dataset.
data <- structure(list(wkhtot = c(40, 8, 50, 40, 40, 50, 39, 48, 45,
16, 45, 45, 52, 45, 50, 37, 50, 7, 37, 36), happy = c(7, 8, 10,
10, 7, 7, 7, 6, 8, 10, 8, 10, 9, 6, 9, 9, 8, 8, 9, 7), stflife = c(8,
8, 10, 10, 7, 7, 8, 6, 8, 10, 9, 10, 9, 5, 9, 9, 8, 8, 7, 7)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
Here are the packages required.
require(dplyr)
require(ggplot2)
require(tidyverse)
Here I have manipulated the data and commented my reasoning.
data <- data %>%
select(wkhtot, happy, stflife) %>% #Select the wanted variables
rename(Happy = happy) %>% #Rename for graphical sake
rename("Life Satisfied" = stflife) %>%
na.omit() %>% # remove NA values
group_by(WorkingHours = cut(wkhtot, c(-Inf, 27, 32,36,42,Inf))) %>% #Create the ranges
select(WorkingHours, Happy, "Life Satisfied") %>% #Select the variables again
pivot_longer(cols = c(`Happy`, `Life Satisfied`), names_to = "Criterion", values_to = "score") %>% # pivot the df longer for plotting
group_by(WorkingHours, Criterion)
data$Criterion <- as.factor(data$Criterion) #Make criterion a factor for graphical reasons
A bit more data prep
# Creating the percentage
data.plot <- data %>%
group_by(WorkingHours, Criterion) %>%
summarise_all(sum) %>% # get the sums for score by working hours and criterion
group_by(WorkingHours) %>%
mutate(tot = sum(score)) %>%
mutate(freq =round(score/tot *100, digits = 2)) # get percentage
Creating the plot.
# Plotting
ggplot(data.plot, aes(x = WorkingHours, y = freq, fill = Criterion)) +
geom_col(position = "dodge") +
geom_text(aes(label = freq),
position = position_dodge(width = 0.9),
vjust = 1) +
xlab("Working Hours") +
ylab("Percentage")
Please let me know if there is a more concise or easier way!!
B
DataSource: https://www.europeansocialsurvey.org/downloadwizard/?fbclid=IwAR2aVr3kuqOoy4mqa978yEM1sPEzOaghzCrLCHcsc5gmYkdAyYvGPJMdRp4
Taking this example dataframe df:
df <- structure(list(Working.hours = c("37-42", "37-42", "<27", "<27",
"43+", "43+", "33-36", "33-36", "37-42", "37-42", "<27", "<27"
), country = c("DK", "DK", "SE", "SE", "DK", "DK", "SE", "SE",
"NO", "NO", "NO", "NO"), criterion = c("happy", "lifesatisfied",
"happy", "lifesatisfied", "happy", "lifesatisfied", "happy",
"lifesatisfied", "happy", "lifesatisfied", "happy", "lifesatisfied"
), score = c(7L, 9L, 8L, 8L, 7L, 8L, 6L, 6L, 7L, 5L, 4L, 7L)), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"))
you can proceed like this:
library(dplyr)
library(ggplot2)
df <-
df %>%
pivot_longer(cols = c(happy, lifesatisfied),
names_to = 'criterion',
values_to = 'score'
)
df %>%
ggplot(aes(x = Working.hours,
y = score,
fill = criterion)) +
geom_col(position = 'dodge') +
coord_flip()
For picking colours see ?scale_fill_manual, for formatting legend etc. numerous existing answers to related questions on stackoverflow.

Removing Incorrect Labels within Tidyverse/ Limiting Actions of as_factor()

I'm working with British Election Study data. To be used in R, this first has to be converted from the .dta form provided, which I think puts labels on to a lot of variables. Most of the time this is useful, but I think a problem I've got is where this isn't the case.
Using as_factor() blindly converts all variables with labels to factors. Is there a way to specify that only certain vectors are converted ? i.e
new_df <- data %>%
as_factor(just_this_column)
Failing that, is there a good way to remove the labels of certain variables within a dataframe ? I've kooked at the sjlabelled package but this does something weird and converts the data from a dataframe:
example_data<- str(sjlabelled::remove_all_labels(example_data$generalElectionVoteW19))
The reason I'm trying to do all of this is to make a histogram of number of people voting for each party (the factor) at a certain age. In this dataset, the age variable has a label which is messing up the code.
Of course, I could just convert the factor to a numeric value at the end but this seems like a messy way of achieving things !
Here is the dput:
structure(list(ageW19 = structure(c(72, 52, 39, 75, 26, 56), label = "Age", format.stata = "%8.0g", labels = c(`Not Asked` = -9,
Skipped = -8), class = c("haven_labelled", "vctrs_vctr", "double"
)), generalElectionVoteW19 = structure(c(1, 13, 3, 1, 2, 1), label = "General election vote intention (recalled vote in post-election waves)", format.stata = "%40.0g", labels = c(`I would/did not vote` = 0,
Conservative = 1, Labour = 2, `Liberal Democrat` = 3, `Scottish National Party (SNP)` = 4,
`Plaid Cymru` = 5, `United Kingdom Independence Party (UKIP)` = 6,
`Green Party` = 7, `British National Party (BNP)` = 8, Other = 9,
`Change UK- The Independent Group` = 11, `Brexit Party` = 12,
`An independent candidate` = 13, `Don't know` = 9999), class = c("haven_labelled",
"vctrs_vctr", "double"))), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"), na.action = c(`1` = 1L, `3` = 3L, `5` = 5L
))
To your first questions, you need mutate to convert a single column, e.g.
new_df <- data %>%
mutate(factor_column = as_factor(old column))
However, as you said you probably want to convert to numeric type, so you might want to use as.numeric instead of as_factor.
We may use base R
data$factor_column <- factor(data$old_column)

how to make a plot to show start and end days

I have a df that looks like this:
sample data can be build using codes:
df<-structure(list(ID = c(101, 101, 101, 101, 101, 101), AEDECOD = c("Diarrhoea",
"Vitreous floaters", "Musculoskeletal pain", "Diarrhoea", "Decreased appetite",
"Fatigue"), AESTDY = structure(c(101, 74, 65, 2, 33, 27), class = "difftime", units = "days"),
AEENDY = structure(c(105, 99, NA, 5, NA, NA), class = "difftime", units = "days")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
I would like to make a plot that looks like following:
Sorry the the blurry plot. This is the closest one that I can find. What someone give me some guidance on how to make such plot?
Thanks.
With ggplot2, using Unicode's black "left pointer" and "right pointer" characters for the start and end arrows.
df %>%
ggplot(aes(y = AEDECOD, yend = AEDECOD, x = AESTDY, xend = AEENDY)) +
geom_point(aes(x = AESTDY), shape = "\u25BA") +
geom_point(aes(x = AEENDY), shape = "\u25C4") +
geom_segment()
This might get you started.
There are issues about what to do with or how to interpret NAs - this approach just ignores them - you do not get a line.
Start days are indicated by a point.
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
df1 <-
df %>%
mutate(across(ends_with("DY"), ~ as.numeric(str_extract(.x, "\\d+"))))
ggplot(df1)+
geom_segment(aes(y = AEDECOD, yend = AEDECOD, x = AESTDY, xend = AEENDY))+
geom_point(data = filter(df1, is.na(AEENDY)), aes(y = AEDECOD, x = AESTDY))
#> Warning: Removed 3 rows containing missing values (geom_segment).
Created on 2021-04-12 by the reprex package (v2.0.0)

Formatting data in a table as percentage

I have a dataframe that looks like this:
Here's the code to create this DF:
structure(list(ethnicity = structure(c(1L, 2L, 3L, 5L), .Label = c("AS",
"BL", "HI", "Others", "WH", "Total"), class = "factor"), `Strongly agree` = c(30.7,
26.2, 37.4, 31.6), Agree = c(43.9, 34.5, 41, 45.4), `Neither agree nor disagree` = c(9.4,
14.3, 8.6, 8.7), Disagree = c(10, 15.5, 9.9, 9.7), `Strongly disagree` = c(6,
9.5, 3.2, 4.6)), row.names = c(NA, -4L), class = "data.frame")
I want to add data bars and makes these numbers as percentages. I tried using the formattable library to do that (see my code below).
formattable(df,align=c("l","l","l","l","l","l"),
list(`ethnicity` = formatter("span", style = ~ style(color = "grey", font.weight = "bold"))
,area(col = 2:6) ~ function(x) percent(x / 100, digits = 0)
,area(col = 2:6) ~ color_bar("#DeF7E9")))
I'm facing 2 problems:
The numbers don't appear as a percentage in the table output.
The alignment seems off in the last column i.e
Would really appreciate if someone could help me understand what am I missing here ?
Here is a solution but it require a lot of typing, I guess it is possible to use mutate_at() but I just can't find out how to pass column names in the percent() part. Using the . produces an error.
This works with a lot of typing:
library(dplyr)
library(formattable)
df %>%
mutate(`Strongly agree` = color_bar("#DeF7E9")(formattable::percent(`Strongly agree`/100))) %>%
mutate(`Agree` = color_bar("#DeF7E9")(formattable::percent(`Agree`/100))) %>%
mutate(`Disagree` = color_bar("#DeF7E9")(formattable::percent(`Disagree`/100))) %>%
mutate(`Neither agree nor disagree` = color_bar("#DeF7E9")(formattable::percent(`Neither agree nor disagree`/100))) %>%
mutate(`Strongly disagree` = color_bar("#DeF7E9")(formattable::percent(`Strongly disagree`/100))) %>%
formattable(.,
align=c("l","l","l","l","l","l"),
`ethnicity` = formatter("span", style = ~ style(color = "grey", font.weight = "bold")))
This does not work but might be improved:
df %>%
mutate_at(.vars = 2:6, .funs = color_bar("#DeF7E9")(formattable::percent(./100))) %>%
formattable(...)
Some more informations about this "strange" structure var = color_bar(...)(var)

Adding a log scale on my graph is not working? 'Non-numeric argument to mathematical function'

I am using the package growthcurver to create a graph with a sigmoidal curve. It needs to have a logarithmic scale on the y axis.
This works to create a graph without a log scale:
# install.packages("growthcurver")
library("growthcurver")
gcfit <- SummarizeGrowth(curveA$time, curveA$biomass)
gcfit
plot(gcfit)
I have tried plot(gcfit, log=y) and plot(gcfit, log="curveA$biomass"). This gives me the error
'Non-numeric argument to mathematical function'.
Could it be that I am using a data frame? How do I get around this?
dput(curveA)
structure(list(time = c(1, 2, 3, 4, 5, 6, 7, 8, 17, 18, 19, 20,
21, 22, 23, 24, 25), biomass = c(0.153333333, 1.303333333, 2.836666667,
4.6, 6.21, 6.746666667, 7.283333333, 7.973333333, 8.663333333,
9.046666667, 10.19666667, 10.50333333, 11.04, 11.88333333, 11.96,
11.96, 9.966666667)), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -17L), spec = structure(list(
cols = list(time = structure(list(), class = c("collector_number",
"collector")), biomass = structure(list(), class = c("collector_number",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
Without the data that you are using it is difficult to reproduce the problem (how to produce reproducible example: How to make a great R reproducible example), but the easiest solution to your problem might be using log function directly on the data...
gcfit_log <- SummarizeGrowth(curveA$time, log10(curveA$biomass))
plot(gcfit_log)
The workaround is to extract the data from the model and plot using ggplot2 package:
library(ggplot2)
library(dplyr)
gcfit <- SummarizeGrowth(curveA$time, curveA$biomass)
points_data <- bind_cols(t = gcfit$data$t, n = gcfit$data$N)
line_fit <- bind_cols(x = max(gcfit$data$t) * (1 : 30) / 30,
y = NAtT(gcfit$vals$k, gcfit$vals$n0, gcfit$vals$r,
max(gcfit$data$t) * (1 : 30) / 30))
ggplot(data = points_data, aes(t, n + 1)) +
geom_point() +
geom_line() +
geom_line(data = line_fit, aes(x, y + 1), color = "red") +
scale_y_log10()

Resources