I'm still pretty new to R, I'm sorry if the question is a stupid one!
For some descriptives I created a barplot to visualize group differences in my sample. I have two groups of people - suicide attempters and non-attempters. They differ regarding their diagnoses and so far I have a plot showing me how many people per group I have with a certain diagnosis, but I'd like to have a bar representing those people per group who do not have this diagnosis.
So I'd have a bar representing the number of people with MDD in the attempters group, a bar for those without in the attempters group, a bar for those with MDD in the non-attempters and a bar for those without MDD in the non-attempters.
Regarding my data: Everything is coded as 0 or 1, except for the attempters or not.
My old data frame looks something like this:
code
MDD
Anxiety
PTBS
attempters
01
0
1
1
1
02
1
1
0
0
03
0
0
1
0
04
0
1
0
0
At first I changed my data from wide to long and recoded the grouping variable attempters to a factor:
# create data frame for attempters
data_attempters <- data_gesamt %>%
pivot_longer(cols = c(MDD, Anxiety, PTBS),
names_to = "predictors", values_to = "value") %>%
filter(value == 1) %>%
# convert "attempters" to factor
mutate(attempter = as.factor(attempters)) %>%
# rename factor levels
mutate(attempter = recode_factor(attempter, "yes" = "0", "no" = "1")) %>%
group_by(predictors, attempter) %>%
summarize(severity = n(),.groups = "drop")
which got me a data frame as follows:
predictors
attempters
severity
MDD
0
1
Anxiety
1
1
Anxiety
0
2
PTBS
1
1
PTBS
0
1
and then used the following to plot:
plot_attempers <- data_attempters %>%
ggplot(aes(x = attempter, y = severity,
fill = attempter, group = attempter)) +
geom_bar(stat = "identity",
# position_dodge for avoid bar stacked on each other
position = position_dodge()) +
scale_fill_manual(labels = labs, values = c("0" = "#999999", "1" = "#CC79A7")) +
facet_grid(.~ predictors) +
scale_y_continuous(limits = c(0, 12), breaks = seq(0, 12, by = 1)) +
theme(legend.position = "bottom",
axis.text.x=element_blank()) +
ylab("Frequency")
plot_attempers
Did I add something in the code where I converted the data which made me lose the data about those who do not have a certain diagnosis which is why it is not shown in my plot? Or what do I need to add to get the non-diagnosis-people in the plot as well? Because as I can see in the new data-frame, I did lose those who do not have a diagnosis ...
My plot looks like this so far (please ignore the diagnoses I did not mention in my explanation here. I did not include them in this post so it is a smaller sample as well):
I would want four bars per diagnosis (two per group, one of them representing people with the diagnosis and one representing the people without)
You have to summarise your data first: Here I create a little example that simulate your data:
library(ggplot2)
library(reshape2)
df <- data.frame(code=1:100,
MDD=sample(0:1,100,replace = T,prob = c(0.3,0.7)),
anxiety=sample(0:1,100,replace = T,prob = c(0.4,0.6)),
PTBS=sample(0:1,100,replace = T),
attempters=sample(0:1,100,replace = T,prob = c(0.2,0.8)))
x <- reshape2::melt(df[,-1],id.vars="attempters",variable.name="diagnosis")
t <- x %>% group_by(diagnosis,attempters) %>%
summarise(sick=sum(value==1),healt=sum(value==0))
t <- reshape2::melt(t,id.vars=c("diagnosis","attempters"))
tt <-as.data.frame( apply(t, 2, as.factor))
ggplot(tt,aes(x=attempters,y=value))+
geom_bar(stat = "identity",aes(fill=variable),position = "dodge")+
facet_wrap(~diagnosis)+
scale_fill_manual(values = c("#CC79A7","#999999"))
and this is the resulting plot
Related
I got a df where variables 1-5 is scale with values total counts.
df<-data.frame(
speed=c(2,3,3,2,2),
race=c(5,5,4,5,5),
cake=c(5,5,5,4,4),
lama=c(2,1,1,1,2))
library(data.table)
dcast(melt(df), variable~value)
# variable 1 2 3 4 5
#1 speed 0 3 2 0 0
#2 race 0 0 0 1 4
#3 cake 0 0 0 2 3
#4 lama 3 2 0 0 0
I want to do stacked bar chart with mean and scale variables 1-5 on x axe by variables in first column (speed, race, cake, lama).
I tried solution from Stacked Bar Plot in R, but there is not what I am looking for.
I had to try a few things and do some workarround to get something very close to want you are looking for (given that I understood the problem correctly):
library(dplyr)
library(ggplot2)
library(tidyr)
df<-data.frame(
speed=c(2,3,3,2,2),
race=c(5,5,4,5,5),
cake=c(5,5,5,4,4),
lama=c(2,1,1,1,2))
# get the data in right shape for ggplot2
dfp <- df %>%
# a column that identifies the rows uniquely is needed ("name of data row")
dplyr::mutate(ID = as.factor(dplyr::row_number())) %>%
# the data has to shaped into "tidy" format (similar to excel pivot)
tidyr::pivot_longer(-ID) %>%
# order by name and ID
dplyr::arrange(name, ID) %>%
# group by name
dplyr::group_by(name) %>%
# calculate percentage and cumsum to be able to calculate label position (p2)
dplyr::mutate(p = value/sum(value),
c= cumsum(p),
p2 = c - p/2,
# the groups or x-axis values have to be recoded to numeric type
name = recode(name, "cake" = 1, "lama" = 2, "race" = 3, "speed" = 4))
# calculate the mean value per group (or label) as you want them in the plot
sec_labels <- dfp %>%
dplyr::summarise(m = mean(value)) %>%
pull(m)
dfp %>%
# building base plot, telling to fill by the new name variable
ggplot2::ggplot(aes(x = name, y = value, fill = ID)) +
# make it a stacked bar chart by percentiles
ggplot2::geom_bar(stat = "identity", position = "fill") +
# recode the x axis labels and add a secondary x axis with the labels
ggplot2::scale_x_continuous(breaks = 1:4,
labels = c("cake", "lama","race", "speed"),
sec.axis = sec_axis(~.,
breaks = 1:4,
labels = sec_labels)) +
# flip the chart by to the side
ggplot2::coord_flip() +
# scale the y axis (now after flipping x axis) to percent
ggplot2::scale_y_continuous(labels=scales::percent) +
# add a layer with labels acording to p2
ggplot2::geom_text(aes(label = value,
y=p2)) +
# put a name to the plot
ggplot2::ggtitle("meaningfull plot name") +
# put the labels on top
ggplot2::theme(legend.position = "top")
Say the categorical variables are,
Do_you_smoke -> Yes/ No
Do_you_drink -> Yes/No
Do_you_exercise -> Yes/No
All 3 categorical variables(Do_you_smoke, Do_you_drink, Do_you_exercise) have 2 category Yes or No. Now I want to visualize all these categorical variables against one continuous variable say "income" at once. How do I visualize this using R ?
It's always better to include a reproducible example of your data so that we can ensure any possible solutions work with your own data structure. However, from your description we should be able to recreate an example data set like this:
set.seed(69)
df <- data.frame(income = runif(1000, 10000, 100000))
df$smoke <- c("Yes", "No")[1 + rbinom(1000, 1, df$income/200000)]
df$drink <- sample(c("Yes", "No"), 1000, TRUE)
df$exercise <- c("No", "Yes")[1 + rbinom(1000, 1, df$income/100000)]
So our data frame contains four columns: the income amount and either a "Yes" or a "No" for each of your three variables:
head(df)
#> income smoke drink exercise
#> 1 57767.86 Yes No Yes
#> 2 79192.70 Yes Yes Yes
#> 3 68132.37 No No No
#> 4 87873.44 Yes No No
#> 5 43199.45 Yes Yes No
#> 6 88188.83 No Yes Yes
To plot this, we need to reshape the data. Since the incomes are all different, we can't get a percentage at each individual income level, so we will have to cut the income into bins. Let's do this by $10,000 bins. We then need to get the proportion of "Yes" for each variable in each income band. Finally, we want to put out data into long format, so that each proportion in each bin has its own row, labelled according to which of the three categorical variables it represents. We can then plot using ggplot.
We need to load a few libraries to help us:
library(dplyr)
library(ggplot2)
library(scales)
library(tidyr)
And now our code looks like this:
df %>%
mutate(income_bracket = cut(income, breaks = 1:10 * 10000)) %>%
group_by(income_bracket) %>%
summarise(exercise = length(which(exercise == "Yes"))/n(),
smoke = length(which(smoke == "Yes"))/n(),
drink = length(which(drink == "Yes"))/n()) %>%
mutate(income = paste(dollar(1:9 * 10000),
dollar(2:10 * 10000), sep = " -\n")) %>%
select(-income_bracket) %>%
pivot_longer(1:3) %>%
ggplot(aes(x = income, y = value, group = name, colour = name)) +
geom_line(size = 1.3) +
geom_point(size = 3) +
scale_y_continuous(labels = percent, limits = c(0, 1)) +
labs(title = "Percentage of activities by income",
y = "Percent", x = "Income bracket", color = "Do you...")
For data called df that reads:
car suv pickup
1 2 1
2 3 4
4 1 2
5 4 2
3 1 1
total = apply(df,1,sum)
barplot(total,col= rainbow(5))
So what I did right now is plotting a barplot on total number of cars, which are in fact, the sum of each row. What I want to do now is to present it as a stack barplot on the sum.
For now, it would just show "total" without any lines indicating whether 1 car, 2 suv, 1 pickup addes to 4 "total".
Note. It is different from barplot(matrix(df)), because that's just dividing it my car,suv,pickup, that disregards total number.
You can achieve this easily using ggplot2 and reshape2.
You will need an ID column to track the rows, so I have added that in. I melt the data to long type so that the different groups can be managed and plotted accordingly.
Then plot using geom_bar, specifying the row ids as the x axis and the groupings (fill and colour) for the stack plot and legend.
library(reshape2)
library(ggplot2)
df <- data.frame("ID" = c(1,2,3,4,5), "car" = c(1,2,4,5,3), "suv" = c(2,3,1,4,1), "pickup" = c(1, 4, 2, 2, 1))
long_df <- df %>% melt(id.vars = c("ID") ,value.name = "Number", variable.name = "Type")
ggplot(data = long_df, aes(x = ID, y = Number)) +
geom_bar(aes(fill = Type, colour = Type),
stat = "identity",
position = "stack")
With base R
df %>% melt(id.vars = c("ID") ,value.name = "Number", variable.name = "Type") %>%
dcast(Type ~ ID, value.var = "Number") %>%
as.matrix() %>%
barplot()
Are you after something like this?
library(tidyverse)
df %>%
rowid_to_column("row") %>%
gather(k, v, -row) %>%
ggplot(aes(row, v, fill = k)) +
geom_col()
We use a stacked barplot here, so there is no need to manually calculate the sum. The key here is to transform data from wide to long and keep track of the row.
Sample data
df <- read.table(text =
"car suv pickup
1 2 1
2 3 4
4 1 2
5 4 2
3 1 1", header = T)
I want to plot the rolling mean of data of different time series with ggplot2. My data have the following structure:
library(dplyr)
library(ggplot2)
library(zoo)
library(tidyr)
df <- data.frame(episode=seq(1:1000),
t_0 = runif(1000),
t_1 = 1 + runif(1000),
t_2 = 2 + runif(1000))
df.tidy <- gather(df, "time", "value", -episode) %>%
separate("time", c("t", "time"), sep = "_") %>%
subset(select = -t)
> head(df.tidy)
# episode time value
#1 1 0 0.7466480
#2 2 0 0.7238865
#3 3 0 0.9024454
#4 4 0 0.7274303
#5 5 0 0.1932375
#6 6 0 0.1826925
Now, the code below creates a plot where the lines for time = 1 and time = 2 towards the beginning of the episodes do not represent the data because value is filled with NAs and the first numeric entry in value is for time = 0.
ggplot(df.tidy, aes(x = episode, y = value, col = time)) +
geom_point(alpha = 0.2) +
geom_line(aes(y = rollmean(value, 10, align = "right", fill = NA)))
How do I have to adapt my code such that the rolling-mean lines are representative of my data?
Your issue is you are applying a moving average over the whole column, which makes data "leak" from one value of time to another.
You could group_by first to apply the rollmean to each time separately:
ggplot(df.tidy, aes(x = episode, y = value, col = time)) +
geom_point(alpha = 0.2) +
geom_line(data = df.tidy %>%
group_by(time) %>%
mutate(value = rollmean(value, 10, align = "right", fill = NA)))
I have a table in R looking like this. Columns are male and female. Rows are 4 variables with both a no & yes. The values are actually the proportions. So in column 1 the sum of value in row 1 and 2 sums up to 1, because this is the sum of proportions yes & no for variable 1.
propvars
prop_sum_male prop_sum_female
1_no 0.90123457 0.96296296
1_yes 0.09876543 0.03703704
2_no 0.88750000 0.96296296
2_yes 0.11250000 0.03703704
3_no 0.88750000 1.00000000
3_yes 0.11250000 0.00000000
4_no 0.44444444 0.40740741
4_yes 0.55555556 0.59259259
I want to created a stacked barplot for those 4 variables.
I used
barplot(propvars)
which gives me this:
But as you can see the distinction between male & female is correct, but he puts all variables on top of each other. And I need 4 different bars next to each other for the 4 variables, with every bar representing yes/no stacked on top of each other. So the Y-axis should go from 0-1 instead of from 0-4 like now.
Any hints on how to do this?
This may be helpful. I arranged your data in order to draw a graph. I added row name as a column. Then, I changed the data to a long-format data.
DATA & CODE
mydf <- structure(list(prop_sum_male = c(0.90123457, 0.09876543, 0.8875,
0.1125, 0.8875, 0.1125, 0.44444444, 0.55555556), prop_sum_female = c(0.96296296,
0.03703704, 0.96296296, 0.03703704, 1, 0, 0.40740741, 0.59259259
)), .Names = c("prop_sum_male", "prop_sum_female"), class = "data.frame", row.names = c("1_no",
"1_yes", "2_no", "2_yes", "3_no", "3_yes", "4_no", "4_yes"))
library(qdap)
library(dplyr)
library(tidyr)
library(ggplot2)
mydf$category <- rownames(mydf)
df <- mydf %>%
gather(Gender, Proportion, - category) %>%
mutate(Gender = char2end(Gender, "_", 2)) %>%
separate(category, c("category", "Response"))
ggplot(data = df, aes(x = category, y = Proportion, fill = Response)) +
geom_bar(stat = "identity", position = "stack") +
facet_grid(. ~ Gender)