I have around 20 variables which are coming from 4 different sources. I want to visualize for each variable how the data across sources varies using ggplot.
I was thinking a line chart would be a good option to visualize. My x-axis can be each responses and 4 lines for the sources would show me how data is changing across these 4 sources of data. I can have region as a split variable to visualize by region.
My data looks like something below (I have provided only 2 variables for simplicity):
library(data.table)
set.seed(1200)
ID <- seq(1001,1100)
region <- sample(1:10,100,replace = T)
Var1_source1 <- sample(1:100,100,replace = T)
Var1_source2 <- sample(1:100,100,replace = T)
Var1_source3 <- sample(1:100,100,replace = T)
Var1_source4 <- sample(1:100,100,replace = T)
Var2_source1 <- sample(1:100,100,replace = T)
Var2_source2 <- sample(1:100,100,replace = T)
Var2_source3 <- sample(1:100,100,replace = T)
Var2_source4 <- sample(1:100,100,replace = T)
df1 <- as.data.table(data.frame(ID,
region,
Var1_source1,
Var1_source2,
Var1_source3,
Var1_source4,
Var2_source1,
Var2_source2,
Var2_source3,
Var2_source4))
I feel this is unique requirement as I do not have anything specific to be plotted on my x axis
I am not entirely sure what you are hoping the plot to look like from your description, but the first part of any ggplot is getting the data a long format.
library(tidyverse)
df2 <- gather(df1, group, value, - c(ID, region)) %>%
separate(group, c("Var", "Source"))
head(df2)
ID region Var Source value
1 1001 2 Var1 source1 92
2 1002 4 Var1 source1 44
3 1003 5 Var1 source1 15
4 1004 6 Var1 source1 42
5 1005 5 Var1 source1 39
6 1006 6 Var1 source1 48
We now have a column which we can use within the ggplot. I am not entirely sure what you want plotting but this is an example:
ggplot(df2, aes(x = region, y = value, colour = Source)) +
stat_summary(fun.y = mean, geom ="line")
Or we can use a facet to split between the two variables:
ggplot(df2, aes(x = region, y = value, colour = Source)) +
stat_summary(fun.y = mean, geom ="line") +
facet_grid(Var~.)
Related
I got a df where variables 1-5 is scale with values total counts.
df<-data.frame(
speed=c(2,3,3,2,2),
race=c(5,5,4,5,5),
cake=c(5,5,5,4,4),
lama=c(2,1,1,1,2))
library(data.table)
dcast(melt(df), variable~value)
# variable 1 2 3 4 5
#1 speed 0 3 2 0 0
#2 race 0 0 0 1 4
#3 cake 0 0 0 2 3
#4 lama 3 2 0 0 0
I want to do stacked bar chart with mean and scale variables 1-5 on x axe by variables in first column (speed, race, cake, lama).
I tried solution from Stacked Bar Plot in R, but there is not what I am looking for.
I had to try a few things and do some workarround to get something very close to want you are looking for (given that I understood the problem correctly):
library(dplyr)
library(ggplot2)
library(tidyr)
df<-data.frame(
speed=c(2,3,3,2,2),
race=c(5,5,4,5,5),
cake=c(5,5,5,4,4),
lama=c(2,1,1,1,2))
# get the data in right shape for ggplot2
dfp <- df %>%
# a column that identifies the rows uniquely is needed ("name of data row")
dplyr::mutate(ID = as.factor(dplyr::row_number())) %>%
# the data has to shaped into "tidy" format (similar to excel pivot)
tidyr::pivot_longer(-ID) %>%
# order by name and ID
dplyr::arrange(name, ID) %>%
# group by name
dplyr::group_by(name) %>%
# calculate percentage and cumsum to be able to calculate label position (p2)
dplyr::mutate(p = value/sum(value),
c= cumsum(p),
p2 = c - p/2,
# the groups or x-axis values have to be recoded to numeric type
name = recode(name, "cake" = 1, "lama" = 2, "race" = 3, "speed" = 4))
# calculate the mean value per group (or label) as you want them in the plot
sec_labels <- dfp %>%
dplyr::summarise(m = mean(value)) %>%
pull(m)
dfp %>%
# building base plot, telling to fill by the new name variable
ggplot2::ggplot(aes(x = name, y = value, fill = ID)) +
# make it a stacked bar chart by percentiles
ggplot2::geom_bar(stat = "identity", position = "fill") +
# recode the x axis labels and add a secondary x axis with the labels
ggplot2::scale_x_continuous(breaks = 1:4,
labels = c("cake", "lama","race", "speed"),
sec.axis = sec_axis(~.,
breaks = 1:4,
labels = sec_labels)) +
# flip the chart by to the side
ggplot2::coord_flip() +
# scale the y axis (now after flipping x axis) to percent
ggplot2::scale_y_continuous(labels=scales::percent) +
# add a layer with labels acording to p2
ggplot2::geom_text(aes(label = value,
y=p2)) +
# put a name to the plot
ggplot2::ggtitle("meaningfull plot name") +
# put the labels on top
ggplot2::theme(legend.position = "top")
For data called df that reads:
car suv pickup
1 2 1
2 3 4
4 1 2
5 4 2
3 1 1
total = apply(df,1,sum)
barplot(total,col= rainbow(5))
So what I did right now is plotting a barplot on total number of cars, which are in fact, the sum of each row. What I want to do now is to present it as a stack barplot on the sum.
For now, it would just show "total" without any lines indicating whether 1 car, 2 suv, 1 pickup addes to 4 "total".
Note. It is different from barplot(matrix(df)), because that's just dividing it my car,suv,pickup, that disregards total number.
You can achieve this easily using ggplot2 and reshape2.
You will need an ID column to track the rows, so I have added that in. I melt the data to long type so that the different groups can be managed and plotted accordingly.
Then plot using geom_bar, specifying the row ids as the x axis and the groupings (fill and colour) for the stack plot and legend.
library(reshape2)
library(ggplot2)
df <- data.frame("ID" = c(1,2,3,4,5), "car" = c(1,2,4,5,3), "suv" = c(2,3,1,4,1), "pickup" = c(1, 4, 2, 2, 1))
long_df <- df %>% melt(id.vars = c("ID") ,value.name = "Number", variable.name = "Type")
ggplot(data = long_df, aes(x = ID, y = Number)) +
geom_bar(aes(fill = Type, colour = Type),
stat = "identity",
position = "stack")
With base R
df %>% melt(id.vars = c("ID") ,value.name = "Number", variable.name = "Type") %>%
dcast(Type ~ ID, value.var = "Number") %>%
as.matrix() %>%
barplot()
Are you after something like this?
library(tidyverse)
df %>%
rowid_to_column("row") %>%
gather(k, v, -row) %>%
ggplot(aes(row, v, fill = k)) +
geom_col()
We use a stacked barplot here, so there is no need to manually calculate the sum. The key here is to transform data from wide to long and keep track of the row.
Sample data
df <- read.table(text =
"car suv pickup
1 2 1
2 3 4
4 1 2
5 4 2
3 1 1", header = T)
#data
set.seed(1)
data_foo <- data.frame(id = rep(LETTERS[1:4], times = 2), group_measure = c(rep('a_c',4),rep('b_c',4), c(rep('a_d',4),rep('b_d',4))),
value = sample(1:5, size = 16, replace = TRUE))
I would like to plot the 'a' subgroups on the x axis against the 'b' subgroups on the y axis, and one plot for each measure.
Like this:
require(tidyr)
require(ggplot2)
require(patchwork)
data_foo_long <- data_foo %>% spread( group_measure, value)
p1 <- ggplot(data_foo_long, aes(x = a_c, y = b_c)) +
geom_point()
p2 <- ggplot(data_foo_long, aes(x = a_d, y = b_d)) +
geom_point()
p1 + p2
I don't see a way with faceting (?).
But I have the impression that there must be a better, more ggplot-like way of plotting the outcomes of two subgroups within a group against one another when I have them in a long format. Needless to say - there are more measures than those two.
P.S. if someone has a suggestion for a better title of this question, please feel free to comment!
Here is one way. How well it works with "more measures" I will leave to you to decide.
Use tidyr::separate to split the group_measure into a prefix and a suffix, then spread on the prefix:
library(tidyverse)
data_foo %>%
separate(group_measure,
into = c("prefix", "suffix"),
sep = "_") %>%
spread(prefix, value)
id suffix a b
1 A c 2 2
2 A d 4 4
3 B c 2 5
4 B d 1 2
5 C c 3 5
6 C d 2 4
7 D c 5 4
8 D d 1 3
Now you can plot a versus b, faceted by suffix:
data_foo %>%
separate(group_measure,
into = c("prefix", "suffix"),
sep = "_") %>%
spread(prefix, value) %>%
ggplot(aes(a, b)) +
geom_point() +
facet_wrap(~suffix)
I am working with categorical longitudinal data. My data has 3 simple variables such as :
id variable value
1 1 1 c
2 1 2 b
3 1 3 c
4 1 4 c
5 1 5 c
...
Where variable is basically time, and value are the 3 possible categories one id can take.
I am interested in producing a "parallel" longitudinal graph, similar to this with ggplot2
I am struggling a bit to get it right. What I came up for now is this :
dt0 %>% ggplot(aes(variable, value, group = id, colour = id)) +
geom_line(colour="grey70") +
geom_point(aes(colour=value, size = nn), size=4) +
scale_colour_brewer(palette="Set1") + theme_minimal()
The issue with this graph is that we can't really see the "thickness" of the "transition" (the id lines).
I wondered if you could help me for :
a) help make visible the id lines, or make it "thicker" according to the number of id going form one state to the other
b) I also would like to re-size the point according to the number of id in this state. I tried to do it with geom_point(aes(colour=value, size = nn), size=4) but it doesn't seem to work.
Thanks.
# data #
library(dplyr)
library(ggplot2)
set.seed(10)
# generate random sequences #
dt = as.data.frame( cbind(id = 1:1000, replicate(5, sample( c('a', 'b', 'c'), prob = c(0.1,0.2,0.7), 1000, replace = T)) ) )
# transform to PP file #
dt = dt %>% melt(id.vars = c('id'))
# create a vector 1-0 if the activity was performed #
dt0 = dt %>% group_by(id) %>% mutate(variable = 1:n()) %>% arrange(id)
# create the number of people in that state #
dt0 = dt0 %>% count(id, variable, value)
dt0 = dt0 %>% group_by(variable, value, n) %>% mutate(nn = n())
# to produce the first graph #
library(vcrpart)
otsplot(dt0$variable, factor(dt0$value), dt0$id)
you were so close with geom_point(aes(colour=value, size = nn), size=4), the problem was that with you redefined size after defining it in aes() ggplot overwrote the variable reference with the constant 4. Assuming you want to use nn to scale line thinkness as well, you could tweak your code to this:
dt0 %>% ggplot(aes(variable, value, group = id, colour = id)) +
geom_line(colour="grey70", aes(size = nn)) +
geom_point(aes(colour=value, size = nn)) +
scale_colour_brewer(palette="Set1") + theme_minimal()
If you wanted to use a lag value for the line thickness I would suggests adding that as a new column in dt0.
I am plotting a simple panel of data with ggplot2. Observations from the same individual (region) are from two different waves, and I want to plot my graph ordering individuals by the value of only one of the waves. However, ggplot by default orders by the mean value of both waves. Here's a basic sample of the data.
data <- read.table(text = "
ID Country time Theil0
1 AT1 2004 0.10358155
2 AT2 2004 0.08181044
3 AT3 2004 0.08238252
4 BE1 2004 0.14754138
5 BE2 2004 0.07205898
6 BE3 2004 0.09522730
7 AT1 2010 0.10901556
8 AT2 2010 0.09593889
9 AT3 2010 0.07579683
10 BE1 2010 0.16500438
11 BE2 2010 0.08313131
12 BE3 2010 0.10281853
", sep = "", header = TRUE)
And here's the code for the plot:
library(ggplot2)
pd <- position_dodge(0.4)
ggplot(data, aes(x=reorder(Country, Theil0), y=Theil0, colour = as.factor(time))) +
geom_point(size=3, position = pd)+
xlab("Region") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ylab("Index") +
ggtitle("2004 and 2010")
And the resulting plot:
As you can see, ordering by the values of 2010 only (and not the average of both years) would make the BE2 and AT3 observations switch order, which is what I would prefer in the graph. Thank you for any help on this.
I created a reproducible example that uses generic xs and ys. Basically, you need to use the ordered function on your factor:
x <- letters[1:4]
y1 <- 1:4
y2 <- c(1, 4, 2, 5) + 1
library(ggplot2)
library(reshape2) # used to melt the dummy dataset
df <- data.frame(x = x, y1 = y1, y2 = y2)
df2 <- melt(df, id.vars = "x", variable.name = "Group", value.name = "y")
df2$Group <- factor(df2$Group)
gg1 <- ggplot(data = df2, aes( x = x, y = y, color = Group)) +
geom_point()
ggsave("eample1.jpg", gg1, width = 3, height = 3)
Gives a plot similar to what you had:
However, x may be reorder:
df2$x2 <- ordered(df2$x, x[order(y2)])
gg2 <- ggplot(data = df2, aes( x = x2, y = y, color = Group)) +
geom_point()
ggsave("eample2.jpg", gg2, width = 3, height = 3)
which gives this figure:
Also, I get tripped up on this a lot. I find adjusting levels in ggplot2 to be trick at times.