it's quite a basic question I think but I can't figure out how to do it in a few elegant steps. I have this dataset:
df <- data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
df1 <- data.frame(lapply(df1, factor))
And I would like to make a grouped barplot in ggplot that only considers the "1" (basically plotting the frequencies of the 1s) by dummy. On the X-axis (the groups) there should be the columns (A,B,C,D,E), on the Y-axis the % of "1" in the respective column and the colors of the bars reflect the dummy considered.
This is basically what I want:
I know how to do this by creating each time a single dataframe for each dummy level and then plot them, but I'm sure there's a better and more efficient way to do this.
Thanks in advance for any suggestion!
Something like the following could work:
library(data.table)
setDT(df1)
graph_data <- df1[, lapply(.SD, function(x) sum(x == 1)/ nrow(df1)),
by = "dummy1",
.SDcols = c("A","B","C","D","E")] %>%
melt(id.vars = "dummy1")
library(ggplot2)
ggplot() +
geom_col(data = graph_data,
mapping = aes(x = variable, y = value, fill = dummy1),
position = "dodge")
Related
What I'm currently stuck on is trying to plot each column of my dataframe as its own histogram in ggplot. I attached a screenshot below:
Ideally I would be able to compare the values in every 'Esteem' column side-by-side by plotting multiple histograms.
I tried using the melt() function to reshape my dataframe, and then feed into ggplot() but somewhere along the way I'm going wrong...
You could pivot to long, then facet by column:
library(tidyr)
library(ggplot2)
esteem81_long <- esteem81 %>%
pivot_longer(
Esteem81_1:Esteem81_10,
names_to = "Column",
values_to = "Value"
)
ggplot(esteem81_long, aes(Value)) +
geom_bar() +
facet_wrap(vars(Column))
Or for a list of separate plots, just loop over the column names:
plots <- list()
for (col in names(esteem81)[-1]) {
plots[[col]] <- ggplot(esteem81) +
geom_bar(aes(.data[[col]]))
}
plots[["Esteem81_4"]]
Example data:
set.seed(13)
esteem81 <- data.frame(Subject = c(2,6,7,8,9))
for (i in 1:10) {
esteem81[[paste0("Esteem81_", i)]] <- sample(1:4, 5, replace = TRUE)
}
esteem_long <- esteem81 %>% pivot_longer(cols = -c(Subject))
plot <- ggplot(esteem_long, aes(x = value)) +
geom_histogram(binwidth = 1) +
facet_wrap(vars(name))
plot
I'm using pivot_longer() from tidyr and ggplot2 for the plotting.
The line pivot_longer(cols = -c(Subject)) reads as "apart from the "Subject" column, all the others should be pivoted into long form data." I've left the default new column names ("name" and "value") - if you rename them then be sure to change the downstream code.
geom_histogram automates the binning and tallying of the data into histogram format - change the binwidth parameter to suit your desired outcome.
facet_wrap() allows you to specify a grouping variable (here name) and will replicate the plot for each group.
I have 2 data frames:
df1 <- setNames(data.frame(c(as.POSIXct("2022-07-29 00:00:00","2022-07-29 00:05:00","2022-07-29 00:10:00","2022-07-29 00:15:00","2022-07-29 00:20:00")), c(1,2,3,4,5)), c("timeStamp", "value"))
df2 <- setNames(data.frame(c(as.POSIXct("2022-07-29 00:00:05","2022-07-29 00:05:05","2022-07-29 00:20:05")), c("a","b","c")), c("timeStamp", "text"))
I want to plot them, so as to to have the main graph be a numerical y scale geom_point, and then collate in the second dataframe with the labels (a,b,c) at the correct timeStamps on the continuous time series x axis.
ggplot() +
geom_point(data=df1, aes(x=timeStamp, y= value)) +
geom_text(data=df2, aes(x=timeStamp, y= text))
The difficulty I think lies in the fact that the timeStamps do not perfectly match up, and I keep getting returned with "Error: Discrete value supplied to continuous scale". Can anybody please offer some advice here?
The end result should look something like this (this an example from a much larger dataframe)
labeled time series using labels from different time series dataframe
Thank you
The issue is not the timeStamp but that for the geom_point you are mapping a numeric or continuous variable on y while for the geom_text you map a discrete one on y. Hence you get the error
Error: Discrete value supplied to continuous scale
To fix that map your text on the label aes (which BTW is required for geom_text) and use the y aes to specify the position where you want to add the labels:
library(ggplot2)
ggplot() +
geom_point(data=df1, aes(x=timeStamp, y= value)) +
geom_text(data=df2, aes(x=timeStamp, label = text, y = 6))
DATA
df1 <- setNames(data.frame(as.POSIXct(c("2022-07-29 00:00:00","2022-07-29 00:05:00","2022-07-29 00:10:00","2022-07-29 00:15:00","2022-07-29 00:20:00")), c(1,2,3,4,5)), c("timeStamp", "value"))
df2 <- setNames(data.frame(as.POSIXct(c("2022-07-29 00:00:05","2022-07-29 00:05:05","2022-07-29 00:20:05")), c("a","b","c")), c("timeStamp", "text"))
Update: removed 1. answer:
I am still not sure. Also #stefan's answer seems more correct, but maybe you think of something like this:
If you want to position the labels from df2 on top of the points from df1 conditional to the nearest time points between df1 and df2 then we would need to use roll from data.table. This answer was adapted from here Merging two sets of data by data.table roll='nearest' function
library(data.table)
library(tidyverse)
setDT(df1)
setDT(df2)
# Create time column by which to do a rolling join
df1[, time := timeStamp]
df2[, time := timeStamp]
setkey(df1, time)
setkey(df2, time)
set_merged <- df2[df1, roll = "nearest"]
set_merged %>%
as_tibble() %>%
ggplot(aes(x = time, y=value, group=1)) +
geom_point() +
geom_line()+
geom_text(aes(x=time, y=max(value)+0.1, label=text))+
theme_minimal()
I have a dataset in which I have one numeric variable and many categorical variables. I would like to make a grid of density plots, each showing the distribution of the numeric variable for different categorical variables, with the fill corresponding to subgroups of each categorical variable. For example:
library(tidyverse)
library(nycflights13)
dat <- flights %>%
select(carrier, origin, distance) %>%
mutate(origin = origin %>% as.factor,
carrier = carrier %>% as.factor)
plot_1 <- dat %>%
ggplot(aes(x = distance, fill = carrier)) +
geom_density()
plot_1
plot_2 <- dat %>%
ggplot(aes(x = distance, fill = origin)) +
geom_density()
plot_2
I would like to find a way to quickly make these two plots. Right now, the only way I know how to do this is to create each plot individually, and then use grid_arrange to put them together. However, my real dataset has something like 15 categorical variables, so this would be very time intensive!
Is there a quicker and easier way to do this? I believe that the hardest part about this is that each plot has its own legend, so I'm not sure how to get around that stumbling block.
This solutions gives all the plots in a list. Here we make a single function that accepts a variable that you want to plot, and then use lapply with a vector of all the variables you want to plot.
fill_variables <- vars(carrier, origin)
func_plot <- function(fill_variable) {
dat %>%
ggplot(aes(x = distance, fill = !!fill_variable)) +
geom_density()
}
plotlist <- lapply(fill_variables, func_plot)
If you have no idea of what those !! mean, I recommend watching this 5 minute video that introduces the key concepts of tidy evaluation. This is what you want to use when you want to create this sorts of wrapper functions to do stuff programmatically. I hope this helps!
Edit: If you want to feed an array of strings instead of a quosure, you can change !!fill_variable for !!sym(fill_variable) as follows:
fill_variables <- c('carrier', 'origin')
func_plot <- function(fill_variable) {
dat %>%
ggplot(aes(x = distance, fill = !!sym(fill_variable))) +
geom_density()
}
plotlist <- lapply(fill_variables, func_plot)
Alternative solution
As #djc wrote in the comments, I'm having trouble passing the column names into 'fill_variables'. Right now I am extracting column names using the following code...
You can separate the categorical and numerical variables like; cat_vars <- flights[, sapply(flights, is.character)] for categorical variables and cat_vars <- flights[, sapply(flights, !is.character)] for continuous variables and then pass these vectors into the wrapper function given by mgiormenti
Full code is given below;
library(tidyverse)
library(nycflights13)
cat_vars <- flights[, sapply(flights, is.character)]
cont_vars<- flights[, !sapply(flights, is.character)]
dat <- flights %>%
select(carrier, origin, distance) %>%
mutate(origin = origin %>% as.factor,
carrier = carrier %>% as.factor)
func_plot_cat <- function(cat_vars) {
dat %>%
ggplot(aes(x = distance, fill = !!cat_vars)) +
geom_density()
}
func_plot_cont <- function(cont_vars) {
dat %>%
ggplot(aes(x = distance, fill = !!cont_vars)) +
geom_point()
}
plotlist_cat_vars <- lapply(cat_vars, func_plot_cat)
plotlist_cont_vars<- lapply(cont_vars, func_plot_cont)
print(plotlist_cat_vars)
print(plotlist_cont_vars)
This question already has answers here:
Order data inside a geom_tile
(2 answers)
Closed 4 years ago.
I am trying to do the following. Consider the following dataset
trends <- c('Virtual Assistant', 'Citizen DS', 'Deep Learning', 'Speech Recognition',
'Handwritten Recognition', 'Machine Translation', 'Chatbots',
'NLP')
impact <- sample(5,8, replace = TRUE)
maturity <- sample(5,8, replace = TRUE)
strategy <- sample(5,8, replace = TRUE)
h <- sample(5,8, replace = TRUE)
df <- data.frame(trends, impact, maturity, strategy, h)
rownames(df) <- df$trends
I am trying to generate a heatmap. So far is good. That is relatively easy. For example I can use
dftemp = df[,c("impact", "maturity", "strategy", "h")]
dt2 <- dftemp %>%
rownames_to_column() %>%
gather(colname, value, -rowname)
and then
ggplot(dt2, aes(x = rowname, y = colname, fill = value)) +
geom_tile()
I know the labels on the x-axis are horizontal, but I know how to fix that. What I would like to have is to order the x-axis based on one specific rows. For example I would like to have the heatmap with the row "impact" (for example) values in ascending order. Anyone can point me in the right direction?
Shoudl I convert the x in a factor and change the levels there?
Yes, you could convert it into factors and specify the levels. So to change it based on impact row we can do
dt2$rowname <- factor(dt2$rowname, levels = df$trends[order(df$impact)])
library(ggplot2)
ggplot(dt2, aes(x = rowname, y = colname, fill = value)) +
geom_tile()
I looked for an answer throughout the former threads, but with no luck.
I was wondering if it could be possible, given a data frame having a structure similar to this one
df <- data.frame(x = rep(1:100, times = 2 ),
y = c(rnorm(100), rnorm(100, 10)),
group = rep(c("a", "b"), each = 100))
to plot directly the difference, between the observations of the two groups, instead of plotting the two samples using different colours, which is what I'm able to do so far using ggplot2. Of course I know I could do that using the base plotting system by simply using
plot(df[df$group == "a",]$y - df[df$group == "b",]$y)
but doing so I waste all the cool features of ggplot2.
Thanks in advance!
EB
You could try something like this:
library(reshape2)
library(ggplot2)
df <- dcast(df, x~group, value.var='y')
df$dif = df$a-df$b
ggplot(df, aes(x, dif)) + geom_line()
Or if you use data.table here is how to do it:
library(data.table)
dt=data.table(df)
dt<-dcast.data.table(dt, x~group, value.var='y')
dt[,dif:=a-b]
ggplot(dt, aes(x, dif)) + geom_line()
How does this look?
Another possibility using dplyr is the following:
ggplot(df %>% group_by(x) %>% summarise(delta = diff(y)),
aes(x = x, y = delta)) + geom_line()
In this case you can avoid the dcast using the function diff and assuming the order between the groups, otherwise you need to sort the factors or apply a dcast on your data frame. I am quite sure that you can do something very similar using data.table.
It's not completely solved, but it looks close to what I meant:
qplot( x = x,
y = diff,
data = dcast( data = df,
value.var = y,
formula = x ~ "diff",
fun.aggregate = function( x ) x[1] - x[2] )
It's quite tricky and strongly depends on what you have in your group variable, but works.
An alternative was to mutate the output of dcast, but in my case the group column was filled in with TRUE and FALSEvalues. Thus, using mutate to obtain diff=TRUE-FALSE returned a column of 1s, not very useful.