I have the following sample data with three different cost-types and a year-column:
library(tidyverse)
# Sample data
costsA <- sample(100:200,30, replace=T)
costsB <- sample(100:140,30, replace=T)
costsC <- sample(20:20,30, replace=T)
year <- sample(c("2000", "2010", "2030"), 30, replace=T)
df <- data.frame(costsA, costsB, costsC, year)
My goal is to plot these costs in a stacked barplot, so that I can compare the mean-costs between the three year-categories. In order to do so I aggregated the values:
df %>% group_by(year) %>%
summarise(n=n(),
meanA = mean(costsA),
meanB = mean(costsB),
meanC = mean(costsC)) %>%
ggplot( ... ) + geom_bar()
But how can I plot the graph now? In the x-axis there should be the years and in the y-axis the stacked costs.
You have to make the summarise data into a tidy(-ish) format to generate a plot like the one you posted. In a tidy-verse, you'd do that with gather function where you convert multiple columns into two-columns of key-value pairs. For instance, the following code generates the figure below.
df %>% group_by(year) %>%
summarise(n=n(),
meanA = mean(costsA),
meanB = mean(costsB),
meanC = mean(costsC)) %>%
gather("key", "value", - c(year, n)) %>%
ggplot(aes(x = year, y = value, group = key, fill = key)) + geom_col()
With gather("key", "value", - c(year, n)), three columns (costsA, costsB, costsC) are changed to the key-value pairs.
Related
I want to plot 3D temperature distribution to demonstrate the trend of temperature according to years and months in the same graph. The x and y axes denote the month and year. The z-axis shows the hourly temperature. How to show multiple distribution to show the trend of temperature.
This is the temperature data.
https://1drv.ms/x/s!AndXEcE6b4oxeaSk0sBJrcJuJ0c?e=6DvrFG
If you want a rotatable 3D surface plot, then plotly is your best bet. You will need to first get the monthly average of temperatures and create a matrix from them.
One option for doing this is using the tidyverse to pivot to long format, summarize, then pivot to wide format. Here's how to do that, assuming your data frame as loaded from the csv is called temp
library(tidyverse)
library(plotly)
temp %>%
pivot_longer(starts_with('Hour'),
names_to = 'Hour',
values_to = 'Temperature') %>%
group_by(Year, Month) %>%
summarise(Temperature = mean(Temperature, na.rm = TRUE)) %>%
pivot_wider(names_from = Month, values_from = Temperature) %>%
ungroup() %>%
select(-1) %>%
as.matrix() %>%
plot_ly(x = month.name, y = unique(temp$Year), z = .) %>%
add_surface()
A nice alternative 2D way to show this kind of data would be with a heatmap:
temp %>%
pivot_longer(starts_with('Hour'),
names_to = 'Hour',
values_to = 'Temperature') %>%
group_by(Year, Month) %>%
summarise(Temperature = mean(Temperature, na.rm = TRUE)) %>%
ggplot(aes(Year, Month, fill = Temperature)) +
geom_tile() +
scale_fill_viridis_c(option = 1) +
scale_y_continuous(breaks = 1:12,
labels = month.name) +
coord_equal()
Consider dat created here:
set.seed(123)
ID = factor(letters[seq(6)])
time = c(100, 102, 120, 105, 109, 130)
dat <- data.frame(ID = rep(ID,time), Time = sequence(time))
dat$group <- rep(c("GroupA","GroupB"), c(322,344))
dat$values <- sample(100, nrow(dat), TRUE)
dat contains time series data for 6 individuals (6 IDs), which belong to 2 groups (GroupA and GroupB). Assume that we expect the time series within each group to have similar properties. Also note that the time series for each individual is of different length. We essentially want to create an "average" time series plot of each group, which I have done like this:
library(dplyr)
library(ggplot2)
dat %>%
group_by(ID) %>%
mutate(maxtime = max(Time)) %>%
group_by(group) %>%
mutate(maxtime = min(maxtime)) %>%
group_by(group, Time) %>%
summarize(values = mean(values)) %>%
ggplot(aes(Time, values, colour = group))+
geom_line()+
facet_wrap(.~group)
How can we do this same thing, but add the original plots for each individual behind the "average" plots to illustrate the error associated with each "average"? Note that The way I created the "average plot" was by using the length of the ID with the shortest time series from each group, but when the originals are added, I would like to see the whole plots from the originals if possible (so some will be longer than others)
Using a second geom_line you can plot the "raw" data in the background as e.g. grey lines.
set.seed(123)
ID = factor(letters[seq(6)])
time = c(100, 102, 120, 105, 109, 130)
dat <- data.frame(ID = rep(ID,time), Time = sequence(time))
dat$group <- rep(c("GroupA","GroupB"), c(322,344))
dat$values <- sample(100, nrow(dat), TRUE)
library(dplyr)
library(ggplot2)
d <- dat %>%
group_by(ID) %>%
mutate(maxtime = max(Time)) %>%
group_by(group) %>%
mutate(maxtime = min(maxtime)) %>%
group_by(group, Time) %>%
summarize(values = mean(values))
#> `summarise()` regrouping output by 'group' (override with `.groups` argument)
ggplot()+
geom_line(data = dat, aes(Time, values, group = ID), color = "grey80", alpha = .7) +
geom_line(data = d, aes(Time, values, colour = group)) +
facet_wrap(.~group)
Maybe you are looking for a composed plot like this:
library(dplyr)
library(ggplot2)
library(patchwork)
G1 <- dat %>%
group_by(ID) %>%
mutate(maxtime = max(Time)) %>%
group_by(group) %>%
mutate(maxtime = min(maxtime)) %>%
group_by(group, Time) %>%
summarize(values = mean(values)) %>%
ggplot(aes(Time, values, colour = group))+
geom_line()+
facet_wrap(.~group)+
ylab('Mean')
G2 <- dat %>%
group_by(ID) %>%
mutate(maxtime = max(Time)) %>%
group_by(group) %>%
mutate(maxtime = min(maxtime)) %>%
ggplot(aes(Time, values, colour = group))+
geom_line()+
facet_wrap(.~group)+
ylab('Real Values')
#Compose plots
G3 <- G2/G1+plot_layout(guides = "collect")
Output:
I have a DF and I wanted to do a density graph with geom_density_ridges from ggridges, but, it's returning the same line in all states. What I'm doing wrong?
I would like to add trim = TRUE like in here, but it returns the following error message:
Ignoring unknown parameters: trim
My code:
library(tidyverse)
library(ggridges)
url <- httr::GET("https://xx9p7hp1p7.execute-api.us-east-1.amazonaws.com/prod/PortalGeral",
httr::add_headers("X-Parse-Application-Id" =
"unAFkcaNDeXajurGB7LChj8SgQYS2ptm")) %>%
httr::content() %>%
'[['("results") %>%
'[['(1) %>%
'[['("arquivo") %>%
'[['("url")
data <- openxlsx::read.xlsx(url) %>%
filter(is.na(municipio), is.na(codmun)) %>%
mutate_at(vars(contains(c("Acumulado", "Novos", "novos"))), ~ as.numeric(.))
data[,8] <- openxlsx::convertToDate(data[,8])
data <- data %>%
mutate(mortalidade = obitosAcumulado / casosAcumulado,
date = data) %>%
select(-data)
ggplot(data = data, aes(x = date, y = estado, heights = casosNovos)) +
geom_density_ridges(trim = TRUE)
You are probably not looking for density ridges but regular ridgelines.
There are a few choices to make in terms of normalisation. If you want to resemble densities, you can devide each group by their sum: height = casosNovos / sum(casosNovos). Next, you can decide that you want each ridge to be scaled to fit in between the lines, which you can do with the scales::rescale() function. It's your decision whether you want to do this per group or for the entire data. I chose the entire data below.
library(tidyverse)
library(ggridges)
url <- httr::GET("https://xx9p7hp1p7.execute-api.us-east-1.amazonaws.com/prod/PortalGeral",
httr::add_headers("X-Parse-Application-Id" =
"unAFkcaNDeXajurGB7LChj8SgQYS2ptm")) %>%
httr::content() %>%
'[['("results") %>%
'[['(1) %>%
'[['("arquivo") %>%
'[['("url")
data <- openxlsx::read.xlsx(url) %>%
filter(is.na(municipio), is.na(codmun)) %>%
mutate_at(vars(contains(c("Acumulado", "Novos", "novos"))), ~ as.numeric(.))
data[,8] <- openxlsx::convertToDate(data[,8])
data <- data %>%
mutate(mortalidade = obitosAcumulado / casosAcumulado,
date = data) %>%
select(-data) %>%
group_by(estado) %>%
mutate(height = casosNovos / sum(casosNovos))
ggplot(data = data[!is.na(data$estado),],
aes(x = date, y = estado, height = scales::rescale(height))) +
geom_ridgeline()
I have the following type of data and would like to create a stacked barplot, which would show the sum of Number on y axis for different bins of Distance on x axis which would indicate distance. In fact, that would be a sort of histogram, but not with frequencies on y but the sums of Number per set bin. This would be cumulative for all categories in Dest which would be marked with different colours.
Thanks so much.
library(ggplot2)
df <- data.frame(c(rep("A",20),rep("B",25),rep("C",35)),sample(1:30, 80,replace = TRUE),
rnorm(80,45,8))
colnames(df) <- c("Dest","Number","Distance")
ggplot(data = df, aes(x = Distance, y = Number, fill = Dest)) +
geom_histogram(colour = c("red","blue","green"))
Here are 2 solutions in case you want to be the one that specifies the (Distance) bins and not the histogram:
Option 1 (using ntile)
Here's a solution that allows you to specify the number of bins using ntile, which means that those bins will have more or less the same number of observations:
library(tidyverse)
df <- data.frame(c(rep("A",20),rep("B",25),rep("C",35)),sample(1:30, 80,replace = TRUE),
rnorm(80,45,8))
colnames(df) <- c("Dest","Number","Distance")
df %>%
group_by(bin = ntile(Distance, 3)) %>% # specify number of bins you want
mutate(DistRange = paste0(round(min(Distance)), " - ", round(max(Distance)))) %>%
ungroup() %>%
group_by(Dest, bin, DistRange = fct_reorder(DistRange, bin)) %>%
summarise(sum_number = sum(Number)) %>%
ungroup() %>%
ggplot(aes(DistRange, sum_number, fill=Dest))+
geom_col()
Option 2 (using cut)
An alternative option using cut to specify ranges:
df %>%
mutate(bin = cut(Distance, breaks = c(min(Distance)-1, 40, 50, 55, max(Distance)))) %>% # specify ranges
group_by(Dest, bin) %>%
summarise(sum_number = sum(Number)) %>%
ungroup() %>%
ggplot(aes(bin, sum_number, fill=Dest))+
geom_col()
I have a time series of several years that I need to plot mm/dd on the x-axis and multiple years on the y-axis using plot_ly. I have generated a sample data here:
date<-seq(as.Date("2010-11-22"),as.Date("2016-05-26"),by ="days")
sales = runif(2013, 2000, 6000)
df = data.frame(date,sales)
I plotted this data and get this:
plot_ly(df,x= ~date) %>% add_lines(y = ~sales,color=I("red"))
Now, I tried to plot multiple y-axis using plot_ly:
plot_ly(df, x = ~date) %>% add_lines(y = ~sales,
df$date <= "2010-12-31",color=I("red")) %>%
add_lines(y = ~sales, df$date <= "2013-12-31" &
df$date >= 2013-01-01, color = I("green"))
but I got wrong plot:
What's the mistake in that?
I want the plot like this:
To create different lines on the same graph we have to split the df in group with plotly::group_by. In your case this is achieved using lubridate to split by year:
library(plotly)
library(lubridate)
date <- seq(as.Date("2010-11-22"), as.Date("2016-05-26"), by = "days")
# Add some variations to distinguish lines
sales <- runif(2013, 10, 20) + year(date) * 5 + yday(date) / 5
df <- data.frame(date, sales)
df %>%
mutate(year = year(date)) %>%
group_by(year) %>%
plot_ly(x = ~ yday(date)) %>%
add_lines(y = ~ sales,
color = ~ factor(year)
)