Subset boxplots by date, order x-axis by month [duplicate] - r

This question already has answers here:
What is the most elegant way to split data and produce seasonal boxplots?
(3 answers)
Closed 5 years ago.
I have a year's worth of data spanning two calendar years. I want to plot boxplots for those data subset by month.
The plots will always be ordered alphabetically (if I use month names) or numerically (if I use month numbers). Neither suits my purpose.
In the example below, I want the months on the x-axis to start at June (2013) and end in May (2014).
date <- seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "days")
set.seed(100)
x <- as.integer(abs(rnorm(365))*1000)
df <- data.frame(date, x)
boxplot(df$x ~ months(df$date), outline = FALSE)
I could probably generate a vector of the months in the order I need (e.g. months <- months(seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "month")))
Is there a more elegant way to do this? What am I missing?

Are you looking for something like this :
boxplot(df$x ~ reorder(format(df$date,'%b %y'),df$date), outline = FALSE)
I am using reorder to reorder your data according to dates. I am also formatting dates to skip day part since it is you aggregate your boxplot by month.
Edit :
If you want to skip year part ( but why ? personally I find this a little bit confusing):
boxplot(df$x ~ reorder(format(df$date,'%B'),df$date), outline = FALSE)
EDIT2 a ggplot2 solution:
Since you are in marketing field and you are learning ggplot2 :)
library(ggplot2)
ggplot(df) +
geom_boxplot(aes(y=x,
x=reorder(format(df$date,'%B'),df$date),
fill=format(df$date,'%Y'))) +
xlab('Month') + guides(fill=guide_legend(title="Year")) +
theme_bw()

I had a similar problem where I wanted to order the plot January to December. This seems to be a common cause of vexation for people, here is my solution:
date <- seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "days")
set.seed(100)
x <- as.integer(abs(rnorm(365))*1000)
months <- month.name
boxplot(x~as.POSIXlt(date)$mon,names=months, outline = FALSE)

Found an answer here - use a factor, not a date:
set.seed(100)
x <- as.integer(abs(rnorm(365))*1000)
df <- data.frame(date, x)
# create an ordered factor
m <- months(seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "month"))
df$months <- factor(months(df$date), levels = m)
# plot x axis as ordered
boxplot(df$x ~ df$months, outline = FALSE)

Related

break down date into year and month

I have a df with date under column "månad" in format 2020M07.
When I plot this the x-axis gets all crowded and instead of plotting a continuous line I want to create a series per year and the x-axis only contain month.
In order to do this I need to have a col. in my df with only year so I can group on that variable (in ggplot), AND I also need to have a col. w only month (for x-data input in ggplot). How do I use the "månad" column to achieve this? Is there like in excel a LEFT-function or something that you can use with the dplyr function mutate? Or else how do I do this?
Maybe my idea isn't the best, feel free to answer both to my suggested solution and if you got a better one!
Thanks!
EDIT: - dump on the df in current state
Code:
library(pxweb)
library(tidyverse)
library(astsa)
library(forecast)
library(scales)
library(plotly)
library(zoo)
library(lubridate)
# PXWEB query
pxweb_query_list_BAS <-
list("Region"=c("22"),
"Kon" =c("1+2"),
"SNI2007" =c("A-U+US"),
"ContentsCode"=c("000005F3"))
# Download data
px_data_BAS <-
pxweb_get(url = "https://api.scb.se/OV0104/v1/doris/sv/ssd/START/AM/AM0210/AM0210B/ArbStDoNMN",
query = pxweb_query_list_BAS)
# Convert to data.frame
df_syss_natt <- as.data.frame(px_data_BAS, column.name.type = "text", variable.value.type = "text") %>%
rename(syss_natt = 'sysselsatta efter bostadens belägenhet') %>%
filter(månad >2020)
# Plot data
ggplot(df_syss_natt, aes(x=månad, y=syss_natt, group=1)) +
geom_point() +
geom_line(color="red")
Very crowded output w current 1 series:

grouping & plotting by textual column value

I've got a (very) basic level of competency with R when working with numbers, but when it comes to manipulating data based on text values in columns I'm stuck. For example, if I want to plot meal frequency vs. day of week (is Tuesday really for tacos?) using the following data frame, how would I do that? I've seen suggestions of tapply, aggregate, colSums, and others, but those have all been for slightly different scenarios and nothing gives me what I'm looking for. Should I be looking at something other than R for this problem? My end goal is a graph with day of week on the X-axis, count on the Y-axis, and a line plot for each meal.
df <- data.frame(meal= c("tacos","spaghetti","burgers","tacos","spaghetti",
"spaghetti"), day = c("monday","tuesday","wednesday","monday","tuesday","wednesday"))
This is as close as I've gotten, and, to be honest, I don't fully understand what it's doing:
tapply(df$day, df$meal, FUN = function(x) length(x))
It will summarize the meal counts, but a) it doesn't have column names (my understanding is that's due to tapply returning a vector), and b) it doesn't keep an association with the day of the week.
Edit: The melt() suggestion below works for this dataset, but it won't scale to the size I need. I was, however, able to get a working graph from the dataframe produced by the melt. If anybody runs across this in the future, try:
ggplot(new, aes(day, value, group=meal, col=meal)) +
geom_line() + geom_point() + scale_y_continuous(breaks = function(x)
unique(floor(pretty(seq(0, (max(x) + 1) * 1.1)))))
(The part after geom_point() is to force the Y-axis to only be integers, which is what makes sense in this case.)
I tried to cut this into smaller pieces so you can understand whats going on
library(tidyverse)
# defining the dataframes
df <- data.frame(meal = c("tacos","spaghetti","burgers","tacos","spaghetti","spaghetti"),
day = c("monday","tuesday","wednesday","monday","tuesday","wednesday"))
# define a vector of days of week ( will be useful to display x axis in the correct order)
ordered_days =c("sunday","monday","tuesday","wednesday",
"thursday","friday",'saturday')
# count the number of meals per day of week
df_count <- df %>% group_by(meal,day) %>% count() %>% ungroup()
# a lot of combinations are missing, for example no burgers on monday
# so i am creating all combinations with count 0
fill_0 <- expand.grid(
meal=factor(unique(df$meal)),
day=factor(ordered_days),
n=0)
# append this fill_0 to df_count
# as some combinations already exist, group by again and sum n
# so only one row per (meal,day) combination
df_count <- rbind(df_count,fill_0) %>%
group_by(meal,day) %>%
summarise(n=sum(n)) %>%
mutate(day=factor(day,ordered=TRUE,
ordered_days))
# plot this by grouping by meal
ggplot(df_count,aes(x=day,y=n,group=meal,col=meal)) + geom_line()
The magic is here, courtesy of #fmarm:
df_count <- df %>% group_by(meal,day) %>% count() %>% ungroup()
The fill_0 and rbind bits also in the sample provided by #fmarm are necessary to keep from bombing out on unspecified combinations, but it's the line above that handles summing meals by day.

Displaying multiple boxplots in R within a specific range

If I have a data frame df with columns yearID and payroll
boxplot(df$payroll ~ df$yearID, ylab="Payroll", xlab="Year")
displays a boxplot for every year. Is there a way to specify the range of years that are displayed? Thanks
It's helpful to have the data that powers your code. You can read more about how to create an example here.
As rawr pointed out in comments, you can use the subset argument to boxplot to narrow the range of years that's rendered.
boxplot(df$payroll ~ df$yearID, ylab="Payroll", xlab="Year", subset = yearID > 2013)
Personally, I prefer using the data management tools from dplyr in order to keep my code consistent regardless of the function I'm using. In this case you can use filter to select only the years that you want. dplyr becomes even more useful when you use pipes, but I'll keep this example simple.
library(tidyverse) # Includes dplyr and other useful packages
# Generate dummy data
yearID <- sample(1995:2016, size = 1000, replace = TRUE)
payroll <- round(rnorm(1000, mean = 50000, sd = 20000))
df <- tibble(yearID, payroll)
# Filter the data to include only the years you want
df_plot <- filter(df, yearID > 2013)
# Generate your boxplot
boxplot(df_plot$payroll ~ df_plot$yearID, ylab="Payroll", xlab="Year")

Plot Number of Categories Per Year

I have some data that I need to graph in R. There are two columns of data. The first one is a series of years ranging from 2001 to 2011. The second column is a string. The strings can be anything. I need to make a multi-line graph ( I was trying to use ggplot ) where the occurences of a string is on the y-axis and the year is on the x-axis.
I don't really have much of an idea where to start. This is what I had but I'm not sure if this is correct.
year <- data$year
# Idk how to get occurences per year
# year_2001 <- data$string[data$year == 2001]
# would this work?
# ggplot + geom_line()
I know most of that is commented out but that's because I'm new to R. Any help or guidance is greatly appreciated. Thanks!
Here is one way to get it done.
library(ggplot2)
library(dplyr)
set.seed(272727)
data <- data.frame(year = sample(2001:2011, 100, replace = TRUE),
string = sample(letters[1:5], 100, replace = TRUE))
# this is what will be plotted
table(data$string, data$year)
dataSummary <- as.data.frame(xtabs(~year+string, data))
ggplot(dataSummary, aes(x = year, y = Freq, group = string, colour = string)) + geom_line()
Note my previous answer used dplyr, but it had an issue with year-string combinations that are zero length. See dplyr summarise: Equivalent of ".drop=FALSE" to keep groups with zero length in output.

R: scatterplot from list

I have data similar to the following:
a <- list("1999"=c(1:50), "2000"=rep(35, 20), "2001"=c(100, 101, 103))
I want to make a scatterplot where the x axis is the years (1999, 2000, 2001 as given by the list names) and the y axis is the value within each list. Is there an easy way to do this? I can achieve something close to what I want with a simple boxplot(a), but I'd like it to be a scatterplot rather than a boxplot.
You could create a data frame with the year in one column and the value in the other, and then you could plot that the appropriate columns:
b <- data.frame(year=as.numeric(rep(names(a), sapply(a, length))), val=unlist(a))
plot(b)
This will do what you want
do.call(rbind,
lapply(names(a), function(year) data.frame(year = year, obs = a[[year]]))
)
Or break it up into two function calls (lapply and then do.call) to make it more understandable what's going on. It's pretty simple if you go through it. The lapply creates one dataframe per year where each year gets all the values for that year in the list. Now you have 3 dataframes. Then do.call binds these dataframes together.
An option using tidyr/dplyr/ggplot2
library(ggplot2)
library(tidyr)
library(dplyr)
unnest(a, year) %>%
ggplot(., aes(x=year, y=x)) +
geom_point(shape=1)

Resources