Arrange weekdays starting on Sunday - r

everyone!
How can I arrange weekdays, starting on Sunday, in R? I got the weekdays using lubridate's function weekdays(), but the days appears randomly (image attached) and I can't seem to find a way to sort it. I tried the arrange function, but I guess it only works with numeric values. A bar chart looks very weird starting on Friday. This is what the code looks like:
my_dataset <- my_dataset %>%
mutate(weekDay = weekdays(Date))
my_dataset %>%
group_by(weekDay) %>%
summarise(mean_steps = mean(TotalSteps)) %>%
ggplot(aes(x = weekDay, y = steps))+
geom_bar(stat = "identity")
Thanks!
I tried the arrange function, but I guess it only works with numeric values.

Your weekDay-vector probably is of the class character. This will be arranged in alphabetical order by ggplot. The solution to this is to convert this character-vector into a factor-class.
There are several ways to get the x-axis in the order you would like to see. All of them mean to convert weekDays into a factor.
In order to come close to your example I have at first created a data frame with weekdays and some data. As those are both created randomly a seed was set to make the code reproducible.
One method is to create the data.frame with summaries and then to define in this DF weekdays as a factor with defined levels.
This can also be done within the ggplot-call when creating the aesthetics.
library(tidyverse)
set.seed(111)
myData <- data.frame(
weekDay = sample(weekdays(Sys.Date() + 0:6), 100, replace = TRUE),
TotalSteps = sample(1000:8000, 100)
)
myData %>%
group_by(weekDay) %>%
summarise(mean_steps = mean(TotalSteps)) -> DF # new data.frame
# the following defines weekDay as a factor and also sets
# the sequence of factor levels. This sequence is then taken
# by ggplot to construct the x-axis.
DF$weekDay <- factor(DF$weekDay, levels = c(
"Sonntag", "Montag",
"Dienstag", "Mittwoch",
"Donnerstag", "Freitag",
"Samstag"
))
ggplot(DF, aes(x = weekDay, y = mean_steps)) +
geom_bar(stat = "identity") +
labs(x="")
# the factor can also be defined within the ggplot-call
myData %>%
group_by(weekDay) %>%
summarise(mean_steps = mean(TotalSteps)) %>%
ggplot(aes(x = factor(weekDay, levels = c(
"Sonntag", "Montag",
"Dienstag", "Mittwoch",
"Donnerstag", "Freitag",
"Samstag"
)), y = mean_steps)) +
geom_bar(stat = "identity") +
labs(x="")

Related

How to specify ggplot legend order when you have multiple variables that are not all part of one column?

I'm plotting the same data by different time scales (Week, Month, Quarter, etc.) using ggplot, and as a result, I'm pulling the data from different columns. However, when I see my legend, I want it to be a specific order.
I know if all the grouping variables were in one column, I could set it as an ordered factor, as it explained here, but my data are spread across multiple columns. I also tried the suggestions here about re-ordering multiple geoms, but it didn't work.
Because my actual dataset is very complex, I've reproduced a smaller version that just has week and month data. For the final answer, please allow it to specify a specific order, not just something like rev(), because in my actual dataset, I have 6 columns that need a specific order.
Here's a code to reproduce--for this, the first 3 chunks make the dataset, so only the 4th chunk to make the plot should be relevant for the actual solution. The default that R shows the order is by showing 'Score - Month' first in the legend, so I'd like to see how I could make this the 2nd.
library(dplyr)
library(ggplot2)
library(lubridate)
#Generates week data -- shouldn't be relevant to troubleshoot
by_week <- tibble(Week = seq(as.Date("2011-01-01"), as.Date("2012-07-01"), by="weeks"),
Week_score = c(sample(100:200, 79)),
Month = ymd(format(Week, "%Y-%m-01")))
#Generates month data -- shouldn't be relevant to troubleshoot
by_month <- tibble(Month = seq(as.Date("2011-01-01"), as.Date("2012-07-01"), by="months"),
Month_score = c(sample(150:200, 19)))
#Joins data and removes duplications of month data for easier plotting -- shouldn't be relevant to troubleshoot
all_time <- by_week %>%
full_join(by_month) %>%
mutate(helper = across(c(contains("Month")), ~paste(.))) %>%
mutate(across(c(contains("Month")), ~ifelse(duplicated(helper), NA, .)), .keep="unused") %>%
mutate(Month = as.Date(Month))
#Makes plot - this is where I want the order in the legend to be different
all_time %>%
ggplot(aes(x = Week)) +
geom_line(aes(y= Week_score, colour = "Week_score")) +
geom_line(data=all_time[!is.na(all_time$Month_score),], aes(y = Month_score, colour = "Month_score")) + #This line tells R just to focus on non-missing values for Month_score
scale_colour_discrete(labels = c("Week_score" = "Score - Week", "Month_score" = "Score - Month"))
Here's what the current legend looks like--I want the order switched with a solution that is scalable to more than 2 options. Thank you!
As #stefan mentioned right in the comments, you should set the names of your labels in the limits option of scale_colour_discrete. You can add more columns by yourself. You can use the following code:
library(dplyr)
library(ggplot2)
library(lubridate)
#Generates week data -- shouldn't be relevant to troubleshoot
by_week <- tibble(Week = seq(as.Date("2011-01-01"), as.Date("2012-07-01"), by="weeks"),
Week_score = c(sample(100:200, 79)),
Month = ymd(format(Week, "%Y-%m-01")))
#Generates month data -- shouldn't be relevant to troubleshoot
by_month <- tibble(Month = seq(as.Date("2011-01-01"), as.Date("2012-07-01"), by="months"),
Month_score = c(sample(150:200, 19)))
#Joins data and removes duplications of month data for easier plotting -- shouldn't be relevant to troubleshoot
all_time <- by_week %>%
full_join(by_month) %>%
mutate(helper = across(c(contains("Month")), ~paste(.))) %>%
mutate(across(c(contains("Month")), ~ifelse(duplicated(helper), NA, .)), .keep="unused") %>%
mutate(Month = as.Date(Month))
#Makes plot - this is where I want the order in the legend to be different
all_time %>%
ggplot(aes(x = Week)) +
geom_line(aes(y= Week_score, colour = "Week_score")) +
geom_line(data=all_time[!is.na(all_time$Month_score),], aes(y = Month_score, colour = "Month_score")) + #This line tells R just to focus on non-missing values for Month_score
scale_colour_discrete(labels = c("Week_score" = "Score - Week", "Month_score" = "Score - Month"), limits = c("Week_score", "Month_score"))
Output:
As you can see the order of the labels is changed.

How can I set my own tick labels in ggplot while plotting factor values of time series?

So, I am plotting some time series in ggplot and on the x axis I got some date/time data. Data from 2008 to 2016. The problem is that dates are not continuous and for instance the last date of 2008 is
2008/05/14 19:05:12
and the next date is for 2009 something like this
2009/03/24 10:17:54
While plotting these, the result is the following
In order to get rid of the empty spaces I turn my dates into factors
dates <- factors(dates) in order to get the correct plot.
But after that I am unable to set the x tick labels as they don't change using
scale_x_continuous(breaks = c(1,1724,2283,5821,8906,10112,10156,14875 ),
labels = c("2008","2009","2010","2011","2012","2013","2014","2015"))
How can I change them?
There's a few problems this is throwing up, and the solution will really depend on what you're looking for. I'd suggest you post up some sample data and your code so far to get a more precise answer, but here's a possibility in the mean time:
Your graph above is not showing a continuous scale (though it may look like it), it's a discrete scale with the number of levels corresponding to unique date observations. Two problems come out of this:
applying a scale_x_continuous wont work, as the year breaks wont be evenly spread
your data looks like it's smoothly spread, but it isn't, which isn't a good principle for visualisation.
If what you're trying to do is show change year-by-year you could sort all of your data into yearly 'bins' and plot:
library(tidyverse)
library(lubridate)
# creating random data
df <- tibble(date = as_datetime(runif(1000, as.numeric(as_datetime("2001/01/24 09:30:43")), as.numeric(as_datetime("2006/02/24 09:30:43")))))
df["val"] <- rnorm(nrow(df), 25, 5)
# use lubridate to extract year as new variable, and plot grouped years
df %>%
mutate(year = factor(year(date))) %>%
ggplot(aes(year, val)) +
geom_point(position = "jitter")
Another possibility could be to use a colour scale to note your groupings by year, keeping all the dates in order but removing the gaps (and therefore not using a continuous x-axis scale):
df %>% # begin by simulating a data 'gap'
filter(date>as_datetime("2003/07/24 09:30:43")|date<as_datetime("2002/09/24 09:30:43")) %>%
mutate(year = factor(year(date)), # 'year' to select colour
date = factor(date)) %>%
ggplot(aes(date, val, col = year)) +
geom_point() +
theme(axis.ticks.x = element_blank(), # removes all ticks and labels, as too many unique times
axis.text.x = element_blank())
If neither of those are helpful do comment below with any clarifications of what you're looking for, and I'll see if I can help!
Edit: One last idea, you could create an invisible series of points which act as the breaks for your axis ticks:
blank_labels <- tibble(date = as_datetime(c("20020101 000000",
"20030101 000000",
"20040101 000000",
"20050101 000000",
"20060101 000000")),
col = "NA", val = 0)
df2 <- df %>%
filter(date>as_datetime("2003/07/24 09:30:43")|date<as_datetime("2002/09/24 09:30:43")) %>%
mutate(col = "black") %>%
bind_rows(blank_labels) %>%
mutate(date_fac = factor(date))
tick_values <- left_join(blank_labels, df2, by = c("date", "col"))
df2 %>%
ggplot(aes(date_fac, val, col = col)) +
geom_point() +
scale_x_discrete(breaks = tick_values$date_fac, labels = c("2002", "2003", "2004", "2005", "2006")) +
scale_color_identity()

ggplot2 - geom_line of cumulative counts of factor levels

I want to plot the cumulative counts of level OK of factor X (*), over time (column Date). I am not sure what is the best strategy, whether or not I should create a new data frame with a summary column, or if there is a ggplot2 way of doing this.
Sample data
DF <- data.frame(
Date = as.Date(c("2018-01-01", "2018-01-01", "2018-02-01", "2018-03-01", "2018-03-01", "2018-04-01") ),
X = factor(rep("OK", 6), levels = c("OK", "NOK")),
Group = factor(c(rep("A", 4), "B", "B"))
)
DF <- rbind(DF, list(as.Date("2018-02-01"), factor("NOK"), "A"))
From similar questions I tried this:
ggplot(DF, aes(Date, col = Group)) + geom_line(stat='bin')
Using stat='count' (as the answer to this question) is even worse:
ggplot(DF, aes(Date, col = Group)) + geom_line(stat='count')
which shows the counts for factor levels (*), but not the accumulation over time.
Desperate measure - count with table
I tried creating a new data frame with counts using table like this:
cum <- as.data.frame(table(DF$Date, DF$Group))
ggplot(cum, aes(Var1, cumsum(Freq), col = Var2, group = Var2)) +
geom_line()
Is there a way to do this with ggplot2? Do I need to create a new column with cumsum? If so, how should I cumsum the factor levels, by date?
(*) Obs: I could just filter the data frame to use only the intended levels with DF[X == "OK"], but I am sure someone can find a smarter solution.
One option using dplyr and ggplot2 can be as:
library(dplyr)
library(ggplot2)
DF %>% group_by(Group) %>%
arrange(Date) %>%
mutate(Value = cumsum(X=="OK")) %>%
ggplot(aes(Date, y=Value, group = Group, col = Group)) + geom_line()

R: using ggplot2 with a group_by data set

I can't quite figure this out. A CSV of 200+ rows assigned to data like so:
gid,bh,p1_id,p1_x,p1_y
90467,R,543333,80.184,98.824
90467,L,408045,74.086,90.923
90467,R,543333,57.629,103.797
90467,L,408045,58.589,95.937
Trying to group by p1_id and plot the mean values for p1_x and p1_y:
grp <- data %>% group_by(p1_id)
Trying to plot geom_point objects like so:
geom_point(aes(mean(grp$p1_x), mean(grp$p1_y), color=grp$p1_id))
But that isn't showing unique plot points per distinct p1_id values.
What's the missing step here?
Why not calculate the mean first?
library(dplyr)
grp <- data %>%
group_by(p1_id) %>%
summarise(mean_p1x = mean(p1_x),
mean_p1y = mean(p1_y))
Then plot:
library(ggplot2)
ggplot(grp, aes(x = mean_p1x, y = mean_p1y)) +
geom_point(aes(color = as.factor(p1_id)))
Edit: As per #eipi10, you can also pipe directly into ggplot
data %>%
group_by(p1_id) %>%
summarise(mean_p1x = mean(p1_x),
mean_p1y = mean(p1_y)) %>%
ggplot(aes(x = mean_p1x, y = mean_p1y)) +
geom_point(aes(color = as.factor(p1_id)))

Using gganimate and ggplot for a boxplot: Cumulative not working

I'm trying to produce an animation for a simulation model, and I want to show how the distribution of results changes as the simulation runs.
I've seen gganimate used for scatter plots but not for boxplots (or ideally violin plots). Here I've provided a reprex.
When I use sim_category (which is a bucket for a certain number of simulation runs) I want the result to be cumulative of all previous runs to show the total distribution.
In this example (and my actual code), cumulative = TRUE does not do this. Why is this?
library(gganimate)
library(animation)
library(ggplot2)
df = as.data.frame(structure(list(ID = c(1,1,2,2,1,1,2,2,1,1,2,2),
value = c(10,15,5,10,7,17,4,12,9,20,6,17),
sim_category = c(1,1,1,1,2,2,2,2,3,3,3,3))))
df$ID <- factor(df$ID, levels = (unique(df$ID)))
df$sim_category <- factor(df$sim_category, levels = (unique(df$sim_category)))
ani.options(convert = shQuote('C:/Program Files/ImageMagick-7.0.5-Q16/magick.exe'))
p <- ggplot(df, aes(ID, value, frame= sim_category, cumulative = TRUE)) + geom_boxplot(position = "identity")
gganimate(p)
gganimate's cumulative doesn't accumulate the data, it just keeps gif frames in subsequent frames as they appear. To achieve what you want, you have to do the accumulation before building the plot, something along the following lines:
library(tidyverse)
library(gganimate)
df <- data_frame(
ID = factor(c(1,1,2,2,1,1,2,2,1,1,2,2), levels = 1:2),
value = c(10,15,5,10,7,17,4,12,9,20,6,17),
sim_category = factor(c(1,1,1,1,2,2,2,2,3,3,3,3), levels = 1:3)
)
p <- df %>%
pull(sim_category) %>%
levels() %>%
as.integer() %>%
map_df(~ df %>% filter(sim_category %in% 1:.x) %>% mutate(sim_category = .x)) %>%
ggplot(aes(ID, value, frame = factor(sim_category))) +
geom_boxplot(position = "identity")
gganimate(p)

Resources