Combining 2 columns with values in R - r

I'm working on a dataset about football. So I've made some time series analyses. I've calculated the amount of goals and the amount of goals in the previous month. Now I'm trying to plot it into a graph.
I'm trying to make a group bar chart with the goals of a certain month and from the previous month next to it.
This is the code that I'm using:
df_eredivisie %>%
group_by(month= month(date, label = TRUE)) %>%
summarise(goals = sum(FTHG + FTAG)) %>%
mutate(last = lag(goals, 1))
So this is the result (Sorry can't post pictures :/ ):
month goals last
Jan 69 NA
Feb 121 69
Mar 116 121
Apr 155 116
May 78 155
Aug 88 78
Sep 124 88
Oct 91 124
Nov 91 91
Dec 128 91
Could someone help me with the grouped bar chart? I've tried to combine the columns, so I could do fill and have the goals and last with different colours. But I couldn't figure out how to do that.

Your data need to be in long format, then it's simple:
library(ggplot2)
library(tidyverse)
df <- tribble(~month, ~goals, ~last,
"Jan", 69, NA,
"Feb", 121, 69,
"Mar", 116, 121,
"Apr", 155, 116,
"May", 78, 155,
"Aug", 88, 78,
"Sep", 124, 88,
"Oct", 91, 124,
"Nov", 91, 91,
"Dec", 128, 91)
df %>%
pivot_longer(cols = 2:3, names_to = "category") %>%
mutate(month = factor(month, levels = month.abb)) %>%
ggplot(aes(x = month, y = value, fill = category)) +
geom_col(position = "dodge")
#> Warning: Removed 1 rows containing missing values (geom_col).
Created on 2020-06-07 by the reprex package (v0.3.0)
If you reverse the factors, it looks like this:
df %>%
pivot_longer(cols = 2:3, names_to = "category") %>%
mutate(month = factor(month, levels = month.abb)) %>%
ggplot(aes(x = month, y = value, fill = forcats::fct_rev(category))) +
geom_col(position = "dodge")
#> Warning: Removed 1 rows containing missing values (geom_col).
Created on 2020-06-07 by the reprex package (v0.3.0)
So it works, but the second column does not add any information, as you can see the previous month right next to it...

Related

How to do countif in R based on dates

I have this data in my excel files, and it has so much data to count if I do it in Excel. I want to count how many days in 1 month have a value of more than 50.
I'd like to turn it into something like :
Could someone help me to solve this?
Another option is count with as.yearmon from zoo - filter the rows where 'Value' is greater than 50, then use count after converting to yearmon class with as.yearmon
library(dplyr)
library(zoo)
df %>%
filter(Value > 50) %>%
count(month_year = as.yearmon(Date))
-ouptut
month_year n
1 Jan 2010 3
2 Feb 2010 1
data
df <- structure(list(Date = structure(c(14610, 14611, 14612, 14618,
14618, 14624, 14641), class = "Date"), Value = c(27, 35, 78,
88, 57, 48, 99)), class = "data.frame", row.names = c(NA, -7L
))
Suppose your data is given by
df <- data.frame(Date = as.Date(c("1/1/2010", "1/2/2010", "1/3/2010", "1/9/2010", "1/9/2010", "1/15/2010", "2/1/2010"), "%m/%d/%Y"),
Value = c(27, 35, 78, 88, 57, 48, 99))
To count your specific values you could use
library(dplyr)
df %>%
group_by(month_year = format(Date, "%m-%y")) %>%
summarise(count = sum(Value > 50))
which returns
# A tibble: 2 x 2
month_year count
<chr> <int>
1 01-10 3
2 02-10 1
Note: Your Date column has to contain dates (as in as.Date).

Summing ranks for variable with fewest entries

I am learning R and want to manually compute the Mann-Whitney U statistic and p-value using a normal approximation (and not use wilcox.test or equivalent). My pensioner's brain struggles with coding so it has taken me hours to produce the same answers as the textbook. However, my code to sum the 'StateRank' for the state with the fewest values is convoluted. How can I replace the commented section with more efficient code? I've hunted high and low, both here and on Google, but I don't even know which search terms to use! It won't surprise me to hear that there is a one-line solution but I'm no nearer knowing what it is.
library(tidyverse)
# Activity 9: aboriginal village size in Alaska and California
a.df <- data.frame(
Alaska = c(23, 26, 30, 33, 42, 45, 45, 50, 50.5, 96, 113, 557, NA),
Calif = c(39, 48, 53.5, 55, 57, 66, 77, 79, 108, 121, 162, 197, 309)
) %>%
pivot_longer(
cols = c("Alaska", "Calif"),
names_to = "State",
values_to = "Value",
values_drop_na = TRUE
) %>%
mutate(StateRank = rank(Value, ties.method = "average"))
# clumsy code to sort, then sum ranks (StateRank) for group with fewest values (nA)
#--------------------------------------------------------------------------------
asc_or_desc <- as.matrix(count(a.df, State))
if (as.numeric(asc_or_desc[1,2])>as.numeric(asc_or_desc[2,2])) {
a.df <- arrange(a.df, desc(State))
} else {
a.df <- arrange(a.df, State)
}
#--------------------------------------------------------------------------------
nA <- as.numeric(min(count(a.df, State, sort = TRUE)$n))
nB <- as.numeric(max(count(a.df, State, sort = TRUE)$n))
a.U <- sum(a.df$StateRank[1:nA])
a.E <- (nA*(nA+nB+1))/2 # Expectation of U
a.V <- (nA*nB*(nA+nB+1))/12 # Variance of U
a.Z <- (a.U - a.E)/sqrt(a.V)
a.P <- round((1 - round(pnorm(round(abs(a.Z), 2),
mean = 0, sd = 1) ,4)) * 2, 3)
# all the rounding is to mimic statistical tables (so that
# the answer is the same as in the textbook that I use)
Please try this code and tell me if I am on the right way:
I replaced your so called clumsy code with this one
... %>%
group_by(State) %>%
mutate(mx = max(Value)) %>%
arrange(desc(mx), desc(Value)) %>%
select(-mx)
The whole code:
library(tidyverse)
# Activity 9: aboriginal village size in Alaska and California
a.df <- data.frame(
Alaska = c(23, 26, 30, 33, 42, 45, 45, 50, 50.5, 96, 113, 557, NA),
Calif = c(39, 48, 53.5, 55, 57, 66, 77, 79, 108, 121, 162, 197, 309)
) %>%
pivot_longer(
cols = c("Alaska", "Calif"),
names_to = "State",
values_to = "Value",
values_drop_na = TRUE
) %>%
mutate(StateRank = rank(Value, ties.method = "average")) %>%
group_by(State) %>%
mutate(mx = max(Value)) %>%
arrange(desc(mx), desc(Value)) %>%
select(-mx)
-----------------------------------------------------------------------------
a.U <- sum(a.df$StateRank[1:nA])
a.E <- (nA*(nA+nB+1))/2 # Expectation of U
a.V <- (nA*nB*(nA+nB+1))/12 # Variance of U
a.Z <- (a.U - a.E)/sqrt(a.V)
a.P <- round((1 - round(pnorm(round(abs(a.Z), 2),
mean = 0, sd = 1) ,4)) * 2, 3)
# all the rounding is to mimic statistical tables (so that
# the answer is the same as in the textbook that I use)

ggplot boxplot with custom X-Axis and grouping and sorting on separate values

I'm trying to create a boxplot based on timeseries data for multiple years. I want to group observations from multiple years by a variable "DAP" (similar to day of year 0-365), order them by day from November to March but only display the Month on the X-Axis.
I can create a custom order and X-Axis by creating a factor with each month, that works
level_order <- c('November', 'December', 'January', 'February', 'March')
plot <- ggplot(data = df, aes(y = y, x = factor(Month,level = level_order), group=DAP)) +
geom_boxplot(fill="grey85", width = 2.0) +
scale_x_discrete(limits = level_order)
plot
Now I'm stuck making the alignment on the X-Axis according to the days of the month. For example the first datapoint from November 26th needs to more right, closer to December.
Changing the X-Axis to "Date" creates monthly labels for each year and also removed the grouping.
plot <- ggplot(data = df, aes(y = y, x = Date, group=DAP)) +
geom_boxplot(fill="grey85")
plot + scale_x_date(date_breaks = "1 month", date_labels = "%B")
Setting the X-Axis to "DAP" instead of date gives me the correct order and spacing , but I need to display month on the X-Axis. How can I combine this last graph with the X-Axis labeling of graph 1?
plot <- ggplot(data = df, aes(y = y, x = DAP, group=DAP)) +
geom_boxplot(fill="grey85")
plot
and here a sample of the dataset
DAP Date Month y
1 47 2010-11-26 November 0.6872708
21 116 2011-02-03 February 0.7643213
41 68 2011-12-17 December 0.7021531
61 137 2012-02-24 February 0.7178306
81 92 2013-01-10 January 0.7330749
101 44 2013-11-23 November 0.6610618
121 113 2014-01-31 January 0.7961012
141 68 2014-12-17 December 0.7510821
161 137 2015-02-24 February 0.7799938
181 92 2016-01-10 January 0.6861423
201 47 2016-11-26 November 0.7155526
221 116 2017-02-03 February 0.7397810
241 72 2017-12-21 December 0.7259670
261 144 2018-03-03 March 0.6725775
281 106 2019-01-24 January 0.7637322
301 65 2019-12-14 December 0.7184616
321 134 2020-02-21 February 0.6760159
The following approach uses tidyverse. The date is separated into year-month-day and those newly created columns are made numeric. In the ggplot part position_dodge2(preserve = "single") is used which keeps the boxwidth the same. scale_x_discrete helps to redefine x-axis breaks and tick labels. width = 1 controls the distance between the boxes.
library(tidyverse)
df <- tibble::tribble(
~DAP, ~Date, ~Month, ~y,
47, "2010-11-26", "November", 0.6872708,
116, "2011-02-03", "February", 0.7643213,
68, "2011-12-17", "December", 0.7021531,
137, "2012-02-24", "February", 0.7178306,
92, "2013-01-10", "January", 0.7330749,
44, "2013-11-23", "November", 0.6610618,
113, "2014-01-31", "January", 0.7961012,
68, "2014-12-17", "December", 0.7510821,
137, "2015-02-24", "February", 0.7799938,
92, "2016-01-10", "January", 0.6861423,
47, "2016-11-26", "November", 0.7155526,
116, "2017-02-03", "February", 0.7397810,
72, "2017-12-21", "December", 0.7259670,
144, "2018-03-03", "March", 0.6725775,
106, "2019-01-24", "January", 0.7637322,
65, "2019-12-14", "December", 0.7184616,
134, "2020-02-21", "February", 0.6760159
)
df$Date <- as.Date(df$Date)
df %>%
separate(Date, sep = "-", into = c("year", "month", "day")) %>%
mutate_at(vars("year":"day"), as.numeric) %>%
select(-c(year, Month)) %>%
ggplot(aes(
x = factor(month, level = c(11, 12, 1, 2, 3)), y = y,
group = DAP, color = factor(month)
)) +
geom_boxplot(width = 1, lwd = 0.2, position = position_dodge2(preserve = "single")) +
scale_x_discrete(
breaks = c(11, 12, 1, 2, 3),
labels = c("November", "December", "January", "February", "March")
) +
labs(x = "") +
theme(legend.position = "none")
Try this. To get the right order, spacing and labels I make a new date. As year seems to be not relevant I set the year for obs November and December to 2019,
and for the other obs to 2020.
df <- structure(list(DAP = c(
47L, 116L, 68L, 137L, 92L, 44L, 113L,
68L, 137L, 92L, 47L, 116L, 72L, 144L, 106L, 65L, 134L
), Date = c(
"2010-11-26",
"2011-02-03", "2011-12-17", "2012-02-24", "2013-01-10", "2013-11-23",
"2014-01-31", "2014-12-17", "2015-02-24", "2016-01-10", "2016-11-26",
"2017-02-03", "2017-12-21", "2018-03-03", "2019-01-24", "2019-12-14",
"2020-02-21"
), Month = c(
"November", "February", "December",
"February", "January", "November", "January", "December", "February",
"January", "November", "February", "December", "March", "January",
"December", "February"
), y = c(
0.6872708, 0.7643213, 0.7021531,
0.7178306, 0.7330749, 0.6610618, 0.7961012, 0.7510821, 0.7799938,
0.6861423, 0.7155526, 0.739781, 0.725967, 0.6725775, 0.7637322,
0.7184616, 0.6760159
)), row.names = c(NA, -17L), class = "data.frame")
library(ggplot2)
# Make a new Date to get the correct order as with DAP.
# Set year for obs November and Decemeber to 2019,
# for other Obs to 2020,
df$Date1 <- gsub("20\\d{2}-(1\\d{1})", "2019-\\1", df$Date)
df$Date1 <- gsub("20\\d{2}-(0\\d{1})", "2020-\\1", df$Date1)
df$Date1 <- as.Date(df$Date1)
# use new date gives correcr order, spacing and labels
# Also adjusted limits
plot <- ggplot(data = df, aes(y = y, x = Date1, group = DAP)) +
geom_boxplot(fill = "grey85")
plot +
scale_x_date(date_breaks = "1 month", date_labels = "%B", limits = c(as.Date("2019-11-01"), as.Date("2020-03-31")))

Special Stacked Bar Chart R ggplot

Can you help me make the following bar chart in R? I have some simplified dummy data that i am using to recreate, and then my plan is to manipulate the data in the same way. No need to do the abline. The most important parts are the waterfall aspect.
ï..labels value
1 start 100
2 january 120
3 febuary 140
4 march 160
5 april 180
6 may 130
7 june 140
8 july 170
9 august 160
10 september 180
11 october 190
12 november 210
13 december 200
14 end 200
This gets you the waterfall effect:
library(tidyverse)
df <-
tibble::tribble(
~month, ~month_name, ~value,
1, "start", 100,
2, "january", 120,
3, "febuary", 140,
4, "march", 160,
5, "april", 180,
6, "may", 130,
7, "june", 140,
8, "july", 170,
9, "august", 160,
10, "september", 180,
11, "october", 190,
12, "november", 210,
13, "december", 200,
14, "end", 200
) %>%
mutate(
type = case_when(
month == min(month) ~ "Initial",
month == max(month) ~ "Final",
value > lag(value) ~ "Increase",
TRUE ~ "Decrease"
),
finish = value,
start = if_else(month == max(month), 0, replace_na(lag(value), 0))
)
df %>%
ggplot(aes(xmin = month - 0.3, xmax = month + 0.3, ymin = start, ymax = finish, fill = type)) +
geom_rect() +
scale_x_continuous(
breaks = 1:14,
labels = df %>% select(month_name) %>% pull()
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none"
)
You should be able to take care of the formatting and colors from here ;)

how to plot multiple daily time series in one plot

I have time series data of 27 days (from 2018-04-09 to 2018-5-15 without weekends) with 7 observations per day (08:00 t0 20:00 every two hours) with two variables per observation (di and eu).
I want to plot all days as line plots in one plot.
I found solutions to plot one plot per day with a ggplot facet plot and I found solutions to plot the whole timeseries in one plot (di and eu from 2018-04-09 to 2018-05-15).
But nothing that let me overlay 27 daily plots for one variable in one 8:00 to 20:00 plot.
The first three days as example data with dput():
structure(list(date_time = structure(c(1523260800, 1523268000,
1523275200, 1523282400, 1523289600, 1523296800, 1523304000, 1523347200,
1523354400, 1523361600, 1523368800, 1523376000, 1523383200, 1523390400,
1523433600, 1523440800, 1523448000, 1523455200, 1523462400, 1523469600,
1523476800), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
di = c(75, 90, 35, 70, 75, 15, 5, 65, 55, 15, 15, 0, NA,
15, 55, 55, 5, 25, NA, 60, NA), eu = c(15, 0, 65, 30, 15,
65, 70, 40, 45, 75, 75, 100, NA, 85, 45, 30, 90, 65, NA,
20, NA)), row.names = c(NA, -21L), class = c("tbl_df", "tbl",
"data.frame"))
A plot with all 27 days in one plot may look confusing, but I like to try it, to see wether it makes a trend in the data obvious. A plot for each weekday would be a nice addition.
You could determine the day and hour up front and then plot with respective groups like this:
library(tidyverse)
library(lubridate)
df %>%
gather(metric, value, -date_time) %>%
mutate(
hour_of_day = hour(date_time),
day = day(date_time)
) %>%
ggplot(aes(x = hour_of_day, y = value)) +
geom_line(aes(group = day)) +
facet_wrap( ~ metric)

Resources