Ggplot loop over unique variables in a group - r

I made a loop to make a plot for every unique value of a variable within a group. To make my code reproducible I used nyflights13 package. Unfortunately, in here my code gives desired result. In my data however I would have flight origins that don't happen in a certain year, giving me an empty plot for that origin in that year. I would like that in one group (in this example year), Only the origins that happened in that year are shown. Could somebody help me out?
library(nycflights13)
library(tidyverse)
plotter_de_plot<-function(origination, YEARR){
eval(substitute(origination), flights)
eval(substitute(YEARR), flights)
flights %>%
subset(year==YEARR)%>%
select(month,origin,hour,year)%>%
group_by(origin, month) %>%
mutate(AMOUNT = (sum(hour, na.rm=TRUE))) %>%
filter(!is.na(hour),
origin==origination,year==YEARR) %>%
ggplot(aes(month,AMOUNT), na.rm = TRUE)+
geom_point() +
labs(title=origination,subtitle=YEARR)
}
for (i in unique(flights$origin)){
plot(plotter_de_plot(i,2013))
}

In addition to stefan's answer which adresses the problem perfectly, I would recommend using purrr::map instead of your for loop:
my_plots = unique(flights$origin) %>%
set_names() %>%
map(plotter_de_plot, YEARR=2013)
my_plots$EWR
my_plots$LGA
my_plots$JFK
This way, you can access each plot inside a list. Another way would be to use facets.
Also, your plots are absurdly heavy (several Mb) and might take a long time to plot. That is because you are using mutate() instead of summarise().
Here is an example with facets that took <1 sec to compute:
flights %>%
filter(year==2013)%>%
select(month, origin, hour,year)%>%
group_by(origin, month) %>%
summarise(AMOUNT = (sum(hour, na.rm=TRUE))) %>%
ggplot(aes(month,AMOUNT), na.rm = TRUE)+
geom_point() +
labs(subtitle="Year 2013") +
facet_wrap(~origin)

One option would be to break your pipeline into two parts, data wrangling and plotting. Doing so you could check whether the filtered and aggregated dataset contains any data using e.g. nrow > 0 and return NULL if it doesn't. In your for loop you could then check for NULL before plotting:
To mimic your use case I used flights$year[flights$origin == "EWR"] <- 2015 so that the example data includes an origin with no data for year 2013:
library(nycflights13)
library(tidyverse)
plotter_de_plot <- function(origination, YEARR) {
d <- flights %>%
select(month, origin, hour, year) %>%
filter(
!is.na(hour),
origin == origination, year == YEARR
) %>%
group_by(month) %>%
mutate(AMOUNT = sum(hour, na.rm = TRUE))
if (nrow(d) > 0) {
ggplot(d, aes(month, AMOUNT), na.rm = TRUE) +
geom_point() +
labs(title = origination, subtitle = YEARR)
}
}
flights$year[flights$origin == "EWR"] <- 2015
for (i in unique(flights$origin)) {
p <- plotter_de_plot(i, 2013)
if (!is.null(p)) plot(p)
}

Related

Getting candlestick chart to display properly using a text / .txt file of historic stock prices in R

Hell there,
I have purchased the historic intraday prices of the S&P 500 (1min through 1hour) back through 2005 because most stock charting packages stop reporting intraday prices around 2016 or 2011. I have successfully imported the prices and gotten R to only read market hours, excluding premarket and aftermarket. Two problems exist. First, I need to get the chart to not show saturday and sunday. The bigger problem is that the plot is NOT showing candlesticks, but bars and they are very hard to read. I have tried increasing the size via (size = 4), but the bars overlap and are still not candlesticks. How can I get these to show as proper candlesticks? thank you
library(quantmod)
library(tidyquant)
library(tidyverse)
library(ggplot2)
library(readr)
library(ggforce)
library(dplyr)
dir <- "E:/Stock Trading/Historical Data/SPY_qjrt28"
setwd(dir)
data <- read_csv("SPY_30min.txt",
col_names = FALSE)
names(data) <- tolower(c("DateTime", "Open", "High", "Low", "Close", "Volume"))
data
#clean the data
write_rds(data, "cleaned.rds")
read_rds("cleaned.rds")
spy30m <- read_rds("cleaned.rds")
firstwave <- filter(spy30m, datetime >= as.Date('2009-03-02'), datetime <= as.Date('2009-03-19'))
# adding more time objects to the dataset
data <- data %>%
mutate(hour = hour(datetime),
minute = minute(datetime),
hms = as_hms(datetime))
# is the hour function working as expected? Yes!
data %>%
select(datetime, hour) %>%
sample_n(10)
# look at bins of observations at 30 minute intervals. Looks good!
data %>%
group_by(hms) %>%
summarise(count = n()) %>%
arrange(hms) %>%
print(n=100)
# filter the dataset to only include the times during regular market hours
data_regularmkt <- data %>%
# `filter` is the dplyr function that limits the number of observations in a data frame
# `between` function takes 3 arguments: an object/variable, a lower bound value, and upper bound value
filter(between(hms, as_hms("09:30:00"), as_hms("16:00:00")))
# look at it again
data_regularmkt %>%
group_by(hms) %>%
summarise(count = n()) %>%
arrange(hms) %>%
print(n=100)
###########
firstwave <- filter(spy30m, datetime >= as.Date('2009-03-06'), datetime <= as.Date('2009-03-19'))
ggplot(firstwave, aes(x = datetime, y = close)) +
geom_candlestick(aes(open = open, high = high, low = low, close = close, size = 3))
Say we have a data frame df with the columns date (dttm format), open, high, low, close.
To overcome the issue that non-trading hours are shown, my first idea was to use another x-axis scale. Here's with a row-index.
library(tidyverse)
library(lubridate)
library(tidyquant)
df <- df %>%
arrange(date) %>%
mutate(i = row_number())
# this is for the x-axis labels
df_x <- df %>%
group_by(d = floor_date(date, "day")) %>%
filter(date %in% c(min(date)))
df %>%
ggplot(aes(x = i)) +
geom_candlestick(aes(open = open, low = low, high = high, close = close)) +
scale_x_continuous(breaks = df_x$i,
labels = df_x$date)
The problem then is that if a contract is halted during trading hours, there will be no data too just like with night or weekend. However, these times you probably want to show anyway.
One could probably play with dplyr functions' complete or expand to fix the data first and still use my solution of plotting over an index x-scale.
Easier could be to use the plotly library.
plt <- plot_ly(data = df, x = ~date,
open = ~open, close = ~close,
high = ~high, low = ~low,
type="candlestick")
plt
This is to hide the non-trading hours:
plt %>% layout(showlegend = F, xaxis = list(rangebreaks=
list(
list(bounds=list(17, 9),
pattern="hour")),#hide hours outside of 9am-5pm
dtick=86400000.0/2,
tickformat="%H:%M\n%b\n%Y"))
More information can be found here: https://plotly.com/r/time-series/#hiding-nonbusiness-hours and https://plotly.com/r/candlestick-charts/
As for you not liking the appearance of tidyquant's geom_candlestick, I also suggest you try out Plotly.

why my bar chart not showing all the data

I am working on a music streaming project, and I am trying to get the top15 global streamings in 2020 and make it an interactive graph.
It successfully showed the top 15 song names as a dataframe, but it failed to show as a bar graph, I wonder where did I do wrong here? Although it worked after I flip the bar graph into horizontal, but the data seem to look a bit off.
It looks like this as a vertical bar graph:
The horizontical bar graph looks like this, but the data seem incorrect:
Here is the code I have:
library("dplyr")
library("ggplot2")
# load the .csv into R studio, you can do this 1 of 2 ways
#read.csv("the name of the .csv you downloaded from kaggle")
spotiify_origional <- read.csv("charts.csv")
spotiify_origional <- read.csv("https://raw.githubusercontent.com/info201a-au2022/project-group-1-section-aa/main/data/charts.csv")
View(spotiify_origional)
# filters down the data
# removes the track id, explicit, and duration columns
spotify_modify <- spotiify_origional %>%
select(name, country, date, position, streams, artists, genres = artist_genres)
#returns all the data just from 2022
#this is the data set you should you on the project
spotify_2022 <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
arrange(date) %>%
group_by(date)
# use write.csv() to turn the new dataset into a .csv file
write.csv(Your DataFrame,"Path to export the DataFrame\\File Name.csv", row.names = FALSE)
write.csv(spotify_2022, "/Users/oliviasapp/Documents/info201/project-group-1-section-aa/data/spotify_2022.csv" , row.names = FALSE)
# then I pushed the spotify_2022.csv to the GitHub repo
View(spotiify_origional)
spotify_2022_global <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
filter(country == "global") %>%
arrange(date) %>%
group_by(streams)
View(spotify_2022_global)
top_15 <- spotify_2022_global[order(spotify_2022_global$streams, decreasing = TRUE), ]
top_15 <- top_15[1:15,]
top_15$streams <- as.numeric(top_15$streams)
View(top_15)
col_chart <- ggplot(data = top_15) +
geom_col(mapping = aes(x = name, y = streams)) +
ggtitle("Top 15 Songs Daily Streamed Globally") +
theme(plot.title = element_text(hjust = 0.5))
col_chart <- col_chart + coord_cartesian(ylim = c(999000,1000000)) + coord_flip()
col_chart
Thank you so much! Any suggestions will hugely help!
top_15 <- spotify_2022_global[order(spotify_2022_global$streams, decreasing = TRUE), ]
This code sorts in decreasing order, but the streams data here is still of character type, so numbers like 999975 will be "higher" than 1M, which is why your data looks weird. One song had two weeks just under 1M which is why it shows up with ~2M.
If you use this instead you'll get more what you intended:
top_15 <- spotify_2022_global[order(as.numeric(spotify_2022_global$streams), decreasing = TRUE), ]
However, this is finding the highest song-weeks, not the highest songs, so in this case all 15 highest song-weeks were one song.
I'd suggest you group_by(name) and then summarize to get total streams by song, filter top 15, and then make name an ordered factor, e.g. with forcats::fct_reorder.

Manipulating data.frame while using cycles and storing values in a list

I have 2 codes that manipulate and filter (by date) my data.frame and that work perfectly. Now I want to run the code for not only one day, but for every day in vector:
seq(from=as.Date('2020-03-02'), to=Sys.Date(),by='days')` #.... 538 days
The code I want to run for all the days between 2020-03-02 and today is:
KOKOKO <- data.frame %>%
filter(DATE < '2020-03-02')%>%
summarize(DATE = '2020-03-02', CZK = sum(Objem.v.CZK,na.rm = T)
STAVPTF <- data.frame %>%
filter (DATE < '2020-03-02')%>%
group_by(CP) %>%
summarize(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), DATE = '2020-03-02') %>%
select(DATE,CP,mnozstvi) %>%
rbind(KOKOKO)%>%
drop_na() %>%
So instead of '2020-03-02' I want to fill in all days since '2020-03-02' one after another. And each of the KOKOKO and STAVPTF created for the unique day like this I want to save as a separate data.frame and all of them store in a list.
We could use map to loop over the sequence and apply the code
library(dplyr)
library(purrr)
out <- map(s1, ~ data.frame %>%
filter(DATE < .x)%>%
summarize(DATE = .x, CZK = sum(Objem.v.CZK,na.rm = TRUE))
As this is repeated cycle, a function would make it cleaner
f1 <- function(dat, date_col, group_col, Objem_col, aktualni_col, date_val) {
filtered <- dat %>%
filter({{date_col}} < date_val)
KOKOKO <- filtered %>%
summarize({{date_col}} := date_val,
CZK = sum({{Objem_col}}, na.rm = TRUE)
STAVPTF <- filtered %>%
group_by({{group_col}}) %>%
summarize(mnozstvi = last({{aktualni_col}}),
{{date_col}} := date_val) %>%
select({{date_col}}, {{group_col}}, mnozstvi) %>%
bind_rows(KOKOKO)%>%
drop_na()
return(STAVPTF)
}
and call as
map(s1, ~ f1(data.frame, DATE, CP, Objem.v.CZK, AKTUALNI_MNOZSTVI_AKCIE, !!.x))
where
s1 <- seq(from=as.Date('2020-03-02'), to=Sys.Date(), by='days')
It would be easier to answer your question, if you would provide a minimal reproducible example. It's easy done with tidyverses reprex packages
However, your KOKOKO code can be rewritten as simple cumulative sum:
KOKOKO =
data.frame %>%
arrange(DATE) %>% # if necessary
group_by(DATE) %>%
summarise(CZK = sum(Objem.v.CZK), .groups = 'drop') %>% # summarise per DATE (if necessary)
mutate(CZK = cumsum(CZK) - CZK) # cumulative sum excluding current row (current DATE)
Even STAVPTF code can probably be rewritten without iterations. First find the last value of AKTUALNI_MNOZSTVI_AKCIE per CP and DATE. Then this value is assigned to the next DATE:
STAVPTF <-
data.frame %>%
group_by(CP, DATE) %>%
summarise(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), .groups='drop_last') %>%
arrange(DATE) %>% # if necessary
mutate(DATE = lead(DATE))

Passing arguments dynamically in Expss tables with user-defined functions

I have a (new) question related to expss tables. I wrote a very simple UDF (that relies on few expss functions), as follows:
library(expss)
z_indices <- function(x, m_global, std_global, weight=NULL){
if(is.null(weight)) weight = rep(1, length(x))
z <- (w_mean(x, weight)-m_global)/std_global
indices <- 100+(z*100)
return(indices)
}
Reproducible example, based on infert dataset (plus a vector of arbitrary weights):
data(infert)
infert$w <- as.vector(x=rep(2, times=nrow(infert)), mode='numeric')
infert %>%
tab_cells(age, parity) %>%
tab_cols(total(), education, case %nest% list(total(), education)) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
tab_stat_fun(label="Z", function(x, m_global, std_global, weight=NULL){
z_indices(x, m_global=w_mean(infert$age, infert$w),std_global=w_sd(infert$age, infert$w))
}) %>%
tab_pivot(stat_position="inside_columns")
The table is computed and the output for the first line is (almost) as expected.
Then things go messy for the second line, since both arguments of z_indices explicitely refer to infert$age, where infert$parity is expected.
My question: is there a way to dynamically pass the variables of tab_cells as function argument within tab_stat_fun to match the variable being processed? I guess this happens inside function declaration but have not clue how to proceed...
Thanks!
EDIT April 28th 2020:
Answer from #Gregory Demin works great in the scope of infert dataset, although for better scalability to larger dataframes I wrote the following loop:
var_df <- data.frame("age"=infert$age, "parity"=infert$parity)
tabZ=infert
for(each in names(var_df)){
tabZ = tabZ %>%
tab_cells(var_df[each]) %>%
tab_cols(total(), education) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
tab_stat_fun(label="Z", function(x, m_global, std_global, weight=NULL){
z_indices(x, m_global=w_mean(var_df[each], infert$w),std_global=w_sd(var_df[each], infert$w))
})
}
tabZ = tabZ %>% tab_pivot()
Hope this inspires other expss users in the future!
There is no universal solution for this case. Function in the tab_stat_fun is always calculated inside cell so you can't get global values in it.
However, in your case we can calculate z-index before summarizing. Not so flexible solution but it works:
# function for weighted z-score
w_z_index = function(x, weight = NULL){
if(is.null(weight)) weight = rep(1, length(x))
z <- (x - w_mean(x, weight))/w_sd(x, weight)
indices <- 100+(z*100)
return(indices)
}
data(infert)
infert$w <- rep(2, times=nrow(infert))
infert %>%
tab_cells(age, parity) %>%
tab_cols(total(), education, case %nest% list(total(), education)) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
# here we get z-index instead of original variables
tab_cells(age = w_z_index(age, w), parity = w_z_index(parity, w)) %>%
tab_stat_mean(label="Z") %>%
tab_pivot(stat_position="inside_columns")
UPDATE.
A little more scalable approach:
w_z_index = function(x, weight = NULL){
if(is.null(weight)) weight = rep(1, length(x))
z <- (x - w_mean(x, weight))/w_sd(x, weight)
indices <- 100+(z*100)
return(indices)
}
w_z_index_df = function(df, weight = NULL){
df[] = lapply(df, w_z_index, weight = weight)
df
}
data(infert)
infert$w <- rep(2, times=nrow(infert))
infert %>%
tab_cells(age, parity) %>%
tab_cols(total(), education, case %nest% list(total(), education)) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
# here we get z-index instead of original variables
# we process a lot of variables at once
tab_cells(w_z_index_df(data.frame(age, parity))) %>%
tab_stat_mean(label="Z") %>%
tab_pivot(stat_position="inside_columns")

dplyr not grouping correctly or else using data from previous groups

I am working with JHU data on coronavirus infections, and I'm trying to compute new cases (and deaths) by group. Here's the code:
base <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-"
world.confirmed <- read.csv(paste0(base,"Confirmed.csv"), sep=',', head=T)
world.confirmed <- gather( world.confirmed, Date, Cases, X1.22.20:X3.21.20)
world.deaths <- read.csv(paste0(base,"Deaths.csv"), sep=',', head=T)
world.deaths <- gather(world.deaths, Date, Deaths, X1.22.20:X3.21.20)
world.data <- merge(world.confirmed, world.deaths,
by=c("Province.State","Country.Region","Lat", "Long", "Date"))
world.data$Date <- as.Date(world.data$Date, "X%m.%d.%y")
world.data <- world.data %>%
group_by(Province.State,Country.Region,Date) %>%
arrange(Province.State, Country.Region, as.Date(Date))
Following solutions to this question in SO I have tried to compute differences by group using something like this:
world.data <- world.data %>%
group_by(Lat,Long) %>%
mutate(New.Cases = Cases - lag(Cases))
That does not work, however; any other grouping does not either. Here're results on boundary between two first countries:
I have tried also inserting an arrange phase, and even trying to zero the first element of the group. Same problem. Any idea?
Update I'm using R 3.4.4 and dplyr_0.8.5
Probably, this might help :
library(dplyr)
world.data %>%
mutate(Date = as.Date(Date, "X%m.%d.%y")) %>%
arrange(Country.Region, Lat, Long, Date) %>%
group_by(Country.Region, Lat, Long) %>%
mutate(New_Cases = Cases - lag(Cases),
New_deaths = Deaths - lag(Deaths))
We arrange the data according to Date, and find New_Cases by subtracting today's case with yesterday's case for each Country and the same for deaths.

Resources