I have a "trips" dataset that includes a unique trip id, and a start and end time (the specific hour and minute) of the trips. These trips were all taken on the same day. I am trying to determine the number of cars on the road at any given time and plot it as a line graph using ggplot in R. In other words, a car is "on the road" at any time in between its start and end time.
The most similar example I can find uses the following structure:
yearly_counts <- trips %>%
count(year, trip_id)
ggplot(data = yearly_counts, mapping = aes(x = year, y = n)) +
geom_line()
Would the best approach be to modify this structure have an "minutesByHour_count" variable that has a count for every minute of every hour? This seems inefficient to me, and still doesn't solve the problem of getting the counts from the start/end time.
Is there any easier way to do this?
Here's an example based on counting each start as an additional car, and each end as a reduction in the count:
library(tidyverse)
df %>%
gather(type, time, c(start_hour, end_hour)) %>%
mutate(count_chg = if_else(type == "start_hour", 1, -1)) %>%
arrange(time) %>%
mutate(car_count = cumsum(count_chg)) %>%
ggplot(aes(time, car_count)) +
geom_step()
Sample data:
df <- data.frame(
uniqueID = 1:60,
start_hour = seq(8, 12, length.out = 60),
dur_hour = 0.05*1:60
)
df$end_hour = df$start_hour + df$dur_hour
df$dur_hour = NULL
Related
I am working on a music streaming project, and I am trying to get the top15 global streamings in 2020 and make it an interactive graph.
It successfully showed the top 15 song names as a dataframe, but it failed to show as a bar graph, I wonder where did I do wrong here? Although it worked after I flip the bar graph into horizontal, but the data seem to look a bit off.
It looks like this as a vertical bar graph:
The horizontical bar graph looks like this, but the data seem incorrect:
Here is the code I have:
library("dplyr")
library("ggplot2")
# load the .csv into R studio, you can do this 1 of 2 ways
#read.csv("the name of the .csv you downloaded from kaggle")
spotiify_origional <- read.csv("charts.csv")
spotiify_origional <- read.csv("https://raw.githubusercontent.com/info201a-au2022/project-group-1-section-aa/main/data/charts.csv")
View(spotiify_origional)
# filters down the data
# removes the track id, explicit, and duration columns
spotify_modify <- spotiify_origional %>%
select(name, country, date, position, streams, artists, genres = artist_genres)
#returns all the data just from 2022
#this is the data set you should you on the project
spotify_2022 <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
arrange(date) %>%
group_by(date)
# use write.csv() to turn the new dataset into a .csv file
write.csv(Your DataFrame,"Path to export the DataFrame\\File Name.csv", row.names = FALSE)
write.csv(spotify_2022, "/Users/oliviasapp/Documents/info201/project-group-1-section-aa/data/spotify_2022.csv" , row.names = FALSE)
# then I pushed the spotify_2022.csv to the GitHub repo
View(spotiify_origional)
spotify_2022_global <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
filter(country == "global") %>%
arrange(date) %>%
group_by(streams)
View(spotify_2022_global)
top_15 <- spotify_2022_global[order(spotify_2022_global$streams, decreasing = TRUE), ]
top_15 <- top_15[1:15,]
top_15$streams <- as.numeric(top_15$streams)
View(top_15)
col_chart <- ggplot(data = top_15) +
geom_col(mapping = aes(x = name, y = streams)) +
ggtitle("Top 15 Songs Daily Streamed Globally") +
theme(plot.title = element_text(hjust = 0.5))
col_chart <- col_chart + coord_cartesian(ylim = c(999000,1000000)) + coord_flip()
col_chart
Thank you so much! Any suggestions will hugely help!
top_15 <- spotify_2022_global[order(spotify_2022_global$streams, decreasing = TRUE), ]
This code sorts in decreasing order, but the streams data here is still of character type, so numbers like 999975 will be "higher" than 1M, which is why your data looks weird. One song had two weeks just under 1M which is why it shows up with ~2M.
If you use this instead you'll get more what you intended:
top_15 <- spotify_2022_global[order(as.numeric(spotify_2022_global$streams), decreasing = TRUE), ]
However, this is finding the highest song-weeks, not the highest songs, so in this case all 15 highest song-weeks were one song.
I'd suggest you group_by(name) and then summarize to get total streams by song, filter top 15, and then make name an ordered factor, e.g. with forcats::fct_reorder.
I want to count the number of patients waiting to be seen by a doctor each 15 min for a 3.5 years time frame.
I have a first data.frame (dates) which has 122880 rows of dates (each 15 min).
I have another data.frame (episode) which has 225000 rows of patient ID with the time when they came to the ER, the time when they left and their ID.
I have a third data.frame (care) with the time the doctor saw the patients and the patients' ID.
Here is my code:
for(hour in 1:122880){
for(patient in 1:nrow(episode){
if(episode$begin[patient] <dates[hour]&episode$end[patient]>dates[hour]){
no_episode = episode$id[patient]
if(care$begin[care$id==no_episode]>dates[hour]{
nb_wait = nb_wait + 1
delay = delay + dates[hour]-episode$begin[patient]
}
}
}
nb_wait_total = rbind(nb_wait_total, nb_wait)
delay_total = rbind(delay_total, delay)
nb_wait = 0
delay = 0
}
The first loop is to set the date and write the results.
The second loop + first if statement is to search in the episode data.frame which patients are in the ER during the time frame.
The second if is to count the patient that haven't seen the doctor yet. I also sum how much time the patients have been waiting.
It is extremely long (estimated 30 days) and by the tests I have done, the longest line is this one:
if(episode$begin[patient]<dates[hour}&episode$end[patient}>dates[hour)
Any idea how to clean my code so it doesn't take so much time?
I have tried cutting the episode data.frame into 4, which speeds up the process but would still take 7 days!
Thank you!
Update! Went from 30 days to 11 hours, thanks to the comments I had on the post!
Here is the code after modification:
for(hour in 1:122880){
temp <- episode$id[episode$begin<dates[hour]&episode$end>dates[hour]]
for(element in 1:length(temp){
no_episode = temp[element]
if(care$begin[care$id==no_episode]>dates[hour]{
nb_wait = nb_wait + 1
delay = delay + dates[hour]-episode$begin[episode$id==no_episode]
}
}
nb_wait_total = rbind(nb_wait_total, nb_wait)
delay_total = rbind(delay_total, delay)
nb_wait = 0
delay = 0
}
Getting rid of the first if statement and the 2nd (longest) loop did the trick!
Not sure if this does exactly what you need, but it uses data of similar size and calculates the wait counts at each 15 minute interval in the date range in under 1 second.
First, here's some fake data. I assume here that "id" + "date" are unique identifiers in both the "care" and "episode" tables, and there's a 1:1 match for each. I also assume patient always arrives before doctor arrival, and ends after doctor arrival.
library(tidyverse); library(lubridate)
set.seed(42)
epi_n <- 500000 # before filtering; results in 230k data
care <- tibble(
id = sample(1:99999, epi_n, replace = TRUE),
begin = ymd_h(2010010100) + runif(epi_n, max = 3.5*365*24*60*60),
date = as_date(begin)) %>%
filter(hour(begin) >= 7, hour(begin) <= 17) %>%
distinct(id, date, .keep_all = TRUE) # make "id" and "date" unique identifiers
episode <- care %>%
transmute(id, date,
begin = begin - rpois(nrow(care), 30)*60,
end = begin + rgamma(nrow(care), 1)*600) %>%
arrange(begin)
Matching the data between the two data sets takes 0.2 sec.
tictoc::tic()
combined <- left_join(
episode %>% rename("patient_arrival" = "begin"),
care %>% rename("dr_arrival" = "begin")
)
tictoc::toc()
Counting how many patients had arrived but doctors hadn't at each 15 minute interval takes another 0.2 sec.
Here, I isolate each patient arrival and each doctor arrival; the first moment adds one to the wait at that moment, and the second reduces the wait count by one. We can sort those and count the cumulative number of waiting patients. Finally, I collate in a table like your "dates" table, and fill the wait counts that were in effect prior to each 15 minute interval. Then I just show those.
tictoc::tic()
combined %>%
select(id, patient_arrival, dr_arrival) %>%
pivot_longer(-id, values_to = "hour") %>%
mutate(wait_chg = if_else(name == "patient_arrival", 1, -1)) %>%
arrange(hour) %>%
mutate(wait_count = cumsum(wait_chg)) %>%
bind_rows(.,
tibble(name = "15 minute interval",
hour = seq.POSIXt(ymd_h(2010010100),
to = ymd_h(2013070112),
by = "15 min")) %>%
filter(hour(hour) >= 6, hour(hour) < 20)
) %>%
arrange(hour) %>%
fill(wait_count) %>%
replace_na(list(wait_count = 0)) %>%
filter(name == "15 minute interval") %>%
select(hour,wait_count)
tictoc::toc()
I am working with a call log data set from a telephone hotline service. There are three call outcomes: Answered, Abandoned & Engaged. I am trying to find out the average time taken by each caller to contact the hotline again if they abandoned the previous call. The time difference can be either seconds, minutes, hours or days but I would like to get all four if possible.
Here is some mock data with the variables I am working with:-
library(wakefield)#for generating the Status variable
library(dplyr)
library(stringi)
library(Pareto)
library(uuid)
n_users<-1300
n_rows <- 365000
set.seed(1)
#data<-data.frame()
Date<-seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by = "1 day")
Date<-sample(rep(Date,each=1000),replace = T)
u <- runif(length(Date), 0, 60*60*12) # "noise" to add or subtract from some timepoint
CallDateTime<-as.POSIXlt(u, origin = paste0(Date,"00:00:00"))
CallDateTime
CallOutcome<-r_sample_factor(x = c("Answered", "Abandoned", "Engaged"), n=length(Date))
CallOutcome
data<-data.frame(Date,CallDateTime,CallOutcome)
relative_probs <- rPareto(n = n_users, t = 1, alpha = 0.3, truncation = 500)
unique_ids <- UUIDgenerate(n = n_users)
data$CallerId <- sample(unique_ids, size = n_rows, prob = relative_probs, replace = TRUE)
data<-data%>%arrange(CallDateTime)
head(data)
So to reiterate, if a caller abandons their call (represented by "Abandoned" in the CallOutcome column), I would like to know the average time taken for the caller to make another call to the service, in the four time units I have mentioned. Any pointers on how I can achieve this would be great :)
Keep rows in the data where the current row is "Abandoned" and the next row is not "Abandoned" for each ID. Find difference in time between every 2 rows to get time required for the caller to make another call to service after it was abandoned, take average of each of the duration to get average time.
library(dplyr)
data %>%
#Test the answer on smaller subset
#slice(1:1000) %>%
arrange(CallerId, CallDateTime) %>%
group_by(CallerId) %>%
filter(CallOutcome == 'Abandoned' & dplyr::lead(CallOutcome) != 'Abandoned' |
CallOutcome != 'Abandoned' & dplyr::lag(CallOutcome) == 'Abandoned') %>%
mutate(group = rep(row_number(), each = 2, length.out = n())) %>%
group_by(group, .add = TRUE) %>%
summarise(avg_sec = difftime(CallDateTime[2], CallDateTime[1], units = 'secs')) %>%
mutate(avg_sec = as.numeric(mean(avg_sec)),
avg_min = avg_sec/60,
avg_hour = avg_min/60,
avg_day = avg_hour/24) -> result
result
First, I would create the lead variable (basically calculate what is the "next" value by group. Then it's just as easy as using whatever unit you want for difftime. A density plot can help you analyze these differences, as shown below.
data <-
data %>%
group_by(CallerId) %>%
mutate(CallDateTime_Next = lead(CallDateTime)) %>%
ungroup() %>%
mutate(
diff_days = difftime(CallDateTime_Next, CallDateTime, units = 'days'),
diff_hours = difftime(CallDateTime_Next, CallDateTime, units = 'hours'),
diff_mins = difftime(CallDateTime_Next, CallDateTime, units = 'mins'),
diff_secs = difftime(CallDateTime_Next, CallDateTime, units = 'secs')
)
data %>%
filter(CallOutcome == 'Abandoned') %>%
ggplot() +
geom_density(aes(x = diff_days))
Im working through this example. However, when I begin investigating the tk_ts output I don't think it is taking the start/end data Im entering correctly, but am unsure as to what the proper input is if I want it to start at 12-31-2019 and end at 7-17-2020:
daily_cases2 <- as_tibble(countrydatescases) %>%
mutate(Date = as_date(date)) %>%
group_by(country, Date) %>%
summarise(total_cases = sum(total_cases))
daily_cases2$total_cases <- as.double(daily_cases2$total_cases)
# Nest
daily_cases2_nest <- daily_cases2 %>%
group_by(country) %>%
tidyr::nest()
# TS
daily_cases2_ts <- daily_cases2_nest %>%
mutate(data.ts = purrr::map(.x = data,
.f = tk_ts,
select = -Date,
start = 2019-12-31,
freq = 1))
Here is what I get when I examine it closely:
When I go through the example steps with these parameters the issue is also then seen in the subsequent graph:
I've tried varying the frequency and start parameters and its just not making sense. Any suggestions?
You've given the start and end dates, but you haven't said what frequency you want. Given that you want the series to start at the end of 2019 and end in the middle of July, 2020, I'm guessing you want a daily time series. In that case, the code should be:
daily_cases2_ts <- daily_cases2_nest %>%
mutate(data.ts = purrr::map(.x = data,
.f = tk_ts,
select = -Date,
start = c(2019, 365), # day 365 of year 2019
freq = 365)) # daily series
I'm using the campaign contributions data from Oregon and I'm trying to make a graph that displays the cumulative amount of contributions per candidate over time. Here's what I have so far:
ggplot(aes(x = as.Date(contb_receipt_dt, "%d-%b-%y"),
y = cumsum(contb_receipt_amt)),
data = subset(oregon_data,
table(oregon_data$cand_nm)[oregon_data$cand_nm] > 1000
& as.Date(contb_receipt_dt, "%d-%b-%y") > as.Date("2015-01-01")))
+ geom_line(aes(color = cand_nm), bins = 5)
This is what it looks like:
What I would like to see is a line for each candidate that starts off at 0 and slowly goes up with each additional contribution. What should I do?
I would use dplyr to calculate the cumsum column before sending it on to ggplot. This should give you enough to get sarted, however you will need to pretty it up and filter the data to get the results you are looking for:
WashingtonData <- read.csv("P00000001-WA.csv")
WashingtonData <- WashingtonData %>% arrange(contb_receipt_dt)
MyGraphData <- WashingtonData %>% group_by(cand_nm) %>% mutate(cumsum = cumsum(contb_receipt_amt))
g <- ggplot(data=MyGraphData, aes(y=cumsum, x=contb_receipt_dt, color=cand_nm)) + geom_line()
g