R Function involving two for loops - baseball data - r

For those into sports, I am working on a function that adds a column with the pitch count for a game in a given season for a pitcher.
For example's sake, data used is a data frame called pitcher that contains a game_date and sv_id (date/timestamp or the pitch). My goal is to order the sv_id in ascending order for each unique game_date and then add a column with a numbering system for this order. So for example, if for game_date=9/9/2018 there were 3 pitches thrown with sv_id's equal to 090918_031456, 090918_031613, and 090918_031534, I would first want to sort this data into chronological order (090918_031456,090918_031534,090918_031613) and then have a new column with the values 1,2,3 respectively to act as a pitch count. Below is my function so far. I originally thought I would make a list of lists but now I am not sure that is the right way to go about this. Please help! This is also my first time posting on here so any advice is appreciated. Thank you!!!
` pitchCount <- function(game_date, sv_id){
gameUnique<-unique(pitcher$game_date)
PC<-list()
for (j in 1:length(gameUnique)){
PCLocal<-filter(pitcher,game_date==gameUnique[j])
PCLocal[order(PCLocal$sv_id),]
for (i in 1:length(PCLocal$sv_id)){
PCLocal$PC[i]=i
}
PC[j]=PCLocal$PC
}
return(PC)
}
pitch.Count <- pitchCount(pitcher$game_date,pitcher$sv_id)
pitcher$PC<-pitch.Count
`

So you want to count pitches as they come in order, right? Should be no need for a loop. In R, loops are rarely needed.
Check if this is what you want. A tidyverse/dplyr solution.
The sv_id variable is in a format that can be converted to POSIX (a type of date format). This makes it simple to sort in order.
library(tidyverse)
# Create data_frame
pitcher <- data_frame(game_date = as.Date(c("2018-09-09", "2018-09-09", "2018-09-09")),
sv_id = c("090918_031456", "090918_031613", "090918_031534"))
# First, convert sv_id strings to POSIX format (this can be done in the code below but this makes it clearer.
pitcher$sv_id <- as.POSIXct(c("090918_031456", "090918_031613", "090918_031534"), format = "%y%m%d_%H%M%S", tz = "GMT")
# Create pitch count
pitcher %>%
arrange(sv_id) %>%
mutate(Count = 1, pitchcount = cumsum(Count), Count = NULL)
# A tibble: 3 x 3
game_date sv_id pitchcount
<date> <dttm> <dbl>
1 2018-09-09 2009-09-18 03:14:56 1
2 2018-09-09 2009-09-18 03:15:34 2
3 2018-09-09 2009-09-18 03:16:13 3

Try using data.table.
library(data.table)
pitcher_dt <- data.table(pitcher)
> pitcher_dt
game_date sv_id
1: 2018-01-02 090918_031456
2: 2018-01-02 090918_031613
3: 2018-01-02 090918_031534
We can add Count column by := and add a position of 'sv_id' by order(sv_id).
pitcher_dt [, Count := order(sv_id)]
> pitcher_dt
game_date sv_id Count
1: 2018-01-02 090918_031456 1
2: 2018-01-02 090918_031613 3
3: 2018-01-02 090918_031534 2
Since Count only puts the position of 'sv_id', in this case (1,3,2), we can either sort 'Count' or 'sv_id' in ascending order
pitcher_dt[,order(Count)] or pitcher_dt[,order(sv_id)]
> pitcher_dt[order(Count)]
game_date sv_id Count
1: 2018-01-02 090918_031456 1
2: 2018-01-02 090918_031534 2
3: 2018-01-02 090918_031613 3
For me, it is easy to manipulate data with data.table. But, you can also use dplyr.
Introduction to data.table is a good start to learn about data.table.

I am not sure how is your data looks like, but I assume the following from your description
> df
# A tibble: 9 x 2
game_date sv_id
<date> <chr>
1 2018-09-09 090918_031456
2 2018-09-09 090918_031613
3 2018-09-09 090918_031534
4 2018-05-17 090918_031156
5 2018-05-17 090918_031213
6 2018-06-30 090918_031177
7 2018-06-30 090918_031211
8 2018-06-30 090918_031144
9 2018-06-30 090918_031203
Then you use dplyr to do generate your target
library(dplyr)
df <- df %>%
group_by(game_date) %>%
mutate(count = n_distinct(sv_id)) %>% #count sv_id with each game_date
arrange(desc(sv_id))
The output is:
# A tibble: 9 x 3
# Groups: game_date [3]
game_date sv_id count
<date> <chr> <int>
1 2018-06-30 090918_031144 4
2 2018-05-17 090918_031156 2
3 2018-06-30 090918_031177 4
4 2018-06-30 090918_031203 4
5 2018-06-30 090918_031211 4
6 2018-05-17 090918_031213 2
7 2018-09-09 090918_031456 3
8 2018-09-09 090918_031534 3
9 2018-09-09 090918_031613 3
I hope this could help

Related

R create week numbers with specified start date

This seems like it should be straightforward but I cannot find a way to do this.
I have a sales cycle that begins ~ August 1 of each year and need to sum sales by week number. I need to create a "week number" field where week #1 begins on a date that I specify. Thus far I have looked at lubridate, baseR, and strftime, and I cannot find a way to change the "start" date from 01/01/YYYY to something else.
Solution needs to let me specify the start date and iterate week numbers as 7 days from the start date. The actual start date doesn't always occur on a Sunday or Monday.
EG Data Frame
eg_data <- data.frame(
cycle = c("cycle2019", "cycle2019", "cycle2018", "cycle2018", "cycle2017", "cycle2017", "cycle2016", "cycle2016"),
dates = as.POSIXct(c("2019-08-01" , "2019-08-10" ,"2018-07-31" , "2018-08-16", "2017-08-03" , "2017-08-14" , "2016-08-05", "2016-08-29")),
week_n = c("1", "2","1","3","1","2","1","4"))
I'd like the result to look like what is above - it would take the min date for each cycle and use that as a starting point, then iterate up week numbers based on a given date's distance from the cycle starting date.
This almost works. (Doing date arithmetic gives us durations in seconds: there may be a smoother way to convert with lubridate tools?)
secs_per_week <- 60*60*24*7
(eg_data
%>% group_by(cycle)
%>% mutate(nw=1+as.numeric(round((dates-min(dates))/secs_per_week)))
)
The results don't match for 2017, because there is an 11-day gap between the first and second observation ...
cycle dates week_n nw
<chr> <dttm> <chr> <dbl>
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 3
If someone has a better answer plz post, but this works -
Take the dataframe in the example, eg_data -
eg_data %>%
group_by(cycle) %>%
mutate(
cycle_start = as.Date(min(dates)),
days_diff = as.Date(dates) - cycle_start,
week_n = days_diff / 7,
week_n_whole = ceiling(days_diff / 7) ) -> eg_data_check
(First time I've answered my own question)
library("lubridate")
eg_data %>%
as_tibble() %>%
group_by(cycle) %>%
mutate(new_week = week(dates)-31)
This doesn't quite work the same as your example, but perhaps with some fiddling based on your domain experience you could adapt it:
library(lubridate)
eg_data %>%
mutate(aug1 = ymd_h(paste(str_sub(cycle, start = -4), "080100")),
week_n2 = ceiling((dates - aug1)/ddays(7)))
EDIT: If you have specific known dates for the start of each cycle, it might be helpful to join those dates to your data for the calc:
library(lubridate)
cycle_starts <- data.frame(
cycle = c("cycle2019", "cycle2018", "cycle2017", "cycle2016"),
start_date = ymd_h(c(2019080100, 2018072500, 2017080500, 2016071300))
)
eg_data %>%
left_join(cycle_starts) %>%
mutate(week_n2 = ceiling((dates - start_date)/ddays(7)))
#Joining, by = "cycle"
# cycle dates week_n start_date week_n2
#1 cycle2019 2019-08-01 1 2019-08-01 1
#2 cycle2019 2019-08-10 2 2019-08-01 2
#3 cycle2018 2018-07-31 1 2018-07-25 1
#4 cycle2018 2018-08-16 3 2018-07-25 4
#5 cycle2017 2017-08-03 1 2017-08-05 0
#6 cycle2017 2017-08-14 2 2017-08-05 2
#7 cycle2016 2016-08-05 1 2016-07-13 4
#8 cycle2016 2016-08-29 4 2016-07-13 7
This is a concise solution using lubridate
library(lubridate)
eg_data %>%
group_by(cycle) %>%
mutate(new_week = floor(as.period(ymd(dates) - ymd(min(dates))) / weeks()) + 1)
# A tibble: 8 x 4
# Groups: cycle [4]
cycle dates week_n new_week
<chr> <dttm> <chr> <dbl>
1 cycle2019 2019-08-01 00:00:00 1 1
2 cycle2019 2019-08-10 00:00:00 2 2
3 cycle2018 2018-07-31 00:00:00 1 1
4 cycle2018 2018-08-16 00:00:00 3 3
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 2
7 cycle2016 2016-08-05 00:00:00 1 1
8 cycle2016 2016-08-29 00:00:00 4 4

How to check for continuity minding possible gaps in dates

I have a big data frame with dates and i need to check for the first date in a continuous way, as follows:
ID ID_2 END BEG
1 55 2017-06-30 2016-01-01
1 55 2015-12-31 2015-11-12 --> Gap (required date)
1 88 2008-07-26 2003-02-24
2 19 2014-09-30 2013-05-01
2 33 2013-04-30 2011-01-01 --> Not Gap (overlapping)
2 19 2012-12-31 2011-01-01
2 33 2010-12-31 2008-01-01
2 19 2007-12-31 2006-01-01
2 19 2005-12-31 1980-10-20 --> No actual Gap(required date)
As shown, not all the dates have overlapping and i need to return by ID (not ID_2) the date when the first gap (going backwards in time) appears. I've tried using for but it's extremely slow (dataframe has 150k rows). I've been messing around with dplyr and mutate as follows:
df <- df%>%
group_by(ID)%>%
mutate(END_lead = lead(END))
df$FLAG <- df$BEG - days(1) == df$END_lead
df <- df%>%
group_by(ID)%>%
filter(cumsum(cumsum(FLAG == FALSE))<=1)
But this set of instructions stops at the first overlapping, filtering the wrong date. I've tried anything i could think of, ordering in decreasing or ascending order, and using min and max but could not figure out a solution.
The actual result wanted would be:
ID ID_2 END BEG
1 55 2015-12-31 2015-11-12
2 19 2008-07-26 1980-10-20
Is there a way of doing this using dplyr,tidyr and lubridate?
A possible solution using dplyr:
library(dplyr)
df %>%
mutate_at(vars(END, BEG), funs(as.Date)) %>%
group_by(ID) %>%
slice(which.max(BEG > ( lead(END) + 1 ) | is.na(BEG > ( lead(END) + 1 ))))
With your last data, it gives:
# A tibble: 2 x 4
# Groups: ID [2]
ID ID_2 END BEG
<int> <int> <date> <date>
1 1 55 2015-12-31 2015-11-12
2 2 19 2005-12-31 1980-10-20
What the solution does is basically:
Changes the dates to Date format (no need for lubridate);
Groups by ID;
Selects the highest row that satisfies your criteria, i.e. the highest row which is either a gap (TRUE), or if there is no gap it is the first row (meaning it has a missing value when checking for a gap, this is why is.na(BEG > ( lead(END) + 1 ))).
I would use xts package, first creating xts objects for each ID you have, than use first() and last() function on each objects.
https://www.datacamp.com/community/blog/r-xts-cheat-sheet

R Sum rows by hourly rate

I'm getting started with R, so please bear with me
For example, I have this data.table (or data.frame) object :
Time Station count_starts count_ends
01/01/2015 00:30 A 2 3
01/01/2015 00:40 A 2 1
01/01/2015 00:55 B 1 1
01/01/2015 01:17 A 3 1
01/01/2015 01:37 A 1 1
My end goal is to group the "Time" column to hourly and sum the count_starts and count_ends based on the hourly time and station :
Time Station sum(count_starts) sum(count_ends)
01/01/2015 01:00 A 4 4
01/01/2015 01:00 B 1 1
01/01/2015 02:00 A 4 2
I did some research and found out that I should use the xts library.
Thanks for helping me out
UPDATE :
I converted the type of transactions$Time to POSIXct, so the xts package should be able to use the timeseries directly.
Using base R, we can still do the above. Only that the hour will be one less for all of them:
dat=read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
dat$Time=cut(strptime(dat$Time,"%m/%d/%Y %H:%M"),"hour")
aggregate(.~Time+Station,dat,sum)
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
2 2015-01-01 01:00:00 A 4 2
3 2015-01-01 00:00:00 B 1 1
You can use the order function to rearrange the table or even the sort.POSIXlt function:
m=aggregate(.~Time+Station,dat,sum)
m[order(m[,1]),]
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
3 2015-01-01 00:00:00 B 1 1
2 2015-01-01 01:00:00 A 4 2
A solution using dplyr and lubridate. The key is to use ceiling_date to convert the date time column to hourly time-step, and then group and summarize the data.
library(dplyr)
library(lubridate)
dt2 <- dt %>%
mutate(Time = mdy_hm(Time)) %>%
mutate(Time = ceiling_date(Time, unit = "hour")) %>%
group_by(Time, Station) %>%
summarise(`sum(count_starts)` = sum(count_starts),
`sum(count_ends)` = sum(count_ends)) %>%
ungroup()
dt2
# # A tibble: 3 x 4
# Time Station `sum(count_starts)` `sum(count_ends)`
# <dttm> <chr> <int> <int>
# 1 2015-01-01 01:00:00 A 4 4
# 2 2015-01-01 01:00:00 B 1 1
# 3 2015-01-01 02:00:00 A 4 2
DATA
dt <- read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
Explanation
mdy_hm is the function to convert the string to date-time class. It means "month-day-year hour-minute", which depends on the structure of the string. ceiling_date rounds a date-time object up based on the unit specified. group_by is to group the variable. summarise is to conduct summary operation.
There are basically two things required:
1) round of the Time to nearest 1 hour window:
library(data.table)
library(lubridate)
data=data.table(Time=c('01/01/2015 00:30','01/01/2015 00:40','01/01/2015 00:55','01/01/2015 01:17','01/01/2015 01:37'),Station=c('A','A','B','A','A'),count_starts=c(2,2,1,3,1),count_ends=c(3,1,1,1,1))
data[,Time_conv:=as.POSIXct(strptime(Time,'%d/%m/%Y %H:%M'))]
data[,Time_round:=floor_date(Time_conv,unit="1 hour")]
2) List the data table obtained above to get the desired result:
New_data=data[,list(count_starts_sum=sum(count_starts),count_ends_sum=sum(count_ends)),by='Time_round']

Count event occurrence and assign it to individuals according to date and place of interview

I have two dataframes in R: A.df and B.df. The first contains N rows where each row is an event that happened in a certain date and place.
The second is a list of individuals that have been interviewed in a certain date and place.
For each individual, I would like to count the number of events that happened within a certain timeframe before the interview date in the same location of the individual's place of interview.
Let's say that the time frame is x days before the date of interview, and that I have computed that date and stored in the variable xdaysbefore.
Here below how the data frames look like
A.df
#Event Date Place
1 2015-05-01 1
2 2015-03-11 1
3 2015-07-04 2
4 2015-05-10 3
B.df
#Individual Date of Interview Place xdaysbefore
1 2016-07-11 1 2014-09-11
2 2016-05-07 3 2014-07-04
3 2016-08-09 2 2014-03-22
4 2016-01-10 3 2014-09-17
Note that Date, Date of Interview and xdaysbefore are all in Date R class
How can I count for each individual in B.df the events happened within the time frame Date of Interview - xdaysbefore according to the place in which the event has happened and the individual place of interview.
What I would expect in B.df would look like this:
B.df
#Individual Date of Interview Place xdaysbefore CountedEvents
1 2016-07-11 1 2014-09-11 2
2 2016-05-07 3 2014-07-04 1
3 2016-08-09 2 2014-03-22 1
4 2016-01-10 3 2014-09-17 1
where CountedEvents are the number of events happened in the time frame Date of Interview - xdaysbefore and in the same location where the individual i has been interviewed.
You can use apply on every row of B.df.
Take a subset of A.df where places are equal. Check if the Date in A.df is within the range of Date_of_Interview and xdaysbefore
B.df$CountedEvents <- apply(B.df, 1, function(x) {
temp = A.df[A.df$Place %in% x[3],]
length(temp$Date < as.Date(x[2]) & temp$Date > as.Date(x[4]))
})
B.df
# Individual Date_of_Interview Place xdaysbefore CountedEvents
#1 1 2016-07-11 1 2014-09-11 2
#2 2 2016-05-07 3 2014-07-04 1
#3 3 2016-08-09 2 2014-03-22 1
#4 4 2016-01-10 3 2014-09-17 1
EDIT
If you want to access columns with names instead of indexes, you can use
apply(B.df, 1, function(x) {
temp = A.df[A.df$Place %in% x["Place"],]
length(temp$Date < as.Date(x["Date_of_Interview"]) &
temp$Date > as.Date(x["xdaysbefore"]))
})
You can achieve that by using a combination of merge and aggregate:
# merge into a new dataset
AB <- merge(A, B, by = 'Place', all = TRUE)
# create a logical variable which indicates whether 'Date' falls within the range
AB$count <- AB$xdaysbefore < AB$Date & AB$Date_of_Interview > AB$Date
# aggregate into a count varaible
aggregate(count ~ Individual + Date_of_Interview + xdaysbefore, AB, sum)
which gives:
Individual Date_of_Interview xdaysbefore count
1 3 2016-08-09 2014-03-22 1
2 2 2016-05-07 2014-07-04 1
3 1 2016-07-11 2014-09-11 2
4 4 2016-01-10 2014-09-17 1
Alternatively you could use the new non-equi join possibility from the development version of the data.table package:
library(data.table)
# convert the dataframes to data.table's (which are enhanced dataframes)
setDT(A)
setDT(B)
# join and count
A[B, on = .(Place, Date < Date_of_Interview, Date > xdaysbefore)
][, .(count = .N), .(Individual, Place, Date_of_Interview = Date, xdaysbefore = Date.1)]
which gives:
Individual Place Date_of_Interview xdaysbefore count
1: 1 1 2016-07-11 2014-09-11 2
2: 2 3 2016-05-07 2014-07-04 1
3: 3 2 2016-08-09 2014-03-22 1
4: 4 3 2016-01-10 2014-09-17 1

Fastest way for filling-in missing dates for data.table

I am loading a data.table from CSV file that has date, orders, amount etc. fields.
The input file occasionally does not have data for all dates. For example, as shown below:
> NADayWiseOrders
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-04 1 18.81 0
4: 2013-01-05 2 77.62 0
5: 2013-01-07 2 35.82 2
In the above 03-Jan and 06-Jan do not have any entries.
Would like to fill the missing entries with default values (say, zero for orders, amount etc.), or carry the last vaue forward (e.g, 03-Jan will reuse 02-Jan values and 06-Jan will reuse the 05-Jan values etc..)
What is the best/optimal way to fill-in such gaps of missing dates data with such default values?
The answer here suggests using allow.cartesian = TRUE, and expand.grid for missing weekdays - it may work for weekdays (since they are just 7 weekdays) - but not sure if that would be the right way to go about dates as well, especially if we are dealing with multi-year data.
The idiomatic data.table way (using rolling joins) is this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2
Here is how you fill in the gaps within subgroup
# a toy dataset with gaps in the time series
dt <- as.data.table(read.csv(textConnection('"group","date","x"
"a","2017-01-01",1
"a","2017-02-01",2
"a","2017-05-01",3
"b","2017-02-01",4
"b","2017-04-01",5')))
dt[,date := as.Date(date)]
# the desired dates by group
indx <- dt[,.(date=seq(min(date),max(date),"months")),group]
# key the tables and join them using a rolling join
setkey(dt,group,date)
setkey(indx,group,date)
dt[indx,roll=TRUE]
#> group date x
#> 1: a 2017-01-01 1
#> 2: a 2017-02-01 2
#> 3: a 2017-03-01 2
#> 4: a 2017-04-01 2
#> 5: a 2017-05-01 3
#> 6: b 2017-02-01 4
#> 7: b 2017-03-01 4
#> 8: b 2017-04-01 5
Not sure if it's the fastest, but it'll work if there are no NAs in the data:
# just in case these aren't Dates.
NADayWiseOrders$date <- as.Date(NADayWiseOrders$date)
# all desired dates.
alldates <- data.table(date=seq.Date(min(NADayWiseOrders$date), max(NADayWiseOrders$date), by="day"))
# merge
dt <- merge(NADayWiseOrders, alldates, by="date", all=TRUE)
# now carry forward last observation (alternatively, set NA's to 0)
require(xts)
na.locf(dt)

Resources