Cumulative sum, with lag, by group, on date-stamped observations - r

I have this data set of batting data from the GameDay servers:
eliasID teamID gameID gameDate h hr bb so rbi ab runs t d lob sb cs sf hbp
1 430203 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 1 2 0 0 0 0 0 0 0 0
2 459714 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 1 0 0 1 0 3 1 0 0 1 0 0 0 0
3 325392 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 1 0 0 1 0 0 0 0 0 0 0 0
4 429801 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 0 1 0 3 0 0 0 2 0 0 0 0
5 456714 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 1 0 0 4 0 0 0 2 0 0 0 0
6 150449 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 0 1 1 4 0 0 0 2 0 0 0 0
ba ID gameDateFormat year Year
1 0.345 1 2010-04-01 2010 NA
2 0.250 2 2010-04-01 2010 NA
3 0.319 3 2010-04-01 2010 NA
4 0.327 4 2010-04-01 2010 NA
5 0.333 5 2010-04-01 2010 NA
6 0.217 6 2010-04-01 2010 NA
My issue is that I'd like to build a running total of at bats (ab) for each game, but total only those at bats from games with gameDate lower than the row's gameDate, and with games in the same gameYear.
I've look at the for loop and dplyr has been suggested, but these all want to sum all the ab column for one player, when I need an ongoing sum added to each game to show the player's ab total for the year so far at that game.
I'm attempting to build the equivalent of the kind of statistics you see on baseball-ref.com.
In English, I'm looking for:
For each EliasID, gameID in Batting:
sum(ab) for the EliasID where the gameDate < this row's gameDate and the gameYear = this row's gameYear
What do you think?

Welcome to R programming. Because you did not provide complete sample data (i.e. dput() rather than just a print(), this answer makes a couple assumptions:
your data.frame is called df. You can replace this name with the actual name.
your gameDate is an actual date vector, not just a string vector. If it is a string, change it to a date with df$gameDate <- as.Date(df$gameDate, format = "%m/%d/%Y")
It appears that what you want is a "cumulative sum with lag, by group." (I recommend that you make this your title to make it clear that this is what you want.) Let's look at both of those parts.
Cumulative sum, with lag
As suggested in this answer, an easy way to introduce a lag of 1 into cumsum() is to replace the vector x1, x2, ... xn with 0, x1, x2, ... xn-1. Thus:
cumsumLag1 <- function(x){
cumsum(c(0, head(x, n = -1))) # see ?cumsum and ?head, particularly the note on negative n
}
# test it out on first 5 counting numbers
cumsumLag1(1:5) # returns: 0 1 3 6 10
Your dataset should be in the right chronological order for the cumulative function. So you could do something with ?order like:
df <- df[order(df$gameDate)]
but we will use arrange() in dplyr (see below) to keep things simple.
By group
There are many ways to do sum (and similar functions) by group. Perhaps the simplest syntax is %>% group_by(thing) in dplyr. You want to group by year, and perhaps other variables (maybe teamId or playerId). One really unclear part of your question is what you're trying to group by, so please just focus on the concept here. The first challenge is that you don't have a year variable, and there are lots of ways to do this. Let's just do something like this:
df$gameYear <- as.POSIXlt(df$gameDate)$year + 1900 # see ?POSIXlt for more details
Putting it together
Using the chain operator %>%, we just sequence what we've already reviewed.
library(dplyr)
cumsumLag1 <- function(x) cumsum(c(0, head(x, n = -1)))
df %>%
mutate(gameYear = as.POSIXlt(gameDate)$year + 1900) %>%
arrange(gameDate) %>%
group_by(gameYear) %>%
mutate(priorAtBats = cumsumLag1(ab))

Related

How to create a new variable based on condition from different dataframe in R

I have 2 data frames from an experiment. The 1st df reads a (roughly) continuous signal over 40 mins. There are 5 columns, 1:3 are binary - saying whether a button was pushed. The 4th column is a binary of if either from column 2 or 3 was pushed. The 5th column is an approximate time in seconds. Example from df below:
initiate
left
right
l or r
time
0
0
1
1
2.8225
0
0
1
1
2.82375
0
0
1
1
2.82500
0
0
1
1
2.82625
1
0
0
0
16.82000
1
0
0
0
16.82125
etc.
The 2nd data frame is session info where each row is a trial, usually 100-150 rows depending on the day. I have a column that marks trial start time and another column that marks trial end time in seconds. Example from df below (I omitted several irrelevant columns):
trial
success
t start
t end
1
0
16.64709
35.49431
2
1
41.81843
57.74304
3
0
65.54510
71.16612
4
0
82.65743
87.30914
etc.
For the 1st data frame, I want to create a column that indicates whether or not the button was pushed within a trial. This is based on those start and end times in the 2nd df. I would like it to look something like this (iti = inter-trial, wt = within trial):
initiate
left
right
l or r
time
trial
0
0
1
1
2.8225
iti
0
0
1
1
2.82375
iti
0
0
1
1
2.82500
iti
0
0
1
1
2.82625
iti
1
0
0
0
16.82000
wt
1
0
0
0
16.82125
wt
etc.
I had the idea to do something like this, but I don't have a grouping variable between the 2 data frames so it doesn't work:
df2 %>%
full_join(df1, by = "trial") %>%
mutate(in_iti = case_when(time < tstart & time > tend ~ "iti",
time > tstart & time < tend ~ "within_trial"))
Any ideas on how to label the rows in df1 based on the time condition from the df2?
Thank you!
Maybe try the following, if you data is relatively small, with dplyr. Assuming names of data.frames of df and df2. Using mutate to create your new column, and ifelse comparing each time in the first data.frame with t_start and t_end in your second data.frame.
library(dplyr)
df %>%
rowwise() %>%
mutate(trial = ifelse(any(time > df2$t_start & time < df2$t_end), "wt", "iti"))
Output
initiate left right l_or_r time trial
<int> <int> <int> <int> <dbl> <chr>
1 0 0 1 1 2.82 iti
2 0 0 1 1 2.82 iti
3 0 0 1 1 2.82 iti
4 0 0 1 1 2.83 iti
5 1 0 0 0 16.8 wt
6 1 0 0 0 16.8 wt

Transforming longitudinal data for time-to-event analysis in R

I am trying to reformat longitudinal data for a time to event analysis. In the example data below, I simply want to find the earliest week that the result was “0” for each ID.
The specific issue I am having is how to patients that don't convert to 0, and had either all 1's or 2's. In the example data, patient J has all 1's.
#Sample data
have<-data.frame(patient=rep(LETTERS[1:10], each=9),
week=rep(0:8,times=10),
result=c(1,0,2,rep(0,6),1,1,2,1,rep(0,5),1,1,rep(0,7),1,rep(0,8),
1,1,1,1,2,1,0,0,0,1,1,1,rep(0,6),1,2,1,rep(0,6),1,2,rep(0,7),
1,rep(0,8),rep(1,9)))
patient week result
A 0 1
A 1 0
A 2 2
A 3 0
A 4 0
A 5 0
A 6 0
A 7 0
A 8 0
B 0 1
B 1 0
... .....
J 6 1
J 7 1
J 8 1
I am able to do this relatively straightforward process with the following code:
want<-aggregate(have$week, by=list(have$patient,have$result), min)
want<-want[which(want[2]==0),]
but realize if someone does not convert to 0, it excludes them (in this example, patient J is excluded). Instead, J should be present with a 1 in the second column and an 8 in the third column. Instead it of course is omitted
print(want)
Group.1 Group.2 x
A 0 1
B 0 4
C 0 2
D 0 1
E 0 6
F 0 3
G 0 3
H 0 2
I 0 1
#But also need
J 1 8
Pursuant to guidelines on posting here, I did work to solve this, am able to get what I need very inelegantly:
mins<-aggregate(have$week, by=list(have$patient,have$result), min)
maxs<-aggregate(have$week, by=list(have$patient,have$result), max)
want<-rbind(mins[which(mins[2]==0),],maxs[which(maxs[2]==1&maxs[3]==8),])
This returns the correct desired dataset, but the coding is terrible and not sustainable as I work with other datasets (i.e. datasets with different timeframes since I have to manually put in maxsp[3]==8, etc).
Is there a more elegant or systematic way to approach this data manipulation issue?
We can write a function to select a row from the group.
select_row <- function(result, week) {
if(any(result == 0)) which.max(result == 0) else which.max(week)
}
This function returns the index of first 0 value if it is present or else returns index of maximum value of week.
and apply it to all groups.
library(dplyr)
have %>% group_by(patient) %>% slice(select_row(result, week))
# patient week result
# <fct> <int> <dbl>
# 1 A 1 0
# 2 B 4 0
# 3 C 2 0
# 4 D 1 0
# 5 E 6 0
# 6 F 3 0
# 7 G 3 0
# 8 H 2 0
# 9 I 1 0
#10 J 8 1

How to convert the result of xtabs() into dataframe in R? [duplicate]

This question already has answers here:
How to convert a table to a data frame
(5 answers)
Closed 4 years ago.
I have data like dataframe df_a, and want to have it converted to the format as in dataframe df_b.
xtabs() gives similar result, but I did not find a way to access elements as in the example code below. Accessing through xa[1,1] gives no advantage since there is a weak correlation between indexing by numbers ("1") and names ("A"). As you can see there is a sort difference in the xtabs() result, so xa[2,2]=2 and not 0 as on the df_b listing.
> df_a
ItemName Feature Amount
1 First A 2
2 First B 3
3 First A 4
4 Second C 3
5 Second C 2
6 Third D 1
7 Fourth B 2
8 Fourth D 3
9 Fourth D 2
> df_b
ItemName A B C D
1 First 6 3 0 0
2 Second 0 0 5 0
3 Third 0 0 0 1
4 Fourth 0 2 0 5
> df_b$A
[1] 6 0 0 0
> xa<-xtabs(df_a$Amount~df_a$ItemName+df_a$Feature)
> xa
df_a$Feature
df_a$ItemName A B C D
First 6 3 0 0
Fourth 0 2 0 5
Second 0 0 5 0
Third 0 0 0 1
> xa$A
Error in xa$A : $ operator is invalid for atomic vectors
There is a way of iterative conversion with for() loops, but totally inefficient in my case because my data has millions of records.
For the purpose of further processing my required output format is dataframe.
If anyone solved similar problem please share.
You can just use as.data.frame.matrix(xa)
# output
A B C D
First 6 3 0 0
Fourth 0 2 0 5
Second 0 0 5 0
Third 0 0 0 1
## or
df_b <- as.data.frame.matrix(xa)[unique(df_a$ItemName), ]
data.frame(ItemName = row.names(df_b), df_b, row.names = NULL)
# output
ItemName A B C D
1 First 6 3 0 0
2 Second 0 0 5 0
3 Third 0 0 0 1
4 Fourth 0 2 0 5
Without using xtabs you can do something like this:
df %>%
dplyr::group_by(ItemName, Feature) %>%
dplyr::summarise(Sum=sum(Amount, na.rm = T)) %>%
tidyr::spread(Feature, Sum, fill=0) %>%
as.data.frame()
This will transform as you require and it stays as a data.frame
Or, you can just as.data.frame(your_xtabs_result) and that should work too

Conditional column creation (horizontal and vertical conditions)

My starting condition is something like the df data frame
df<-data.frame(id=c(rep(2, 3), rep(4, 2)), year=c(2005:2007, 2005:2006), event=c(1,0,0,0,1))
id year event
1 2 2005 1
2 2 2006 0
3 2 2007 0
4 4 2005 0
5 4 2006 1
I have a series of actors (identified through an id) who happen to experience an event in a certain year.
Here I am trying to build is a series of additional columns that describe a) the distance from events and b) whether such distance is observable.
This is what I would like to obtain.
id year event evm2 evm1 evp1 evp2 ndm2 ndm1 ndp1 ndp2
1 2 2005 1 0 0 0 0 1 1 0 0
2 2 2006 0 0 1 0 0 1 0 0 1
3 2 2007 0 1 0 0 0 0 0 1 1
4 4 2005 0 0 0 1 0 1 1 0 1
5 4 2006 1 0 0 0 0 1 0 1 1
event equals 1 when there is an event in a certain year. evm1 equals 1 when an event is observable in the year before. Similarly, evp1 is 1 when the event is in the following year - the letters p or m stand for 'plus' and 'minus' and the numbers represent the distance in years from the event.
For some of these observations the distance is not observable because the available time window is too short. This is the case of df[1,] for which we don't know if in the previous years an event took place or not. In such a case, ndm1 and ndm2 are coded 1. If we consider the case df[5,], it will be ndp1 (and ndp2) to be coded 1.
ev and nd variables work exactly in the same way. But the former tells if at a certain distance there is an event or not and the latter reveals whether such a distance is actually observable.
I tried to accomplish this using the following nested for loops, but I didn't succeed.
lag<-c(-2, -1, 1, 2)
df2<-df
df2[,4:11]<-0
colnames(df2)<-c("id", "year", "event", "evm2", "evm1", "evp1", "evp2", "ndm2", "ndm1", "ndp1", "ndp2")
for (i in length(df2$id)) {
id<-df2[i,1]
yr<-df2[i,2]
sta<-3
sta2<-7
for (j in lag){
sta<-sta+1
sta2<-sta2+1
if !is.null(df2[df2$id==id & df2$year==yr+j])==TRUE {
rw<-which(df2[df2$id==id & df2$year==yr+j])
if (df2[rw,3]==1) df2[i, sta]==1
} else {
df2[i, sta2]==1
}
}
}
Do you see anything that may be responsible for the errors? I have been going mad for two days trying to make it work and I would be really thankful if you could help.
Following my comment, here is what I had in mind as a potential rewrite:
lag.it <- function(x, n = 0L) {
l <- length(x)
neg.lag <- min(max(0L, -n), l)
pos.lag <- min(max(0L, +n), l)
c(rep(NA, +neg.lag),
head(x, -neg.lag),
tail(x, -pos.lag),
rep(NA, +pos.lag))
}
library(plyr)
ddply(df, "id", transform,
evm2 = lag.it(event, -2),
evm1 = lag.it(event, -1),
evp1 = lag.it(event, +1),
evp2 = lag.it(event, +2))
# id year event evm2 evm1 evp1 evp2
# 1 2 2005 1 NA NA 0 0
# 2 2 2006 0 NA 1 0 NA
# 3 2 2007 0 1 0 NA NA
# 4 4 2005 0 NA NA 1 NA
# 5 4 2006 1 NA 0 NA NA
Notice how I use NAs instead of using two sets of variables. While I'd recommend you keep it this way, you can easily get what you asked for by defining e.g. ndm2 as is.na(evm2) then replace the NAs by zeroes.

Restructure Data in R

I am just starting to get beyond the basics in R and have come to a point where I need some help. I want to restructure some data. Here is what a sample dataframe may look like:
ID Sex Res Contact
1 M MA ABR
1 M MA CON
1 M MA WWF
2 F FL WIT
2 F FL CON
3 X GA XYZ
I want the data to look like:
ID SEX Res ABR CON WWF WIT XYZ
1 M MA 1 1 1 0 0
2 F FL 0 1 0 1 0
3 X GA 0 0 0 0 1
What are my options? How would I do this in R?
In short, I am looking to keep the values of the CONT column and use them as column names in the restructred data frame. I want to hold a variable set of columns constant (in th example above, I held ID, Sex, and Res constant).
Also, is it possible to control the values in the restructured data? I may want to keep the data as binary. I may want some data to have the value be the count of times each contact value exists for each ID.
The reshape package is what you want. Documentation here: http://had.co.nz/reshape/. Not to toot my own horn, but I've also written up some notes on reshape's use here: http://www.ling.upenn.edu/~joseff/rstudy/summer2010_reshape.html
For your purpose, this code should work
library(reshape)
data$value <- 1
cast(data, ID + Sex + Res ~ Contact, fun = "length")
model.matrix works great (this was asked recently, and gappy had this good answer):
> model.matrix(~ factor(d$Contact) -1)
factor(d$Contact)ABR factor(d$Contact)CON factor(d$Contact)WIT factor(d$Contact)WWF factor(d$Contact)XYZ
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 0 0 1 0 0
5 0 1 0 0 0
6 0 0 0 0 1
attr(,"assign")
[1] 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`factor(d$Contact)`
[1] "contr.treatment"

Resources