Frequency count with multiple conditions R - r

Have a data frame
Date Team Opponent Weather Outcome
2017-05-01 All Stars B Stars Rainy 1
2017-05-02 All Stars V Stars Rainy 1
2017-05-03 All Stars M Trade Sunny 0
.
.
2017-05-11 All Stars Vdronee Sunny 0
Where Outcome 1 indicates a win. I have used the table function to get the frequency and applied condition.
table(df$Outcome, df$Team == "All Stars")
Returns me this
FALSE TRUE
0 1005 30
1 1323 57
So frequency of win is 57/87 =0.655
Two Questions:
Rather the calculating the win frequency manually, how do I embed this directly in a formula?
and
How do I filter based on the x most recent observations? i.e something like
table(df$Outcome, df$Team == "All Stars" & df$date = filtering for the 5 most recent observations)
thanks

An option is to use data.table
libray(data.table)
dt <- data.table(df)
dt[, .(prop=sum(outcome)/.N),Team]
to get the 5 most recent observations you can to the following:
dt[,head(.SD,5),by=.(Team,Date)][,.(prop=sum(outcoume/.N),Team]

Related

I need to find the mean for the data with cells without values

I need to find the average prices for all the different weeks. I need to make a ggplot to show how the price is during the year.
When you find the mean how does the empty cells affect the mean?
I have tried several thing including using the melt() function so I only have 3 variables. The variable are factors which I want to find the mean of.
Company variable value
ns Price week 24 1749
ns Price week 24
ns Price week 24 1599
ns Price week 24
ns Price week 24
ns Price week 24 359
ns Price week 24 460
I got more than 300K obs, and would love to have a small data.frame where I only have the Company, Price of different weeks as a mean. Now I have all observations for each week and I need to use the mean for using GGplot.
When I use following code
dat %in% mutate(means=mean(value), na.rm=TRUE)
I got a warning message saying the argument is not numeric or logical: returning NA.
I am looking forward to getting your help!
Clean code from PavoDive's comment
dt[!is.na(value), mean(value), by = .(price, week)]
and even better
dt[ , mean(value, na.rm = TRUE), by = .(price, week)]
Original:
This works using data.table. The first part filters out rows that don't have a number in value. Next is to say we want the average from the value column. Final the by defines how to group the rows.
Code:
dt[value >0 | value<1, .(MeanValues = mean(`value`)), by = c("Price", "Week")][]
Input:
dt <- data.table(`Price` = c("A","B","B","A","A","B","B","A"),
`Week`= c(1,2,1,1,2,2,1,2),
`value` = c(3,7,2,NA,1,46,1,NA))
Price Week value
1: A 1 3
2: B 2 7
3: B 1 2
4: A 1 NA
5: A 2 1
6: B 2 46
7: B 1 1
8: A 2 NA
Output:
1: A 1 3.0
2: B 2 26.5
3: B 1 1.5
4: A 2 1.0

Rank most recent scores of students within a given date - 30 days window

Following is what my dataframe/data.table looks like. The rank column is my desired calculated field.
library(data.table)
df <- fread('
Name Score Date Rank
John 42 1/1/2018 3
Rob 85 12/31/2017 2
Rob 89 12/26/2017 1
Rob 57 12/24/2017 1
Rob 53 08/31/2017 1
Rob 72 05/31/2017 2
Kate 87 12/25/2017 1
Kate 73 05/15/2017 1
')
df[,Date:= as.Date(Date, format="%m/%d/%Y")]
I am trying to calculate the rank of each student at every given point in time in the data within a 30 day windows. For that, I need to fetch the most recent scores of all students at a given point in time and then pass the rank function.
In the 1st row, as of 1/1/2018, John has two more competitors in a past 30 day window: Rob with the most recent score of 85 in 12/31/2017 AND Kate with the most recent score of 87 in 12/25/2017 and both of these dates fall within the 1/1/2018 - 30 Day Window. John gets a rank of 3 with the lowest score of 42. If only one students falls within date(at a given row) - 30 day window, then the rank is 1.
In the 3rd row the date is 12/26/2017. So Rob's score as of 12/26/2017 is 89. There is only one case of another student that falls in the time window of 12/26/2017 - 30 days and that is the most recent score(87) of kate on 12/25/2017. So within the time window of (12/26/2017) - 30 , Rob's score of 89 is higher than Kate's score of 87 and therefore Rob gets rank 1.
I was thinking about using the framework from here Efficient way to perform running total in the last 365 day window but struggling to think of a way to fetch all recent score of all students at a given point in time before using rank.
This seems to work:
ranks = df[.(d_dn = Date - 30L, d_up = Date), on=.(Date >= d_dn, Date <= d_up), allow.cart=TRUE][,
.(LatestScore = last(Score)), by=.(Date = Date.1, Name)]
setorder(ranks, Date, -LatestScore)
ranks[, r := rowid(Date)]
df[ranks, on=.(Name, Date), r := i.r]
Name Score Date Rank r
1: John 42 2018-01-01 3 3
2: Rob 85 2017-12-31 2 2
3: Rob 89 2017-12-26 1 1
4: Rob 57 2017-12-24 1 1
5: Rob 53 2017-08-31 1 1
6: Rob 72 2017-05-31 2 2
7: Kate 87 2017-12-25 1 1
8: Kate 73 2017-05-15 1 1
... using last since the Cartesian join seems to sort and we want the latest measurement.
How the update join works
The i. prefix means it's a column from i in the x[i, ...] join, and the assignment := is always in x. So it's looking up each row of i in x and where matches are found, copying values from i to x.
Another way that is sometimes useful is to look up x rows in i, something like df[, r := ranks[df, on=.(Name,Date), x.r]] in which case x.r is still from the ranks table (now in the x position relative to the join).
There's also...
ranks = df[CJ(Name = Name, Date = Date, unique=TRUE), on=.(Name, Date), roll=30, nomatch=0]
setnames(ranks, "Score", "LatestScore")
# and then use the same last three lines above
I'm not sure about efficiency of one vs another, but I guess it depends on number of Names, frequency of measurement and how often measurement days coincide.
A solution that uses data.table though not sure if it is the most efficient usage:
df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
.(Rank=frank(-c(iScore[1L], .SD[Name != iName, max(Score), by=.(Name)]$V1),
ties.method="first")[1L]),
by=.EACHI,
on=.(Date >= StartDate, Date <= EndDate)]
Explanation:
1) The outer square brackets do a non-equi join within a date range (i.e. 30days ago and latest date for each row). Try studying the below output against the input data:
df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
c(.(RowGroup=.GRP),
.SD[, .(Name, Score, Date, OrigDate, iName, iScore, iDate, StartDate, EndDate)]),
by=.EACHI,
on=.(Date >= StartDate, Date <= EndDate)]
2) .EACHI is to perform j calculations for each row of i.
3) Inside j, iScore[1L] is the score for the current row, .SD[Name != iName] means taking scores not corresponding to the student in the current row. Then, we use the max(Score) for each student of those students within the 30days window.
4) Concatenate all these scores and calculate the rank for the score of the current row while taking care of ties by taking the first tie.
Note:
see ?data.table to understand what i, j, by, on and .EACHI refers to.
EDIT after comments by OP:
I would add a OrigDate column and find those that matches the latest date.
df[, OrigDate := Date]
df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
.(Name=iName, Score=iScore, Date=iDate,
Rank=frank(-c(iScore[1L],
.SD[Name != iName, Score[OrigDate==max(OrigDate)], by=.(Name)]$V1),
ties.method="first")[1L]),
by=.EACHI,
on=.(Date >= StartDate, Date <= EndDate)]
I came up with following partial solution, encountered however problem - is it possible that there will be two people occuring with the same date?
if not, have a look at following piece of code:
library(tidyverse) # easy manipulation
library(lubridate) # time handling
# This function can be added to
get_top <- function(df, date_sel) {
temp <- df %>%
filter(Date > date_sel - months(1)) %>% # look one month in the past from given date
group_by(Name) %>% # and for each occuring name
summarise(max_score = max(Score)) %>% # find the maximal score
arrange(desc(max_score)) %>% # sort them
mutate(Rank = 1:n()) # and rank them
temp
}
Now, you have to find the name in the table, for given date and return its rank.
library(data.table)
library(magrittr)
setorder(df, -Date)
fun <- function(i){
df[i:nrow(df), head(.SD, 1), by = Name] %$%
rank(-Score[Date > df$Date[i] - 30])[1]
}
df[, rank := sapply(1:.N, fun)]
This can be done by joining to df those rows of df that are within 30 days behind it or the same date and have higher or equal scores. Then for each original row and joined row Name get the joined row Name that is the most recent. The count of the remaining joined rows for each of the original df rows is the rank.
library(sqldf)
sqldf("with X as
(select a.rowid r, a.*, max(b.Date) Date
from df a join df b
on b.Date between a.Date - 30 and a.Date and b.Score >= a.Score
group by a.rowid, b.Name)
select Name, Date, Score, count(*) Rank
from X
group by r
order by r")
giving:
Name Date Score Rank
1 John 2018-01-01 42 3
2 Rob 2017-12-31 85 2
3 Rob 2017-12-26 89 1
4 Rob 2017-12-24 57 1
5 Rob 2017-08-31 53 1
6 Rob 2017-05-31 72 2
7 Kate 2017-12-25 87 1
8 Kate 2017-05-15 73 1
A tidyverse solution (dplyr + tidyr):
df %>%
complete(Name,Date) %>%
group_by(Name) %>%
mutate(last_score_date = `is.na<-`(Date,is.na(Score))) %>%
fill(Score,last_score_date) %>%
filter(!is.na(Score) & Date-last_score_date <30) %>%
group_by(Date) %>%
mutate(Rank = rank(-Score)) %>%
right_join(df)
# # A tibble: 8 x 5
# # Groups: Date [?]
# Name Date Score last_score_date Rank
# <chr> <date> <int> <date> <dbl>
# 1 John 2018-01-01 42 2018-01-01 3
# 2 Rob 2017-12-31 85 2017-12-31 2
# 3 Rob 2017-12-26 89 2017-12-26 1
# 4 Rob 2017-12-24 57 2017-12-24 1
# 5 Rob 2017-08-31 53 2017-08-31 1
# 6 Rob 2017-05-31 72 2017-05-31 2
# 7 Kate 2017-12-25 87 2017-12-25 1
# 8 Kate 2017-05-15 73 2017-05-15 1
We add all missing combinations of Date and Name
then we create a column for the last_score_date, equal to Date when score isn't NA.
by filling NAs down Score has become the latest score
we filter out NAs and keep only scores that have < 30 days of age
That's our table of valid scores by dates
From there it's easy to add ranks
and a final right_join on the original table gives us the expected output
data
library(data.table)
df <- fread('
Name Score Date
John 42 01/01/2018
Rob 85 12/31/2017
Rob 89 12/26/2017
Rob 57 12/24/2017
Rob 53 08/31/2017
Rob 72 05/31/2017
Kate 87 12/25/2017
Kate 73 05/15/2017
')
df[,Date:= as.Date(Date, format="%m/%d/%Y")]

Create data frame for each unique row in another data frame

For an assignment for my graduate program, I have been asked to extract data from datasets of English Premier League results (located here). I am very close to being done but need help on the last two outputs.
We must create a function that can receive two arguments, a date and a season. The function must return a data frame with the table of the respective season on that date. It must include wins, losses, home record, away record, etc. The only ones I have not managed to figure out are W/L streak and the results of the last 10 matches.
Here is an example of what the initial dataset looks like:
e.Date e.HomeTeam e.AwayTeam e.FTHG e.FTAG e.FTR
1 2015-08-08 Bournemouth Aston Villa 0 1 A
2 2015-08-08 Chelsea Swansea 2 2 D
3 2015-08-08 Everton Watford 2 2 D
4 2015-08-08 Leicester Sunderland 4 2 H
5 2015-08-08 Man United Tottenham 1 0 H
My plan was to get Home and Away data sorted out for each club then merge them together before doing the analysis to find streak and last 10 results.
I manipulated the data to look like this:
HomeTeam FTR Date freq
1 Arsenal L 2015-08-09 1
2 Arsenal D 2015-08-24 1
3 Arsenal W 2015-09-12 1
4 Aston Villa L 2015-08-14 1
5 Aston Villa L 2015-09-19 1
6 Aston Villa D 2015-08-29 1
And now I'm kinda lost. My idea was to run some kind of loop (for? ddply? data.table?) to create a data frame for each club with their results in it and then loop again to do whatever calculations to get the desired variables (streak and last 10) and somehow push those back into the main data frame where I am housing all of the other outputs.
I don't want to be told the answer outright since it's important I learn this on my own. However, if someone could point me in the right direction that would be great. Thanks so much.
I created some dummy data just to demonstrate a few commands and maybe give you some ideas.
set.seed(321)
dat <- data.frame(team = sample(letters[1:3], 20, replace=TRUE),
season = rep("season1", 20),
time = rnorm(20),
win_loss = sample(c("win", "loss"), 20, replace=TRUE))
Problem 1. Find win/loss streak
Take a look at the rle function example below
# 1. find wl streak of team 'a'
tmp <- dat[dat$team == "a", ]
tmp <- tmp[order(tmp$time), ]
> tmp
team season time win_loss
19 a season1 -1.12032742 loss
14 a season1 -1.07223880 loss
16 a season1 0.09500072 loss
3 a season1 0.18832552 loss
8 a season1 0.42033257 loss
4 a season1 2.44325982 win
# shows runs of 5 consecutive losses, then 1 consecutive win
rle(tmp$win_loss == "win")
Run Length Encoding
lengths: int [1:2] 5 1
values : logi [1:2] FALSE TRUE
Here's a very helpful post on rle How can I count runs in a sequence?
Problem 2. Last 3 results
I reversed the order of time and then picked the top 3 results.
# 2. find last 3 matches for team 'b'
tmp <- dat[dat$team == "b", ]
tmp <- tmp[rev(order(tmp$time)), ]
> tmp[1:3, ]
team season time win_loss
11 b season1 0.9172555 loss
9 b season1 0.5775845 win
7 b season1 0.4560691 loss

Count the number of previous occurrences using a time window, not a fixed window size

I have a dataset like the following, the last column is desired output.
DX_CD AID date2 <count.occurences.1000.days>
1 272.4 1649 2007-02-10 0 or N/A
2 V58.67 1649 2007-02-10 0<- (excluding the same day). OR 1
3 787.91 1649 2010-04-14 0
4 788.63 1649 2011-03-10 1
5 493.90 4193 2007-09-13 0 or N/A #new AID
6 787.20 6954 2010-02-25 0 or N/A #new AID
.....
I want to compute the column (count.occurences.1000.days) that counts the number of previous occurrences within X days (e.g. X=1000) by AID.
The first value is 0 or N/A because there is no previous record before record #1 for AID=1649. The second value is 0 because this event occurs on the same day as record #1. Third value is 0 because there are records older than 2010-04-14, but they are beyond 1000days. Fourth value is 1 because the record #3 happened within 1000 days. Same logic goes for AID=4193 and AID=6954
Can someone provide an idea, preferably vectorized?
If I understood correctly the question, this should do
First, a sample of the data
df <- data.frame(date2=days <-
seq(as.Date("2008-12-30"), as.Date("2015-01-03"), by="days"),
AID=sample(c(1649, 4193, 6954, 3466), 2196, replace=T),
count=(rep.int(1,2196)))
Now we group by the 1000 days from max to min
df$date.bin <- Hmisc::cut2(df$date2,
cuts=sort(seq(max(df$date2), length=10,by="-1000 days")))
Now we use cumsum on the grouped variables
res <-df %>% dplyr::arrange(date.bin, AID) %>% group_by(date.bin, AID) %>%
mutate(cumsum=cumsum(count))

R: Using values from data frame A from a date prior to populate a row in data frame B

This may be very complicated and I suspect requires advanced knowledge. I have now two different types of data.frames I need to combine:
The data:
Dataframe A:
lists all transfusion dates by patient ID. Every transfusion is represented by a separate row, patients can have multiple transfusions. Different patients can have transfusions on the same date.
Patient ID Transfusion.Date
1 01/01/2000
1 01/30/2000
2 04/01/2003
3 04/01/2003
Dataframes of Type B contain test results at other dates, also by patient ID:
Patient ID Test.Date Test.Value
1 11/30/1999 negative
1 01/15/2000 700 copies/uL
1 01/27/2000 900 copies/uL
2 03/30/2003 negative
What I would like to have is Dataframe A with the same number of rows (1 for each transfusion), and with the most recent Test.Value as a separate column. Each transfusion date should have the test result from the test performed most closely (prior) to the transfusion.
desired output:
-->
Patient ID Transfusion.Date Pre.Transfusion.Test
1 01/01/2000 negative
1 01/30/2000 900 copies/ul
2 04/01/2003 negative
3 04/01/2003 NA
I think the general strategy would be to subset the data.frames by patient IDs. Then take all transfusion dates for patient 1, check which result is closest to all available test_dates for each element and then return the value closest.
How can I explain R to do that?
Edit 1: Here is the R-code for these examples
df_A <- data.frame(MRN = c(1,1,2,3),
Transfusion.Date = as.Date(c('01/01/2000', '01/30/2000',
'04/01/2003','04/01/2003'),'%m/%d/%Y'))
df_B <- data.frame(MRN = c(1,1,1,2),
Test.Date = as.Date(c('11/30/1999', '01/15/2000', '01/27/2000',
'03/30/2003'),'%m/%d/%Y'), Test.Result = c('negative',
'700 copies/ul','900 copies/ul','negative'))
Edit 2:
To clarify, the resulting data should be: Patient A received transfusions on Day X and Day Y. (for df_A). Prior to the transfusion on day X, his most recent test result was X (closest test date to first transfusion, in df_B). Prior to the transfusion on day Y, his most recent test result was Y (prior to the second transfusion, also in df_B. df_B also contains a bunch of other test dates, which are not needed for the final output.
Here's using data.table's rolling joins:
require(data.table)
setkey(setDT(df_A), MRN, Transfusion.Date)
setkey(setDT(df_B), MRN, Test.Date)
df_B[df_A, roll=TRUE]
# MRN Test.Date Test.Result
# 1: 1 2000-01-01 negative
# 2: 1 2000-01-30 900 copies/ul
# 3: 2 2003-04-01 negative
# 4: 3 2003-04-01 NA
setDT converts data.frame to data.table by reference (without any additional copying). That'll result in df_A and df_B now being data.tables.
setkey sorts the data.table by the columns we provided, and marks those columns as key columns, which allows us to use binary search based joins.
We perform a join of the form x[i] on the key columns, where for each row of i, the matching rows of x (if any, else NA) along with i's rows are returned. This is what we call an equi-join. By adding roll = TRUE, in the event of a mismatch, the last observation is carried forward (LOCF). This is what we call a rolling join. The sorting in increasing order (due to setkey()) ensures that the last observation is the most recent date.
HTH
dfLast <- df_B[ df_B$Test.Date %in%
as.Date( tapply(df_B$Test.Date, df_B$MRN, tail,1),"1970-01-01"), ]
merge(df_A, dfLast, by=c(1:2,1:2) ,all.y=TRUE)
MRN Transfusion.Date Test.Result
1 1 2000-01-27 900 copies/ul
2 2 2003-03-30 negative
Edited. Had some logical errors and some sytactic errors. tapply returned the integer values of the Dates and as you pointed out I was using the wrong column name in the data reduction step.
OK thanks for everyone's help. It took me a lot of toil, blood, sweat, and tears, but this is the solution I came up with:
Merge both data frames:
df_AB <- merge(df_A, df_B, all.x = T)
df_AB:
MRN Transfusion.Date Test.Date Test.Result
1 1 2000-01-01 1999-11-30 negative
2 1 2000-01-01 2000-01-15 700 copies/ul
3 1 2000-01-01 2000-01-27 900 copies/ul
4 1 2000-01-30 1999-11-30 negative
5 1 2000-01-30 2000-01-15 700 copies/ul
6 1 2000-01-30 2000-01-27 900 copies/ul
7 2 2003-04-01 2003-03-30 negative
8 3 2003-04-01 <NA> <NA>
Using dplyr
df_tests <- df_AB %>%
group_by(MRN, Transfusion.Date) %>%
mutate(Time.Difference = Transfusion.Date - Test.Date) %>%
filter(Time.Difference > 0) %>%
arrange(Time.Difference) %>%
summarize(Test.Date = Test.Date[1], Test.Result = Test.Result[1])
df_tests:
MRN Transfusion.Date Test.Date Test.Result
1 1 2000-01-01 1999-11-30 negative
2 1 2000-01-30 1999-11-30 negative
3 2 2003-04-01 2003-03-30 negative
using merge again for MRN3:
df_desired <- merge(df_A, df_tests, all.x = T)
MRN Transfusion.Date Test.Date Test.Result
1 1 2000-01-01 1999-11-30 negative
2 1 2000-01-30 2000-01-27 900 copies/ul
3 2 2003-04-01 2003-03-30 negative
4 3 2003-04-01 <NA> <NA>

Resources