Vectorize calculation across relational dataframes in R - r

Is it possible in R to vectorize a calculation on data in a dataframe, where one criteria on which the calculation is performed comes from an external dataframe? This can be performed using a for loop, however it is slow.
The full task involves asking questions of 15 years of medical laboratory data in relational format. For example, what is the lowest haemoglobin level recorded for a patient in the three months following a surgical procedure? This from two tables: one with dates of surgery (~ 6000, often multiple per patient) and one of dated haemoglobin levels (~200,000, multiple per patient). A loop as below takes ~30 minutes per query.
In this MWE data is in two tables and is linked by an index.
##create two dataframes
a<-c("ID1","ID2","ID3","ID2","ID1")
b<-c(1,2,3,4,5)
c<-as.Date(c("2005-01-01","2002-01-01","2003-01-01","2004-01-01","2001-01-01"))
df.1<-cbind.data.frame(a,b,c,stringsAsFactors=FALSE)
d<-c("ID1","ID2","ID1")
e<-as.Date(c("2002-02-01","2001-02-01","2000-01-01"))
df.2<-cbind.data.frame(d,e,stringsAsFactors=FALSE)
>df.1
a b c
1 ID1 1 2005-01-01
2 ID2 2 2002-01-01
3 ID3 3 2003-01-01
4 ID2 4 2004-01-01
5 ID1 5 2001-01-01
>df.2
d e
1 ID1 2002-02-01
2 ID2 2001-02-01
3 ID1 2000-01-01
out<-rep(NA,length(df.2$d))
for(i in 1:length(df.2$d)){
out[i]<-max(df.1$b[df.1$a==df.2$d[i] & df.1$c>df.2$e[i]])
}
> cbind(df.2,out)
d e out
1 ID1 2002-02-01 1
2 ID2 2001-02-01 4
3 ID1 2000-01-01 5

To answer your question, you can vectorize a calculation in r with Vectorize.
However, I'm not sure what "slow" means here. And there are probably better ways to accomplish your task, but I would rather read a word problem than code.
##create two dataframes
a<-c("ID1","ID2","ID3","ID2","ID1")
b<-c(1,2,3,4,5)
c<-as.Date(c("2005-01-01","2002-01-01","2003-01-01","2004-01-01","2001-01-01"))
df.1<-cbind.data.frame(a,b,c,stringsAsFactors=FALSE)
d<-c("ID1","ID2","ID1")
e<-as.Date(c("2002-02-01","2001-02-01","2000-01-01"))
df.2<-cbind.data.frame(d,e,stringsAsFactors=FALSE)
f <- function(i)
## your code here
max(df.1$b[df.1$a==df.2$d[i] & df.1$c>df.2$e[i]])
vf <- Vectorize(f)
vf(1:3)
# [1] 1 4 5

Related

How to run a loop in R to find a unique combination of numbers within a range of 7?

I have a dataset which looks something like this:-
Key Days
A 1
A 2
A 3
A 8
A 9
A 36
A 37
B 14
B 15
B 44
B 45
I would like to split the individual keys based on the days in groups of 7. For e.g.:-
Key Days
A 1
A 2
A 3
Key Days
A 8
A 9
Key Days
A 36
A 37
Key Days
B 14
B 15
Key Days
B 44
B 45
I could use ifelse and specify buckets of 1-7, 7-14 etc until 63-70 (max possible value of days). However the issue lies with the days column. There are lots of cases wherein there is an overlap in days - Take days 14-15 as an example which would fall into 2 brackets if split using the ifelse logic (7-14 & 15-21).
The ideal method of splitting this would be to identify a day and add 7 to it and check how many rows of data are actually falling under that category. I think we need to use loops for this. I could do it in excel but i have 20000 rows of data for 2000 keys hence i'm using R. I would need a loop which checks each key value and for each key it further checks the value of days and buckets them in group of 7 by checking the first day value of each range.
We create a grouping variable by applying %/% on the 'Day' column and then split the dataset into a list based on that 'grp'.
grp <- df$Day %/%7
split(df, factor(grp, levels = unique(grp)))
#$`0`
# Key Days
#1 A 1
#2 A 2
#3 A 3
#$`1`
# Key Days
#4 A 8
#5 A 9
#$`5`
# Key Days
#6 A 36
#7 A 37
#$`2`
# Key Days
#8 B 14
#9 B 15
#$`6`
# Key Days
#10 B 44
#11 B 45
Update
If we need to split by 'Key' also
lst <- split(df, list(factor(grp, levels = unique(grp)), df$Key), drop=TRUE)

R subset column of data frame based on list of dates

PeopleId Date NumItemsQuotes
(dbl) (time) (dbl)
1 2 2015-10-26 147
2 2 2015-10-27 4
3 2 2015-10-28 268
4 2 2015-10-30 55
5 2 2015-11-02 1
6 2 2015-11-03 6
Given the above data frame, I want to subset this to just the observations which occur on a weekend. So far, I've created a vector of the weekends because its a small amount (only 5 weekends) but I'm stuck in this approach.
weekends<- as.Date(c("2015-10-24","2015-10-25","2015-10-30","2015-11-01","2015-11-07","2015-11-08","2015-11-14","2015-11-15","2015-11-21","2015-11-22"))
weekenders<-itemquote[,"PeopleId"][itemquote$Date == weekends]
In the interest of genuinely learning this concept - I could also create an index column from timeDate::isWeekend() and then drop false observations correct?
Any other suggestions for approaching this?
If you want to stick to your method using the object weekends, you could use %in%. So your code would be weekenders<-itemquote[,"PeopleId"][itemquote$Date == weekends] or in a shorter way:weekenders <- subset(itemquote, Date %in% weekends)$PeopleId.

Create data frame for each unique row in another data frame

For an assignment for my graduate program, I have been asked to extract data from datasets of English Premier League results (located here). I am very close to being done but need help on the last two outputs.
We must create a function that can receive two arguments, a date and a season. The function must return a data frame with the table of the respective season on that date. It must include wins, losses, home record, away record, etc. The only ones I have not managed to figure out are W/L streak and the results of the last 10 matches.
Here is an example of what the initial dataset looks like:
e.Date e.HomeTeam e.AwayTeam e.FTHG e.FTAG e.FTR
1 2015-08-08 Bournemouth Aston Villa 0 1 A
2 2015-08-08 Chelsea Swansea 2 2 D
3 2015-08-08 Everton Watford 2 2 D
4 2015-08-08 Leicester Sunderland 4 2 H
5 2015-08-08 Man United Tottenham 1 0 H
My plan was to get Home and Away data sorted out for each club then merge them together before doing the analysis to find streak and last 10 results.
I manipulated the data to look like this:
HomeTeam FTR Date freq
1 Arsenal L 2015-08-09 1
2 Arsenal D 2015-08-24 1
3 Arsenal W 2015-09-12 1
4 Aston Villa L 2015-08-14 1
5 Aston Villa L 2015-09-19 1
6 Aston Villa D 2015-08-29 1
And now I'm kinda lost. My idea was to run some kind of loop (for? ddply? data.table?) to create a data frame for each club with their results in it and then loop again to do whatever calculations to get the desired variables (streak and last 10) and somehow push those back into the main data frame where I am housing all of the other outputs.
I don't want to be told the answer outright since it's important I learn this on my own. However, if someone could point me in the right direction that would be great. Thanks so much.
I created some dummy data just to demonstrate a few commands and maybe give you some ideas.
set.seed(321)
dat <- data.frame(team = sample(letters[1:3], 20, replace=TRUE),
season = rep("season1", 20),
time = rnorm(20),
win_loss = sample(c("win", "loss"), 20, replace=TRUE))
Problem 1. Find win/loss streak
Take a look at the rle function example below
# 1. find wl streak of team 'a'
tmp <- dat[dat$team == "a", ]
tmp <- tmp[order(tmp$time), ]
> tmp
team season time win_loss
19 a season1 -1.12032742 loss
14 a season1 -1.07223880 loss
16 a season1 0.09500072 loss
3 a season1 0.18832552 loss
8 a season1 0.42033257 loss
4 a season1 2.44325982 win
# shows runs of 5 consecutive losses, then 1 consecutive win
rle(tmp$win_loss == "win")
Run Length Encoding
lengths: int [1:2] 5 1
values : logi [1:2] FALSE TRUE
Here's a very helpful post on rle How can I count runs in a sequence?
Problem 2. Last 3 results
I reversed the order of time and then picked the top 3 results.
# 2. find last 3 matches for team 'b'
tmp <- dat[dat$team == "b", ]
tmp <- tmp[rev(order(tmp$time)), ]
> tmp[1:3, ]
team season time win_loss
11 b season1 0.9172555 loss
9 b season1 0.5775845 win
7 b season1 0.4560691 loss

Merging overlapping dataframes in R

Okay, so I have two different data frames (df1 and df2) which, to simplify it, have an ID, a date, and the score on a test. In each data frame the person (ID) have taken the test on multiple dates. When looking between the two data frames, some of the people are listed in df1 but not in df2, and vice versa, but some are listed in both and they can overlap differently.
I want to combine all the data into one frame, but the tricky part is if any of the IDs and scores from df1 and df2 are within 7 days (I can do this with a subtracted dates column), I want to combine that row.
In essence, for every ID there will be one row with both scores written separately if taken within 7 days, and if not it will make two separate rows, one with score from df1 and one from df2 along with all the other scores that might not be listed in both.
EX:
df1
ID Date1(yyyymmdd) Score1
1 20140512 50
1 20140501 30
1 20140703 50
1 20140805 20
3 20140522 70
3 20140530 10
df2
ID Date2(yyyymmdd) Score2
1 20140530 40
1 20140622 20
1 20140702 10
1 20140820 60
2 20140522 30
2 20140530 80
Wanted_df
ID Date1(yyyymmdd) Score1 Date2(yyyymmdd) Score2
1 20140512 50
1 20140501 30
1 20140703 50 20140702 10
1 20140805 20
1 20140530 40
1 20140622 20
1 20140820 60
3 20140522 70
3 20140530 10
2 20140522 30
2 20140530 80
Alright. I feel bad about the bogus outer join answer (which may be possible in a library I don't know about, but there are advantages to using RDBMS sometimes...) so here is a hacky workaround. It assumes that all the joins will be at most one to one, which you've said is OK.
# ensure the date columns are date type
df1$Date1 <- as.Date(as.character(df1$Date1), format="%Y%m%d")
df2$Date2 <- as.Date(as.character(df2$Date2), format="%Y%m%d")
# ensure the dfs are sorted
df1 <- df1[order(df1$ID, df1$Date1),]
df2 <- df2[order(df2$ID, df2$Date2),]
# initialize the output df3, which starts as everything from df1 and NA from df2
df3 <- cbind(df1,Date2=NA, Score2=NA)
library(plyr) #for rbind.fill
for (j in 1:nrow(df2)){
# see if there are any rows of test1 you could join test2 to
join_rows <- which(df3[,"ID"]==df2[j,"ID"] & abs(df3[,"Date1"]-df2[j,"Date2"])<7 )
# if so, join it to the first one (see discussion)
if(length(join_rows)>0){
df3[min(join_rows),"Date2"] <- df2[j,"Date2"]
df3[min(join_rows),"Score2"] <- df2[j,"Score2"]
} # if not, add a new row of just the test2
else df3 <- rbind.fill(df3,df2[j,])
}
df3 <- df3[order(df3$ID,df3$Date1,df3$Date2),]
row.names(df3)<-NULL # i hate these
df3
# ID Date1 Score1 Date2 Score2
# 1 1 2014-05-01 30 <NA> NA
# 2 1 2014-05-12 50 <NA> NA
# 3 1 2014-07-03 50 2014-07-02 10
# 4 1 2014-08-05 20 <NA> NA
# 5 1 <NA> NA 2014-05-30 40
# 6 1 <NA> NA 2014-06-22 20
# 7 1 <NA> NA 2014-08-20 60
# 8 2 <NA> NA 2014-05-22 30
# 9 2 <NA> NA 2014-05-30 80
# 10 3 2014-05-22 70 <NA> NA
# 11 3 2014-05-30 10 <NA> NA
I couldn't get the rows in the same sort order as yours, but they look the same.
Short explanation: For each row in df2, see if there's a row in df1 you can "join" it to. If not, stick it at the bottom of the table. In the initialization and rbinding, you'll see some hacky ways of assigning blank rows or columns as placeholders.
Why this is a bad hacky workaround: for large data sets, the rbinding of df3 to itself will consume more and more memory. The loop is definitely not optimal and its search does not exploit the fact that the tables are sorted. If by some chance the test were taken twice within a week, you would see some unexpected behavior (duplicates from df2, etc).
Use an outer join with an absolute value limit on the date difference. (A outer join B keeps all rows of A and B.) For example:
library(sqldf)
sqldf("select a.*, b.* from df1 a outer join df2 b on a.ID = b.ID and abs(a.Date1 - b.Date2) <=7")
Note that your date variables will have to be true dates. If they are currently characters or integers, you need to do something like df1$Date1 <- as.Date(as.character(df$Date1), format="%Y%M%D) etc.

Assign rows to a group based on spatial neighborhood and temporal criteria in R

I have an issue that I just cannot seem to sort out. I have a dataset that was derived from a raster in arcgis. The dataset represents every fire occurrence during a 10-year period. Some raster cells had multiple fires within that time period (and, thus, will have multiple rows in my dataset) and some raster cells will not have had any fire (and, thus, will not be represented in my dataset). So, each row in the dataset has a column number (sequential integer) and a row number assigned to it that corresponds with the row and column ID from the raster. It also has the date of the fire.
I would like to assign a unique ID (fire_ID) to all of the fires that are within 4 days of each other and in adjacent pixels from one another (within the 8-cell neighborhood) and put this into a new column.
To clarify, if there were an observation from row 3, col 3, Jan 1, 2000 and another from row 2, col 4, Jan 4, 2000, those observations would be assigned the same fire_ID.
Below is a sample dataset with "rows", which are the row IDs of the raster, "cols", which are the column IDs of the raster, and "dates" which are the dates the fire was detected.
rows<-sample(seq(1,50,1),600, replace=TRUE)
cols<-sample(seq(1,50,1),600, replace=TRUE)
dates<-sample(seq(from=as.Date("2000/01/01"), to=as.Date("2000/02/01"), by="day"),600, replace=TRUE)
fire_df<-data.frame(rows, cols, dates)
I've tried sorting the data by "row", then "column", then "date" and looping through, to create a new fire_ID if the row and column ID were within one value and the date was within 4 days, but this obviously doesn't work, as fires which should be assigned the same fire_ID are assigned different fire_IDs if there are observations in between them in the list that belong to a different fire_ID.
fire_df2<-fire_df[order(fire_df$rows, fire_df$cols, fire_df$date),]
fire_ID=numeric(length=nrow(fire_df2))
fire_ID[1]=1
for (i in 2:nrow(fire_df2)){
fire_ID[i]=ifelse(
fire_df2$rows[i]-fire_df2$rows[i-1]<=abs(1) & fire_df2$cols[i]-fire_df2$cols[i-1]<=abs(1) & fire_df2$date[i]-fire_df2$date[i-1]<=abs(4),
fire_ID[i-1],
i)
}
length(unique(fire_ID))
fire_df2$fire_ID<-fire_ID
Please let me know if you have any suggestions.
I think this task requires something along the lines of hierarchical clustering.
Note, however, that there will be necessarily some degree of arbitrariness in the ids. This is because it is entirely possible that the cluster of fires itself is longer than 4 days yet every fire is less than 4 days away from some other fire in that cluster (and thus should have the same id).
library(dplyr)
# Create the distances
fire_dist <- fire_df %>%
# Normalize dates
mutate( norm_dates = as.numeric(dates)/4) %>%
# Only keep the three variables of interest
select( rows, cols, norm_dates ) %>%
# Compute distance using L-infinite-norm (maximum)
dist( method="maximum" )
# Do hierarchical clustering with "single" aggl method
fire_clust <- hclust(fire_dist, method="single")
# Cut the tree at height 1 and obtain groups
group_id <- cutree(fire_clust, h=1)
# First attach the group ids back to the data frame
fire_df2 <- cbind( fire_df, group_id ) %>%
# Then sort the data
arrange( group_id, dates, rows, cols )
# Print the first 20 records
fire_df2[1:10,]
(Make sure you have dplyr library installed. You can run install.packages("dplyr",dep=TRUE) if not installed. It is a really good and very popular library for data manipulations)
A couple of simple tests:
Test #1. The same forest fire moving.
rows<-1:6
cols<-1:6
dates<-seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day")
fire_df<-data.frame(rows, cols, dates)
gives me this:
rows cols dates group_id
1 1 1 2000-01-01 1
2 2 2 2000-01-02 1
3 3 3 2000-01-03 1
4 4 4 2000-01-04 1
5 5 5 2000-01-05 1
6 6 6 2000-01-06 1
Test #2. 6 different random forest fires.
set.seed(1234)
rows<-sample(seq(1,50,1),6, replace=TRUE)
cols<-sample(seq(1,50,1),6, replace=TRUE)
dates<-sample(seq(from=as.Date("2000/01/01"), to=as.Date("2000/02/01"), by="day"),6, replace=TRUE)
fire_df<-data.frame(rows, cols, dates)
output:
rows cols dates group_id
1 6 1 2000-01-10 1
2 32 12 2000-01-30 2
3 31 34 2000-01-10 3
4 32 26 2000-01-27 4
5 44 35 2000-01-10 5
6 33 28 2000-01-09 6
Test #3: one expanding forest fire
dates <- seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day")
rows_start <- 50
cols_start <- 50
fire_df <- data.frame(dates = dates) %>%
rowwise() %>%
do({
diff = as.numeric(.$dates - as.Date("2000/01/01"))
expand.grid(rows=seq(rows_start-diff,rows_start+diff),
cols=seq(cols_start-diff,cols_start+diff),
dates=.$dates)
})
gives me:
rows cols dates group_id
1 50 50 2000-01-01 1
2 49 49 2000-01-02 1
3 49 50 2000-01-02 1
4 49 51 2000-01-02 1
5 50 49 2000-01-02 1
6 50 50 2000-01-02 1
7 50 51 2000-01-02 1
8 51 49 2000-01-02 1
9 51 50 2000-01-02 1
10 51 51 2000-01-02 1
and so on. (All records identified correctly to belong to the same forest fire.)

Resources