Count the number of previous occurrences using a time window, not a fixed window size - r

I have a dataset like the following, the last column is desired output.
DX_CD AID date2 <count.occurences.1000.days>
1 272.4 1649 2007-02-10 0 or N/A
2 V58.67 1649 2007-02-10 0<- (excluding the same day). OR 1
3 787.91 1649 2010-04-14 0
4 788.63 1649 2011-03-10 1
5 493.90 4193 2007-09-13 0 or N/A #new AID
6 787.20 6954 2010-02-25 0 or N/A #new AID
.....
I want to compute the column (count.occurences.1000.days) that counts the number of previous occurrences within X days (e.g. X=1000) by AID.
The first value is 0 or N/A because there is no previous record before record #1 for AID=1649. The second value is 0 because this event occurs on the same day as record #1. Third value is 0 because there are records older than 2010-04-14, but they are beyond 1000days. Fourth value is 1 because the record #3 happened within 1000 days. Same logic goes for AID=4193 and AID=6954
Can someone provide an idea, preferably vectorized?

If I understood correctly the question, this should do
First, a sample of the data
df <- data.frame(date2=days <-
seq(as.Date("2008-12-30"), as.Date("2015-01-03"), by="days"),
AID=sample(c(1649, 4193, 6954, 3466), 2196, replace=T),
count=(rep.int(1,2196)))
Now we group by the 1000 days from max to min
df$date.bin <- Hmisc::cut2(df$date2,
cuts=sort(seq(max(df$date2), length=10,by="-1000 days")))
Now we use cumsum on the grouped variables
res <-df %>% dplyr::arrange(date.bin, AID) %>% group_by(date.bin, AID) %>%
mutate(cumsum=cumsum(count))

Related

group_by to select first two rows, then spread()

I'm trying to reformat this so I can generate a dataframe of all instances of On Hold Begins and the next event immediately after it. On Hold Begins is the start an event, and I'd like to capture its Timestamp and Deviation as well as the Timestamp and Deviation for the next event immediately after it (i.e. Below Thresold, Stage Enabled).
If possible, I only want to include slices that have On Hold Begins as the first event (so the ideal solution would not include rows 1 &2 above), do not want the additional X columns, and would want it to be formatted as I described.
This is similar to: How can I spread repeated measures of multiple variables into wide format?, but I ran into errors asking for a dictionary when I tried it.
Thank you all so much for the help.
Simple solution using base R:
first_idx <- which(df$Flag == "On Hold Begins")
second_idx <- first_idx + 1
df_1 <- df[first_idx,]; colnames(df_1) <- paste("Flag 1 ", colnames(df_1))
df_2 <- df[second_idx,]; colnames(df_2) <- paste("Flag 2 ", colnames(df_2))
cbind(df_1, df_2)
Flag 1 Stage Flag 1 Flag Flag 1 Timestamp Flag 1 x Flag 1 Deviation Flag 2 Stage Flag 2 Flag Flag 2 Timestamp Flag 2 x Flag 2 Deviation
3 a On Hold Begins 4/29/17 15:34 1 1.200 a Below Threshold 4/29/17 15:35 1 0.0000
6 a On Hold Begins 4/29/17 21:49 5 1.200 a Below Threshold 4/29/17 21:50 5 0.0000
10 a On Hold Begins 4/29/17 23:29 6 1.200 a Below Threshold 4/29/17 23:30 6 0.0000
12 a On Hold Begins 5/16/17 17:22 8 1.774 a Stage Enabled 5/16/17 17:39 8 1.8973
15 a On Hold Begins 5/16/17 19:14 9 1.095 a Below Threshold 5/16/17 19:15 9 -0.2252
21 b On Hold Begins 4/28/17 22:05 125 1.200 b On Hold Ends 4/28/17 22:07 125 1.2000
24 b On Hold Begins 4/28/17 23:29 128 1.200 b Below Threshold 4/28/17 23:30 128 0.0000
26 b On Hold Begins 4/29/17 1:53 133 1.200 b Below Threshold 4/29/17 1:55 133 0.0000
29 b On Hold Begins 4/29/17 2:40 135 1.200 <NA> <NA> <NA> NA NA
My solution 1) assigns common serial to related records; 2) groups and slices the first in the set, and tags with "Flag 1" or "Flag 2."
df_tidy <- df %>%
slice(-1) %>%
mutate(my_serial = case_when(
str_detect(Flag, "On Hold Begins")~row_number() )) %>%
fill(my_serial) %>% #< Assign serials to related records
group_by(my_serial) %>%
slice(1:2) %>% #< Take first records in each set
mutate(flag_number = if_else(
str_detect(Flag, "On Hold Begins"), "Flag 1", "Flag 2")) #< Tag Records
df_1 <- df_tidy %>%
filter(flag_number %in% "Flag 1") %>%
select(1:3) %>%
setNames(paste0("Flag 1_", names(.)) )
df_2 <- df_tidy %>%
filter(flag_number %in% "Flag 2") %>%
select(1:3) %>%
setNames(paste0("Flag 2_", names(.)) )
bind_cols(df_1, df_2)

Frequency count with multiple conditions R

Have a data frame
Date Team Opponent Weather Outcome
2017-05-01 All Stars B Stars Rainy 1
2017-05-02 All Stars V Stars Rainy 1
2017-05-03 All Stars M Trade Sunny 0
.
.
2017-05-11 All Stars Vdronee Sunny 0
Where Outcome 1 indicates a win. I have used the table function to get the frequency and applied condition.
table(df$Outcome, df$Team == "All Stars")
Returns me this
FALSE TRUE
0 1005 30
1 1323 57
So frequency of win is 57/87 =0.655
Two Questions:
Rather the calculating the win frequency manually, how do I embed this directly in a formula?
and
How do I filter based on the x most recent observations? i.e something like
table(df$Outcome, df$Team == "All Stars" & df$date = filtering for the 5 most recent observations)
thanks
An option is to use data.table
libray(data.table)
dt <- data.table(df)
dt[, .(prop=sum(outcome)/.N),Team]
to get the 5 most recent observations you can to the following:
dt[,head(.SD,5),by=.(Team,Date)][,.(prop=sum(outcoume/.N),Team]

How to Have a COUNTIF Function dependent on the dates of the same row in R

My main problem is figuring out a way to count the number of days a particular item was sold. For example, if I have the following data frame, I would like to count the number of days in which item A or B were sold, i.e., item A was only sold on one day during our sample, and item B was sold 3 times, however only sold on 2 different days. My goal would be to have a function that outputs the number of days in which item was sold, here being (A,B)=(1, 2).
row item_name date
1 A 2016-03-04 3:49
2 B 2016-05-31 16:15
3 B 2016-05-31 16:35
4 B 2016-06-08 16:05
Try this
library(dplyr)
df1 %>% group_by(item_name) %>% summarise(n_distinct(as.Date(date)))

How to check if any row has negative values by leaving out selected rows?

Below is the dataframe I get by running a query. Please note that df1 is a dynamic dataframe and it might return either an empty df or partial df with not all quarters as seen below:
df1
FISC_QTR_VAL Revenue
1 2014-Q1 0.00
2 2014-Q2 299111.86
3 2014-Q3 174071.98
4 2014-Q4 257655.30
5 2015-Q1 0.00
6 2015-Q2 317118.63
7 2015-Q3 145461.88
8 2015-Q4 162972.41
9 2016-Q1 96896.04
10 2016-Q2 135058.78
11 2016-Q3 111773.77
12 2016-Q4 138479.28
13 2017-Q1 169276.04
I would want to check the values of all the rows in Revenue column and see if any value is 0 or negative excluding 2014-Q1 row
Also, the df1 is dynamic and will contain only 12 quarters of data i.e. when I reach next qtr i.e. 2017-Q2, the Revenue associated with 2014-Q2 becomes 0 and it will look like this:
df1
FISC_QTR_VAL Revenue
1 2014-Q1 0.00
2 2014-Q2 0.00
3 2014-Q3 174071.98
4 2014-Q4 257655.30
5 2015-Q1 0.00
6 2015-Q2 317118.63
7 2015-Q3 145461.88
8 2015-Q4 162972.41
9 2016-Q1 96896.04
10 2016-Q2 135058.78
11 2016-Q3 111773.77
12 2016-Q4 138479.28
13 2017-Q1 169276.04
14 2017-Q2 146253.64
In the above case, I would need to check all rows for the Revenue column by excluding 2014-Q1 and 2014-Q2
And this goes on as quarter progresses
Need your help to generate the code which would dynamically do all the above steps of excluding the row(s) and check only the rows that matter for a particular quarter
Currently, I am using the below code:
#Taking the first df1 into consideration which has 2017-Q1 as the last quarter
startQtr <- "2014-Q2" #This value is dynamically achieved and will change as we move ahead. Next quarter, the value changes to 2014-Q3 and so on
if(length(df1[["FISC_QTR_VAL"]][nrow(df1)-11] == startQtr) == 1){
if(nrow(df1[df1$Revenue < 0,]) == 0 & nrow(df1[df1$Revenue == 0,]) == 0){
df1 <- df1 %>% slice((nrow(df1)-11):(nrow(df1)))
}
}
The first IF loop checks if there is data in df1
If the df is empty, df1[["FISC_QTR_VAL"]][nrow(df1)-10] == startQtr condition would return numeric(0) whose length would be 0 and hence the condition fails
If not, then it goes to the next IF loop and checks for -ve and 0 values in Revenue column. But it does for all the rows. I want 2014-Q1 excluded in this case, and going forward to the future quarters, would want the condition to be dynamic as explained above.
Also, I do not want to slice the dataset before the if condition as the code would throw an error if the initial dataframe df1 returns 1 row or 2 rows and we try to slice those further
Thanks
Here's a solution using a few functions from the dplyr and tidyr packages.
Here's a toy data set to work with:
d <- data.frame(
FISC_QTR_VAL = c("2015-Q1", "2014-Q2", "2014-Q1", "2015-Q2"),
Revenue = c(100, 200, 0, 0)
)
d
#> FISC_QTR_VAL Revenue
#> 1 2015-Q1 100
#> 2 2014-Q2 200
#> 3 2014-Q1 0
#> 4 2015-Q2 0
Notice that FISC_QTR_VAL is intentionally out of order (as a precaution).
Next, set variables for the current year and quarter (you'll see why separate in a moment):
current_year <- 2014
current_quarter <- 2
Then run the following:
d %>%
separate(FISC_QTR_VAL, c("year", "quarter"), sep = "-Q") %>%
arrange(year, quarter) %>%
slice(which(year == current_year & quarter == current_quarter):n()) %>%
filter(Revenue <= 0)
#> year quarter Revenue
#> 1 2015 2 0
First, we separate() the FISC_QTR_VAL into separate year and quarter variables for (a) a tidy data set and (b) a way to order the data in case it's out of order (as in the toy used here). We then arrange() the data so that it's ordered by year and quarter. Then, we slice() away any quarters prior to the current one, and then filter() to return all rows where Revenue <= 0.
To alternatively get, for example, a count of the number of rows that are returned, you can pipe on something like nrow().
Is the subset function an option for you?
exclude.qr <- c("2014-Q1", "2014-Q2")
df <- data.frame(
FISC_QTR_VAL = c("2014-Q1", "2014-Q2", "2014-Q3", "2014-Q4"),
Revenue = c(0.00, 299111.86, 174071.98, 257655.30))
subset(
df,
FISC_QTR_VAL != exclude.qr, Revenue > 0)
You can easily create exclue.qr dynamically, e.g. via paste an years <- 2010:END.
I hope this is helpfull!

Assign rows to a group based on spatial neighborhood and temporal criteria in R

I have an issue that I just cannot seem to sort out. I have a dataset that was derived from a raster in arcgis. The dataset represents every fire occurrence during a 10-year period. Some raster cells had multiple fires within that time period (and, thus, will have multiple rows in my dataset) and some raster cells will not have had any fire (and, thus, will not be represented in my dataset). So, each row in the dataset has a column number (sequential integer) and a row number assigned to it that corresponds with the row and column ID from the raster. It also has the date of the fire.
I would like to assign a unique ID (fire_ID) to all of the fires that are within 4 days of each other and in adjacent pixels from one another (within the 8-cell neighborhood) and put this into a new column.
To clarify, if there were an observation from row 3, col 3, Jan 1, 2000 and another from row 2, col 4, Jan 4, 2000, those observations would be assigned the same fire_ID.
Below is a sample dataset with "rows", which are the row IDs of the raster, "cols", which are the column IDs of the raster, and "dates" which are the dates the fire was detected.
rows<-sample(seq(1,50,1),600, replace=TRUE)
cols<-sample(seq(1,50,1),600, replace=TRUE)
dates<-sample(seq(from=as.Date("2000/01/01"), to=as.Date("2000/02/01"), by="day"),600, replace=TRUE)
fire_df<-data.frame(rows, cols, dates)
I've tried sorting the data by "row", then "column", then "date" and looping through, to create a new fire_ID if the row and column ID were within one value and the date was within 4 days, but this obviously doesn't work, as fires which should be assigned the same fire_ID are assigned different fire_IDs if there are observations in between them in the list that belong to a different fire_ID.
fire_df2<-fire_df[order(fire_df$rows, fire_df$cols, fire_df$date),]
fire_ID=numeric(length=nrow(fire_df2))
fire_ID[1]=1
for (i in 2:nrow(fire_df2)){
fire_ID[i]=ifelse(
fire_df2$rows[i]-fire_df2$rows[i-1]<=abs(1) & fire_df2$cols[i]-fire_df2$cols[i-1]<=abs(1) & fire_df2$date[i]-fire_df2$date[i-1]<=abs(4),
fire_ID[i-1],
i)
}
length(unique(fire_ID))
fire_df2$fire_ID<-fire_ID
Please let me know if you have any suggestions.
I think this task requires something along the lines of hierarchical clustering.
Note, however, that there will be necessarily some degree of arbitrariness in the ids. This is because it is entirely possible that the cluster of fires itself is longer than 4 days yet every fire is less than 4 days away from some other fire in that cluster (and thus should have the same id).
library(dplyr)
# Create the distances
fire_dist <- fire_df %>%
# Normalize dates
mutate( norm_dates = as.numeric(dates)/4) %>%
# Only keep the three variables of interest
select( rows, cols, norm_dates ) %>%
# Compute distance using L-infinite-norm (maximum)
dist( method="maximum" )
# Do hierarchical clustering with "single" aggl method
fire_clust <- hclust(fire_dist, method="single")
# Cut the tree at height 1 and obtain groups
group_id <- cutree(fire_clust, h=1)
# First attach the group ids back to the data frame
fire_df2 <- cbind( fire_df, group_id ) %>%
# Then sort the data
arrange( group_id, dates, rows, cols )
# Print the first 20 records
fire_df2[1:10,]
(Make sure you have dplyr library installed. You can run install.packages("dplyr",dep=TRUE) if not installed. It is a really good and very popular library for data manipulations)
A couple of simple tests:
Test #1. The same forest fire moving.
rows<-1:6
cols<-1:6
dates<-seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day")
fire_df<-data.frame(rows, cols, dates)
gives me this:
rows cols dates group_id
1 1 1 2000-01-01 1
2 2 2 2000-01-02 1
3 3 3 2000-01-03 1
4 4 4 2000-01-04 1
5 5 5 2000-01-05 1
6 6 6 2000-01-06 1
Test #2. 6 different random forest fires.
set.seed(1234)
rows<-sample(seq(1,50,1),6, replace=TRUE)
cols<-sample(seq(1,50,1),6, replace=TRUE)
dates<-sample(seq(from=as.Date("2000/01/01"), to=as.Date("2000/02/01"), by="day"),6, replace=TRUE)
fire_df<-data.frame(rows, cols, dates)
output:
rows cols dates group_id
1 6 1 2000-01-10 1
2 32 12 2000-01-30 2
3 31 34 2000-01-10 3
4 32 26 2000-01-27 4
5 44 35 2000-01-10 5
6 33 28 2000-01-09 6
Test #3: one expanding forest fire
dates <- seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day")
rows_start <- 50
cols_start <- 50
fire_df <- data.frame(dates = dates) %>%
rowwise() %>%
do({
diff = as.numeric(.$dates - as.Date("2000/01/01"))
expand.grid(rows=seq(rows_start-diff,rows_start+diff),
cols=seq(cols_start-diff,cols_start+diff),
dates=.$dates)
})
gives me:
rows cols dates group_id
1 50 50 2000-01-01 1
2 49 49 2000-01-02 1
3 49 50 2000-01-02 1
4 49 51 2000-01-02 1
5 50 49 2000-01-02 1
6 50 50 2000-01-02 1
7 50 51 2000-01-02 1
8 51 49 2000-01-02 1
9 51 50 2000-01-02 1
10 51 51 2000-01-02 1
and so on. (All records identified correctly to belong to the same forest fire.)

Resources