Data.table: sum between irregular date ranges - r

Surveys and fires occurred at irregular intervals in different burn units.
(srv=1 means a survey was done, fire=1 means a fire occurred)
I want calculate how many fires were lighted between surveys, i.e.,
including the year of the survey and going back to one year before the last survey.
nyear = 10
units = 4
set.seed(15)
DT <- data.table(
unit = rep(1:units, each=nyear),
year = 2000:(2000+nyear-1),
srv = rbinom(nyear*units, 1, 0.4),
fire = rbinom(nyear*units, 1, 0.3)
)
DT
I can calculate the years elapsed but I have to create a new dataset then join it back to the original data set. Then I cannot figure out out to sum fires between date ranges.
DT1 <- DT[srv != 0] # Drop years without surveys
DT2 <- DT1[, .(year, elapsed = year - shift(year)), by = "unit"] # Use 'shift' to find years elapsed
DT3 <- DT2[DT, on=.(unit, year)] # join dataset with elapsed time to original dataset
DT3[ , sum(fire), on = .(year >= year, year < year -(elapsed-1)), by="unit"] # Doesn't work
Example output follows, where 'nfire' is what I'm after -- in years without surveys it is 'NA', otherwise it provides numbers of fires after the last survey and including current survey year:
unit year elapsed srv fire nfire
1: 1 2000 NA 1 1 1
2: 1 2001 NA 0 0 NA
3: 1 2002 2 1 1 1
4: 1 2003 1 1 0 0
5: 1 2004 NA 0 0 NA
6: 1 2005 2 1 0 0
7: 1 2006 1 1 0 1
8: 1 2007 NA 0 1 NA
9: 1 2008 2 1 1 2
10: 1 2009 1 1 0 1
11: 2 2000 NA 0 0 NA
12: 2 2001 NA 1 1 NA

The answer of r2evans works:
DT[, grp := rev(cumsum(rev(srv == 1))), by = .(unit)][, nfire := sum(fire), by=.(unit, grp)]
Times when surveys occurred (srv ==1) are placed in reverse order then summed cumulatively. The reverse ordering ensures that each survey is grouped with the years that preceded it, and the cumulative summing provides assigns a list of consecutively numbered groups. The outer 'rev' changes the order back to its original organization.
The second part of the statement '[, nfire := sum(fire), by=.(unit, grp)]' is an example of chaining--as I understand it, just a way of introducing more operations in a data.table step without cluttering the first part of the statement. The syntax within is reasonably intuitive.

Related

Moving average that resets based on another binary column

I have a dataset containing blood percentage (hemoglobin level) and I am interested in calculating a backwards-looking (ie. not centred) moving average that restarts everytime the patient goes below 7 in hemoglobin level.
Pt_id represents the patient id, hemoglobin_level is the given hemoglobin, and anemia_start is the column that indicates when a given patients hemoglobin first goes below 7 (ie. when anemia_start equals 1).
Example data:
df <- data.frame(pt_id = c(1,1,1,1,1),
hemoglobin_level = c(8,6,5,8,7),
anemia_start = c(0,1,0,0,0))
df
pt_id hemoglobin_level anemia_start
1 1 8 0
2 1 6 1
3 1 5 0
4 1 8 0
5 1 7 0
Expected output column is:
moving_average = c(8, 6, 5.5, 6.3, 6.5)
The moving average is restarted once anemia starts, so the second value is 6 and then the moving average continues.
I know how to create a moving average (using zoo package / slider), but i do not know how to make it restart conditionally based on the "anemia_start column".
Thanks for any help.
Further information:
My professor did this in SAS using a bunch of if statements, but I have had a hard time translating it to R.
In order to understand the expected output, here's a picture of my professor's output (made in SAS) that I would like to reproduce in R. The column I'm having a hard time reproducing is the one called hb_gennemsnit (= Hemoglobin average).
he has created a bunch of intermediary columns in SAS to produce his code. Its in danish but HB = hemoglobin (the one I called hemoglobin level)
ptnr slut = patient number end, ptnr start = patient number start, and HB gennemsnit = hemoglobin average.
The hb_gennemsnit column is the moving average column that I am trying to reproduce in R
Using data.table and slider:
library(data.table)
library(slider)
setDT(df)
# Helper function
adder <- function(x) {
for (i in 1:length(x)) {
if (x[i] == 0L) {
if (i != 1L) {
x[i] <- x[i-1]
} else {x[i] <- 1}
} else {
x[i] <- x[i-1] + 1
}
}
return(x)
}
# Create period index
df[, period := adder(anemia_start), by = pt_id]
# Do moving average
df[, moving_average := slide_vec(
.x = hemoglobin_level
.f = mean,
.before = Inf),
by = c("pt_id", "period")]
Output:
df
pt_id hemoglobin_level anemia_start period moving_average
1: 1 8 0 1 8.000000
2: 1 6 1 2 6.000000
3: 1 5 0 2 5.500000
4: 1 8 0 2 6.333333
5: 1 7 0 2 6.500000
6: 2 8 0 1 8.000000
7: 2 4 1 2 4.000000
8: 2 3 0 2 3.500000
9: 2 9 0 2 5.333333
10: 2 9 0 2 6.250000
OP edited the question so that there is a unique value for pt_id. In this case, you can just drop the by = pt_id everywhere, but the original solution will still work.

Lookup observations data based on another table

I have 2 tibble data frames that I am trying to reconcile. The first tibble has over a million observations, the first few rows are as follows:
data
ID Time(Converted to number)
1 23160
1 23161
1 23162
1 23163
1 23164
1 23165
2 24251
2 24252
The second tibble is a lookup table (that has information of a particular event that has occurred), simplified version as follows:
lookup_table
ID Event_Time Event_Indicator Number_of_Cumulative_Events
1 23162 1 1
1 23164 1 2
2 24255 1 1
2 24280 0 1
I would like to create a 3rd column in the first tibble, such that it shows the number of cumulative events at that time of the observation. The 3rd column in the above example would therefore be:
ID Time(Converted to number) Number
1 23160 0
1 23161 0
1 23162 1
1 23163 1
1 23164 2
1 23165 2
2 24251 0
2 24252 0
I am trying to avoid having to loop through the millions of observations to compare each observation's time to the Event_Time in the lookup table because of computation time.
However, I am not sure how to go about doing this without the use of a loop. The issue is that the lookup_table contains some IDs multiple times, if all IDs only appeared in the lookup_table only once, then I could do:
data$Event_Time <- lookup_table[match(data$ID, lookup_table$ID),"Event_Time"]
data$Number <- data %>% mutate(ifelse(Time >= Event_Time,1,0))
Any ideas how I could avoid the use of a loop and yet apply the lookup conditions for each observation? Thank you.
Edit: I am not trying to join the tables, but more of comparing the time columns in the lookup_table and data table to obtain my desired column. Example, if I were to write an inefficient loop function, it would be:
for (i in 1:nrow(data)) {
data$Number[i] <- subset(lookup_table,ID == data$ID[i])[max(which
(data$Time[i] >= lookup_table$Event_Time)), "Number_of_Cumulative_Events"]
}
A possible solution is to count the cumulative events after the join. Note that an update on join is used.
library(data.table)
setDT(data)[, new := 0L][setDT(lookup_table), on = .(ID, Time = Event_Time), new := Event_Indicator][
, new := cumsum(new), by = ID][]
ID Time new
1: 1 23160 0
2: 1 23161 0
3: 1 23162 1
4: 1 23163 1
5: 1 23164 2
6: 1 23165 2
7: 2 24251 0
8: 2 24252 0
Alternatively,
setDT(data)[setDT(lookup_table), on = .(ID, Time = Event_Time), new := Event_Indicator][
is.na(new), new := 0][
, new := cumsum(new), by = ID][]
will set missing entries to zero after the join.
A completely different approach is to use a rolling join:
lookup_table[, !"Event_Indicator"][data, on = .(ID, Event_Time = Time), roll = TRUE]
ID Event_Time Number_of_Cumulative_Events
1: 1 23160 NA
2: 1 23161 NA
3: 1 23162 1
4: 1 23163 1
5: 1 23164 2
6: 1 23165 2
7: 2 24251 NA
8: 2 24252 NA
(NA's have been left untouched for illustration)

R data.table to determine if a person is new or existing

I have the following data.table
year Person Number_of_visits
2012 1 0
2013 1 4
2014 1 0
2015 1 1
2012 2 1
2013 2 5
...
I would like to determine by each person, which year their first visit is. So a desired output is:
year Person Number_of_visits New?
2012 1 0 NA
2013 1 4 Yes
2014 1 0 No
2015 1 1 No
2012 2 1 NA
2013 2 5 No
I think one could perhaps use the SHIFT function in data.table, but I can't figure out how to do so. Once a person has had a visit, he/she is not new anymore, even though there may be a year afterwards with no visits. If the first visits occur in 2012, there should be an NA or similar entry.
I have used
test <- DT[ , NEW := c(0, (2:1)[(Number_of_visits== shift(Number_of_visits)) + 1][-1]), by = Person]
but this naturally gives me all changes, and I would like to only register the first change from 0 to some value above 0 (number of visits)
I would break this into the following steps (I'm sure the solution could be golfed to something much shorter though)
setorder(dt, Person, year) # Make sure the order is correct
dt[, New := "No"] # Set No as default
dt[dt[, .I[which.max(Number_of_visits > 0)], by = Person]$V1, New := "Yes"] # find first visits
dt[year == 2012, New := NA_character_] # Set NAs to 2012
dt
# year Person Number_of_visits New
# 1: 2012 1 0 NA
# 2: 2013 1 4 Yes
# 3: 2014 1 0 No
# 4: 2015 1 1 No
# 5: 2012 2 1 NA
# 6: 2013 2 5 No

Create sequential counter that restarts on a condition within panel data groups [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 6 years ago.
I have a panel data set for which I would like to create a counter that increases with each step in the panel but restarts whenever some condition occurs. In my case, I'm using country-year data and want to count the passage of years between an event. Here's a toy data set with the key features of my real one:
df <- data.frame(country = rep(c("A","B"), each=5), year=rep(2000:2004, times=2), event=c(0,0,1,0,0,1,0,0,1,0), stringsAsFactors=FALSE)
What I'm looking to do is to create a counter that is keyed to df$event within each country's series of observations. The clock starts at 1 when we start observing each country; it increases by 1 with the passage of each year; and it restarts at 1 whenever df$event==1. The desired output is this:
country year event clock
1 A 2000 0 1
2 A 2001 0 2
3 A 2002 1 1
4 A 2003 0 2
5 A 2004 0 3
6 B 2000 1 1
7 B 2001 0 2
8 B 2002 0 3
9 B 2003 1 1
10 B 2004 0 2
I have tried using getanID from splitstackshape and a few variations of if and ifelse but have failed so far to get the desired result.
I'm already using dplyr in the scripts where I need to do this, so I would prefer a solution that uses it or base R, but I would be grateful for anything that works. My data sets are not massive, so speed is not critical, but efficiency is always a plus.
With dplyr that would be:
df %>%
group_by(country, idx = cumsum(event == 1L)) %>%
mutate(counter = row_number()) %>%
ungroup %>%
select(-idx)
#Source: local data frame [10 x 4]
#
# country year event counter
#1 A 2000 0 1
#2 A 2001 0 2
#3 A 2002 1 1
#4 A 2003 0 2
#5 A 2004 0 3
#6 B 2000 1 1
#7 B 2001 0 2
#8 B 2002 0 3
#9 B 2003 1 1
#10 B 2004 0 2
Or using data.table:
library(data.table)
setDT(df)[, counter := seq_len(.N), by = list(country, cumsum(event == 1L))]
Edit: group_by(country, idx = cumsum(event == 1L)) is used to group by country and a new grouping index "idx". The event == 1L part creates a logical index telling us whether the column "event" is an integer 1 or not (TRUE/FALSE). Then, cumsum(...) sums up starting from 0 for the first 2 rows, 1 for the next 3, 2 for the next 3 and so on. We use this new column (+ country) to group the data as needed. You can check it out if you remove the last to pipe-parts in the dplyr code.
library(splitstackshape)
df$counter <- getanID(cbind(df$country, cumsum(df$event)))[,.id]
We take advantage of the fact that you already have zeroes and ones in your event column. That makes indexing much easier. I combine the country column with cumsum(df$event). When that command is run by itself you can see its effect:
cumsum(df$event)
[1] 0 0 1 1 1 2 2 2 3 3
It will only increase with each 1 value. When combined with the country, we are able to see the increase grouped by country.
From there, we can create an id column. #AnandaMahto's splitstackshape package has the function getanID for that.
df
country year event counter
1 A 2000 0 1
2 A 2001 0 2
3 A 2002 1 1
4 A 2003 0 2
5 A 2004 0 3
6 B 2000 1 1
7 B 2001 0 2
8 B 2002 0 3
9 B 2003 1 1
10 B 2004 0 2

Select rows in data.table based on in-row calculation

The data set is similar to this:
library(data.table)
uid <- c("a","a","a","b","b","b","c","c","c")
date <- c(2001,2002,2003)
DT <- data.table(id=uid, year=rep(date,3), value= c(1,3,2,1:6))
Q1
Now I want to find which observations has the "value" column increase year by year
what I want is like this:
for b and c, value is increasing all the time.
4: b 2001 1
5: b 2002 2
6: b 2003 3
7: c 2001 4
8: c 2002 5
9: c 2003 6
In real data, the recording time span for each id is different.
besides, I want to calculate : for given id, how many years the value increases.
ID V1
1: a 1
2: b 2
3: c 2
Thanks a lot if you have some ideas about this.
I preferred the data.table method, due to the speed calculation requirement.
I think this does what you want:
DT[order(year)][, sum(diff(value) > 0), by=id]
produces:
id V1
1: a 1
2: b 2
3: c 2
This assumes you have at most one value per year.
For your first question, if they're not sorted, I'd do a setkey on id, year for sorting (rather than using base:::order, as it's very slow). id is also added so that you'll get the results in the same order as you expect for question 2 as well.
setkey(DT, id, year)
DT[, if (.N == 1L ||
( .N > 1 && all(value[2:.N]-value[1:(.N-1)] > 0) )
) .SD,
by=list(id)]
id year value
1: b 2001 1
2: b 2002 2
3: b 2003 3
4: c 2001 4
5: c 2002 5
6: c 2003 6
For your second question:
DT[, if (.N == 1L) 1L else sum(value[2:.N]-value[1:(.N-1)] > 0), by=list(id)]
id V1
1: a 1
2: b 2
3: c 2
I take the 2nd to the last (.N) value and subtract it with 1st to n-1th explicitly because diff being a S3 generic will take time for dispatch of the right method (here, diff.default) and it would be much faster to directly write your function in j.

Resources