Moving average that resets based on another binary column - r

I have a dataset containing blood percentage (hemoglobin level) and I am interested in calculating a backwards-looking (ie. not centred) moving average that restarts everytime the patient goes below 7 in hemoglobin level.
Pt_id represents the patient id, hemoglobin_level is the given hemoglobin, and anemia_start is the column that indicates when a given patients hemoglobin first goes below 7 (ie. when anemia_start equals 1).
Example data:
df <- data.frame(pt_id = c(1,1,1,1,1),
hemoglobin_level = c(8,6,5,8,7),
anemia_start = c(0,1,0,0,0))
df
pt_id hemoglobin_level anemia_start
1 1 8 0
2 1 6 1
3 1 5 0
4 1 8 0
5 1 7 0
Expected output column is:
moving_average = c(8, 6, 5.5, 6.3, 6.5)
The moving average is restarted once anemia starts, so the second value is 6 and then the moving average continues.
I know how to create a moving average (using zoo package / slider), but i do not know how to make it restart conditionally based on the "anemia_start column".
Thanks for any help.
Further information:
My professor did this in SAS using a bunch of if statements, but I have had a hard time translating it to R.
In order to understand the expected output, here's a picture of my professor's output (made in SAS) that I would like to reproduce in R. The column I'm having a hard time reproducing is the one called hb_gennemsnit (= Hemoglobin average).
he has created a bunch of intermediary columns in SAS to produce his code. Its in danish but HB = hemoglobin (the one I called hemoglobin level)
ptnr slut = patient number end, ptnr start = patient number start, and HB gennemsnit = hemoglobin average.
The hb_gennemsnit column is the moving average column that I am trying to reproduce in R

Using data.table and slider:
library(data.table)
library(slider)
setDT(df)
# Helper function
adder <- function(x) {
for (i in 1:length(x)) {
if (x[i] == 0L) {
if (i != 1L) {
x[i] <- x[i-1]
} else {x[i] <- 1}
} else {
x[i] <- x[i-1] + 1
}
}
return(x)
}
# Create period index
df[, period := adder(anemia_start), by = pt_id]
# Do moving average
df[, moving_average := slide_vec(
.x = hemoglobin_level
.f = mean,
.before = Inf),
by = c("pt_id", "period")]
Output:
df
pt_id hemoglobin_level anemia_start period moving_average
1: 1 8 0 1 8.000000
2: 1 6 1 2 6.000000
3: 1 5 0 2 5.500000
4: 1 8 0 2 6.333333
5: 1 7 0 2 6.500000
6: 2 8 0 1 8.000000
7: 2 4 1 2 4.000000
8: 2 3 0 2 3.500000
9: 2 9 0 2 5.333333
10: 2 9 0 2 6.250000
OP edited the question so that there is a unique value for pt_id. In this case, you can just drop the by = pt_id everywhere, but the original solution will still work.

Related

R - retaining sequence count if only 2 rows don't match the condition

I have a few years of dataset and I am try to look at duration of events within the dataset. For example, I would like to know the duration of "Strong wind events". I can do this by:
wind.df <- data.frame(ws = c(6,7,8,9,1,7,6,1,2,3,4,10,4,1,2))
r <- rle(wind.df$ws>=6)
sequence <- unlist(sapply(r$lengths, seq))
wind.df$strong.wind.duration <- sequence
BUT, if the wind speed goes below a threshold for only two datapoints, I want to keep counting. If the wind speed is below a threshold for more than two, then I want to reset the counter.
So the output would look like:
## manually creating a desired output ###
wind.df$desired.output <- c(1,2,3,4,5,6,7,1,2,3,4,5,6,7,8)
You can do this with a customized function that loops over your wind speeds and counts the consecutive numbers above a threshold:
numerate = function(nv, threshold = 6){
counter = 1
clist = c()
low=TRUE
for(i in 1:(length(nv))){
if(max(nv[i:(i+2)],na.rm=T)<threshold & !low){ ## Reset the counter
counter = 1
low = T
}
if(nv[i]>=threshold){low=FALSE}
clist=c(clist,counter)
counter=counter+1
}
return(clist)
}
wind.df <- data.frame(ws = c(6,7,8,9,1,7,6,1,2,3,4,10,4,1,2))
wind.df$desired.output = numerate(wind.df$ws)
The output of this function would be:
> print(wind.df)
ws desired.output
1 6 1
2 7 2
3 8 3
4 9 4
5 1 5
6 7 6
7 6 7
8 1 1
9 2 2
10 3 3
11 4 4
12 10 5
13 4 1
14 1 2
15 2 3
The desired output you wrote in your question is wrong, as the last three element of the wind speed are 4, 1, 2. That's more than two values below 6 after there was a value above 6. So, the counter has to be reset.

Data.table: sum between irregular date ranges

Surveys and fires occurred at irregular intervals in different burn units.
(srv=1 means a survey was done, fire=1 means a fire occurred)
I want calculate how many fires were lighted between surveys, i.e.,
including the year of the survey and going back to one year before the last survey.
nyear = 10
units = 4
set.seed(15)
DT <- data.table(
unit = rep(1:units, each=nyear),
year = 2000:(2000+nyear-1),
srv = rbinom(nyear*units, 1, 0.4),
fire = rbinom(nyear*units, 1, 0.3)
)
DT
I can calculate the years elapsed but I have to create a new dataset then join it back to the original data set. Then I cannot figure out out to sum fires between date ranges.
DT1 <- DT[srv != 0] # Drop years without surveys
DT2 <- DT1[, .(year, elapsed = year - shift(year)), by = "unit"] # Use 'shift' to find years elapsed
DT3 <- DT2[DT, on=.(unit, year)] # join dataset with elapsed time to original dataset
DT3[ , sum(fire), on = .(year >= year, year < year -(elapsed-1)), by="unit"] # Doesn't work
Example output follows, where 'nfire' is what I'm after -- in years without surveys it is 'NA', otherwise it provides numbers of fires after the last survey and including current survey year:
unit year elapsed srv fire nfire
1: 1 2000 NA 1 1 1
2: 1 2001 NA 0 0 NA
3: 1 2002 2 1 1 1
4: 1 2003 1 1 0 0
5: 1 2004 NA 0 0 NA
6: 1 2005 2 1 0 0
7: 1 2006 1 1 0 1
8: 1 2007 NA 0 1 NA
9: 1 2008 2 1 1 2
10: 1 2009 1 1 0 1
11: 2 2000 NA 0 0 NA
12: 2 2001 NA 1 1 NA
The answer of r2evans works:
DT[, grp := rev(cumsum(rev(srv == 1))), by = .(unit)][, nfire := sum(fire), by=.(unit, grp)]
Times when surveys occurred (srv ==1) are placed in reverse order then summed cumulatively. The reverse ordering ensures that each survey is grouped with the years that preceded it, and the cumulative summing provides assigns a list of consecutively numbered groups. The outer 'rev' changes the order back to its original organization.
The second part of the statement '[, nfire := sum(fire), by=.(unit, grp)]' is an example of chaining--as I understand it, just a way of introducing more operations in a data.table step without cluttering the first part of the statement. The syntax within is reasonably intuitive.

rolling function with variable width R

I need to summarize some data using a rolling window of different width and shift. In particular I need to apply a function (eg. sum) over some values recorded on different intervals.
Here an example of a data frame:
df <- tibble(days = c(0,1,2,3,1),
value = c(5,7,3,4,2))
df
# A tibble: 5 x 2
days value
<dbl> <dbl>
1 0 5
2 1 7
3 2 3
4 3 4
5 1 2
The columns indicate:
days how many days elapsed from the previous observation. The first value is 0 because no previous observation.
value the value I need to aggregate.
Now, let's assume that I need to sum the field value every 4 days shifting 1 day at the time.
I need something along these lines:
days value roll_sum rows_to_sum
0 5 15 1,2,3
1 7 10 2,3
2 3 3 3
3 4 6 4,5
1 2 NA NA
The column rows_to_sum has been added to make it clear.
Here more details:
The first value (15), is the sum of the 3 rows because 0+1+2 = 3 which is less than the reference value 4 and adding the next line (with value 3) will bring the total day count to 7 which is more than 4.
The second value (10), is the sum of row 2 and 3. This is because, excluding the first row (since we are shifting one day), we only summing row 2 and 3 because including row 4 will bring the total sum of days to 1+2+3 = 6 which is more than 4.
...
How can I achieve this?
Thank you
Here is one way :
library(dplyr)
library(purrr)
df %>%
mutate(roll_sum = map_dbl(row_number(), ~{
i <- max(which(cumsum(days[.x:n()]) <= 4))
if(is.na(i)) NA else sum(value[.x:(.x + i - 1)])
}))
# days value roll_sum
# <dbl> <dbl> <dbl>
#1 0 5 15
#2 1 7 10
#3 2 3 3
#4 3 4 6
#5 1 2 2
Performing this calculation in base R :
sapply(seq(nrow(df)), function(x) {
i <- max(which(cumsum(df$days[x:nrow(df)]) <= 4))
if(is.na(i)) NA else sum(df$value[x:(x + i - 1)])
})

How to divide all previous observations by the last observation iteratively within a data frame column by group in R and then store the result

I have the following data frame:
data <- data.frame("Group" = c(1,1,1,1,1,1,1,1,2,2,2,2),
"Days" = c(1,2,3,4,5,6,7,8,1,2,3,4), "Num" = c(10,12,23,30,34,40,50,60,2,4,8,12))
I need to take the last value in Num and divide it by all of the preceding values. Then, I need to move to the second to the last value in Num and do the same, until I reach the first value in each group.
Edited based on the comments below:
In plain language and showing all the math, starting with the first group as suggested below, I am trying to achieve the following:
Take 60 (last value in group 1) and:
Day Num Res
7 60/50 1.2
6 60/40 1.5
5 60/34 1.76
4 60/30 2
3 60/23 2.60
2 60/12 5
1 60/10 6
Then keep only the row that has the value 2, as I don't care about the others (I want the value that is greater or equal to 2 that is the closest to 2) and return the day of that value, which is 4, as well. Then, move on to 50 and do the following:
Day Num Res
6 50/40 1.25
5 50/34 1.47
4 50/30 1.67
3 50/23 2.17
2 50/12 4.17
1 50/10 5
Then keep only the row that has the value 2.17 and return the day of that value, which is 3, as well. Then, move on to 40 and do the same thing over again, move on to 34, then 30, then 23, then 12, the last value (or Day 1 value) I don't care about. Then move on to the next group's last value (12) and repeat the same approach for that group (12/8, 12/4, 12/2; 8/4, 8/2; 4/2)
I would like to store the results of these divisions but only the most recent result that is greater than or equal to 2. I would also like to return the day that result was achieved. Basically, I am trying to calculate doubling time for each day. I would also need this to be grouped by the Group. Normally, I would use dplyr for this but I am not sure how to link up a loop with dyplr to take advantage of group_by. Also, I could be overlooking lapply or some variation thereof. My expected dataframe with the results would ideally be this:
data2 <- data.frame(divres = c(NA,NA,2.3,2.5,2.833333333,3.333333333,2.173913043,2,NA,2,2,3),
obs_n =c(NA,NA,1,2,2,2,3,4,NA,1,2,2))
data3 <- bind_cols(data, data2)
I have tried this first loop to calculate the division but I am lost as to how to move on to the next last value within each group. Right now, this is ignoring the group, though I obviously have not told it to group as I am unclear as to how to do this outside of dplyr.
for(i in 1:nrow(data))
data$test[i] <- ifelse(!is.na(data$Num), last(data$Num)/data$Num[i] , NA)
I also get the following error when I run it:
number of items to replace is not a multiple of replacement length
To store the division, I have tried this:
division <- function(x){
if(x>=2){
return(x)
} else {
return(FALSE)
}
}
for (i in 1:nrow(data)){
data$test[i]<- division(data$test[i])
}
Now, this approach works but only if i need to run this once on the last observation and only if I apply it to 1 group. I have 209 groups and many days that I would need to run this over. I am not sure how to put together the first for loop with the division function and I also am totally lost as to how to do this by group and move to the next last values. Any suggestions would be appreciated.
You can modify your division function to handle vector and return a dataframe with two columns divres and ind the latter is the row index that will be used to calculate obs_n as shown below:
division <- function(x){
lenx <- length(x)
y <- vector(mode="numeric", length = lenx)
z <- vector(mode="numeric", length = lenx)
for (i in lenx:1){
y[i] <- ifelse(length(which(x[i]/x[1:i]>=2))==0,NA,x[i]/x[1:i] [max(which(x[i]/x[1:i]>=2))])
z[i] <- ifelse(is.na(y[i]),NA,max(which(x[i]/x[1:i]>=2)))
}
df <- data.frame(divres = y, ind = z)
return(df)
}
Check the output of division function created above using data$Num as input
> division(data$Num)
divres ind
1 NA NA
2 NA NA
3 2.300000 1
4 2.500000 2
5 2.833333 2
6 3.333333 2
7 2.173913 3
8 2.000000 4
9 NA NA
10 2.000000 9
11 2.000000 10
12 3.000000 10
Use cbind to combine the above output with dataframe data1, use pipes and mutate from dplyr to lookup the obs_n value in Day using ind, select appropriate columns to generate the desired dataframe data2:
data2 <- cbind.data.frame(data, division(data$Num)) %>% mutate(obs_n = Days[ind]) %>% select(-ind)
Output
> data2
Group Days Num divres obs_n
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 2.300000 1
4 1 4 30 2.500000 2
5 1 5 34 2.833333 2
6 1 6 40 3.333333 2
7 1 7 50 2.173913 3
8 1 8 60 2.000000 4
9 2 1 2 NA NA
10 2 2 4 2.000000 1
11 2 3 8 2.000000 2
12 2 4 12 3.000000 2
You can create a function with a for loop to get the desired day as given below. Then use that to get the divres in a dplyr mutation.
obs_n <- function(x, days) {
lst <- list()
for(i in length(x):1){
obs <- days[which(rev(x[i]/x[(i-1):1]) >= 2)]
if(length(obs)==0)
lst[[i]] <- NA
else
lst[[i]] <- max(obs)
}
unlist(lst)
}
Then use dense_rank to obtain the row number corresponding to each obs_n. This is needed in case the days are not consecutive, i.e. have gaps.
library(dplyr)
data %>%
group_by(Group) %>%
mutate(obs_n=obs_n(Num, Days), divres=Num/Num[dense_rank(obs_n)])
# A tibble: 12 x 5
# Groups: Group [2]
Group Days Num obs_n divres
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 1 2.3
4 1 4 30 2 2.5
5 1 5 34 2 2.83
6 1 6 40 2 3.33
7 1 7 50 3 2.17
8 1 8 60 4 2
9 2 1 2 NA NA
10 2 2 4 1 2
11 2 3 8 2 2
12 2 4 12 2 3
Explanation of dense ranks (from Wikipedia).
In dense ranking, items that compare equally receive the same ranking number, and the next item(s) receive the immediately following ranking number.
x <- c(NA, NA, 1,2,2,4,6)
dplyr::dense_rank(x)
# [1] NA, NA, 1 2 2 3 4
Compare with rank (default method="average"). Note that NAs are included at the end by default.
rank(x)
[1] 6.0 7.0 1.0 2.5 2.5 4.0 5.0

Replacing the last value within groups with different values

My question is similar to this post, but the difference is instead of replacing the last value within each group/id with all 0's, different values are used to replace the last value within each group/id.
Here is an example (I borrowed it from the above link):
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 9999
6 2 0
7 2 9
8 2 500
9 3 0
10 3 1
In the above link, the last value within each group/id was replaced by a zero, using something like:
df %>%
group_by(id) %>%
mutate(Time = c(Time[-n()], 0))
And the output was
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 0
6 2 0
7 2 9
8 2 0
9 3 0
10 3 0
In my case, I would like the last value within each group/id to be replaced by a different value. Originally, the last value within each group/id was 9999, 500, and 1. Now I would like: 9999 is replaced by 5, 500 is replaced by 12, and 1 is replaced by 92. The desired output is:
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 5
6 2 0
7 2 9
8 2 12
9 3 0
10 3 92
I tried this one:
df %>%
group_by(id) %>%
mutate(Time = replace(Time, n(), c(5,12,92))),
but it did not work.
This could be solved using almost identical solution as I posted in the linked question. e.g., just replace 0L with the desired values
library(data.table)
indx <- setDT(df)[, .I[.N], by = id]$V1
df[indx, Time := c(5L, 12L, 92L)]
df
# id Time
# 1: 1 3
# 2: 1 10
# 3: 1 1
# 4: 1 0
# 5: 1 5
# 6: 2 0
# 7: 2 9
# 8: 2 12
# 9: 3 0
# 10: 3 92
So to add some explanations:
.I is identical to row_number() or 1:n() in dplyr for an ungrouped data, e.g. 1:nrow(df) in base R
.N is like n() in dplyr, e.g., the size of a certain group (or the whole data set). So basically when I run .I[.N] by group, I'm retrieving the global index of the last row of each group
The next step is just use this index as a row index within df while assigning the desired values to Time by reference using the := operator.
Edit
Per OPs request, here's a possible dplyr solution. Your original solution doesn't work because you are working per group and thus you were trying to pass all three values to each group.
The only way I can think of is to first calculate group sizes, then ungroup and then mutate on the cumulative sum of these locations, something among these lines
library(dplyr)
df %>%
group_by(id) %>%
mutate(indx = n()) %>%
ungroup() %>%
mutate(Time = replace(Time, cumsum(unique(indx)), c(5, 12, 92))) %>%
select(-indx)
# Source: local data frame [10 x 2]
#
# id Time
# 1 1 3
# 2 1 10
# 3 1 1
# 4 1 0
# 5 1 5
# 6 2 0
# 7 2 9
# 8 2 12
# 9 3 0
# 10 3 92
Another way using data.table would be to create another data.table which contains the values to be replaced with for a given id, and then join and update by reference (simultaneously).
require(data.table) # v1.9.5+ (for 'on = ' feature)
replace = data.table(id = 1:3, val = c(5L, 12L, 9L)) # from #David
setDT(df)[replace, Time := val, on = "id", mult = "last"]
# id Time
# 1: 1 3
# 2: 1 10
# 3: 1 1
# 4: 1 0
# 5: 1 5
# 6: 2 0
# 7: 2 9
# 8: 2 12
# 9: 3 0
# 10: 3 9
In data.table, joins are considered as an extension of subsets. It's natural to think of doing whatever operation we do on subsets also on joins. Both operations do something on some rows.
For each replace$id, we find the last matching row (mult = "last") in df$id, and update that row with the corresponding val.
Installation instructions for v1.9.5 here. Hope this helps.

Resources