Irregular Interval based representation of survival data in R - r

I have the following dataset:
df =
id Time A
1 3 0
1 5 1
1 6 1
2 8 0
2 9 0
2 12 1
I want to do two things: i) have a starting time of -1 across all ids, and ii) split the time into two columns; start and end while preserving the time at which the individual got the observation A (setting end as the reference point). The final result should look something like this:
df =
id start end A
1 -1 0 0
1 0 2 1
1 2 3 1
2 -1 0 0
2 0 1 0
2 1 4 1

This does the trick with this set. I wasn't 100% sure on the question from the description so tried to go off what I could see here. For future reference, please try pasting in dput(df) as the input data :)
df <- data.frame(id=c(rep(1,3),rep(2,3)),
Time=c(3,5,6,8,9,12),
A=c(0,1,1,0,0,1))
library(data.table)
dt <- as.data.table(df)
# diff(Time) finds the interval between points
# cumsum then adds this diff together to take in to account the previous time
# gaps
dt[, end := cumsum(c(0, diff(Time))), by=id]
# start is then just a shifted version of end, with the initial start filled as -1
dt[, start := shift(end, n=1, fill=-1), by=id]
out <- as.data.frame(dt)
out

Related

Add index to runs of positive or negative values of certain length

I have a dataframe, which contains 100.000 rows. It looks like this:
Value
1
2
-1
-2
0
3
4
-1
3
I want to create an extra column (column B). Which consist of 0 and 1's.
It is basically 0, but when there are 5 data points in a row positive OR negative, then it should give a 1. But, only if they are in a row (e.g.: when the row is positive, and there is a negative number.. the count shall start again).
Value B
1 0
2 0
1 0
2 0
2 1
3 1
4 1
-1 0
3 0
I tried different loops, but It didn't work. I also tried to convert the whole DF to a list (and loop over the list). Unfortunately with no end.
Here's an approach that uses the rollmean function from the zoo package.
set.seed(1000)
df = data.frame(Value = sample(-9:9,1000,replace=T))
sign = sign(df$Value)
library(zoo)
rolling = rollmean(sign,k=5,fill=0,align="right")
df$B = as.numeric(abs(rolling) == 1)
I generated 1000 values with positive and negative sets.
Extract the sign of the values - this will be -1 for negative, 1 for positive and 0 for 0
Calculate the right aligned rolling mean of 5 values (it will average x[1:5], x[2:6], ...). This will be 1 or -1 if all the values in a row are positive or negative (respectively)
Take the absolute value and store the comparison against 1. This is a logical vector that turns into 0s and 1s based on your conditions.
Note - there's no need for loops. This can all be vectorised (once we have the rolling mean calculated).
This will work. Not the most efficient way to do it but the logic is pretty transparent -- just check if there's only one unique sign (i.e. +, -, or 0) for each sequence of five adjacent rows:
dat <- data.frame(Value=c(1,2,1,2,2,3,4,-1,3))
dat$new_col <- NA
dat$new_col[1:4] <- 0
for (x in 5:nrow(dat)){
if (length(unique(sign(dat$Value[(x-4):x])))==1){
dat$new_col[x] <- 1
} else {
dat$new_col[x] <- 0
}
}
Use the cumsum(...diff(...) <condition>) idiom to create a grouping variable, and ave to calculate the indices within each group.
d$B2 <- ave(d$Value, cumsum(c(0, diff(sign(d$Value)) != 0)), FUN = function(x){
as.integer(seq_along(x) > 4)})
# Value B B2
# 1 1 0 0
# 2 2 0 0
# 3 1 0 0
# 4 2 0 0
# 5 2 1 1
# 6 3 1 1
# 7 4 1 1
# 8 -1 0 0
# 9 3 0 0

How to extract all rows between start signal and end signal?

I have the following df and I would like to extract all rows based on the following start and end signals.
Start signal : When status changes from 1 to 0
End signal : When status changes from 0 to -1.
df <- data.frame(time = rep(1:14), status = c(0,1,1,0,0,0,-1,0,1,0,0,0,-1,0))
time status
1 1 0
2 2 1
3 3 1
4 4 0
5 5 0
6 6 0
7 7 -1
8 8 0
9 9 1
10 10 0
11 11 0
12 12 0
13 13 -1
14 14 0
Desire:
time status
4 4 0
5 5 0
6 6 0
10 10 0
11 11 0
12 12 0
Here's a possible solution using the data.table package. I'm basically first grouping by status == 1 appearances and then checking per group if there was also a status == -1, if so, I'm sub-setting the group from the second incident until the -1 incident minus 1
library(data.table)
setDT(df)[, indx := cumsum(status == 1)]
df[, if(any(status == -1)) .SD[2:(which(status == -1) - 1)], by = indx]
# indx time status
# 1: 2 4 0
# 2: 2 5 0
# 3: 2 6 0
# 4: 3 10 0
# 5: 3 11 0
# 6: 3 12 0
We count start and end markers, then use those values and the cumulative-sum of (start - end) to filter rows. The (cumsum(start)-cumsum(end)>1) is a slight fiddle to avoid the cumulative counts being upset by row 2 which starts but doesn't end; otherwise row 14 would unwantedly get included.
require(dplyr)
df %>% mutate(start=(status==1), end=(status==-1)) %>%
filter(!start & !end & (cumsum(start)-cumsum(end)>1) ) %>%
select(-start, -end)
# time status
# 1 4 0
# 2 5 0
# 3 6 0
# 4 10 0
# 5 11 0
# 6 12 0
A little ugly, but you can always just loop over the values and keep a flag for determining whether the element should be kept or not.
keepers <- rep(FALSE, nrow(df))
flag <- FALSE
for(i in 1:(nrow(df)-1)) {
if(df$status[i] == 1 && df$status[i+1] == 0) {
flag <- TRUE
next # keep signal index false
}
if(df$status[i] == -1 && df$status[i+1] == 0) {
flag <- FALSE
next # keep signal index false
}
keepers[i] <- flag
}
keepers[nrow(df)] <- flag # Set the last element to final flag value
newdf <- df[keepers, ] # subset based on the T/F values determined
Do you have some more data (or can you gen some more data you know the outcome of) to see if this/these generalize?
Two similar approaches:
library(stringr)
df <- data.frame(time = rep(1:14), status = c(0,1,1,0,0,0,-1,0,1,0,0,0,-1,0))
dfr <- rle(df$status)
# first approach
find_seq_str <- function() {
str_locate_all(paste(gsub("-1", "x", dfr$values), collapse=""), "10")[[1]][,2]
}
df[as.vector(sapply(find_seq_str(),
function(n) {
i <- sum(dfr$lengths[1:(n-1)])
tail(i:(i+dfr$lengths[n]), -1)
})),]
# second approach
find_seq_ts <- function() {
which(apply(embed(dfr$values, 2), 1, function(x) all(x == c(0, 1))))
}
df[as.vector(sapply(find_seq_ts(),
function(n) {
i <- sum(dfr$lengths[1:(n)])+1
head(i:(i+dfr$lengths[n+1]), -1)
})),]
Both approaches need a run length encoding of the status vector.
The first does a single character replacement for -1 so we can make an unambiguous, contiguous string to then use str_locate to find the pairs that tell us when the target sequence starts then rebuilds the ranges of zeroes from the rle lengths.
If it needs to be base R I can try to whip up something with regexpr.
The second builds a paired matrix and compares for the same target sequence.
Caveats:
I did no benchmarking
Both create potentially big things if status is big.
I'm not completely positive it generalizes (hence my initial q).
David's is far more readable, maintainable & transferrable code but you get to deal with all the "goodness" that comes with using data.table ;-)
I wrapped the approaches in functions as they could potentially then be parameterized, but you could just as easily just assign the value to a variable or shove it into the sapply (ugh, tho).

R ffdfdply reset cumsum using data.table

Sorry for asking basic question. I am using ff package and used read.csv.ffdf to import the data. I have more then 50 million rows in excel and I want to do cumulative sum on one of the column and reset it when it finds 0. I have the below code to generate cumulative series and but don't know how to access the current row.
idx <- ffdforder(i[c("a","c","b")])
ordered_i <- i[idx, ]
ordered_i$key_a_c_d <- ikey(ordered_i[c("a", "c","d")])
cumsum_i <- ffdfdply(ordered_i, split=as.character(ordered_i$key_a_c_d), FUN= function(x) {
x <- as.data.table(x)
if(x[**current row**, d]==0)
{
result <- x[,cumsum_a_c_d :=0]
}
else
{
result <- x[, cumsum_a_c_d := cumsum(d), by = list(key_a_c_d)]
}
as.data.frame(result)
}, trace=T)
I am using the data.tablepackage to get the cumulative sum done. How can I access the current row in data table so that I can compare it with 0 and reset the cumsum. I need the expected output as shown below. It's the cumulative sum of column d.
a b c d Result
1 1 1 1 1
1 4 1 0 0
1 6 1 1 1
1 2 1 1 2
1 5 1 0 0
1 3 1 1 1
Thanks

How do I apply a "patch" to a data.frame?

For lack of a better word, how do I apply a "patch" to a R data.frame? Suppose I have a master database with firm and outlet columns and an ownership shares variable that is 1 or 0 in this example, but could be any percentage.
// master
firm outlet shares.pre
1 five 1 0
2 one 1 1
3 red 1 0
4 yellow 1 0
5 five 2 0
6 one 2 0
// many more
I want to let firm "one" sell outlet "1" to firm "red", which transaction I have in another data.frame
// delta
firm outlet shares.delta
1 one 1 -1
2 red 1 1
What is the most efficient way in R to apply this "patch" or transaction to my master database? The end result should look like this:
// preferably master, NOT a copy
firm outlet shares.post
1 five 1 0
2 one 1 0 <--- was 1
3 red 1 1 <--- was 0
4 yellow 1 0
5 five 2 0
6 one 2 0
// many more
I am not particular about keeping the suffixes pre, post or delta. If they were all named shares that would be fine too, I simply want to "add" these data frames.
UPDATE: my current approach is this
update <- (master$firm %in% delta$firm) & (master$outlet %in% delta$outlet)
master[update,]$shares <- master[update,]$shares + delta$shares
Yes, I'm aware it does a vector scan to creat the Boolean update vector, and that the subsetting is also not very efficient. But the thing I don't like about it most is that I have to write out the matching columns.
Another way using data.table. Assuming you've loaded both your data in df1 and df2 data.frames,
require(data.table)
dt1 <- data.table(df1)
dt2 <- data.table(df2)
setkey(dt1, firm, outlet)
setkey(dt2, firm, outlet)
dt1 <- dt2[dt1]
dt1[is.na(dt1)] <- 0
dt1[, shares.post := shares.delta + shares.pre]
# firm outlet shares.delta shares.pre shares.post
# 1: five 1 0 0 0
# 2: five 2 0 0 0
# 3: one 1 -1 1 0
# 4: one 2 0 0 0
# 5: red 1 1 0 1
# 6: yellow 1 0 0 0
I'd give a more precise answer if you had provided a reproducible example, but here's one way:
Call your first data.frame dat and your second chg
Then you could merge the two:
dat <- merge(dat,chg)
And just subtract:
dat$shares <- with(dat, shares.pre + shares.delta )

Filtering data without loops in R

I have quite big data frame (few millions of records).
I need to filter it due to following rule:
- For each product delete all records which are before the fifth record after the first record with x>0.
So, We are interested only in two columns - ID and x. Data frame is sorted by ID.
It is fairly easy to do it using loops, but loops doesn't perform well on such big data frame.
How to do it in 'vector style'?
Example:
BEFORE FILTERING
ID x
1 0
1 0
1 5 # First record with x>0
1 0
1 3
1 4
1 0
1 9
1 0 # Delete all earlier records of that product
1 0
1 6
2 0
2 1 # First record with x>0
2 0
2 4
2 5
2 8
2 0 # Delete all earlier records of that product
2 1
2 3
After filtering:
ID x
1 9
1 0
1 0
1 6
2 0
2 1
2 3
For these split, apply, combine problems - I like using plyr. There are alternatives if speed becomes an issue, but for most things - plyr is easy to understand and use. I wrote a function that implements the logic you described above and then fed that to ddply() to operate on each chunk of the data based on ID.
fun <- function(x, column, threshold, numplus){
whichcol <- which(x[column] > threshold)[1]
rows <- seq(from = (whichcol + numplus), to = nrow(x))
return(x[rows,])
}
And then feed this to ddply()
require(plyr)
ddply(dat, "ID", fun, column = "x", threshold = 0, numplus = 5)
#-----
ID x
1 1 9
2 1 0
3 1 0
4 1 6
5 2 0
6 2 1
7 2 3

Resources