Conditional Series Fill in R - r

Looking for a way to fill in a vector with new values conditional on values within that vector and another variable in the data frame. Pasted an example of what the data looks like below.
PrsVar= c(rep(1,10),rep(2,7),rep(3,11))
IndVar = c(0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0)
OutVar = c(1,1,1,1,2,2,2,3,3,3,1,1,1,1,2,2,2,1,1,1,1,2,2,2,2,3,3,3)
exampdata <- cbind(PrsVar,IndVar,OutVar)
exampdata <- as.data.frame(exampdata)
> exampdata
PrsVar IndVar OutVar
1 1 0 1
2 1 0 1
3 1 0 1
4 1 1 1
5 1 0 2
6 1 0 2
7 1 1 2
8 1 0 3
9 1 0 3
10 1 0 3
11 2 0 1
12 2 0 1
13 2 0 1
14 2 1 1
15 2 0 2
16 2 0 2
17 2 1 2
18 3 0 1
19 3 0 1
20 3 0 1
21 3 1 1
22 3 0 2
23 3 0 2
24 3 0 2
25 3 1 2
26 3 0 3
27 3 0 3
28 3 0 3
This is time-series data and each row represents a person-day. PrsVar is an ID for an individual in the study and IndVar is an indicator that an episode has ended on that person-day. The person-day after that represents a new episode.
I'd like to create a variable that looks like OutVar using just the values from PrsVar and IndVar. This new variable OutVar labels the episode each person-day is in, incrementing by 1, and starting over at 1 for each new individual.
I could run this through a loop, but I need more efficient code to work with 3,000,000+ rows of data. Was trying to use something in dplyr or maybe mapply, but I'm stumped. Thinking a solution to this would be helpful to others and would certainly be helpful to me again in the near future.

The data.table package offers a fast, efficient, and tidy way to do this. It's all done by reference (not by value, so no copying is done) so millions of rows won't be an issue at all (under a minute, maybe).
library(data.table)
patient.data <- data.table(PrsVar = c(rep(1,10), rep(2,7), rep(3,11)),
IndVar = c(0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0),
OutVar = c(1,1,1,1,2,2,2,3,3,3,1,1,1,1,2,2,2,1,1,1,1,2,2,2,2,3,3,3))
Increment an episode counter EpVar based on the cumulative sum of IndVar (plus 1). This increases the counter at the record where IndVar increases (which is too early) so shift it down a record with shift, replacing the missing value with a reset counter (1). This can be done groupwise with the by keyword.
patient.data[ , EpVar:=shift(1+cumsum(IndVar), fill=1), by=PrsVar]
patient.data
PrsVar IndVar OutVar EpVar
1: 1 0 1 1
2: 1 0 1 1
3: 1 0 1 1
4: 1 1 1 1
5: 1 0 2 2
6: 1 0 2 2
7: 1 1 2 2
8: 1 0 3 3
9: 1 0 3 3
10: 1 0 3 3
11: 2 0 1 1
12: 2 0 1 1
13: 2 0 1 1
14: 2 1 1 1
15: 2 0 2 2
16: 2 0 2 2
17: 2 1 2 2
18: 3 0 1 1
19: 3 0 1 1
20: 3 0 1 1
21: 3 1 1 1
22: 3 0 2 2
23: 3 0 2 2
24: 3 0 2 2
25: 3 1 2 2
26: 3 0 3 3
27: 3 0 3 3
28: 3 0 3 3

A bit ugly, but this logic should be easily adaptable to other methods:
with(exampdata,
ave(IndVar, PrsVar, FUN=function(x) {
out <- rev(cumsum(rev(x)))
max(out) - out + 1
})
)
# [1] 1 1 1 1 2 2 2 3 3 3 1 1 1 1 2 2 2 1 1 1 1 2 2 2 2 3 3 3

Related

How to label clustered groups in a vector?

Data:
set.seed (112098)
op <- data.frame(id=1:100,cluster=rbinom(100,1,0.5))
id cluster
1 1
2 1
3 1
4 0
5 1
6 1
7 0
8 0
9 1
Intended:
id cluster groups
1 1 1
2 1 1
3 1 1
4 0 0
5 1 2
6 1 2
7 0 0
8 0 0
9 1 3
Essentially, every consecutive 1 series forms a group. How could I add the group column in R?
Here is one option using rleid from data.table
library(data.table)
setDT(op)[, groups := rleid(cluster)*(cluster)
][groups!=0, groups := as.integer(factor(groups))]
head(op, 9)
# id cluster groups
#1: 1 1 1
#2: 2 1 1
#3: 3 1 1
#4: 4 0 0
#5: 5 0 0
#6: 6 1 2
#7: 7 1 2
#8: 8 0 0
#9: 9 1 3

R data.table conditional (min/max) aggregation

I'm relatively new to R and I have a question regarding how to do conditional aggregation using data.tables (or other methods) while still accessing the table columns by reference. There was an answer to a similar question here but it takes a long time on my data and takes a lot of memory. Here is some toy data:
t <- data.table(User=c(1,1,1,1,1,2,2,2,2,3,3,3,3,3,3),
Obs=c(1,2,3,4,5,1,2,3,4,1,2,3,4,5,6),
Flag=c(0,1,0,1,0,0,1,0,0,1,0,0,0,1,0))
Which looks like this:
User Obs Flag
1: 1 1 0
2: 1 2 1
3: 1 3 0
4: 1 4 1
5: 1 5 0
6: 2 1 0
7: 2 2 1
8: 2 3 0
9: 2 4 0
10: 3 1 1
11: 3 2 0
12: 3 3 0
13: 3 4 0
14: 3 5 1
15: 3 6 0
What I would like to do with this is to get the maximum observation less than the current observation where the flag is 1, by user. The output should look like this:
User Obs Flag min.max
1: 1 1 0 NA
2: 1 2 1 2
3: 1 3 0 2
4: 1 4 1 4
5: 1 5 0 4
6: 2 1 0 NA
7: 2 2 1 2
8: 2 3 0 2
9: 2 4 0 2
10: 3 1 1 1
11: 3 2 0 1
12: 3 3 0 1
13: 3 4 0 1
14: 3 5 1 5
15: 3 6 0 5
Any help would be greatly appreciated!
t[, max := Obs[Flag == 1], by = .(User, cumsum(diff(c(0, Flag)) == 1))]
t
# User Obs Flag max
# 1: 1 1 0 NA
# 2: 1 2 1 2
# 3: 1 3 0 2
# 4: 1 4 1 4
# 5: 1 5 0 4
# 6: 2 1 0 NA
# 7: 2 2 1 2
# 8: 2 3 0 2
# 9: 2 4 0 2
#10: 3 1 1 1
#11: 3 2 0 1
#12: 3 3 0 1
#13: 3 4 0 1
#14: 3 5 1 5
#15: 3 6 0 5

Sum rows in a group, starting when a specific value occurs

I want to accumulate the values of a column till the end of the group, though starting the addition when a specific value occurs in another column. I am only interested in the first instance of the specific value within a group. So if that value occurs again within the group, the addition column should continue to add the values. I know this sounds like a rather strange problem, so hopefully the example table makes sense.
The following data frame is what I have now:
> df = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0))
> df
group numToAdd occurs
1 1 1 0
2 1 1 0
3 1 3 1
4 1 2 0
5 2 4 0
6 2 2 1
7 2 1 0
8 2 3 0
9 2 2 0
10 3 1 0
11 3 2 1
12 3 1 1
13 4 2 0
14 4 3 0
15 4 2 0
Thus, whenever a 1 occurs within a group, I want a cumulative sum of the values from the column numToAdd, until a new group starts. This would look like the following:
> finalDF = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0),added = c(0,0,3,5,0,2,3,6,8,0,2,3,0,0,0))
> finalDF
group numToAdd occurs added
1 1 1 0 0
2 1 1 0 0
3 1 3 1 3
4 1 2 0 5
5 2 4 0 0
6 2 2 1 2
7 2 1 0 3
8 2 3 0 6
9 2 2 0 8
10 3 1 0 0
11 3 2 1 2
12 3 1 1 3
13 4 2 0 0
14 4 3 0 0
15 4 2 0 0
Thus, the added column is 0 until a 1 occurs within the group, then accumulates the values from numToAdd until it moves to a new group, turning the added column back to 0. In group three, a value of 1 is found a second time, yet the cumulated sum continues. Additionally, in group 4, a value of 1 is never found, thus the value within the added column remains 0.
I've played around with dplyr, but can't get it to work. The following solution only outputs the total sum, and not the increasing cumulated number at each row.
library(dplyr)
df =
df %>%
mutate(added=ifelse(occurs == 1,cumsum(numToAdd),0)) %>%
group_by(group)
Try
df %>%
group_by(group) %>%
mutate(added= cumsum(numToAdd*cummax(occurs)))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
Or using data.table
library(data.table)#v1.9.5+
i1 <-setDT(df)[, .I[(rleid(occurs) + (occurs>0))>1], group]$V1
df[, added:=0][i1, added:=cumsum(numToAdd), by = group]
Or a similar option as in dplyr
setDT(df)[,added := cumsum(numToAdd * cummax(occurs)) , by = group]
You can use split-apply-combine in base R with something like:
df$added <- unlist(lapply(split(df, df$group), function(x) {
y <- rep(0, nrow(x))
pos <- cumsum(x$occurs) > 0
y[pos] <- cumsum(x$numToAdd[pos])
y
}))
df
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
To add another base R approach:
df$added <- unlist(lapply(split(df, df$group), function(x) {
c(x[,'occurs'][cumsum(x[,'occurs']) == 0L],
cumsum(x[,'numToAdd'][cumsum(x[,'occurs']) != 0L]))
}))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
Another base R:
df$added <- unlist(lapply(split(df,df$group),function(x){
cumsum((cumsum(x$occurs) > 0) * x$numToAdd)
}))

Transforming a two-way table() into pairwise list of counts in R

Starting with some sample two-way frequency table:
a <- c(1,2,3,4,4,3,4,2,2,2)
b <- c(1,2,3,4,1,2,4,3,2,2)
tab <- table(a,b)
> tab
b
a 1 2 3 4
1 1 0 0 0
2 0 3 1 0
3 0 1 1 0
4 1 0 0 2
I need to transform the table into the following format:
goal <- data.frame(a=c(1,2,3,4),b=c(1,2,3,4),count=c(1,3,1,2))
> goal
a b count
1 1 1 1
2 2 2 3
3 3 3 1
4 4 4 2
. . . .
How can I form all pairwise combinations from the two-way table and add the frequency counts in the third column?
Intuition tells me there should be a simple kind of 'reverse' function for table, but I could not find anything on SO or Google.
Naturally, after posting the question I found the right search query for Google...
> as.data.frame(tab)
a b Freq
1 1 1 1
2 2 1 0
3 3 1 0
4 4 1 1
5 1 2 0
6 2 2 3
7 3 2 1
8 4 2 0
9 1 3 0
10 2 3 1
11 3 3 1
12 4 3 0
13 1 4 0
14 2 4 0
15 3 4 0
16 4 4 2

Conditional counting in R

I have a question I hope some of you might help me with. I am doing a thesis on pharmaceuticals and the effect from parallelimports. I am dealing with this in R, having a Panel Dataset
I need a variable, that counts for a given original product - how many parallelimporters are there for this given time period.
Product_ID PI t
1 0 1
1 1 1
1 1 1
1 0 2
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1
2 1 1
2 0 2
2 1 2
2 0 3
2 1 3
2 1 3
2 1 3
Ideally what i want here is a new column, like number of PI-products (PI=1) for an original (PI=0) at time, t. So the output would be like:
Product_ID PI t nPIcomp
1 0 1 2
1 1 1
1 1 1
1 0 2 4
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1 1
2 1 1
2 0 2 1
2 1 2
2 0 3 3
2 1 3
2 1 3
2 1 3
I hope I have made my issue clear :)
Thanks in advance,
Henrik
Something like this?
x <- read.table(text = "Product_ID PI t
1 0 1
1 1 1
1 1 1
1 0 2
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1
2 1 1
2 0 2
2 1 2
2 0 3
2 1 3
2 1 3
2 1 3", header = TRUE)
find.count <- rle(x$PI)
count <- find.count$lengths[find.count$values == 1]
x[x$PI == 0, "nPIcomp"] <- count
Product_ID PI t nPIcomp
1 1 0 1 2
2 1 1 1 NA
3 1 1 1 NA
4 1 0 2 4
5 1 1 2 NA
6 1 1 2 NA
7 1 1 2 NA
8 1 1 2 NA
9 2 0 1 1
10 2 1 1 NA
11 2 0 2 1
12 2 1 2 NA
13 2 0 3 3
14 2 1 3 NA
15 2 1 3 NA
16 2 1 3 NA
I would use ave and your two columns Product_ID and t as grouping variables. Then, within each group, apply a function that returns the sum of PI followed by the appropriate number of NAs:
dat <- transform(dat, nPIcomp = ave(PI, Product_ID, t,
FUN = function(z) {
n <- sum(z)
c(n, rep(NA, n))
}))
The same idea can be used with the data.table package if your data is large and speed is a concern.
Roman's answers gives exactly what you want. In case you want to summarise the data this would be handy, using the plyr pacakge (df is what I have called your data.frame)...
ddply( df , .(Product_ID , t ) , summarise , nPIcomp = sum(PI) )
# Product_ID t nPIcomp
#1 1 1 2
#2 1 2 4
#3 2 1 1
#4 2 2 1
#5 2 3 3

Resources