Conditional variable using R code - r

I have a data set named "dats".
id y i j
1 0 1 1
1 0 1 2
1 0 1 3
2 1 2 1
2 1 2 2
2 1 2 3
I want to calculate, a new variable ynew=(yij-1*yij) based on (y11*y12, y12*y13....so on). I have tried in this way:
ynew <- NULL
for(p in 1)
{
for (q in ni)
{
ynew[p,q] <- dats$y[dats$i==p & dats$j==q-1]*dats$y[dats$i==p & dats$j==q]
}
}
ynew
But it showing error!
Expected output
id y i j ynew
1 0 1 1 NA
1 0 1 2 0
1 0 1 3 0
2 1 2 1 NA
2 1 2 2 1
2 1 2 3 1
Could anybody help? TIA

Using dplyr and rollapply from zoo package,
library(dplyr)
library(zoo)
dats %>%
group_by(id) %>%
mutate(ynew = c(NA, rollapply(y, 1, by = 2, prod)))
#Source: local data frame [6 x 5]
#Groups: id [2]
# id y i j ynew
# (int) (int) (int) (int) (dbl)
#1 1 0 1 1 NA
#2 1 0 1 2 0
#3 1 0 1 3 0
#4 2 1 2 1 NA
#5 2 1 2 2 1
#6 2 1 2 3 1

May be we need to just multiply with the lag of 'y' grouped by 'id'
library(data.table)
setDT(dats)[, ynew := y * shift(y), by = id]
dats
# id y i j ynew
#1: 1 0 1 1 NA
#2: 1 0 1 2 0
#3: 1 0 1 3 0
#4: 2 1 2 1 NA
#5: 2 1 2 2 1
#6: 2 1 2 3 1
It could also be done with roll_prod
library(RcppRoll)
setDT(dats)[, ynew := c(NA, roll_prod(y, 2)), by = id]
dats
# id y i j ynew
#1: 1 0 1 1 NA
#2: 1 0 1 2 0
#3: 1 0 1 3 0
#4: 2 1 2 1 NA
#5: 2 1 2 2 1
#6: 2 1 2 3 1

Related

Conditional running count (cumulative sum) with reset in R (dplyr)

I'm trying to calculate a running count (i.e., cumulative sum) that is conditional on other variables and that can reset for particular values on another variable. I'm working in R and would prefer a dplyr-based solution, if possible.
I'd like to create a variable for the running count, cumulative, based on the following algorithm:
Calculate the running count (cumulative) within combinations of id and age
Increment running count (cumulative) by 1 for every subsequent trial where accuracy = 0, block = 2, and condition = 1
Reset running count (cumulative) to 0 for each trial where accuracy = 1, block = 2, and condition = 1, and the next increment resumes at 1 (not the previous number)
For each trial where block != 2, or condition != 1, leave the running count (cumulative) as NA
Here's a minimal working example:
mydata <- data.frame(id = c(1,1,1,1,1,1,1,1,1,1,1),
age = c(1,1,1,1,1,1,1,1,1,1,2),
block = c(1,1,2,2,2,2,2,2,2,2,2),
trial = c(1,2,1,2,3,4,5,6,7,8,1),
condition = c(1,1,1,1,1,2,1,1,1,1,1),
accuracy = c(0,0,0,0,0,0,0,1,0,0,0)
)
id age block trial condition accuracy
1 1 1 1 1 0
1 1 1 2 1 0
1 1 2 1 1 0
1 1 2 2 1 0
1 1 2 3 1 0
1 1 2 4 2 0
1 1 2 5 1 0
1 1 2 6 1 1
1 1 2 7 1 0
1 1 2 8 1 0
1 2 2 1 1 0
The expected output is:
id age block trial condition accuracy cumulative
1 1 1 1 1 0 NA
1 1 1 2 1 0 NA
1 1 2 1 1 0 1
1 1 2 2 1 0 2
1 1 2 3 1 0 3
1 1 2 4 2 0 NA
1 1 2 5 1 0 4
1 1 2 6 1 1 0
1 1 2 7 1 0 1
1 1 2 8 1 0 2
1 2 2 1 1 0 1
Here is an option using data.table. Create a binary column based on matching the pasted values of 'accuracy', 'block', 'condition' with that of the custom values, grouped by run-length-id of the binary column ('ind'), 'id' and 'age', get the cumulative sum of 'ind' and assign (:=) it to a new column ('Cumulative')
library(data.table)
setDT(mydata)[, ind := match(do.call(paste0, .SD), c("121", "021")) - 1,
.SDcols = c("accuracy", "block", "condition")
][, Cumulative := cumsum(ind), .(rleid(ind), id, age)
][, ind := NULL][]
# id age block trial condition accuracy Cumulative
# 1: 1 1 1 1 1 0 NA
# 2: 1 1 1 2 1 0 NA
# 3: 1 1 2 1 1 0 1
# 4: 1 1 2 2 1 0 2
# 5: 1 1 2 3 1 0 3
# 6: 1 1 2 4 2 0 NA
# 7: 1 1 2 5 1 1 0
# 8: 1 1 2 6 1 0 1
# 9: 1 1 2 7 1 0 2
#10: 1 2 2 1 1 0 1
We can use case_when to assign the value which we need based on our conditions. We then add an additional group_by condition using cumsum to switch values when the temp column 0. In the final mutate step we temporarily replace NA values in temp to 0, then take cumsum over it and put back the NA values again to it's place to get the final output.
library(dplyr)
mydata %>%
group_by(id, age) %>%
mutate(temp = case_when(accuracy == 0 & block == 2 & condition == 1 ~ 1,
accuracy == 1 & block == 2 & condition == 1 ~ 0,
TRUE ~ NA_real_)) %>%
ungroup() %>%
group_by(id, age, group = cumsum(replace(temp == 0, is.na(temp), 0))) %>%
mutate(cumulative = replace(cumsum(replace(temp, is.na(temp), 0)),
is.na(temp), NA)) %>%
select(-temp, -group)
# group id age block trial condition accuracy cumulative
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0 1 1 1 1 1 0 NA
# 2 0 1 1 1 2 1 0 NA
# 3 0 1 1 2 1 1 0 1
# 4 0 1 1 2 2 1 0 2
# 5 0 1 1 2 3 1 0 3
# 6 0 1 1 2 4 2 0 NA
# 7 0 1 1 2 5 1 0 4
# 8 1 1 1 2 6 1 1 0
# 9 1 1 1 2 7 1 0 1
#10 1 1 1 2 8 1 0 2
#11 1 1 2 2 1 1 0 1

Creating a new variable while using subsequent values in r

I have the following data frame:
df1 <- data.frame(id = rep(1:3, each = 5),
time = rep(1:5),
y = c(rep(1, 4), 0, 1, 0, 1, 1, 0, 0, 1, rep(0,3)))
df1
## id time y
## 1 1 1 1
## 2 1 2 1
## 3 1 3 1
## 4 1 4 1
## 5 1 5 0
## 6 2 1 1
## 7 2 2 0
## 8 2 3 1
## 9 2 4 1
## 10 2 5 0
## 11 3 1 0
## 12 3 2 1
## 13 3 3 0
## 14 3 4 0
## 15 3 5 0
I'd like to create a new indicator variable that tells me, for each of the three ids, at what point y = 0 for all subsequent responses. In the example above, for ids 1 and 2 this occurs at the 5th time point, and for id 3 this occurs at the 3rd time point.
I'm getting tripped up on id 2, where y = 1 at time point 2, but then goes back to one -- I'd like to the indicator variable to take subsequent time points into account.
Essentially, I'm looking for the following output:
df1
## id time y new_col
## 1 1 1 1 0
## 2 1 2 1 0
## 3 1 3 1 0
## 4 1 4 1 0
## 5 1 5 0 1
## 6 2 1 1 0
## 7 2 2 0 0
## 8 2 3 1 0
## 9 2 4 1 0
## 10 2 5 0 1
## 11 3 1 0 0
## 12 3 2 1 0
## 13 3 3 0 1
## 14 3 4 0 1
## 15 3 5 0 1
The new_col variable is indicating whether or not y = 0 at that time point and for all subsequent time points.
I would use a little helper function for that.
foo <- function(x, val) {
pos <- max(which(x != val)) +1
as.integer(seq_along(x) >= pos)
}
df1 %>%
group_by(id) %>%
mutate(indicator = foo(y, 0))
# # A tibble: 15 x 4
# # Groups: id [3]
# id time y indicator
# <int> <int> <dbl> <int>
# 1 1 1 1 0
# 2 1 2 1 0
# 3 1 3 1 0
# 4 1 4 1 0
# 5 1 5 0 1
# 6 2 1 1 0
# 7 2 2 0 0
# 8 2 3 1 0
# 9 2 4 1 0
# 10 2 5 0 1
# 11 3 1 0 0
# 12 3 2 1 0
# 13 3 3 0 1
# 14 3 4 0 1
# 15 3 5 0 1
In case you want to consider NA-values in y, you can adjust foo to:
foo <- function(x, val) {
pos <- max(which(x != val | is.na(x))) +1
as.integer(seq_along(x) >= pos)
}
That way, if there's a NA after the last y=0, the indicator will remain 0.
Here is an option using data.table
library(data.table)
setDT(df1)[, indicator := cumsum(.I %in% .I[which.max(rleid(y)*!y)]), id]
df1
# id time y indicator
# 1: 1 1 1 0
# 2: 1 2 1 0
# 3: 1 3 1 0
# 4: 1 4 1 0
# 5: 1 5 0 1
# 6: 2 1 1 0
# 7: 2 2 0 0
# 8: 2 3 1 0
# 9: 2 4 1 0
#10: 2 5 0 1
#11: 3 1 0 0
#12: 3 2 1 0
#13: 3 3 0 1
#14: 3 4 0 1
#15: 3 5 0 1
Based on the comments from #docendodiscimus, if the values are not 0 for 'y' at the end of each 'id', then we can do
setDT(df1)[, indicator := {
i1 <- rleid(y) * !y
if(i1[.N]!= max(i1) & !is.na(i1[.N])) 0L else cumsum(.I %in% .I[which.max(i1)]) }, id]

How to Perform Consecutive Counts of Column by Group Conditionally Upon Another Column

I'm trying to get consecutive counts from the Noshow column grouped by the PatientID column. The below code that I am using is very close to the results that I wish to attain. However, using the sum function returns the sum of the whole group. I would like the sum function to only sum the current row and only the rows that have a '1' above it. Basically, I'm trying to count the consecutive amount of times a patient noshows their appointment for each row and then reset to 0 when they do show. It seems like only some tweaks need to be made to my below code. However, I cannot seem to find the answer anywhere on this site.
transform(df, ConsecNoshows = ifelse(Noshow == 0, 0, ave(Noshow, PatientID, FUN = sum)))
The above code produces the below output:
#Source: local data frame [12 x 3]
#Groups: ID [2]
#
# PatientID Noshow ConsecNoshows
# <int> <int> <int>
#1 1 0 0
#2 1 1 4
#3 1 0 0
#4 1 1 4
#5 1 1 4
#6 1 1 4
#7 2 0 0
#8 2 0 0
#9 2 1 3
#10 2 1 3
#11 2 0 0
#12 2 1 3
This is what I desire:
#Source: local data frame [12 x 3]
#Groups: ID [2]
#
# PatientID Noshow ConsecNoshows
# <int> <int> <int>
#1 1 0 0
#2 1 1 0
#3 1 0 1
#4 1 1 0
#5 1 1 1
#6 1 1 2
#7 2 0 0
#8 2 0 0
#9 2 1 0
#10 2 1 1
#11 2 0 2
#12 2 1 0
[UPDATE] I would like the consecutive count to be offset by one row down.
Thank you for any help you can offer in advance!
And here's another (similar) data.table approach
library(data.table)
setDT(df)[, ConsecNoshows := seq(.N) * Noshow, by = .(PatientID, rleid(Noshow))]
df
# PatientID Noshow ConsecNoshows
# 1: 1 0 0
# 2: 1 1 1
# 3: 1 0 0
# 4: 1 1 1
# 5: 1 1 2
# 6: 1 1 3
# 7: 2 0 0
# 8: 2 0 0
# 9: 2 1 1
# 10: 2 1 2
# 11: 2 0 0
# 12: 2 1 1
This is basically groups by PatientID and "run-length-encoding" of Noshow and creates sequences using the group sizes while multiplying by Noshow in order to keep only the values when Noshow == 1
We can use rle from base R (No packages used). Using ave, we group by 'PatientID', get the rle of 'Noshow', multiply the sequence of 'lengths' by the 'values' replicated by 'lengths' to get the expected output.
helperfn <- function(x) with(rle(x), sequence(lengths) * rep(values, lengths))
df$ConsecNoshows <- with(df, ave(Noshow, PatientID, FUN = helperfn))
df$ConsecNoshows
#[1] 0 1 0 1 2 3 0 0 1 2 0 1
As the OP seems to be using 'tbl_df', a solution in dplyr would be
library(dplyr)
df %>%
group_by(PatientID) %>%
mutate(ConsecNoshows = helperfn(Noshow))
# PatientID Noshow ConsecNoshows
# <int> <int> <int>
#1 1 0 0
#2 1 1 1
#3 1 0 0
#4 1 1 1
#5 1 1 2
#6 1 1 3
#7 2 0 0
#8 2 0 0
#9 2 1 1
#10 2 1 2
#11 2 0 0
#12 2 1 1
I would create a helper function to then use whatever implementation you're most comfortable with:
sum0 <- function(x) {x[x == 1]=sequence(with(rle(x), lengths[values == 1]));x}
#base R
transform(df1, Consec = ave(Noshow, PatientID, FUN=sum0))
#dplyr
library(dplyr)
df1 %>% group_by(PatientID) %>% mutate(Consec=sum0(Noshow))
#data.table
library(data.table)
setDT(df1)[, Consec := sum0(Noshow), by = PatientID]
# PatientID Noshow Consec
# <int> <int> <int>
# 1 1 0 0
# 2 1 1 1
# 3 1 0 0
# 4 1 1 1
# 5 1 1 2
# 6 1 1 3
# 7 2 0 0
# 8 2 0 0
# 9 2 1 1
# 10 2 1 2
# 11 2 0 0
# 12 2 1 1
The most straight forward way to group consecutive values is to use rleid from data.table, here is an option from data.table package, where you group data by the PatientID as well as rleid of Noshow variable. And also you need the cumsum function to get a cumulative sum of the Noshow variable instead of sum:
library(data.table)
setDT(df)[, ConsecNoshows := ifelse(Noshow == 0, 0, cumsum(Noshow)), .(PatientID, rleid(Noshow))]
df
# PatientID Noshow ConsecNoshows
# 1: 1 0 0
# 2: 1 1 1
# 3: 1 0 0
# 4: 1 1 1
# 5: 1 1 2
# 6: 1 1 3
# 7: 2 0 0
# 8: 2 0 0
# 9: 2 1 1
#10: 2 1 2
#11: 2 0 0
#12: 2 1 1

R For Loop Not Working

Mydata set test is below. I want to create a new variable "indicator" which is=1 if all variables equal 1 (example row 3) or else 0.
id X10J X10f X10m X10ap X10myy X10junn X10julyy
1 1001 2 2 2 2 2 2 2
2 1002 1 1 -1 2 1 1 1
3 1003 1 1 1 1 1 1 1
4 1004 1 1 2 1 1 1 1
12 1012 1 2 1 1 1 1 1
i created the following for loop:
for (i in c(test$X10J,test$X20f,test$X10m,test$X10ap,test$Xmyy,test$X10junn,test$X10julyy)){
if(i==1){
test$indicator=1
}else if(i==2|i==-1){
test$indicator=0
}
}
this creates a variable with all values=1 instead of 0 and -1.
A vectorized solution:
test$indicator <- ifelse(rowSums(test[,-1] ==1)==ncol(test[,-1]),1,0)
No need for a for loop. You can use apply
> test$indicator <- apply(test[-1], 1, function(x) ifelse(all(x == 1), 1, 0))
> test
id X10J X10f X10m X10ap X10myy X10junn X10julyy indicator
1 1001 2 2 2 2 2 2 2 0
2 1002 1 1 -1 2 1 1 1 0
3 1003 1 1 1 1 1 1 1 1
4 1004 1 1 2 1 1 1 1 0
12 1012 1 2 1 1 1 1 1 0
You could just use:
indicator <- apply(test[,-1], 1, function(row)
{
ifelse(all(row==1), 1, 0)
})
Note: the second parameter of apply is 1 if you for rows and 2 for columns.

Conditional counting in R

I have a question I hope some of you might help me with. I am doing a thesis on pharmaceuticals and the effect from parallelimports. I am dealing with this in R, having a Panel Dataset
I need a variable, that counts for a given original product - how many parallelimporters are there for this given time period.
Product_ID PI t
1 0 1
1 1 1
1 1 1
1 0 2
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1
2 1 1
2 0 2
2 1 2
2 0 3
2 1 3
2 1 3
2 1 3
Ideally what i want here is a new column, like number of PI-products (PI=1) for an original (PI=0) at time, t. So the output would be like:
Product_ID PI t nPIcomp
1 0 1 2
1 1 1
1 1 1
1 0 2 4
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1 1
2 1 1
2 0 2 1
2 1 2
2 0 3 3
2 1 3
2 1 3
2 1 3
I hope I have made my issue clear :)
Thanks in advance,
Henrik
Something like this?
x <- read.table(text = "Product_ID PI t
1 0 1
1 1 1
1 1 1
1 0 2
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1
2 1 1
2 0 2
2 1 2
2 0 3
2 1 3
2 1 3
2 1 3", header = TRUE)
find.count <- rle(x$PI)
count <- find.count$lengths[find.count$values == 1]
x[x$PI == 0, "nPIcomp"] <- count
Product_ID PI t nPIcomp
1 1 0 1 2
2 1 1 1 NA
3 1 1 1 NA
4 1 0 2 4
5 1 1 2 NA
6 1 1 2 NA
7 1 1 2 NA
8 1 1 2 NA
9 2 0 1 1
10 2 1 1 NA
11 2 0 2 1
12 2 1 2 NA
13 2 0 3 3
14 2 1 3 NA
15 2 1 3 NA
16 2 1 3 NA
I would use ave and your two columns Product_ID and t as grouping variables. Then, within each group, apply a function that returns the sum of PI followed by the appropriate number of NAs:
dat <- transform(dat, nPIcomp = ave(PI, Product_ID, t,
FUN = function(z) {
n <- sum(z)
c(n, rep(NA, n))
}))
The same idea can be used with the data.table package if your data is large and speed is a concern.
Roman's answers gives exactly what you want. In case you want to summarise the data this would be handy, using the plyr pacakge (df is what I have called your data.frame)...
ddply( df , .(Product_ID , t ) , summarise , nPIcomp = sum(PI) )
# Product_ID t nPIcomp
#1 1 1 2
#2 1 2 4
#3 2 1 1
#4 2 2 1
#5 2 3 3

Resources