add "counting" column to a data frame with certain conditions - r

I have a data frame with different account and win or lose record. I want to count how many times a person has lost in a row.
df <- data.frame(account_number =c(1,1,1,1,1,1,1,2,2,2,2,2,3,3),
win_lose = c(-1,-1,-1,1,-1,-1,-1,-1,-1,1,1,1,1,-1))
> df
account_number win_lose
1 1 -1
2 1 -1
3 1 -1
4 1 1
5 1 -1
6 1 -1
7 1 -1
8 2 -1
9 2 -1
10 2 1
11 2 1
12 2 1
13 3 1
14 3 -1
Each account represents a person. The end results should look like this
account_number win_lose losing_streak
1 1 -1 1
2 1 -1 2
3 1 -1 3
4 1 1 0
5 1 -1 1
6 1 -1 2
7 1 -1 3
8 2 -1 1
9 2 -1 2
10 2 1 0
11 2 1 0
12 2 1 0
13 3 1 0
14 3 -1 1

One option is rleid from data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'account_numberandrleidof 'win_lose', we get the sequence of rows (seq_len(.N)) multiplied by 'win_lose < 0' so that all the FALSE values gets coerced to 0 and would be 0 by multiplying and the TRUE will be coerced to 1 and we get the sequence value by multiplying with 1.
library(data.table)
setDT(df)[, losing_streak := seq_len(.N) * (win_lose <0) ,
by = .(account_number, rleid(win_lose))]
df
# account_number win_lose losing_streak
# 1: 1 -1 1
# 2: 1 -1 2
# 3: 1 -1 3
# 4: 1 1 0
# 5: 1 -1 1
# 6: 1 -1 2
# 7: 1 -1 3
# 8: 2 -1 1
# 9: 2 -1 2
#10: 2 1 0
#11: 2 1 0
#12: 2 1 0
#13: 3 1 0
#14: 3 -1 1
A base R option would be using ave (for group by) and with rle
with(df, ave(win_lose, account_number, FUN =
function(x) with(rle(x== -1), sequence(lengths) * rep(values, lengths))))
#[1] 1 2 3 0 1 2 3 1 2 0 0 0 0 1

Related

Recoding by an order in r

I have a data recoding puzzle. Here is how my sample data looks like:
df <- data.frame(
id = c(1,1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3,3),
scores = c(0,1,1,0,0,-1,-1, 0,0,1,-1,-1,-1, 0,1,0,1,1,0,1),
position = c(1,2,3,4,5,6,7, 1,2,3,4,5,6, 1,2,3,4,5,6,7),
cat = c(1,1,1,1,1,0,0, 1,1,1,0,0,0, 1,1,1,1,1,1,1))
id scores position cat
1 1 0 1 1
2 1 1 2 1
3 1 1 3 1
4 1 0 4 1
5 1 0 5 1
6 1 -1 6 0
7 1 -1 7 0
8 2 0 1 1
9 2 0 2 1
10 2 1 3 1
11 2 -1 4 0
12 2 -1 5 0
13 2 -1 6 0
14 3 0 1 1
15 3 1 2 1
16 3 0 3 1
17 3 1 4 1
18 3 1 5 1
19 3 0 6 1
20 3 1 7 1
There are three ids in the dataset and rows were ordered by a positon variable. For each id, the first row after the scores start by -1 needs to be 0, and the cat variable needs to be 1. For example, for id=1, the first row would be 6th position and in that row, score should be 0 and the cat variable needs to 1. For those ids do not have scores=-1, I keep them as they are.
The desired output should look like below:
id scores position cat
1 1 0 1 1
2 1 1 2 1
3 1 1 3 1
4 1 0 4 1
5 1 0 5 1
6 1 0 6 1
7 1 -1 7 0
8 2 0 1 1
9 2 0 2 1
10 2 1 3 1
11 2 0 4 1
12 2 -1 5 0
13 2 -1 6 0
14 3 0 1 1
15 3 1 2 1
16 3 0 3 1
17 3 1 4 1
18 3 1 5 1
19 3 0 6 1
20 3 1 7 1
Any recommendations??
Thanks
This may be what you are after
df %>%
group_by(id) %>%
mutate(i = which(scores == -1)[1]) %>% # find the first row == -1
mutate(scores = case_when(position == i & scores !=0 ~ 0, T ~ scores), # update the score using position & i
cat = ifelse(scores == -1,0,1)) %>% # then update cat
select (-i) # remove I
After trying a few things and getting ideas from #Ricky and #e.matt, I came up with a solution.
df %>%
filter(scores == -1) %>% # keep cases where var = 1
distinct(id, .keep_all = T) %>% # keep distinct cases based on group
mutate(first = 1) %>% # create first column
right_join(df, by=c("id","scores","position","cat")) %>% # join back original dataset
mutate(first = coalesce(first, 0)) %>% # replace NAs with 0
mutate(scores = case_when(
first == 1 ~ 0,
TRUE~scores)) %>%
mutate(cat = case_when(
first == 1 ~ 1,
TRUE~cat))
This provides my desired output.
id scores position cat first
1 1 0 1 1 0
2 1 1 2 1 0
3 1 1 3 1 0
4 1 0 4 1 0
5 1 0 5 1 0
6 1 0 6 1 1
7 1 -1 7 0 0
8 2 0 1 1 0
9 2 0 2 1 0
10 2 1 3 1 0
11 2 0 4 1 1
12 2 -1 5 0 0
13 2 -1 6 0 0
14 3 0 1 1 0
15 3 1 2 1 0
16 3 0 3 1 0
17 3 1 4 1 0
18 3 1 5 1 0
19 3 0 6 1 0
20 3 1 7 1 0
here is a data.table oneliner
library( data.table )
setDT(df)
df[ df[, .(cumsum( scores == -1 ) == 1), by = .(id)]$V1, `:=`( scores = 0, cat = 1) ]
# id scores position cat
# 1: 1 0 1 1
# 2: 1 1 2 1
# 3: 1 1 3 1
# 4: 1 0 4 1
# 5: 1 0 5 1
# 6: 1 0 6 1
# 7: 1 -1 7 0
# 8: 2 0 1 1
# 9: 2 0 2 1
# 10: 2 1 3 1
# 11: 2 0 4 1
# 12: 2 -1 5 0
# 13: 2 -1 6 0
# 14: 3 0 1 1
# 15: 3 1 2 1
# 16: 3 0 3 1
# 17: 3 1 4 1
# 18: 3 1 5 1
# 19: 3 0 6 1
# 20: 3 1 7 1
You could do something along these lines using the dplyr package:
library(dplyr)
df = mutate(df, cat = ifelse(scores == -1, 1, cat),
scores = ifelse(scores == -1, 0, scores))
Using the mutate() function, I am re-assigning the values for the scores and cat fields according to ifelse() conditional statements. For scores, if the score is -1, the value is replaced by 0, otherwise it keeps the score as is. For cat, it also checks if scores is equal to -1, but would assign a value of 1 when the condition is met, or the already existing value of cat when the condition is not met.
EDIT
After our discussion in the comments, I think something along these lines should be helpful (you may have to modify the logic since I don't exactly follow what the desired output is here):
for(i in 1:nrow(df)){
# Check if score is -1
if(df[i, 'scores'] == -1){
# Update values for the next row
df[i+1, 'scores'] <- 0
df[i+1, 'cat'] <- 1
}
}
Sorry that I don't really follow the desired output, hopefully this is helpful in getting you to your answer!

How can I create a new variable which identifies rows where another variable changes sign?

I have a question regarding data preparation. I have the following data set (in long format; one row per measurement point, therefore several rows per person):
dd <- read.table(text=
"ID time
1 -4
1 -3
1 -2
1 -1
1 0
1 1
2 -3
2 -1
2 2
2 3
2 4
3 -3
3 -2
3 -1
4 -1
4 1
4 2
4 3
5 0
5 1
5 2
5 3
5 4", header=TRUE)
Now I would like to create a new variable that has a 1 in the row, in which a sign change on the time variable happens for the first time for this person, and a 0 in all other rows. If a person has only negative values on time, the should not be any 1 on the new variable. For a person that has only positive values on time, the first row should have a 1 on the new variable and all other rows should be coded with 0. For my example above the new data frame should look like this:
dd <- read.table(text=
"ID time new.var
1 -4 0
1 -3 0
1 -2 0
1 -1 0
1 0 1
1 1 0
2 -3 0
2 -1 0
2 2 1
2 3 0
2 4 0
3 -3 0
3 -2 0
3 -1 0
4 -1 0
4 1 1
4 2 0
4 3 0
5 0 1
5 1 0
5 2 0
5 3 0
5 4 0", header=TRUE)
Does anyone know how to do this? I thought about using dplyr and group_by, however I am pretty new to R and did not make it. Any help is much appreciated!
There are 2 different operations you want done to create new.var, so you need to do them in 2 steps. I'll break this into 2 separate mutate calls for simplicity, but you can put both of them into the same mutate
First, we group by ID and then find the rows where the sign changes. We need to use time >= 0 instead of sign as recommended in this answer: R identifying a row prior to a change in sign because you want a sign change to be counted only when going from -1 <-> 0, not from 0 <-> 1:
library(tidyverse)
dd2 <- dd %>%
group_by(ID) %>%
mutate(new.var = as.numeric((time >= 0) != (lag(time) >= 0)))
dd2
# A tibble: 23 x 3
# Groups: ID [5]
ID time new.var
<int> <int> <dbl>
1 1 -4 NA
2 1 -3 0
3 1 -2 0
4 1 -1 0
5 1 0 1
6 1 1 0
7 2 -3 NA
8 2 -1 0
9 2 2 1
10 2 3 0
# … with 13 more rows
Then we use case_when to modify the first row based on your desired rules. Due to the way lag works, the first row will always have NA (since there is no previous row to look at) which makes it a good way to pick out that first row to change it based on the time values in that group:
dd3 <- dd2 %>%
mutate(new.var = case_when(
!is.na(new.var) ~ new.var,
all(time >= 0) ~ 1,
TRUE ~ 0)
)
print(dd3, n = 100) #n=100 because tibbles are truncated to 10 rows by print
# A tibble: 23 x 3
# Groups: ID [5]
ID time new.var
<int> <int> <dbl>
1 1 -4 0
2 1 -3 0
3 1 -2 0
4 1 -1 0
5 1 0 1
6 1 1 0
7 2 -3 0
8 2 -1 0
9 2 2 1
10 2 3 0
11 2 4 0
12 3 -3 0
13 3 -2 0
14 3 -1 0
15 4 -1 0
16 4 1 1
17 4 2 0
18 4 3 0
19 5 0 1
20 5 1 0
21 5 2 0
22 5 3 0
23 5 4 0
You can try this:
library(dplyr)
dd %>% left_join(dd %>% group_by(ID) %>% summarise(index=min(which(time>=0)))) %>%
group_by(ID) %>% mutate(new.var=ifelse(row_number(ID)==index,1,0)) %>% select(-index)-> DF
# A tibble: 23 x 3
# Groups: ID [5]
ID time new.var
<int> <int> <dbl>
1 1 -4 0
2 1 -3 0
3 1 -2 0
4 1 -1 0
5 1 0 1
6 1 1 0
7 2 -3 0
8 2 -1 0
9 2 2 1
10 2 3 0
The following ave instruction does what the question asks for.
dd$new.var <- with(dd, ave(time, ID, FUN = function(x){
y <- integer(length(x))
if(any(x >= 0)) y[which.max(x[1]*x <= 0)] <- 1L
y
}))
dd
# ID time new.var
#1 1 -4 0
#2 1 -3 0
#3 1 -2 0
#4 1 -1 0
#5 1 0 1
#6 1 1 0
#7 2 -3 0
#8 2 -1 0
#9 2 2 1
#10 2 3 0
#11 2 4 0
#12 3 -3 0
#13 3 -2 0
#14 3 -1 0
#15 4 -1 0
#16 4 1 1
#17 4 2 0
#18 4 3 0
#19 5 0 1
#20 5 1 0
#21 5 2 0
#22 5 3 0
#23 5 4 0
If the expected output is renamed dd2 then
identical(dd, dd2)
#[1] TRUE

Order columns by year independently in a dataframe in R

Data:
set.seed(0)
Temp <- data.frame(year=rep(1:3,each=4),V1=floor(rnorm(12)*2),V2=floor(rnorm(12)*2))
year V1 V2
1 1 2 -3
2 1 -1 -1
3 1 2 -1
4 1 2 -1
5 2 0 0
6 2 -4 -2
7 2 -2 0
8 2 -1 -3
9 3 -1 -1
10 3 4 0
11 3 1 0
12 3 -2 1
I want to reorder V1 and V2 independently within each year. I can do it with 10 lines, but I believe there must be a more beautiful way to do it.
Desired output:
year V1 V2
1 1 -1 -3
2 1 2 -1
3 1 2 -1
4 1 2 -1
5 2 -4 -3
6 2 -2 -2
7 2 -1 0
8 2 0 0
9 3 -2 -1
10 3 -1 0
11 3 1 0
12 3 4 1
Using dplyr you can do
library(dplyr)
Temp %>%
group_by(year) %>%
mutate(V1=sort(V1), V2=sort(V2))
which returns
# A tibble: 12 x 3
# Groups: year [3]
year V1 V2
<int> <dbl> <dbl>
1 1 -1 -3
2 1 2 -1
3 1 2 -1
4 1 2 -1
5 2 -4 -3
6 2 -2 -2
7 2 -1 0
8 2 0 0
9 3 -2 -1
10 3 -1 0
11 3 1 0
12 3 4 1
And if you needed to do that with all columns, you could do
Temp %>%
group_by(year) %>%
mutate_all(sort)
Using data.table:
library(data.table)
setDT(Temp)[,c("V1","V2"):=list(sort(V1),sort(V2)),year]
If you use plyr and you know the column names, you can easily do this using ddply:
library(plyr)
ddply(Temp, "year", summarize, V1=sort(V1), V2=sort(V2))
year V1 V2
1 1 -1 -3
2 1 2 -1
3 1 2 -1
4 1 2 -1
5 2 -4 -3
6 2 -2 -2
7 2 -1 0
8 2 0 0
9 3 -2 -1
10 3 -1 0
11 3 1 0
12 3 4 1
If you don't know the column names, you'd have to make a function to do it:
> ddply(Temp, "year", function(x) { as.data.frame(lapply(x, sort)) })
year V1 V2
1 1 -1 -3
2 1 2 -1
3 1 2 -1
4 1 2 -1
5 2 -4 -3
6 2 -2 -2
7 2 -1 0
8 2 0 0
9 3 -2 -1
10 3 -1 0
11 3 1 0
12 3 4 1

Conditional Series Fill in R

Looking for a way to fill in a vector with new values conditional on values within that vector and another variable in the data frame. Pasted an example of what the data looks like below.
PrsVar= c(rep(1,10),rep(2,7),rep(3,11))
IndVar = c(0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0)
OutVar = c(1,1,1,1,2,2,2,3,3,3,1,1,1,1,2,2,2,1,1,1,1,2,2,2,2,3,3,3)
exampdata <- cbind(PrsVar,IndVar,OutVar)
exampdata <- as.data.frame(exampdata)
> exampdata
PrsVar IndVar OutVar
1 1 0 1
2 1 0 1
3 1 0 1
4 1 1 1
5 1 0 2
6 1 0 2
7 1 1 2
8 1 0 3
9 1 0 3
10 1 0 3
11 2 0 1
12 2 0 1
13 2 0 1
14 2 1 1
15 2 0 2
16 2 0 2
17 2 1 2
18 3 0 1
19 3 0 1
20 3 0 1
21 3 1 1
22 3 0 2
23 3 0 2
24 3 0 2
25 3 1 2
26 3 0 3
27 3 0 3
28 3 0 3
This is time-series data and each row represents a person-day. PrsVar is an ID for an individual in the study and IndVar is an indicator that an episode has ended on that person-day. The person-day after that represents a new episode.
I'd like to create a variable that looks like OutVar using just the values from PrsVar and IndVar. This new variable OutVar labels the episode each person-day is in, incrementing by 1, and starting over at 1 for each new individual.
I could run this through a loop, but I need more efficient code to work with 3,000,000+ rows of data. Was trying to use something in dplyr or maybe mapply, but I'm stumped. Thinking a solution to this would be helpful to others and would certainly be helpful to me again in the near future.
The data.table package offers a fast, efficient, and tidy way to do this. It's all done by reference (not by value, so no copying is done) so millions of rows won't be an issue at all (under a minute, maybe).
library(data.table)
patient.data <- data.table(PrsVar = c(rep(1,10), rep(2,7), rep(3,11)),
IndVar = c(0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0),
OutVar = c(1,1,1,1,2,2,2,3,3,3,1,1,1,1,2,2,2,1,1,1,1,2,2,2,2,3,3,3))
Increment an episode counter EpVar based on the cumulative sum of IndVar (plus 1). This increases the counter at the record where IndVar increases (which is too early) so shift it down a record with shift, replacing the missing value with a reset counter (1). This can be done groupwise with the by keyword.
patient.data[ , EpVar:=shift(1+cumsum(IndVar), fill=1), by=PrsVar]
patient.data
PrsVar IndVar OutVar EpVar
1: 1 0 1 1
2: 1 0 1 1
3: 1 0 1 1
4: 1 1 1 1
5: 1 0 2 2
6: 1 0 2 2
7: 1 1 2 2
8: 1 0 3 3
9: 1 0 3 3
10: 1 0 3 3
11: 2 0 1 1
12: 2 0 1 1
13: 2 0 1 1
14: 2 1 1 1
15: 2 0 2 2
16: 2 0 2 2
17: 2 1 2 2
18: 3 0 1 1
19: 3 0 1 1
20: 3 0 1 1
21: 3 1 1 1
22: 3 0 2 2
23: 3 0 2 2
24: 3 0 2 2
25: 3 1 2 2
26: 3 0 3 3
27: 3 0 3 3
28: 3 0 3 3
A bit ugly, but this logic should be easily adaptable to other methods:
with(exampdata,
ave(IndVar, PrsVar, FUN=function(x) {
out <- rev(cumsum(rev(x)))
max(out) - out + 1
})
)
# [1] 1 1 1 1 2 2 2 3 3 3 1 1 1 1 2 2 2 1 1 1 1 2 2 2 2 3 3 3

Sum rows in a group, starting when a specific value occurs

I want to accumulate the values of a column till the end of the group, though starting the addition when a specific value occurs in another column. I am only interested in the first instance of the specific value within a group. So if that value occurs again within the group, the addition column should continue to add the values. I know this sounds like a rather strange problem, so hopefully the example table makes sense.
The following data frame is what I have now:
> df = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0))
> df
group numToAdd occurs
1 1 1 0
2 1 1 0
3 1 3 1
4 1 2 0
5 2 4 0
6 2 2 1
7 2 1 0
8 2 3 0
9 2 2 0
10 3 1 0
11 3 2 1
12 3 1 1
13 4 2 0
14 4 3 0
15 4 2 0
Thus, whenever a 1 occurs within a group, I want a cumulative sum of the values from the column numToAdd, until a new group starts. This would look like the following:
> finalDF = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0),added = c(0,0,3,5,0,2,3,6,8,0,2,3,0,0,0))
> finalDF
group numToAdd occurs added
1 1 1 0 0
2 1 1 0 0
3 1 3 1 3
4 1 2 0 5
5 2 4 0 0
6 2 2 1 2
7 2 1 0 3
8 2 3 0 6
9 2 2 0 8
10 3 1 0 0
11 3 2 1 2
12 3 1 1 3
13 4 2 0 0
14 4 3 0 0
15 4 2 0 0
Thus, the added column is 0 until a 1 occurs within the group, then accumulates the values from numToAdd until it moves to a new group, turning the added column back to 0. In group three, a value of 1 is found a second time, yet the cumulated sum continues. Additionally, in group 4, a value of 1 is never found, thus the value within the added column remains 0.
I've played around with dplyr, but can't get it to work. The following solution only outputs the total sum, and not the increasing cumulated number at each row.
library(dplyr)
df =
df %>%
mutate(added=ifelse(occurs == 1,cumsum(numToAdd),0)) %>%
group_by(group)
Try
df %>%
group_by(group) %>%
mutate(added= cumsum(numToAdd*cummax(occurs)))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
Or using data.table
library(data.table)#v1.9.5+
i1 <-setDT(df)[, .I[(rleid(occurs) + (occurs>0))>1], group]$V1
df[, added:=0][i1, added:=cumsum(numToAdd), by = group]
Or a similar option as in dplyr
setDT(df)[,added := cumsum(numToAdd * cummax(occurs)) , by = group]
You can use split-apply-combine in base R with something like:
df$added <- unlist(lapply(split(df, df$group), function(x) {
y <- rep(0, nrow(x))
pos <- cumsum(x$occurs) > 0
y[pos] <- cumsum(x$numToAdd[pos])
y
}))
df
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
To add another base R approach:
df$added <- unlist(lapply(split(df, df$group), function(x) {
c(x[,'occurs'][cumsum(x[,'occurs']) == 0L],
cumsum(x[,'numToAdd'][cumsum(x[,'occurs']) != 0L]))
}))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
Another base R:
df$added <- unlist(lapply(split(df,df$group),function(x){
cumsum((cumsum(x$occurs) > 0) * x$numToAdd)
}))

Resources