Multiple Conditional Cumulative Sum in R - r

This is my data frame as given below
rd <- data.frame(
Customer = rep("A",15),
date_num = c(3,3,9,11,14,14,15,16,17,20,21,27,28,29,31),
exp_cumsum_col = c(1,1,2,3,4,4,4,4,4,5,5,6,6,6,7))
I am trying to get column 3 (exp_cumsum_col), but am unable to get the correct values after trying many times. This is the code I used:
rd<-as.data.frame(rd %>%
group_by(customer) %>%
mutate(exp_cumsum_col = cumsum(row_number(ifelse(date_num[i]==date_num[i+1],1)))))
If my date_num is continuous, then I am treating that entire series as a one number, and if there is any break in my date_num, then I am increasing exp_cumsum_col by 1 ..... exp_cumsum_col would start at 1.

We can take the differece of adjacent elements, check if it is greater than 1 and get the cumsum
rd %>%
group_by(Customer) %>%
mutate(newexp_col = cumsum(c(TRUE, diff(date_num) > 1)))
# Customer date_num exp_cumsum_col newexp_col
#1 A 3 1 1
#2 A 3 1 1
#3 A 9 2 2
#4 A 11 3 3
#5 A 14 4 4
#6 A 14 4 4
#7 A 15 4 4
#8 A 16 4 4
#9 A 17 4 4
#10 A 20 5 5
#11 A 21 5 5
#12 A 27 6 6
#13 A 28 6 6
#14 A 29 6 6
#15 A 31 7 7

Related

using intervals in a column to populate values for another column

I have a dataframe:
dataframe <- data.frame(Condition = rep(c(1,2,3), each = 5, times = 2),
Time = sort(sample(1:60, 30)))
Condition Time
1 1 1
2 1 3
3 1 4
4 1 7
5 1 9
6 2 11
7 2 12
8 2 14
9 2 16
10 2 18
11 3 19
12 3 24
13 3 25
14 3 28
15 3 30
16 1 31
17 1 34
18 1 35
19 1 38
20 1 39
21 2 40
22 2 42
23 2 44
24 2 47
25 2 48
26 3 49
27 3 54
28 3 55
29 3 57
30 3 59
I want to divide the total length of Time (i.e., max(Time) - min(Time)) per Condition by a constant 'x' (e.g., 3). Then I want to use that quotient to add a new variable Trial such that my dataframe looks like this:
Condition Time Trial
1 1 1 A
2 1 3 A
3 1 4 B
4 1 7 C
5 1 9 C
6 2 11 A
7 2 12 A
8 2 14 B
9 2 16 C
10 2 18 C
... and so on
As you can see, for Condition 1, Trial is populated with unique identifying values (e.g., A, B, C) every 2.67 seconds = 8 (total time) / 3. For Condition 2, Trial is populated every 2.33 seconds = 7 (total time) /3.
I am not getting what I want with my current code:
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, 3, labels = F)])
# Groups: Condition [3]
Condition Time Trial
<dbl> <int> <chr>
1 1 1 A
2 1 3 A
3 1 4 A
4 1 7 A
5 1 9 A
6 2 11 A
7 2 12 A
8 2 14 A
9 2 16 A
10 2 18 A
# ... with 20 more rows
Thanks!
We can get the diffrence of range (returns min/max as a vector) and divide by the constant passed into i.e. 3 as the breaks in cut). Then, use integer index (labels = FALSE) to get the corresponding LETTER from the LETTERS builtin R constant
library(dplyr)
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
If the grouping should be based on adjacent values in 'Condition', use rleid from data.table on the 'Condition' column to create the grouping, and apply the same code as above
library(data.table)
dataframe %>%
group_by(grp = rleid(Condition)) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
Here's a one-liner using my santoku package. The rleid line is the same as mentioned in #akrun's solution.
dataframe %<>%
group_by(grp = data.table::rleid(Condition)) %>%
mutate(
Trial = chop_evenly(Time, intervals = 3, labels = lbl_seq("A"))
)

Add a unique identifier to the same column value in R data frame

I have a data frame as follows:
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6
4 4 25 7
5 5 3 7
6 6 34 7
For each row with the sample_id, I would like to add a unique identifier as follows:
index val sample_id
1 1 14 5
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
Any suggestion? Thank you for your help.
Base R
dat$id2 <- ave(dat$sample_id, dat$sample_id,
FUN = function(z) if (length(z) > 1) paste(z, LETTERS[seq_along(z)], sep = "-") else as.character(z))
dat
# index val sample_id id2
# 1 1 14 5 5
# 2 2 22 6 6-A
# 3 3 1 6 6-B
# 4 4 25 7 7-A
# 5 5 3 7 7-B
# 6 6 34 7 7-C
tidyverse
library(dplyr)
dat %>%
group_by(sample_id) %>%
mutate(id2 = if (n() > 1) paste(sample_id, LETTERS[row_number()], sep = "-") else as.character(sample_id)) %>%
ungroup()
Minor note: it might be tempting to drop the as.character(z) from either or both of the code blocks. In the first, nothing will change (here): base R allows you to be a little sloppy; if we rely on that and need the new field to always be character, then in that one rare circumstance where all rows have unique sample_id, then the column will remain integer. dplyr is much more careful in guarding against this; if you run the tidyverse code without as.character, you'll see the error.
Using dplyr:
library(dplyr)
dplyr::group_by(df, sample_id) %>%
dplyr::mutate(sample_id = paste(sample_id, LETTERS[seq_along(sample_id)], sep = "-"))
index val sample_id
<int> <dbl> <chr>
1 1 14 5-A
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
If you just want to create unique tags for the same sample_id, maybe you can try make.unique like below
transform(
df,
sample_id = ave(as.character(sample_id),sample_id,FUN = function(x) make.unique(x,sep = "_"))
)
which gives
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6_1
4 4 25 7
5 5 3 7_1
6 6 34 7_2

conditional merge or left join two dataframes in R

I am trying to add additional data from a reference table onto my primary dataframe. I see similar questions have been asked about this however cant find anything for my specific case.
An example of my data frame is set up like this
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
print(df)
participant time
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4
12 3 4
13 1 5
14 2 5
15 3 5
16 1 6
17 2 6
18 3 6
19 1 7
20 2 7
21 3 7
22 1 8
23 2 8
24 3 8
25 1 9
26 2 9
27 3 9
> print(lookup)
start.time end.time var1 var2 var3
1 1 3 A 8 fast
2 5 6 B 12 fast
3 8 10 A 3 slow
What I want to do is merge or join these two dataframes in a way which also includes the times in between both the start and end time of the look up data frame. So the columns var1, var2 and var3 are added onto the df at each instance where the time lies between the start time and end time.
for example, in the above case - the look up value in the first row has a start time of 1, an end time of 3, so for times 1, 2 and 3 for each participant, the first row data should be added.
the output should look something like this.
print(output)
participant time var1 var2 var3
1 1 1 A 8 fast
2 2 1 A 8 fast
3 3 1 A 8 fast
4 1 2 A 8 fast
5 2 2 A 8 fast
6 3 2 A 8 fast
7 1 3 A 8 fast
8 2 3 A 8 fast
9 3 3 A 8 fast
10 1 4 <NA> NA <NA>
11 2 4 <NA> NA <NA>
12 3 4 <NA> NA <NA>
13 1 5 B 12 fast
14 2 5 B 12 fast
15 3 5 B 12 fast
16 1 6 B 12 fast
17 2 6 B 12 fast
18 3 6 B 12 fast
19 1 7 <NA> NA <NA>
20 2 7 <NA> NA <NA>
21 3 7 <NA> NA <NA>
22 1 8 A 3 slow
23 2 8 A 3 slow
24 3 8 A 3 slow
25 1 9 A 3 slow
26 2 9 A 3 slow
27 3 9 A 3 slow
I realise that column names don't match and they should for merging data sets.
One option would be to use the sqldf package, and phrase your problem as a SQL left join:
sql <- "SELECT t1.participant, t1.time, t2.var1, t2.var2, t2.var3
FROM df t1
LEFT JOIN lookup t2
ON t1.time BETWEEN t2.\"start.time\" AND t2.\"end.time\""
output <- sqldf(sql)
A dplyr solution:
output <- df %>%
# Create an id for the join
mutate(merge_id=1) %>%
# Use full join to create all the combinations between the two datasets
full_join(lookup %>% mutate(merge_id=1), by="merge_id") %>%
# Keep only the rows that we want
filter(time >= start.time, time <= end.time) %>%
# Select the relevant variables
select(participant,time,var1:var3) %>%
# Right join with initial dataset to get the missing rows
right_join(df, by = c("participant","time")) %>%
# Sort to match the formatting asked by OP
arrange(time, participant)
This produces the output asked by OP, but it will only work for data of reasonable size, as the full join produces a data frame with number of rows equal to the product of the number of rows of both initial datasets.
Using tidyverse and creating an auxiliary table:
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
lookup_extended <- lookup %>%
mutate(time = map2(start.time, end.time, ~ c(.x:.y))) %>%
unnest(time) %>%
select(-start.time, -end.time)
df2 <- df %>%
left_join(lookup_extended, by = "time")

I want to create a loop with a large dataframe in R

Problem
I want to create a loop from data in df1 it's important the data is taken one ID value at a time.
I'm unsure how this can be done with R.
#original dataset
id=c(1,1,1,2,2,2,3,3,3)
dob=c("11-08","12-04","04-03","10-04","03-07","06-02","12-09","01-01","03-08")
count=c(1,6,3,2,5,6,8,6,4)
outcome=rep(1:0,length.out=9)
df1=data.frame(id,dob,count,outcome)
#changes for each value this needs to be completed separately for each value
df2<-df1[df1$id==1,]
df2<-df2[,-4]
addition<-df2$count+45
df2<-cbind(df2,addition)
df3<-df1[df1$id==2,]
df3<-df3[,-4]
addition<-df3$count+45
df3<-cbind(df3,addition)
df4<-df1[df1$id==3,]
df4<-df4[,-4]
addition<-df4$count+45
df4<-cbind(df4,addition)
df5<-rbind(df2,df3,df4)
Expected Output
df5<-rbind(df2,df3,df4)
1 1 11-08 1 46
2 1 12-04 6 51
3 1 04-03 3 48
4 2 10-04 2 47
5 2 03-07 5 50
6 2 06-02 6 51
7 3 12-09 8 53
8 3 01-01 6 51
9 3 03-08 4 49
In the present context (could be a simplified example) it doesn't even need that to loop, as we can directly add the 'count' with a number
df1$addition <- df1$count + 45
However, if it is a complicated operation and needs to look into the 'id' separately, then do a group_by operation
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(addition = count + 45)
# A tibble: 9 x 5
# Groups: id [3]
# id dob count outcome addition
# <dbl> <fct> <dbl> <int> <dbl>
#1 1 11-08 1 1 46
#2 1 12-04 6 0 51
#3 1 04-03 3 1 48
#4 2 10-04 2 0 47
#5 2 03-07 5 1 50
#6 2 06-02 6 0 51
#7 3 12-09 8 1 53
#8 3 01-01 6 0 51
#9 3 03-08 4 1 49
Also, data.table syntax would be
library(data.table)
setDT(df1)[, addition := count + 45, by = id]
or simply
setDT(df1)[, addition := count + 45]

Group dataframe rows by consecutive ascending IDs [duplicate]

This question already has answers here:
Find consecutive values in vector in R [duplicate]
(2 answers)
Closed 6 years ago.
I am fairly new to the art of programming (loops etc..) and this is something where I would be grateful if I could get an opinion whether my approach is fine or it would definitely need to be optimized if it was about to used on much bigger sample.
Currently I have approximately 20 000 observations and one of the columns is the ID of receipt. What I would like to achieve is to assign each row to a group that would consist of IDs that are ascending in a format of n+1. If this rule is broken the new group should be created until the rule is broken again.
To illustrate, lets say I have this table (Important note is that ID are not necessarily unique and can repeat, like ID 10 in my example):
MyTable <- data.frame(ID = c(1,2,3,4,6,7,8,10,10,11,17,18,19,200,201,202,2010,2011,2013))
MyTable
ID
1
2
3
4
6
7
8
10
10
11
17
18
19
200
201
202
2010
2011
2013
The result of my grouping should be following:
ID GROUP
1 1
2 1
3 1
4 1
6 2
7 2
8 2
10 3
10 3
11 3
17 4
18 4
19 4
200 5
201 5
202 5
2010 6
2011 6
2013 7
I used dplyr for ordering the ID in ascending way. Then created the variable MyData$Group which I have simply filled with 1's.
rep(1,length(MyTable$ID)
for (i in 2:length(MyTable$ID) ) {
if(MyTable$ID[i] == MyTable$ID[i-1]+1 | MyTable$ID[i] == MyTable$ID[i-1]) {
MyTable$ID[i] <- MyTable$GROUP[i-1]
} else {
MyTable$GROUP[i] <- MyTable$GROUP[i-1]+1
}
}
This code worked for me and I got the results fairly easily. However, I wonder if in eyes of more experienced programmers, this piece of code would be considered as "bad", "average", "good" or whatever rating you come up with.
EDIT: I am sure this topic has been touched already, not arguing against that. Though, as the main difference is that I would like to touch a topic of optimization here and see whether my approach meets standards.
Thanks!
To make a long story short:
MyTable$Group <- cumsum(c(1, diff(MyTable$ID) != 1))
# ID Group
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 6 2
#6 7 2
#7 8 2
#8 10 3
#9 11 3
#10 12 3
#11 17 4
#12 18 4
#13 19 4
#14 200 5
#15 201 5
#16 202 5
#17 2010 6
#18 2011 6
#19 2013 7
You are searching all differences in your vector Mytable$ID, which are not 1, so this are your "breaks". And then you cumsum all these values. When you do not know cumsum so type ?cumsum.
That's all!
UPDATE: with repeating IDs, you can use this:
MyTable <- data.frame(ID = c(1,2,3,4,6,7,8,10,10,11,17,18,19,200,201,202,2010,2011,2013))
MyTable$Group <- cumsum(c(1, !diff(MyTable$ID) %in% c(0,1) ))
# ID Group
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 6 2
#6 7 2
#7 8 2
#8 10 3
#9 10 3
#10 11 3
#11 17 4
#12 18 4
#13 19 4
#14 200 5
#15 201 5
#16 202 5
#17 2010 6
#18 2011 6
#19 2013 7

Resources