I have a dataset like here:
customer_id <- c("1","1","1","2","2","2","2","3","3","3")
account_id <- as.character(c(11,11,11,55,55,55,55,38,38,38))
time <- c(as.Date("2017-01-01","%Y-%m-%d"), as.Date("2017-02-01","%Y-%m-%d"), as.Date("2017-03-01","%Y-%m-%d"),
as.Date("2017-12-01","%Y-%m-%d"), as.Date("2018-01-01","%Y-%m-%d"), as.Date("2018-02-01","%Y-%m-%d"),
as.Date("2018-03-01","%Y-%m-%d"), as.Date("2018-04-01","%Y-%m-%d"), as.Date("2018-05-01","%Y-%m-%d"),
as.Date("2018-06-01","%Y-%m-%d"))
tenor <- c(1,2,3,1,2,3,4,1,2,3)
variable_x <- c(87,90,100,120,130,150,12,13,15,14)
my_data <- data.table(customer_id,account_id,time,tenor,variable_x)
Now, I would like to create new variables "PD_Q1" up to "PD_Q20" that would equal to the value of "variable_x" when "tenor" is equal to 1 up to 20, i.e., PD_Q1 equal to variable_x's value if tenor = 1, PD_Q2 equal to variable_x's value if tenor = 2, etc. and I would like to do that by customer_id, account_id. I have the code for that, however only for PD_Q1 and I would like to make a loop that loops over i = 1:20 in which I change just tenor == i (this one is easy) and refer to columns PD_Qi in this loop, which is a problem for me. The code for one value of i is here:
my_data[tenor == 1, PD_Q1_temp := variable_x, by = c("customer_id", "account_id")]
list_accs <- my_data[tenor == 1, c("customer_id", "account_id", "PD_Q1_temp")]
list_accs <- unique(list_accs, by = c("customer_id", "account_id"))
names(list_accs) = c("customer_id", "account_id", "PD_Q1")
my_data = merge(x = my_data, y = list_accs, by = c("customer_id", "account_id"), all.x = TRUE)
my_data$PD_Q1_temp <- NULL
Now, can you please advise how to make a loop from 1 to 20, in which tenor, PD_Q1_temp and PD_Q1 would change? Specifically, I don't know how to refer to column names or variables using this i index within a loop.
The expected output for i = 1 and i = 2 (creating variables PD_Q1 and PD_Q2) is here:
> my_data
customer_id account_id time tenor variable_x PD_Q1 PD_Q2
1: 1 11 2017-01-01 1 87 87 90
2: 1 11 2017-02-01 2 90 87 90
3: 1 11 2017-03-01 3 100 87 90
4: 2 55 2017-12-01 1 120 120 130
5: 2 55 2018-01-01 2 130 120 130
6: 2 55 2018-02-01 3 150 120 130
7: 2 55 2018-03-01 4 12 120 130
8: 3 38 2018-04-01 1 13 13 15
9: 3 38 2018-05-01 2 15 13 15
10: 3 38 2018-06-01 3 14 13 15
now I want to create PD_Q3, PD_Q4 etc. in a loop using my code above that creates one such variable.
Can you show your expected output?
I think you can do what you want with tidyr::gather():
library(dplyr)
library(tidyr)
my_data %>%
tbl_df() %>%
select(-time) %>%
mutate(tenor = paste0("PD_Q", tenor)) %>%
spread(tenor, variable_x)
# # A tibble: 3 x 6
# customer_id account_id PD_Q1 PD_Q2 PD_Q3 PD_Q4
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 11 87 90 100 NA
# 2 2 55 120 130 150 12
# 3 3 38 13 15 14 NA
Related
The code posted at the bottom does a nice job of filling in a dataframe, using package tidyr, so that all ID's end up with the same number of periods, in the case of period defined as number of months ("Period_1" in the below code). Base dataframe testDF has ID of 1 with 5 periods, and ID of 50 and 60 with only 3 periods each. The tidyr code creates additional periods ("Period_1") for ID of 50 and 60 so they too have 5 Period_1´s. The code copies down the "Bal" and "State" fields so that all ID end up with the same number of Period_1, which is correct.
However, how would I extend the calendar month expression of "Period_2" in the same manner, as illustrated immediately below?
Code:
library(tidyr)
testDF <-
data.frame(
ID = as.numeric(c(rep(1,5),rep(50,3),rep(60,3))),
Period_1 = as.numeric(c(1:5,1:3,1:3)),
Period_2 = c("2012-06","2012-07","2012-08","2012-09","2012-10","2013-06","2013-07","2013-08","2012-01","2012-02","2012-03"),
Bal = as.numeric(c(rep(10,5),21:23,36:34)),
State = c("XX","AA","BB","CC","XX","AA","BB","CC","SS","XX","AA")
)
testDFextend <-
testDF %>%
tidyr::complete(ID, nesting(Period_1)) %>%
tidyr::fill(Bal, State, .direction = "down")
testDFextend
Edit: rolling from one year to the next
A better OP example would have Period 2 = c("2012-06","2012-07","2012-08","2012-09","2012-10","2013-06","2013-07","2013-08","2012-10","2012-11","2012-12"), providing an example whereby extending Period_2 causes a rollover to the next year. Below I add to the tidyr/dplyr answer below to correctly roll over the year:
library(tidyr)
library(dplyr)
testDF <-
data.frame(
ID = as.numeric(c(rep(1,5),rep(50,3),rep(60,3))),
Period_1 = as.numeric(c(1:5,1:3,1:3)),
Period_2 = c("2012-06","2012-07","2012-08","2012-09","2012-10","2013-06","2013-07","2013-08","2012-10","2012-11","2012-12"),
Bal = as.numeric(c(rep(10,5),21:23,36:34)),
State = c("XX","AA","BB","CC","XX","AA","BB","CC","SS","XX","AA")
)
testDFextend <-
testDF %>%
tidyr::complete(ID, nesting(Period_1)) %>%
tidyr::fill(Bal, State, .direction = "down")
testDFextend %>%
separate(Period_2, into = c("year", "month"), convert = TRUE) %>%
fill(year) %>%
group_by(ID) %>%
mutate(month = sprintf("%02d", zoo::na.spline(month))) %>%
unite("Period_2", year, month, sep = "-") %>%
# Now I add the below lines:
separate(Period_2, into = c("year", "month"), convert = TRUE) %>%
mutate(month = as.integer(sprintf("%02d", zoo::na.spline(month)))) %>%
mutate(year1 = ifelse(month > 12, year+trunc(month/12), year)) %>%
mutate(month1 = ifelse(month > 12 & month%%12!= 0, month%%12, month)) %>%
mutate(month1 = ifelse(month1 < 10, paste0(0,month1),month1)) %>%
unite("Period_2", year1, month1, sep = "-") %>%
select("ID","Period_1","Period_2","Bal","State")
A tidyverse solution based on zoo::na.spline. Note that it does not handle year changes. It's harder than I thought, especially because zoo::na.spline does not seem to work on yearmon format.
library(tidyr)
library(dplyr)
testDFextend %>%
separate(Period_2, into = c("year", "month"), convert = TRUE) %>%
fill(year) %>%
group_by(ID) %>%
mutate(month = sprintf("%02d", zoo::na.spline(month))) %>%
unite("Period_2", year, month, sep = "-")
output
ID Period_1 Period_2 Bal State
<dbl> <dbl> <chr> <dbl> <chr>
1 1 1 2012-06 10 XX
2 1 2 2012-07 10 AA
3 1 3 2012-08 10 BB
4 1 4 2012-09 10 CC
5 1 5 2012-10 10 XX
6 50 1 2013-06 21 AA
7 50 2 2013-07 22 BB
8 50 3 2013-08 23 CC
9 50 4 2013-09 23 CC
10 50 5 2013-10 23 CC
11 60 1 2012-01 36 SS
12 60 2 2012-02 35 XX
13 60 3 2012-03 34 AA
14 60 4 2012-04 34 AA
15 60 5 2012-05 34 AA
by ID you can strsplit the date, and take the elements to create a new data.frame to merge with.
ml <- max(with(testDF, tapply(ID, ID, length))) ## get max. period length
by(testDF, testDF$ID, \(x) {
sp <- strsplit(x$Period_2, '-')
s <- as.numeric(sp[[1]][[2]])
if (ml != nrow(x))
merge(x, data.frame(Period_2=paste0(sp[[1]][[1]], '-', sprintf('%02d', (s + nrow(x)):(s + ml - 1))),
Period_1=(nrow(x) + 1):ml,
ID=x$ID[nrow(x)], Bal=x$Bal[nrow(x)], State=x$State[nrow(x)]), all=TRUE)
else x
}) |> c(make.row.names=FALSE) |> do.call(what=rbind)
# ID Period_1 Period_2 Bal State
# 1 1 1 2012-06 10 XX
# 2 1 2 2012-07 10 AA
# 3 1 3 2012-08 10 BB
# 4 1 4 2012-09 10 CC
# 5 1 5 2012-10 10 XX
# 6 50 1 2013-06 21 AA
# 7 50 2 2013-07 22 BB
# 8 50 3 2013-08 23 CC
# 9 50 4 2013-09 23 CC
# 10 50 5 2013-10 23 CC
# 11 60 1 2012-01 36 SS
# 12 60 2 2012-02 35 XX
# 13 60 3 2012-03 34 AA
# 14 60 4 2012-04 34 AA
# 15 60 5 2012-05 34 AA
Edit
For older R versions (although it's recommended to always use update software), do:
do.call(c(by(testDF, testDF$ID, function(x) {
sp <- strsplit(x$Period_2, '-')
s <- as.numeric(sp[[1]][[2]])
if (ml != nrow(x))
merge(x, data.frame(Period_2=paste0(sp[[1]][[1]], '-', sprintf('%02d', (s + nrow(x)):(s + ml - 1))),
Period_1=(nrow(x) + 1):ml,
ID=x$ID[nrow(x)], Bal=x$Bal[nrow(x)], State=x$State[nrow(x)]), all=TRUE)
else x
}), make.row.names=FALSE), what=rbind)
For each ID convert Period_2 to yearmon class. This represents year and month without day. Internally it uses year + fraction where fraction = 0, 1/12, ..., 11/12 for the 12 months. Expand it out using seq. Then convert it back to character or omit the format line to keep the result as a yearmon object.
library(dplyr, exclude = c("filter", "lag"))
library(zoo)
testDFextend %>%
group_by(ID) %>%
mutate(Period_2 = as.yearmon(first(Period_2)) + seq(0, by=1/12, length=n())) %>%
mutate(Period_2 = format(Period_2, "%Y-%m")) %>%
ungroup
giving:
# A tibble: 15 × 5
ID Period_1 Period_2 Bal State
<dbl> <dbl> <chr> <dbl> <chr>
1 1 1 2012-06 10 XX
2 1 2 2012-07 10 AA
3 1 3 2012-08 10 BB
4 1 4 2012-09 10 CC
5 1 5 2012-10 10 XX
6 50 1 2013-06 21 AA
7 50 2 2013-07 22 BB
8 50 3 2013-08 23 CC
9 50 4 2013-09 23 CC
10 50 5 2013-10 23 CC
11 60 1 2012-01 36 SS
12 60 2 2012-02 35 XX
13 60 3 2012-03 34 AA
14 60 4 2012-04 34 AA
15 60 5 2012-05 34 AA
I think the nicest way to do this is to make use of the padr package, which is built to pad data.frames where there are missing/incomplete columns.
This uses grouping and cur_data() to make the correct date sequence in Period_2.
library(dplyr)
library(tidyr)
library(padr)
n_periods <- 5
testDF %>%
pad_int(end_val = n_periods , by = "Period_1", group = "ID") %>%
group_by(ID) %>%
mutate(Period_2 = as.Date(paste0(Period_2, "-01"))) %>%
mutate(Period_2 = seq(cur_data()$Period_2[1], by = "months", length.out =
n_periods) %>% format("%Y-%m")) %>%
fill(Bal, State) %>%
ungroup() %>%
select(ID, Period_1, Period_2, Bal, State)
ID Period_1 Period_2 Bal State
<dbl> <dbl> <chr> <dbl> <chr>
1 1 1 2012-06 10 XX
2 1 2 2012-07 10 AA
3 1 3 2012-08 10 BB
4 1 4 2012-09 10 CC
5 1 5 2012-10 10 XX
6 50 1 2013-06 21 AA
7 50 2 2013-07 22 BB
8 50 3 2013-08 23 CC
9 50 4 2013-09 23 CC
10 50 5 2013-10 23 CC
11 60 1 2012-01 36 SS
12 60 2 2012-02 35 XX
13 60 3 2012-03 34 AA
14 60 4 2012-04 34 AA
15 60 5 2012-05 34 AA
Note that this will handle cases when the year rolls over to the next year during Period_2.
Finally, you could adjust n_periods if you needed a different number of periods (or use a function to figure it out automatically, like jay.sf's answer).
Let's say I have a dataframe of Name and value, is there any ways to extract BOTH minimum and maximum values within Name in a single function?
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
# A tibble: 9 x 2
Name Value
<chr> <int>
1 A 27
2 A 37
3 A 57
4 B 89
5 B 20
6 B 86
7 C 97
8 C 62
9 C 58
The output should contains TWO columns only (Name and Value).
Thanks in advance!
You can use range to get max and min value and use it in summarise to get different rows for each Name.
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Value = range(Value), .groups = "drop")
# Name Value
# <chr> <int>
#1 A 27
#2 A 57
#3 B 20
#4 B 89
#5 C 58
#6 C 97
If you have large dataset using data.table might be faster.
library(data.table)
setDT(df)[, .(Value = range(Value)), Name]
You can use dplyr::group_by() and dplyr::summarise() like this:
library(dplyr)
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
df %>%
group_by(Name) %>%
summarise(
maximum = max(Value),
minimum = min(Value)
)
This outputs:
# A tibble: 3 × 3
Name maximum minimum
<chr> <int> <int>
1 A 68 1
2 B 87 34
3 C 82 14
What's a little odd is that my original df object looks a little different than yours, in spite of the seed:
# A tibble: 9 × 2
Name Value
<chr> <int>
1 A 68
2 A 39
3 A 1
4 B 34
5 B 87
6 B 43
7 C 14
8 C 82
9 C 59
I'm currently using rbind() together with slice_min() and slice_max(), but I think it may not be the best way or the most efficient way when the dataframe contains millions of rows.
library(tidyverse)
rbind(df %>% group_by(Name) %>% slice_max(Value),
df %>% group_by(Name) %>% slice_min(Value)) %>%
arrange(Name)
# A tibble: 6 x 2
# Groups: Name [3]
Name Value
<chr> <int>
1 A 57
2 A 27
3 B 89
4 B 20
5 C 97
6 C 58
In base R, the output format can be created with tapply/stack - do a group by tapply to get the output as a named list or range, stack it to two column data.frame and change the column names if needed
setNames(stack(with(df, tapply(Value, Name, FUN = range)))[2:1], names(df))
Name Value
1 A 27
2 A 57
3 B 20
4 B 89
5 C 58
6 C 97
Using aggregate.
aggregate(Value ~ Name, df, range)
# Name Value.1 Value.2
# 1 A 1 68
# 2 B 34 87
# 3 C 14 82
I am attempting to reference existing columns in dplyr through a loop. Effectively, I would like to evaluate the operations from one table (evaluation in below example) to be performed to another table (dt in below example). I do not want to hardcode the column names on the RHS within mutate(). I would like to control the evaluations being performed from the evaluation table below. So I am trying to make the process dynamic.
Here is a sample dataframe:
dt = data.frame(
A = c(1:20),
B = c(11:30),
C = c(21:40),
AA = rep(1, 20),
BB = rep(2, 20)
)
Here is a table of sample operations to be performed:
evaluation = data.frame(
New_Var = c("AA", "BB"),
Operation = c("(A*2) > B", "(B*2) <= C"),
Result = c("True", "False")
) %>% mutate_all(as.character)
What I am trying to do is the following:
for (i in 1:nrow(evaluation)) {
var = evaluation$New_Var[i]
dt = dt %>%
rowwise() %>%
mutate(!!var := ifelse(eval(parse(text = evaluation$Operation[i])),
evaluation$Result[i],
!!var))
}
my desired result would be something like this except for the "AA" in the AA column would be the original numeric values of the AA column of 1, 1, 1, 1, 1.
UPDATED:
I believe my syntax in the "False" part of the ifelse statement is incorrect. What is the correct syntax to specify "!!var" in the false portion of the ifelse statement?
I know there are other ways to do it using base R, but I would rather do it through dplyr as it is cleaner code to look at. I am leveraging "rowise()" to do it element by element.
Modified data to (a) enforce type consistency for columns AA and BB and (b) ensure that at least one row satisfies the second condition.
dt = tibble(
A = c(1:20),
B = c(10:29), ## Note the change
C = c(21:40),
AA = rep("a", 20), ## Note initialization with strings
BB = rep("b", 20) ## Ditto
)
To make your loop work, you need to convert your code strings into actual expressions. You can use rlang::sym() for variable names and rlang::parse_expr() for everything else.
for( i in 1:nrow(evaluation) )
{
var <- rlang::sym(evaluation$New_Var[i])
op <- rlang::parse_expr(evaluation$Operation[i])
dt = dt %>% rowwise() %>%
mutate(!!var := ifelse(!!op, evaluation$Result[i],!!var))
}
# # A tibble: 20 x 5
# A B C AA BB
# <int> <int> <int> <chr> <chr>
# 1 1 10 21 a False
# 2 2 11 22 a False
# 3 3 12 23 a b
# 4 4 13 24 a b
# 5 5 14 25 a b
# 6 6 15 26 a b
# 7 7 16 27 a b
# 8 8 17 28 a b
# 9 9 18 29 a b
# 10 10 19 30 True b
# 11 11 20 31 True b
# 12 12 21 32 True b
# 13 13 22 33 True b
# 14 14 23 34 True b
# 15 15 24 35 True b
# 16 16 25 36 True b
# 17 17 26 37 True b
# 18 18 27 38 True b
# 19 19 28 39 True b
# 20 20 29 40 True b
Assuming that Felipe's answer was the functionality you desired, here's a more "tidyverse"/pipe-oriented/functional approach.
Data
library(rlang)
library(dplyr)
library(purrr)
operations <- tibble(
old_var = exprs(A, B),
new_var = exprs(AA, BB),
test = exprs(2*A > B, 2*B <= C),
result = exprs("True", "False")
)
original <- tibble(
A = sample.int(30, 10),
B = sample.int(30, 10),
C = sample.int(30, 10)
)
original
# A tibble: 10 x 3
A B C
<int> <int> <int>
1 4 20 5
2 30 29 11
3 1 27 14
4 2 21 4
5 17 19 24
6 14 25 9
7 5 22 22
8 6 13 7
9 25 4 21
10 12 11 12
Functions
# Here's your reusable functions
generic_mutate <- function(dat, new_var, test, result, old_var) {
dat %>% mutate(!!new_var := ifelse(!!test, !!result, !!old_var))
}
generic_ops <- function(dat, ops) {
pmap(ops, generic_mutate, dat = dat) %>%
reduce(full_join)
}
generic_mutate takes a single original dataframe, a single new_var, etc. It performs the test, adds the new column with the appropriate name and values.
generic_ops is the "vectorized" version. It takes the original dataframe as the first argument, and a dataframe of operations as the second. It then parallel maps over each column of new variable names, tests, etc, and calls generic_mutate on each one. That results in a list of dataframes, each with one added column. The reduce then combines them back all together with a sequential full_join.
Results
original %>%
generic_ops(operations)
Joining, by = c("A", "B", "C")
# A tibble: 10 x 5
A B C AA BB
<int> <int> <int> <chr> <chr>
1 4 20 5 4 20
2 30 29 11 True 29
3 1 27 14 1 27
4 2 21 4 2 21
5 17 19 24 True 19
6 14 25 9 True 25
7 5 22 22 5 22
8 6 13 7 6 13
9 25 4 21 True False
10 12 11 12 True 11
The magic here is using exprs(...) so you can store NSE names and operations in a tibble without forcing their evaluation. I think this is a lot cleaner than storing names and operations in strings with quotation marks.
How's this:
evaluation = data.frame(
Old_Var = c('A', 'B'),
New_Var = c("AA", "BB"),
Operation = c("(A*2) > B", "(B*2) <= C"),
Result = c("True", "False")
) %>% mutate_all(as.character)
for (i in 1:nrow(evaluation)) {
old <- sym(evaluation$Old_Var[i])
new <- sym(evaluation$New_Var[i])
op <- sym(evaluation$Operation[i])
res <- sym(evaluation$Result[i])
dt <- dt %>%
mutate(!!new := ifelse(!!op, !!res, !!old))
}
EDIT: My last answer doesn't work because rlang tries to find a variable named !!op (e.g. named (A*2) > B) instead of evaluating the expression. I got this to work using a mix of tidyselect and base R. You can of course follow #Brian's advice and use this solution with pmap. I honestly don't know how well this will perform though, as I think it will evaluate the ifelse once per row, and am not sure it's a vectorized operation...
dt <- tibble(
A = c(1:20),
B = c(11:30),
C = c(21:40),
AA = rep(1, 20),
BB = rep(2, 20)
)
evaluation = tibble(
Old_Var = c('A', 'B'),
New_Var = c("AA", "BB"),
Operation = c('(A*2) > B', '(B*2) <= C'),
Result = c("True", "False")
)
for (i in 1:nrow(evaluation)) {
old <- evaluation$Old_Var[i]
new <- evaluation$New_Var[i]
op <- evaluation$Operation[i]
res <- evaluation$Result[i]
dt <- dt %>%
mutate(!!sym(new) := eval(parse(text = sprintf('ifelse(%s, "%s", %s)', op, res, old))))
}
One way is to rework the conditions first, then pass them to mutate :
conds <- parse(text=evaluation$Operation) %>%
as.list() %>%
setNames(evaluation$New_Var) %>%
imap(~expr(ifelse(!!.,"True", !!sym(.y))))
conds
#> $AA
#> ifelse((A * 2) > B, "True", AA)
#>
#> $BB
#> ifelse((B * 2) <= C, "True", BB)
dt %>% mutate(!!!conds)
#> A B C AA BB
#> 1 1 11 21 1 2
#> 2 2 12 22 1 2
#> 3 3 13 23 1 2
#> 4 4 14 24 1 2
#> 5 5 15 25 1 2
#> 6 6 16 26 1 2
#> 7 7 17 27 1 2
#> 8 8 18 28 1 2
#> 9 9 19 29 1 2
#> 10 10 20 30 1 2
#> 11 11 21 31 True 2
#> 12 12 22 32 True 2
#> 13 13 23 33 True 2
#> 14 14 24 34 True 2
#> 15 15 25 35 True 2
#> 16 16 26 36 True 2
#> 17 17 27 37 True 2
#> 18 18 28 38 True 2
#> 19 19 29 39 True 2
#> 20 20 30 40 True 2
My data came to me like this (but with 4000+ records). The following is data for 4 patients. Every time you see surgery OR age reappear, it is referring to a new patient.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
So to say again, every time surgery or age appear (surgery isn't always there, but age is), those records and the ones after pertain to the same patient until you see surgery or age appear again.
Thus I somehow need to add an ID column with this data:
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,4)
testdat$ID = ID
I know how to transpose and melt and all that to put the data into regular format, but how can I create that ID column?
Advice on relevant tags to use is helpful!
Assuming that surgery and age will be the first two pieces of information for each patient and that each patient will have a information that is not age or surgery afterward, this is a solution.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
# Use a tibble and get rid of factors.
dfTest = as_tibble(testdat) %>%
mutate_all(as.character)
# A little dplyr magic to see find if the start of a new patient, then give them an id.
dfTest = dfTest %>%
mutate(couldBeStart = if_else(col1 == "surgery" | col1 == "age", T, F)) %>%
mutate(isStart = couldBeStart & !lag(couldBeStart, default = FALSE)) %>%
mutate(patientID = cumsum(isStart)) %>%
select(-couldBeStart, -isStart)
# # A tibble: 17 x 3
# col1 col2 patientID
# <chr> <chr> <int>
# 1 surgery yes 1
# 2 age 54 1
# 3 weight 153 1
# 4 albumin normal 1
# 5 abiotics 2 1
# 6 surgery no 2
# 7 age 65 2
# 8 weight 134 2
# 9 BAPPS yes 2
# 10 abiotics 1 2
# 11 surgery yes 3
# 12 age 61 3
# 13 weight 210 3
# 14 age 46 4
# 15 weight 178 4
# 16 BAPPS no 4
# 17 albumin low 4
# Get the data to a wide workable format.
dfTest %>% spread(col1, col2)
# # A tibble: 4 x 7
# patientID abiotics age albumin BAPPS surgery weight
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2 54 normal NA yes 153
# 2 2 1 65 NA yes no 134
# 3 3 NA 61 NA NA yes 210
# 4 4 NA 46 low no NA 178
Using dplyr:
library(dplyr)
testdat = testdat %>%
mutate(patient_counter = cumsum(col1 == 'surgery' | (col1 == 'age' & lag(col1 != 'surgery'))))
This works by checking whether the col1 value is either 'surgery' or 'age', provided 'age' is not preceded by 'surgery'. It then uses cumsum() to get the cumulative sum of the resulting logical vector.
You can try the following
keywords <- c('surgery', 'age')
lgl <- testdat$col1 %in% keywords
testdat$ID <- cumsum(c(0, diff(lgl)) == 1) + 1
col1 col2 ID
1 surgery yes 1
2 age 54 1
3 weight 153 1
4 albumin normal 1
5 abiotics 2 1
6 surgery no 2
7 age 65 2
8 weight 134 2
9 BAPPS yes 2
10 abiotics 1 2
11 surgery yes 3
12 age 61 3
13 weight 210 3
14 age 46 4
15 weight 178 4
16 BAPPS no 4
17 albumin low 4
I have a list of transactions for a lot of people. I wish to find out when each particular person has crossed a particular threshold value of total transactions.
Here is an example of what I have already done:
Example dataset:
df <- data.frame(name = rep(c("a","b"),4),
dates = seq(as.Date("2017-01-01"), by = "month", length.out = 8), amt = 11:18)
setorderv(df, "name")
This gives me the following data frame
name dates amt
1 a 2017-01-01 11
3 a 2017-03-01 13
5 a 2017-05-01 15
7 a 2017-07-01 17
2 b 2017-02-01 12
4 b 2017-04-01 14
6 b 2017-06-01 16
8 b 2017-08-01 18
Then I wrote the following code to find the cumulative sums
df$cumsum <- ave(df$amt, df$name, FUN = cumsum)
This gives me the following data frame:
name dates amt cumsum
1 a 2017-01-01 11 11
3 a 2017-03-01 13 24
5 a 2017-05-01 15 39
7 a 2017-07-01 17 56
2 b 2017-02-01 12 12
4 b 2017-04-01 14 26
6 b 2017-06-01 16 42
8 b 2017-08-01 18 60
Now I want to know when each person crossed 20 and 40. I wrote the following code to find this out:
names <- unique(df$name)
for (i in seq_along(names)){
x1 <- Position(function(x) x >= 20, df$cumsum[df$name == names[i]])
x2 <- Position(function(x) x >= 40, df$cumsum[df$name == names[i]])
result_df[i,] <- c(df$name[i],
df[df$name == names[i],2][x1],
df[df$name == names[i],2][x2])
}
This code checks where the thresholds were crossed and stores the row number in a variable. Then extracts the value from that row of the second column and stores it in a another data frame.
The problem is, this code is really slow. I have over 200,000 people in my data set and over 10 million rows. This code takes about 25 seconds to execute for the first 50 users, which means it is likely to take about 30 hours for the entire dataset.
Is there a faster way to do this?
With dplyr you could group by person, filter when cumsum is above >20 or above >40, and then use slice(1) to select the first relevant row per person. Should be way faster than for looping.
df <- read.table(text = '
name dates amt cumsum
a 2017-01-01 11 11
a 2017-03-01 13 24
a 2017-05-01 15 39
a 2017-07-01 17 56
b 2017-02-01 12 12
b 2017-04-01 14 26
b 2017-06-01 16 42
b 2017-08-01 18 60', header = T)
df %>%
group_by(name) %>%
filter(cumsum > 20) %>%
slice(1)
name dates amt cumsum
<fctr> <fctr> <int> <int>
1 a 2017-03-01 13 24
2 b 2017-04-01 14 26
df %>%
group_by(name) %>%
filter(cumsum > 40) %>%
slice(1)
name dates amt cumsum
<fctr> <fctr> <int> <int>
a 2017-07-01 17 56
b 2017-06-01 16 42
Of course you could subsequently rbind these dataframes and arrange on person. Does this help?
Using data table could be something like this:
library(data.table)
dt <- data.table(df[order(df$dates), ])
dt[ ,':='(minDate20 = min(dates[cumsum(amt) > 20]), minDate40 = min(dates[cumsum(amt) > 40])), by = .(name)]
dt[dates == minDate20, ]
dt[dates == minDate40, ]