I am in need of a conditional way to lag back to the last row where the value is one number or "level" lower than the current row. Whenever type = "yes", I want to go back one level lower to the last "no" and get the quantity. For example, rows 2 and 3 here are type "yes" and level 5. In that case, I'd like to go back to the last level 4 "no" row, get the quantity, and assign it to a new column. When type is "no" no lagging needs to be done.
Data:
row_id level type quantity
1 4 no 100
2 5 yes 110
3 5 yes 115
4 2 no 500
5 2 no 375
6 3 yes 250
7 3 yes 260
8 3 yes 420
Desired output:
row_id level type quantity lagged_quantity
1 4 no 100 NA
2 5 yes 110 100
3 5 yes 115 100
4 2 no 500 NA
5 2 no 375 NA
6 3 yes 250 375
7 3 yes 260 375
8 3 yes 420 375
Data:
structure(list(row_id = c(1, 2, 3, 4, 5, 6, 7, 8), level = c(4,
5, 5, 2, 2, 3, 3, 3), type = c("no", "yes", "yes", "no", "no",
"yes", "yes", "yes"), quantity = c(100, 110, 115, 500, 375, 250,
260, 420)), row.names = c(NA, -8L), class = c("tbl_df", "tbl",
"data.frame"))
Desired output:
structure(list(row_id = c(1, 2, 3, 4, 5, 6, 7, 8), level = c(4,
5, 5, 2, 2, 3, 3, 3), type = c("no", "yes", "yes", "no", "no",
"yes", "yes", "yes"), quantity = c(100, 110, 115, 500, 375, 250,
260, 420), lagged_quantity = c("NA", "100", "100", "NA", "NA",
"375", "375", "375")), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
#Mossa
Direct solution would be to:
df1 %>%
mutate(
level_id = 1 + cumsum(c(1, diff(level)) < 0)
) %>%
mutate(lagged_quantity = if_else(type == "yes", NA_real_, quantity)) %>%
fill(lagged_quantity) %>%
mutate(lagged_quantity = if_else(type == "no", NA_real_, lagged_quantity))
Where first we retain only the values you would like, and then the missing entries are filled with last known value, and then the no answers, that need not be lagged, are taken out.
An option with data.table
library(data.table)
setDT(df1)[df1[, .(lagged_qty = last(quantity)), .(level, type)][,
lagged_qty := shift(lagged_qty), .(grp = cumsum(type == 'no'))],
lagged_qty := lagged_qty, on = .(level, type)]
-output
> df1
row_id level type quantity lagged_qty
<int> <int> <char> <int> <int>
1: 1 4 no 100 NA
2: 2 5 yes 110 100
3: 3 5 yes 115 100
4: 4 2 no 500 NA
5: 5 2 no 375 NA
6: 6 3 yes 250 375
7: 7 3 yes 260 375
8: 8 3 yes 420 375
Related
Code
count
AA
BB
CC
101
1
No
NO
4
101
2
Yes
NO
5
101
3
Yes
NO
10
102
1
Yes
NO
7
102
2
Yes
NO
40
102
3
Yes
NO
6
102
4
No
NO
12
I want to apply the condition as,
If the count column is 1 with respect to code column then AA should be "NO" and BB should be "NO".
For count between the max and min count with respect to code column then AA can be "NO" or "YES" and BB should be "NO".
For the max count column with respect to code column then AA should be "NO" and BB should be "NO".
Code
count
AA
BB
CC
101
1
No
NO
4
101
2
Yes
NO
5
102
2
Yes
NO
40
102
3
Yes
NO
6
102
4
No
NO
12
Hi,#Darren Tsai Whatever might be the case if the count column is 1 then it is getting deleted completely, by using you code I am getting the below output
Code
count
AA
BB
CC
101
2
Yes
NO
5
102
2
Yes
NO
40
102
3
Yes
NO
6
102
4
No
NO
12
A dplyr solution:
library(dplyr)
df %>%
group_by(Code) %>%
mutate(flag = count %in% range(count)) %>%
filter(flag & if_all(c(AA, BB), ~ toupper(.x) == 'NO') | !flag & toupper(BB) == 'NO') %>%
ungroup() %>%
select(-flag)
# # A tibble: 5 × 5
# Code count AA BB CC
# <int> <int> <chr> <chr> <int>
# 1 101 1 No NO 4
# 2 101 2 Yes NO 5
# 3 102 2 Yes NO 40
# 4 102 3 Yes NO 6
# 5 102 4 No NO 12
A base equivalent:
df |>
transform(flag = ave(count, Code, FUN = \(x) x %in% range(x))) |>
subset(flag & toupper(AA) == 'NO' & toupper(BB) == 'NO' | !flag & toupper(BB) == 'NO', -flag)
Data
df <- structure(list(Code = c(101L, 101L, 101L, 102L, 102L, 102L, 102L),
count = c(1L, 2L, 3L, 1L, 2L, 3L, 4L), AA = c("No", "Yes",
"Yes", "Yes", "Yes", "Yes", "No"), BB = c("NO", "NO", "NO", "NO",
"NO", "NO", "NO"), CC = c(4L, 5L, 10L, 7L, 40L, 6L, 12L)), class = "data.frame", row.names = c(NA,-7L))
Update with another dataset
This dataset has 12 rows with 3 ID 8540, 2254, 607. After running my code the 2nd, 4th, 12th rows are removed.
library(dplyr)
df2 <- structure(list(Unique_Id = c(8540, 8540, 2254, 2254, 607, 607, 607, 607, 607, 607, 607, 607),
AA = c("No", "Yes", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No"),
count = c(1, 2, 1, 2, 1, 2, 3, 4, 5, 6, 7, 8),
BB = c("No", "Yes", "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -12L))
df2
# A tibble: 12 × 4
Unique_Id AA count BB
<dbl> <chr> <dbl> <chr>
1 8540 No 1 No
2 8540 Yes 2 Yes
3 2254 No 1 No
4 2254 No 2 Yes
5 607 No 1 No
6 607 No 2 No
7 607 No 3 No
8 607 No 4 No
9 607 No 5 No
10 607 No 6 No
11 607 No 7 No
12 607 No 8 Yes
df2 %>%
group_by(Unique_Id) %>%
mutate(flag = count %in% range(count)) %>%
filter(flag & if_all(c(AA, BB), ~ toupper(.x) == 'NO') | !flag & toupper(BB) == 'NO') %>%
ungroup() %>%
select(-flag)
# A tibble: 9 × 4
Unique_Id AA count BB
<dbl> <chr> <dbl> <chr>
1 8540 No 1 No
2 2254 No 1 No
3 607 No 1 No
4 607 No 2 No
5 607 No 3 No
6 607 No 4 No
7 607 No 5 No
8 607 No 6 No
9 607 No 7 No
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I got some data such as this
structure(list(id = c(1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3), dead = c(1,
1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0), futime = c(2062, 2062, 2062,
2062, 2151, 2151, 388, 388, 388, 388, 388, 388), event = c("hosp",
"out", "hosp", "out", "hosp", "out", "hosp", "out", "hosp", "out",
"hosp", "out"), event_time = c(36, 52, 775, 776, 1268, 1283,
178, 192, 271, 272, 387, 377.9)), class = "data.frame", row.names = c(NA,
-12L))
and I would like to make it look like this
structure(list(id2 = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3,
3, 3), dead2 = c(NA, NA, NA, NA, 1, NA, NA, 1, NA, NA, NA, NA,
NA, NA, NA), futime2 = c(NA, NA, NA, NA, 2062, NA, NA, 2151,
NA, NA, NA, NA, NA, NA, 388), event2 = c("hosp", "out", "hosp",
"out", "death", "hosp", "out", "death", "hosp", "out", "hosp",
"out", "hosp", "out", "censored"), event_time2 = c(36, 52, 775,
776, 2062, 1268, 1283, 2151, 178, 192, 271, 272, 387, 377.9,
388)), class = "data.frame", row.names = c(NA, -15L))
So basically, I want the dead == 1 and the value in the futime column to appear in the last observation for each id. and create a new column where all events are sequentially entered. Thanks
I didn't put the "2"s in the column names of the result, but you can make that change easily enough if you need it.
library(dplyr)
last_rows = df %>%
select(id, dead, futime) %>%
group_by(id) %>%
slice(1) %>%
mutate(
event = ifelse(dead == 1, "death", "censored"),
event_time = futime
)
result = df %>%
mutate(
dead = NA,
futime = NA
) %>%
bind_rows(last_rows) %>%
arrange(id, event_time)
result
# id dead futime event event_time
# 1 1 NA NA hosp 36.0
# 2 1 NA NA out 52.0
# 3 1 NA NA hosp 775.0
# 4 1 NA NA out 776.0
# 5 1 1 2062 death 2062.0
# 6 2 NA NA hosp 1268.0
# 7 2 NA NA out 1283.0
# 8 2 1 2151 death 2151.0
# 9 3 NA NA hosp 178.0
# 10 3 NA NA out 192.0
# 11 3 NA NA hosp 271.0
# 12 3 NA NA out 272.0
# 13 3 NA NA out 377.9
# 14 3 NA NA hosp 387.0
# 15 3 0 388 censored 388.0
Here is one method using group_modify and add_row
library(dplyr)
library(tibble)
df1 %>%
group_by(id, futime) %>%
group_modify(~ .x %>%
add_row(dead = NA^!last(.x$dead), event_time = last(.y$futime),
event = if(last(.x$dead) == 1) "death" else "censored")) %>%
mutate(across(c(dead), ~ replace(., row_number() != n(), NA))) %>%
group_by(id) %>%
mutate(futime = replace(futime, duplicated(futime, fromLast = TRUE),
NA)) %>%
ungroup
-output
# A tibble: 15 × 5
id futime dead event event_time
<dbl> <dbl> <dbl> <chr> <dbl>
1 1 NA NA hosp 36
2 1 NA NA out 52
3 1 NA NA hosp 775
4 1 NA NA out 776
5 1 2062 1 death 2062
6 2 NA NA hosp 1268
7 2 NA NA out 1283
8 2 2151 1 death 2151
9 3 NA NA hosp 178
10 3 NA NA out 192
11 3 NA NA hosp 271
12 3 NA NA out 272
13 3 NA NA hosp 387
14 3 NA NA out 378.
15 3 388 NA censored 388
It has been a long time since I have manipulated datasets in R and thus far I have been doing it by hand in excel but that is cumbersome as my dataset isn't small!
I am interested in the following variables:
PATIENT_ID - appears multiple times in dataset
CYCLE_ID - unique
CYCLE_NO - chronological starting at 1 per PATIENT_ID
CALC_ACT - Yes or No
What I need is to create a new dataset which meets the following criteria:
For a single PATIENT_ID I must have one row where CALC_ACT == "Yes" and one row where CALC_ACT == "No".
The CALC_ACT == "Yes" row has to have a higher (but the closest in number) CYCLE_NO than the CALC_ACT == "No" row.
I just need one match per PATIENT_ID and ignore the rest.
For someone where it has been years since doing any programming, the complexity of this is doing my head in!!!
Thankyou so much in advance
In the example below only rows 3 and 4 should be kept. Row 4 (which is CYCLE_NO == 6) is the first time CALC_ACT is Yes and the preceding cycle is No. The rows for PATIENT_ID 222 and 456 do not meet criteria because they need to have both Yes and No for CALC_ACT. Hopefully this makes sense...
PatientData <- data.frame(PATIENT_ID = c(123, 123, 123, 123, 123, 222, 222, 222, 456, 456), CYCLE_ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), CYCLE_NO = c(1, 3, 4, 6, 7, 2, 3, 4, 1, 2), CALC_ACT = c("Yes", "No", "No", "Yes", "Yes", "No", "No", "No", "Yes", "Yes"))
PatientData
PATIENT_ID CYCLE_ID CYCLE_NO CALC_ACT
1 123 1 1 Yes
2 123 2 3 No
3 123 3 4 No
4 123 4 6 Yes
5 123 5 7 Yes
6 222 6 2 No
7 222 7 3 No
8 222 8 4 No
9 456 9 1 Yes
10 456 10 2 Yes
Here's a function which should work as per the logic.
return_rows <- function(CALC_ACT, CYCLE_NO) {
#indices of 'yes' rows
yes_ind <- which(CALC_ACT == 'Yes')
#indices of 'no' rows
no_ind <- which(CALC_ACT == 'No')
#Only if both the yes and no rows are present else don't select any row
if(length(yes_ind) && length(no_ind)) {
#Get the minimum length of yes and no rows
n <- min(length(yes_ind), length(no_ind))
#Get the first value where yes no has higher CYCLE_NO than the no row
ind <- which(CYCLE_NO[yes_ind[1:n]] > CYCLE_NO[no_ind[1:n]])[1]
#Return the corresponding index
sort(c(yes_ind[ind], no_ind[ind]))
} else 0
}
Apply this function for each PATIENT_ID -
PatientData %>%
arrange(PATIENT_ID, CYCLE_NO) %>%
group_by(PATIENT_ID) %>%
slice(return_rows(CALC_ACT, CYCLE_NO)) %>%
ungroup
# PATIENT_ID CYCLE_ID CYCLE_NO CALC_ACT
# <dbl> <dbl> <dbl> <chr>
#1 123 3 4 No
#2 123 4 6 Yes
This code should work
library(data.table)
PatientData <- data.frame(PATIENT_ID = c(123, 123, 123, 123, 123, 222, 222, 222, 456, 456), CYCLE_ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
CYCLE_NO = c(1, 3, 4, 6, 7, 2, 3, 4, 1, 2), CALC_ACT = c("Yes", "No", "No", "Yes", "Yes", "No", "No", "No", "Yes", "Yes"))
d1 = data.table(PatientData[order(PatientData$PATIENT_ID, PatientData$CYCLE_NO),])
d1$CALC_ACT1 = ifelse(d1$CALC_ACT == 'Yes',1,0)
d1[ , diff := CALC_ACT1 - shift(CALC_ACT1), by = PATIENT_ID]
out = d1[c(which(d1$diff == 1), (which(d1$diff == 1) -1)),]
out
I'm trying to translate a mutate_at() to a mutate() using dplyr's new "across" function and a bit stumped.
In a nutshell, I need to compare the values in a series of columns to a "baseline" column. When the values in the columns are higher than the baseline, I need to use the baseline value. When the values in the columns are lower than or equal to the baseline, I need to keep the value. Here's an example dataset (my actual dataset is much larger):
test <- structure(list(baseline = c(5, 7, 8, 4, 9, 1, 0, 46, 47), bob = c(7,
11, 34, 9, 6, 8, 3, 49, 12), sally = c(3, 5, 2, 2, 6, 1, 3, 4,
56), rita = c(6, 4, 6, 7, 6, 0, 3, 11, 3)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
cols = list(baseline = structure(list(), class = c("collector_double",
"collector")), bob = structure(list(), class = c("collector_double",
"collector")), sally = structure(list(), class = c("collector_double",
"collector")), rita = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
My current code uses mutate_at() and works fine:
trial1 <- test %>%
mutate_at(
vars('bob','sally', 'rita'),
funs(case_when(
. > baseline ~ baseline,
. <= baseline ~ .)))
But when I try to update it to reflect across() from dplyr 1.0, I keep getting an error. Here is my attempt:
trial2 <- test %>%
mutate(across(c(bob, sally, rita),
case_when(. > baseline ~ baseline,
. <= baseline ~ .)))
And here is the error:
error: Problem with mutate() input ..1.
x . > baseline ~ baseline, . <= baseline ~ . must be length 36 or one, not 9, 4.
ℹ Input ..1 is across(...)
Any ideas what I might be doing wrong? Does case_when() work with across?
We can use the ~ to specify the anonymous function/lambda function call
library(dplyr)
test %>%
mutate(across(c(bob, sally, rita),
~ case_when(. > baseline ~ baseline,
. <= baseline ~ .)))
-output
# A tibble: 9 x 4
# baseline bob sally rita
# <dbl> <dbl> <dbl> <dbl>
#1 5 5 3 5
#2 7 7 5 4
#3 8 8 2 6
#4 4 4 2 4
#5 9 6 6 6
#6 1 1 1 0
#7 0 0 0 0
#8 46 46 4 11
#9 47 12 47 3
According to ?across the arguments to fns can be either
Functions to apply to each of the selected columns. Possible values are:
NULL, to returns the columns untransformed.
A function, e.g. mean.
A purrr-style lambda, e.g. ~ mean(.x, na.rm = TRUE)
A list of functions/lambdas, e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))
Also, instead of case_when, we can make use of the pmin
test %>%
mutate(across(c(bob, sally, rita), ~ pmin(baseline, .)))
-output
# A tibble: 9 x 4
# baseline bob sally rita
# <dbl> <dbl> <dbl> <dbl>
#1 5 5 3 5
#2 7 7 5 4
#3 8 8 2 6
#4 4 4 2 4
#5 9 6 6 6
#6 1 1 1 0
#7 0 0 0 0
#8 46 46 4 11
#9 47 12 47 3
I have the following dataset
structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12,
NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA,
-6L))
a b c
1 2 4 50
2 1 5 34
3 9 1 77
4 2 9 88
5 9 12 33
6 8 NA 60
From column b I only want values between 4-9. Column c between 50-80. Replacing the values outside the range with NA, resulting in
structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, NA, 9, NA,
NA), c = c(50, NA, 77, NA, NA, 60)), class = "data.frame", row.names = c(NA,
-6L))
a b c
1 2 4 50
2 1 5 NA
3 9 NA 77
4 2 9 NA
5 9 NA NA
6 8 NA 60
I've tried several things with replace_with_na_at function where this seemed most logical:
test <- replace_with_na_at(data = test, .vars="c",
condition = ~.x < 2 & ~.x > 2)
However, nothing I tried works. Does somebody know why? Thanks in advance! :)
You can subset with a logical vector testing your conditions.
x$b[x$b < 4 | x$b > 9] <- NA
x$c[x$c < 50 | x$c > 80] <- NA
x
# a b c
#1 2 4 50
#2 1 5 NA
#3 9 NA 77
#4 2 9 NA
#5 9 NA NA
#6 8 NA 60
Data:
x <- structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12,
NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA,
-6L))
Yet another base R solution, this time with function is.na<-
is.na(test$b) <- with(test, b < 4 | b > 9)
is.na(test$c) <- with(test, c < 50 | c > 80)
A package naniar solution with a pipe could be
library(naniar)
library(magrittr)
test %>%
replace_with_na_at(
.vars = 'b',
condition = ~(.x < 4 | .x > 9)
) %>%
replace_with_na_at(
.vars = 'c',
condition = ~(.x < 50 | .x > 80)
)
You should mention the packages you are using. From googling, i'm guessing you are using naniar. The problem appears to be that you did not properly specify the condition, but the following should work:
library(naniar)
test <- structure(list(a = c(2, 1, 9, 2, 9, 8),
b = c(4, 5, 1, 9, 12, NA),
c = c(50, 34, 77, 88, 33, 60)),
class = "data.frame",
row.names = c(NA, -6L))
replace_with_na_at(test, "c", ~.x < 50 | .x > 80)
#> a b c
#> 1 2 4 50
#> 2 1 5 NA
#> 3 9 1 77
#> 4 2 9 NA
#> 5 9 12 NA
#> 6 8 NA 60
Created on 2020-06-02 by the reprex package (v0.3.0)
You simply could use Map to replace your values with NA.
dat[2:3] <- Map(function(x, y) {x[!x %in% y] <- NA;x}, dat[2:3], list(4:9, 50:80))
dat
# a b c
# 1 2 4 50
# 2 1 5 NA
# 3 9 NA 77
# 4 2 9 NA
# 5 9 NA NA
# 6 8 NA 60
Data:
dat <- structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12,
NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA,
-6L))
We can use map2
library(purrr)
library(dplyr)
df1[c('b', 'c')] <- map2(df1 %>%
select(b, c), list(c(4, 9), c(50,80)), ~
replace(.x, .x < .y[1]|.x > .y[2], NA))