Suppose I have the following dataframe
df:
df=data.frame(ID=c(123100,123200,123300,123400,123500),"2014"=c(1,1,1,2,3),"2015"=c(1,1,1,2,3),"2016"=c(2,1,1,2,1), "2017"=c(2,1,1,2,1), "2018"=c(2,3,1,2,1) )
Now, I want to find out, for which ID the data has changed in which year. So for example, in 2016 ID 123100 has changed from 1 to 2. I would like to add new columns for change (1 = change, 0 = no change), year of change, old value (1,2 or 3) and new value (1,2,3).
In the end it should look like this:
df_final=data.frame(ID=c(123100,123200,123300,123400,123500),"2014"=c(1,1,1,2,3),"2015"=c(1,1,1,2,3),"2016"=c(2,1,1,2,1), "2017"=c(2,1,1,2,1), "2018"=c(2,3,1,2,1), "change"=c(1,1,0,0,1),
"year"=c(2016, 2018, 0, 0, 2016), "before"=c(1,1,0,0,3), "after"=c(2, 3, 0, 0, 1))
I couldn't find any satisfying solution on here, so I hope you can help me.
Here's a base R method.
It may be best to have the IDs with no change as NA. If you really want zeroes, just change c(NA, NA, NA) in the following code to c(0, 0, 0)
Note that in your example data frames, if you run the code as-is, the column names for each year all start with an "x" - you can prevent this by adding the check.names = FALSE argument to the data.frame function.
cbind(df, setNames(as.data.frame(t(apply(df[-1], 1, function(x) {
y <- which(diff(x) != 0)
if(length(y)) c(as.numeric(names(y)), x[y], x[y+1])
else c(NA, NA, NA)
}))), c("Year", "Before", "After")))
#> ID 2014 2015 2016 2017 2018 Year Before After
#> 1 123100 1 1 2 2 2 2016 1 2
#> 2 123200 1 1 1 1 3 2018 1 3
#> 3 123300 1 1 1 1 1 NA NA NA
#> 4 123400 2 2 2 2 2 NA NA NA
#> 5 123500 3 3 1 1 1 2016 3 1
Data used
df <- structure(list(ID = c(123100, 123200, 123300, 123400, 123500),
`2014` = c(1, 1, 1, 2, 3), `2015` = c(1, 1, 1, 2, 3), `2016` = c(2,
1, 1, 2, 1), `2017` = c(2, 1, 1, 2, 1), `2018` = c(2, 3,
1, 2, 1)), class = "data.frame", row.names = c(NA, -5L))
Created on 2022-06-18 by the reprex package (v2.0.1)
here is an optional tidyverse approach:
library(tidyverse)
# join resume df to current df
dplyr::left_join(df,
# make df long to build groupings by ID
tidyr::pivot_longer(df, -ID) %>%
dplyr::group_by(ID) %>%
# order just to be sure
dplyr::arrange(ID, name) %>%
# generate year number, before and after values
dplyr::mutate(year = readr::parse_number(name),
before = lag(value),
# if there is no after value use current value
after = ifelse(is.na(lead(value)), value, lead(value))) %>%
# filter where preceding uneven current
dplyr::filter(before != value) %>%
# unselect obsolete columns
dplyr::select(-name, -value),
by = "ID") %>%
# fill up empty fields with zeros
dplyr::mutate(dplyr::across(year:after, ~ifelse(is.na(.x), 0, .x)))
ID X2014 X2015 X2016 X2017 X2018 year before after
1 123100 1 1 2 2 2 2016 1 2
2 123200 1 1 1 1 3 2018 1 3
3 123300 1 1 1 1 1 0 0 0
4 123400 2 2 2 2 2 0 0 0
5 123500 3 3 1 1 1 2016 3 1
another possibility to solve the task within the tidyverse is to work row-wise:
dplyr::rowwise(df) %>%
dplyr::mutate(year = readr::parse_number(names(.)[stringr::str_detect(names(.), pattern = "^X")][c(0, diff(dplyr::c_across(dplyr::starts_with("x")))) != 0][1]),
before = dplyr::c_across(dplyr::starts_with("x"))[dplyr::lead(c(0, diff(dplyr::c_across(dplyr::starts_with("x")))) != 0, default = FALSE)][1],
after = dplyr::coalesce(dplyr::c_across(dplyr::starts_with("x"))[dplyr::lag(c(0, diff(dplyr::c_across(dplyr::starts_with("x")))) != 0, default = FALSE)][1],
dplyr::c_across(dplyr::starts_with("x"))[c(0, diff(dplyr::c_across(dplyr::starts_with("x")))) != 0][1])) %>%
dplyr::ungroup() %>%
dplyr::mutate(dplyr::across(year:after, ~ ifelse(is.na(.x), 0, .x)))
ID X2014 X2015 X2016 X2017 X2018 year before after
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 123100 1 1 2 2 2 2016 1 2
2 123200 1 1 1 1 3 2018 1 3
3 123300 1 1 1 1 1 0 0 0
4 123400 2 2 2 2 2 0 0 0
5 123500 3 3 1 1 1 2016 3 1
matrixStats::rowDiffs might be helpful and faster here.
z <- apply(matrixStats::rowDiffs(as.matrix(df[-1])) != 0, 1, which.max) + 1; d <- dim(df)
m <- matrix(t(df[-1])[c(z + 0:(d[2] - 2)*d[1] - 1, z + 0:(d[2] - 2)*d[1])],,2, di=list(c(), c('before', 'after')))
cbind(df, `[<-`(cbind(change=1, year=substring(names(df[-1])[z], 2), m), z == 2, 1:4, 0))
# ID X2014 X2015 X2016 X2017 X2018 change year before after
# 1 123100 1 1 2 2 2 1 2016 1 2
# 2 123200 1 1 1 1 3 1 2018 1 3
# 3 123300 1 1 1 1 1 0 0 0 0
# 4 123400 2 2 2 2 2 0 0 0 0
# 5 123500 3 3 1 1 1 1 2016 3 1
Data:
df <- structure(list(ID = c(123100, 123200, 123300, 123400, 123500),
X2014 = c(1, 1, 1, 2, 3), X2015 = c(1, 1, 1, 2, 3), X2016 = c(2,
1, 1, 2, 1), X2017 = c(2, 1, 1, 2, 1), X2018 = c(2, 3, 1,
2, 1)), class = "data.frame", row.names = c(NA, -5L))
Related
My data looks like this:
data <- data.frame(grupoaih = c("09081997", "13122006", "09081997", "22031969"),
NMM_PROC_BR = c(1, 1, 0, 1),
NMM_CID = c(0, 1, 1, 0),
CPAV_PROC_BR = c(0, 0, 0, 1),
CPAV_CID = c(1, 1, 0, 1))
grupoaih NMM_PROC_BR NMM_CID CPAV_PROC_BR CPAV_CID
1 09081997 1 0 0 1
2 13122006 1 1 0 1
3 09081997 0 1 0 0
4 22031969 1 0 1 1
How can I assign the value 1 when "grupoaih" is a duplicate so the other 4 variables get filled equally like this:
data2 <- data.frame(grupoaih = c("09081997", "13122006", "09081997", "22031969"),
NMM_PROC_BR = c(1, 1, 1, 1),
NMM_CID = c(1, 1, 1, 0),
CPAV_PROC_BR = c(0, 0, 0, 1),
CPAV_CID = c(1, 1, 1, 1))
grupoaih NMM_PROC_BR NMM_CID CPAV_PROC_BR CPAV_CID
1 09081997 1 1 0 1
2 13122006 1 1 0 1
3 09081997 1 1 0 1
4 22031969 1 0 1 1
This only applies if grupoaih is duplicated and any of the 4 variables are filled with 1. If both are 0 in all variables, they stay as they are.
You can use a group_by and then an n() to check if there are duplicates. . stands for the original value, and ~ indicates a formula.
library(dplyr)
data %>%
group_by(grupoaih) %>%
mutate(across(c("NMM_PROC_BR", "NMM_CID", "CPAV_CID"), ~ifelse(n() > 1, 1, .))) %>%
ungroup()
# # A tibble: 4 × 5
# grupoaih NMM_PROC_BR NMM_CID CPAV_PROC_BR CPAV_CID
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 09081997 1 1 0 1
# 2 13122006 1 1 0 1
# 3 09081997 1 1 0 1
# 4 22031969 1 0 1 1
It could work with max after grouping
library(dplyr)
data %>%
group_by(grupoaih) %>%
mutate(across(everything(), max)) %>%
ungroup
-output
# A tibble: 4 × 5
grupoaih NMM_PROC_BR NMM_CID CPAV_PROC_BR CPAV_CID
<chr> <dbl> <dbl> <dbl> <dbl>
1 09081997 1 1 0 1
2 13122006 1 1 0 1
3 09081997 1 1 0 1
4 22031969 1 0 1 1
Or use fmax from collapse
library(collapse)
data[-1] <- fmax(data[-1], data$grupoaih, TRA = 1)
I have a database like this:
id <- c(rep(1,3), rep(2, 3), rep(3, 3))
condition <- c(0, 0, 1, 0, 0, 1, 1, 1, 0)
time_point1 <- c(1, 1, NA)
time_point2 <- c(NA, 1, NA)
time_point3 <- c(NA, NA, NA)
time_point4 <- c(1, NA, NA, 1, NA, NA, NA, NA, 1)
data <- data.frame(id, condition, time_point1, time_point2, time_point3, time_point4)
data
id condition time_point1 time_point2 time_point3 time_point4
1 1 0 1 NA NA 1
2 1 0 1 1 NA NA
3 1 1 NA NA NA NA
4 2 0 1 NA NA 1
5 2 0 1 1 NA NA
6 2 1 NA NA NA NA
7 3 1 1 NA NA NA
8 3 1 1 1 NA NA
9 3 0 NA NA NA 1
I want to make a table with how many have the condition == 1 (n_x) and also how many are in each time point (n_t). In case there is none also I want a 0. I tried this:
data %>%
pivot_longer(cols = contains("time_point")) %>%
filter (!is.na(value)) %>%
group_by(name) %>%
mutate(n_t = n_distinct(id)) %>%
ungroup() %>%
filter(condition == 1) %>%
group_by(name) %>%
summarise(n_x = n_distinct(id), n_t = first(n_t))
Obtaining this:
name n_x n_t
<chr> <int> <int>
1 time_point1 1 3
2 time_point2 1 3
Desired Outcome: I want this type of table that considers the cases with condition and without it:
name n_x n_t
1 time_point1 2 6
2 time_point2 1 3
3 time_point3 0 0
4 time_point4 0 3
Thank you!
You can pivot_longer() to be able to group_by() time_points and then summarise just adding up the values. For conditions only sum values where the column values != NA.
data %>%
pivot_longer(cols=c(3:6),names_to = 'point', values_to='values') %>%
group_by(point) %>%
summarise(n_x = sum(condition[!is.na(values)]), n_t = sum(values, na.rm = TRUE))
Output:
# A tibble: 4 x 3
point n_x n_t
<chr> <dbl> <dbl>
1 time_point1 2 6
2 time_point2 1 3
3 time_point3 0 0
4 time_point4 0 3
I have a dataframe with 3 columns and I want to assign values to a fourth column of this dataframe if the sum of a condition is met in another row. In this example I want to assign 1 to df[,4], if df[,3]>=2 for each row.
An example of what I want as the output is:
Any help is appreciated.
Thank you,
library(tidyverse)
data <-
tribble(
~ID, ~time1, ~time2,
'jkjkdf', 1, 1,
'kjkj', 1, 0,
'fgf', 1, 1,
'jhkj', 0, 1,
'hgd', 0,0
)
mutate(data, label = if_else(time1 + time2 >= 2, 1, 0))
#> # A tibble: 5 x 4
#> ID time1 time2 label
#> <chr> <dbl> <dbl> <dbl>
#> 1 jkjkdf 1 1 1
#> 2 kjkj 1 0 0
#> 3 fgf 1 1 1
#> 4 jhkj 0 1 0
#> 5 hgd 0 0 0
#or with n time columns
data %>%
rowwise() %>%
mutate(label = if_else(sum(across(starts_with('time'))) >= 2, 1, 0))
#> # A tibble: 5 x 4
#> # Rowwise:
#> ID time1 time2 label
#> <chr> <dbl> <dbl> <dbl>
#> 1 jkjkdf 1 1 1
#> 2 kjkj 1 0 0
#> 3 fgf 1 1 1
#> 4 jhkj 0 1 0
#> 5 hgd 0 0 0
Created on 2021-06-06 by the reprex package (v2.0.0)
Do you want to assign 1 if both time1 and time2 are 1 ?
If there are only two columns you can do -
df$label <- as.integer(df$time1 == 1 & df$time2 == 1)
If there are many such time columns we can take help of rowSums -
cols <- grep('time', names(df))
df$label <- as.integer(rowSums(df[cols] == 1) == length(cols))
df
# a time1 time2 label
#1 a 1 1 1
#2 b 1 0 0
#3 c 1 1 1
#4 d 0 1 0
#5 e 0 0 0
data
Images are not the right way to share data, provide them in a reproducible format.
df <- data.frame(a = letters[1:5],
time1 = c(1, 1, 1, 0, 0),
time2 = c(1, 0, 1, 1, 0))
We could do thin in a vectorized way using tidyverse methods - select the columns that starts_with 'time' in column name, reduce it to a single vector by adding (+) the corresponding elements, use the aliases from magrittr to convert it to binary for creating the 'label' column. Finally, the object should be assigned (<-) to original data if we want the original object to be changed
library(dplyr)
library(purrr)
library(magrittr)
df %>%
mutate(label = select(cur_data(), starts_with('time')) %>%
reduce(`+`) %>%
is_weakly_greater_than(2) %>%
multiply_by(1))
a time1 time2 label
1 a 1 1 1
2 b 1 0 0
3 c 1 1 1
4 d 0 1 0
5 e 0 0 0
data
df <- structure(list(a = c("a", "b", "c", "d", "e"), time1 = c(1, 1,
1, 0, 0), time2 = c(1, 0, 1, 1, 0)), class = "data.frame", row.names = c(NA,
-5L))
I have a quite big dataframe and I'm trying to add a new variable which is the sum of the three previous rows on a running basis, also it should be grouped by ID. The first three rows per ID should be 0. Here's what it should look like.
ID Var1 VarNew
1 2 0
1 2 0
1 3 0
1 0 7
1 4 5
1 1 7
Here's an example dataframe
ID <- c(1, 1, 1, 1, 1, 1)
Var1 <- c(2, 2, 3, 0, 4, 1)
df <- data.frame(ID, Var1)
You can use any of the package that has rolling calculation function with a window size of 3 and lag the result. For example with zoo::rollsumr.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(VarNew = lag(zoo::rollsumr(Var1, 3, fill = 0), default = 0)) %>%
ungroup
# ID Var1 VarNew
# <dbl> <dbl> <dbl>
#1 1 2 0
#2 1 2 0
33 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
You can use filter in ave.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) c(0, 0, 0,
filter(head(df$Var1, -1), c(1,1,1), side=1)[-1:-2]))
df
# ID Var1 VarNew
#1 1 2 0
#2 1 2 0
#3 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
or using cumsum in combination with head and tail.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) {y <- cumsum(x)
c(0, 0, 0, tail(y, -3) - head(y, -3))})
Library runner also helps
library(runner)
df %>% mutate(var_new = sum_run(Var1, k =3, na_pad = T, lag = 1))
ID Var1 var_new
1 1 2 NA
2 1 2 NA
3 1 3 NA
4 1 0 7
5 1 4 5
6 1 1 7
NAs can be mutated to 0 if desired so, easily.
I have a dataset of questionnaires filled by patients.
I want to identify them using diagnostic criteria; the criteria I'm struggling with requires at least 3 answers of >= 3 (questions are Likert questions from 1 up to 5).
A MWE of the dataset I'm working on is presented below
data <- structure(list(q1 = c(1, 2, 3, 1, 1, 1, 1, 3, 1, 1), q2 = c(1,
1, 3, 1, 1, 1, 1, 3, 1, 1), q3 = c(1, 1, 1, 1, 3, 3, 1, 1,
1, 1), q4 = c(1, 2, 2, 1, 1, 3, 1, 3, 1, 1), q5 = c(1, 1,
3, 1, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
I've figured out how to identify observations that match at least 1 value >=3 using (I do not use all_vars as my dataset is larger than the MWE:
data.match <- data %>%
filter_at(vars(q1, q2, q3, q4, q5), any_vars(. %in% c(3:5)))
data$diagnostic <- ifelse(data$id %in% data.match$id,1,0)
I then back-identified patients using the second line.
The thing is I've not been able to replicate such a strategy to identify patients meeting a determined number of pre-specified values across columns.
In this specific example, I'd like to identify patients 3 and 8. I've tried using rowSums but it seems to me that the number of possible combinations is too high.
Using dplyr, you could use rowwise with c_across :
library(dplyr)
result <- data %>%
rowwise() %>%
mutate(diagnostic = as.integer(sum(c_across(starts_with('q')) >= 3) >= 3))
result
# q1 q2 q3 q4 q5 diagnostic
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 1 1 1 1 0
# 2 2 1 1 2 1 0
# 3 3 3 1 2 3 1
# 4 1 1 1 1 1 0
# 5 1 1 3 1 1 0
# 6 1 1 3 3 1 0
# 7 1 1 1 1 1 0
# 8 3 3 1 3 1 1
# 9 1 1 1 1 1 0
#10 1 1 1 1 1 0
Perhaps, we can use rowSums
data$diagnostic <- +(rowSums(data >=3) == 3)
data$diagnostic
#[1] 0 0 1 0 0 0 0 1 0 0