dummy variable columns based on strings from other columns [duplicate] - r

This question already has answers here:
Dummy variables from a string variable
(7 answers)
Closed 3 years ago.
I have a database with patient id number and the treatment they recived. I would like to have a dummy column for every different INDIVIDUAL treatment (ie, as in did the patient recieve treatment A,B,C,D).
This is way simplified because I have over 20 treatments and thousands of patients, and I can't figure out a simple way to do so.
example <- data.frame(id_number = c(0, 1, 2, 3, 4),
treatment = c("A", "A+B+C+D", "C+B", "B+A", "C"))
I would like to have something like this:
desired_result <- data.frame(id_number = c(0, 1, 2, 3, 4),
treatment = c("A", "A+B+C+D", "C+B", "B+A","C"),
A=c(1,1,0,1,0),
B=c(0,1,1,1,0),
C=c(0,1,1,0,1),
D=c(0,1,0,0,0))

A base version:
example["A"] <- as.numeric(grepl("A", example[,"treatment"]))
example["B"] <- as.numeric(grepl("B", example[,"treatment"]))
example["C"] <- as.numeric(grepl("C", example[,"treatment"]))
example["D"] <- as.numeric(grepl("D", example[,"treatment"]))
example
id_number treatment A B C D
1 0 A 1 0 0 0
2 1 A+B+C+D 1 1 1 1
3 2 C+B 0 1 1 0
4 3 B+A 1 1 0 0
5 4 C 0 0 1 0
The grepl function tests the presence of each pattern in each row, and as.numeric changes the logical TRUE/FALSE to 1/0

One tidyverse possibility could be:
example %>%
mutate(treatment2 = strsplit(treatment, "+", fixed = TRUE)) %>%
unnest() %>%
spread(treatment2, treatment2) %>%
mutate_at(vars(-id_number, -treatment), ~ (!is.na(.)) * 1)
id_number treatment A B C D
1 0 A 1 0 0 0
2 1 A+B+C+D 1 1 1 1
3 2 C+B 0 1 1 0
4 3 B+A 1 1 0 0
5 4 C 0 0 1 0
Or:
example %>%
mutate(treatment2 = strsplit(treatment, "+", fixed = TRUE)) %>%
unnest() %>%
mutate(val = 1) %>%
spread(treatment2, val, fill = 0)

Related

Iterating over columns to create flagging variables

I've got a dataset that has a lot of numerical columns (in the example below these columns are x, y, z). I want to create individual flagging variables for each of those columns (x_YN, y_YN, z_YN) such that, if the numerical column is > 0, the flagging variable is = 1 and otherwise it's = 0. What might be the most efficient way to tackle this?
Thanks for the help!
x <- c(3, 7, 0, 10)
y <- c(5, 2, 20, 0)
z <- c(0, 0, 4, 12)
df <- data.frame(x,y,z)
We may use a logical matrix and coerce
df[paste0(names(df), "_YN")] <- +(df > 0)
-output
> df
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1
The dplyr alternative:
library(dplyr)
df %>%
mutate(across(everything(), ~ +(.x > 0), .names = "{col}_YN"))
output
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1

How to calculate values for the first row that meets a certain condition?

I have the following dummy dataframe:
t <- data.frame(
a= c(0,0,2,4,5),
b= c(0,0,4,6,5))
a b
0 0
0 0
2 4
4 6
5 5
I want to replace just the first value that it is not zero for the column b. Imagine that the row that meets this criteria is i. I want to replace t$b[i] with t[i+2]+t[i+1] and the rest of t$b should remain the same. So the output would be
a b
0 0
0 0
2 11
4 6
5 5
In fact the dataset is dynamic so I cannot directly point to a specific row, it has to meet the criteria of being the first row not equal to zero in column b.
How can I create this new t$b?
Here is a straight forward solution in base R:
t <- data.frame(
a= c(0,0,2,4,5),
b= c(0,0,4,6,5))
ind <- which(t$b > 0)[1L]
t$b[ind] <- t$b[ind+2L] + t$b[ind+1L]
t
a b
1 0 0
2 0 0
3 2 11
4 4 6
5 5 5
Here is a roundabout way of getting there with a combination of group_by() and mutate():
library(tidyverse)
t %>%
mutate(
b_cond = b != 0,
row_number = row_number()
) %>%
group_by(b_cond) %>%
mutate(
min_row_number = row_number == min(row_number),
b = if_else(b_cond & min_row_number, lead(b, 1) + lead(b, 2), b)
) %>%
ungroup() %>%
select(a, b) # optional, to get back to original columns
# A tibble: 5 × 2
a b
<dbl> <dbl>
1 0 0
2 0 0
3 2 11
4 4 6
5 5 5

All possible combinations (sequential)

I am wondering what an efficient approach to the following question would be:
Suppose I have three characters in group 1 and two characters in group 2:
group_1 = c("X", "Y", "Z")
group_2 = c("A", "B")
Clearly, the "all" possible combinations for group_1 and group_2 are given by:
group_1_combs = data.frame(X = c(0,1,0,0,1,1,0,1),
Y = c(0,0,1,0,1,0,1,1),
Z = c(0,0,0,1,0,1,1,1))
group_2_combs = data.frame(A = c(0,1,0,1),
B = c(0,0,1,1))
My question is the following:
(1) How do I go from group_1 to group_1_combs efficiently (given that the character vector might be large).
(2) How do I do an "all possible" combinations of each row of group_1_combs and group_2_combs? Specifically, I want a "final" data.frame where each row of group_1_combs is "permuted" with every row of group_2_combs. This means that the final data.frame would have 8 x 4 rows (since there are 8 rows in group_1_combs and 4 rows in group_2_combs) and 5 columns (X,Y,Z,A,B).
Thanks!
You want expand.grid and merge:
Question 1:
group_1_combs <- expand.grid(setNames(rep(list(c(0, 1)), length(group_1)), group_1))
group_2_combs <- expand.grid(setNames(rep(list(c(0, 1)), length(group_2)), group_2))
Question 2:
> merge(group_1_combs, group_2_combs)
X Y Z A B
1 0 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 1 1 0 0 0
5 0 0 1 0 0
6 1 0 1 0 0
7 0 1 1 0 0
...
Or you can go directly to the merged data.frame:
group_12 <- c(group_1, group_2)
expand.grid(setNames(rep(list(c(0, 1)), length(group_12)), group_12))

extracting unique combinations from a long list of binary variables

I have a dataframe containing a long list of binary variables. Each row represents a participant, and columns represent whether a participant made a certain choice (1) or not (0). For the sakes of simplicity, let's say there's only four binary variables and 6 participants.
df <- data.frame(a = c(0,1,0,1,0,1),
b = c(1,1,1,1,0,1),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
>df
# a b c d
# 1 0 1 0 1
# 2 1 1 0 1
# 3 0 1 0 0
# 4 1 1 1 0
# 5 0 0 1 0
# 6 1 1 1 0
In the dataframe, I want to create a list of columns that reflect each unique combination of variables in df (i.e., abc, abd, bcd, cda). Then, for each row, I want to add value "1" if the row contains the particular combination corresponding to the column. So, if the participant scored 1 on "a", "b", and "c", and 0 on "d" he would have a score 1 in the newly created column "abc", but 0 in the other columns. Ideally, it would look something like this.
>df_updated
# a b c d abc abd bcd cda
# 1 0 1 0 1 0 0 0 0
# 2 1 1 0 1 0 1 0 0
# 3 0 1 0 0 0 0 0 0
# 4 1 1 1 0 1 0 0 0
# 5 0 0 1 0 0 0 0 0
# 6 1 1 1 0 0 0 0 0
The ultimate goal is to have an idea of the frequency of each of the combinations, so I can order them from the most frequently chosen to the least frequently chosen. I've been thinking about this issue for days now, but couldn't find an appropriate answer. I would very much appreciate the help.
Something like this?
funCombn <- function(data){
f <- function(x, data){
data <- data[x]
list(
name = paste(x, collapse = ""),
vec = apply(data, 1, function(x) +all(as.logical(x)))
)
}
res <- combn(names(df), 3, f, simplify = FALSE, data = df)
out <- do.call(cbind.data.frame, lapply(res, '[[', 'vec'))
names(out) <- sapply(res, '[[', 'name')
cbind(data, out)
}
funCombn(df)
# a b c d abc abd acd bcd
#1 0 1 0 1 0 0 0 0
#2 1 1 0 1 0 1 0 0
#3 0 1 0 0 0 0 0 0
#4 1 1 1 0 1 0 0 0
#5 0 0 1 0 0 0 0 0
#6 1 1 1 0 1 0 0 0
Base R option using combn :
n <- 3
cbind(df, do.call(cbind, combn(names(df), n, function(x) {
setNames(data.frame(as.integer(rowSums(df[x] == 1) == n)),
paste0(x, collapse = ''))
}, simplify = FALSE))) -> result
result
# a b c d abc abd acd bcd
#1 0 1 0 1 0 0 0 0
#2 1 1 0 1 0 1 0 0
#3 0 1 0 0 0 0 0 0
#4 1 1 1 0 1 0 0 0
#5 0 0 1 0 0 0 0 0
#6 1 1 1 0 1 0 0 0
Using combn create all combinations of column names taking n columns at a time. For each of those combinations assign 1 to those rows where all the 3 combinations are 1 or 0 otherwise.
If you are just looking for a frequency of the combinations (and they don't need to be back in the original data), then you could use something like this:
df <- data.frame(a = c(0,1,0,1,0,1),
b = c(1,1,1,1,0,1),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
n <- names(df)
out <- sapply(n, function(x)ifelse(df[[x]] == 1, x, ""))
combs <- apply(out, 1, paste, collapse="")
sort(table(combs))
# combs
# abd b bd c abc
# 1 1 1 1 2
Ok, so let's use your data, including one row without any 1's:
df <- data.frame(
a = c(0,1,0,1,0,1,0),
b = c(1,1,1,1,0,1,0),
c = c(0,0,0,1,1,1,0),
d = c(1,1,0,0,0,0,0)
)
Now I want to paste all column names together if they have a 1, and then make that a wide table (so that all have a column for a combination). Of course, I fill all resulting NAs with 0's.
df2 <- df %>%
dplyr::mutate(
combination = paste0(
ifelse(a == 1, "a", ""), # There is possibly a way to automate this as well using across()
ifelse(b == 1, "b", ""),
ifelse(c == 1, "c", ""),
ifelse(d == 1, "d", "")
),
combination = ifelse(
combination == "",
"nothing",
paste0("comb_", combination)
),
value = ifelse(
is.na(combination),
0,
1
),
i = dplyr::row_number()
) %>%
tidyr::pivot_wider(
names_from = combination,
values_from = value,
names_repair = "unique"
) %>%
replace(., is.na(.), 0) %>%
dplyr::select(-i)
Since you want to order the original df by frequency, you can create a summary of all combinations (excluding those without anything filled in). Then you just make it a long table and pull the column for every combination (arranged by frequency) from the table.
comb_in_order <- df2 %>%
dplyr::select(
-tidyselect::any_of(
c(
names(df),
"nothing" # I think you want these last.
)
)
) %>%
dplyr::summarise(
dplyr::across(
.cols = tidyselect::everything(),
.fns = sum
)
) %>%
tidyr::pivot_longer(
cols = tidyselect::everything(),
names_to = "combination",
values_to = "frequency"
) %>%
dplyr::arrange(
dplyr::desc(frequency)
) %>%
dplyr::pull(combination)
The only thing to do then is to reconstruct the original df by these after arranging by the columns.
df2 %>%
dplyr::arrange(
across(
tidyselect::any_of(comb_in_order),
desc
)
) %>%
dplyr::select(
tidyselect::any_of(names(df))
)
This should work for all possible combinations.

Removing a group with conditional statement in r

The conditional statement is that in any event, if there are two or more consecutive rows with values higher than 1, the group should be deleted.
For example:
Event<- c(1,1,1,1,2,2,2,2,2,2,3,3,3,3,3)
Value<- c(1,0,0,0,8,7,1,0,0,0,8,0,0,0,0)
A<- data.frame(Event, Value)
Event Value
1 1
1 0
1 0
1 0
2 8
2 7
2 1
2 0
2 0
2 0
3 8
3 0
3 0
3 0
3 0
In this example the group of event 2 should be deleted because it has two consecutive rows with values higher than 1. So it should looks like:
Event Value
1 1
1 0
1 0
1 0
3 8
3 0
3 0
3 0
3 0
Any suggestion?
We can use rle by groups.
library(dplyr)
A %>%
group_by(Event) %>%
filter(!any(with(rle(Value > 1), lengths[values] > 1)))
#Opposite way using all
#filter(all(with(rle(Value > 1), lengths[values] < 2)))
# Event Value
# <dbl> <dbl>
#1 1 1
#2 1 0
#3 1 0
#4 1 0
#5 3 8
#6 3 0
#7 3 0
#8 3 0
#9 3 0
The same logic can be used in base R :
subset(A, !ave(Value > 1, Event, FUN = function(x)
any(with(rle(x), lengths[values] > 1))))
as well as data.table
library(data.table)
setDT(A)[, .SD[!any(with(rle(Value > 1), lengths[values] > 1))], Event]
Using dplyr
A %>%
group_by(Event) %>%
mutate(consec = if_else(Value > 1, row_number(), 0L),
remove = if_else(consec > 1,"Y","N")) %>%
filter(!any(remove == "Y")) %>%
select(-c("consec","remove"))
A base R approach:
# split the dataframe by event into separate lists, record whether values are > 1 (T/F)
A_split <- split(A$Value > 1, Event)
# for each item in the list, record the number of consecutive T values;
# make T/F vector "keep" with row names corresponding to A$Event
keep <- sapply(A_split, function(x) sum(x[1:length(x) - 1] * x[2:length(x)])) == 0
# convert keep to numeric vector of A$Event values
keep <- as.numeric(names(keep == T))
# subset A based on keep vector
A[A$Event %in% keep, ]

Resources