R, find character string from vector, create new TRUE/FALSE columns - r

I have a data frame like this:
df<-structure(list(MRN = c("53634", "65708", "72122", "40458", "03935",
"67473", "20281", "52479", "10261", "40945", "40630", "92295",
"43505", "80719", "39492", "44720", "70691", "21351", "03457",
"02182"), Outcome_Diagnosis_1 = c(NA, NA, NA, "Seroma of breast [N64.89]",
"Breast implant capsular contracture [T85.44XA]; Breast implant capsular contracture [T85.44XA]; Breast implant capsular contracture [T85.44XA]",
NA, NA, NA, "Acquired breast deformity [N64.89]", NA, NA, NA,
NA, "Acquired breast deformity [N64.89]", NA, NA, NA, NA, NA,
NA), Outcome_Diagnosis_2 = c(NA, NA, NA, "Extrusion of breast implant, initial encounter [T85.49XA]; Extrusion of breast implant, initial encounter [T85.49XA]; Extrusion of breast implant, initial encounter [T85.49XA]",
NA, NA, NA, NA, NA, NA, NA, NA, NA, "Capsular contracture of breast implant, subsequent encounter [T85.44XD]; Capsular contracture of breast implant, subsequent encounter [T85.44XD]; Capsular contracture of breast implant, subsequent encounter [T85.44XD]",
NA, NA, NA, NA, NA, NA), Outcome_Diagnosis_3 = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Acquired breast deformity [N64.89]; Capsular contracture of breast implant, initial encounter [T85.44XA]; Capsular contracture of breast implant, initial encounter [T85.44XA]; Capsular contracture of breast implant, initial encounter [T85.44XA]",
NA, NA, NA, NA, NA, NA)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
And I have a few vectors like this:
Infection<-c("L76","L00", "L01","L02","L03","L04", "L05","L08")
Hematoma<-c("N64.89","M79.81")
Seroma<- c("L76.34")
Necrosis<- c("N64.1","T86.821")
CapsularContracture<- c("T85.44")
MechanicalComplications<- c("T85", "T85.4", "T85.41", "T85.42", "T85.43", "T85.49")
What I'd like to do is create new columns in the data frame that are TRUE/FALSE for if that vector was found in each row. (And it would just be TRUE even if it shows up multiple times in that row, i.e. it doesn't need to "count" them).
So the output I want would be something like this:
The reason I am struggling and came to stack for help is I don't really know how to combine searching for particular strings (that might be within a longer sentence in that column) and looking over multiple columns.
Additional Info that might be important:
There are more columns than just those 3 outcome diagnoses, it'd be useful if the answer looked through the entire row regardless of how many columns
Sometimes those codes aren't specific enough and it'd probably be useful to look for the actual words like "Seroma". I imagine that'd just be a case of swapping out the characters inside the quotes right?

You could store your vectors in a list:
lst <- list(Infection = c("L76","L00", "L01","L02","L03","L04", "L05","L08"),
Hematoma = c("N64.89","M79.81"),
Seroma = c("L76.34"),
Necrosis = c("N64.1","T86.821"),
CapsularContracture = c("T85.44"),
MechanicalComplications = c("T85", "T85.4", "T85.41", "T85.42", "T85.43", "T85.49"))
And then, using dplyr and purrr you could do:
imap(lst,
~ df %>%
mutate(!!.y := reduce(across(Outcome_Diagnosis_1:Outcome_Diagnosis_3, function(y) grepl(paste(sub("\\.", "", .x), collapse = "|"), sub("\\.", "", y))), `|`))) %>%
reduce(full_join)
MRN Outcome_Diagnos… Outcome_Diagnos… Outcome_Diagnos… Infection Hematoma Seroma Necrosis CapsularContrac…
<chr> <chr> <chr> <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
1 53634 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE
2 65708 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE
3 72122 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE
4 40458 Seroma of breas… Extrusion of br… <NA> FALSE TRUE FALSE FALSE FALSE
5 03935 Breast implant … <NA> <NA> FALSE FALSE FALSE FALSE TRUE
6 67473 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE
7 20281 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE
8 52479 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE
9 10261 Acquired breast… <NA> <NA> FALSE TRUE FALSE FALSE FALSE
10 40945 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE

Up front
out <- lapply(manythings, function(thing) {
rowSums(
do.call(cbind, lapply(df[,2:4], function(col) Vectorize(grepl, vectorize.args = "pattern")(thing, col, fixed = TRUE)))
) > 0
})
tibble(cbind(df, out))
# # A tibble: 20 x 10
# MRN Outcome_Diagnosis_1 Outcome_Diagnosis_2 Outcome_Diagnosis_3 Infection Hematoma Seroma Necrosis CapsularContrac~ MechanicalCompl~
# <chr> <chr> <chr> <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
# 1 53634 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
# 2 65708 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
# 3 72122 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
# 4 40458 Seroma of breast [N6~ Extrusion of breast ~ <NA> FALSE TRUE FALSE FALSE FALSE TRUE
# 5 03935 Breast implant capsu~ <NA> <NA> FALSE FALSE FALSE FALSE TRUE TRUE
# 6 67473 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
# 7 20281 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
# 8 52479 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
# 9 10261 Acquired breast defo~ <NA> <NA> FALSE TRUE FALSE FALSE FALSE FALSE
# 10 40945 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
# 11 40630 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
# 12 92295 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
# 13 43505 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
# 14 80719 Acquired breast defo~ Capsular contracture~ Acquired breast defo~ FALSE TRUE FALSE FALSE TRUE TRUE
# 15 39492 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
# 16 44720 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
# 17 70691 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
# 18 21351 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
# 19 03457 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
# 20 02182 <NA> <NA> <NA> FALSE FALSE FALSE FALSE FALSE FALSE
Walk-through
Same as #tmfmnk, I recommend putting all of your patterns into a named list:
manythings <- list(
Infection = c("L76","L00", "L01","L02","L03","L04", "L05","L08"),
Hematoma = c("N64.89","M79.81"),
Seroma = c("L76.34"),
Necrosis = c("N64.1","T86.821"),
CapsularContracture = c("T85.44"),
MechanicalComplications = c("T85", "T85.4", "T85.41", "T85.42", "T85.43", "T85.49"))
Also, we should note that while grepl is good for this, it does not vectorize the pattern= argument, so we need to do that externally. Further, since some of your patterns have regex-sensitive characters (i.e., . which matches anything), we need to guard against regex-injection. For instance, if we aren't careful, then "N64.89" as a pattern will incorrectly match "N64989". For this, I use fixed=TRUE as a safeguard. Unfortunately, this also hampers our ability to shape the patterns such that we can check for all of them in one step. Instead, we'll vectorize it, searching a fixed-regex (single element of one of your vectors of patterns) and aggregate the results.
So let's do one of the pattern-vectors against one column of the frame:
Vectorize(grepl, vectorize.args = "pattern")(manythings[[2]], df[[2]], fixed = TRUE)
# N64.89 M79.81
# [1,] FALSE FALSE
# [2,] FALSE FALSE
# [3,] FALSE FALSE
# [4,] TRUE FALSE
# [5,] FALSE FALSE
# [6,] FALSE FALSE
# [7,] FALSE FALSE
# [8,] FALSE FALSE
# [9,] TRUE FALSE
# [10,] FALSE FALSE
# [11,] FALSE FALSE
# [12,] FALSE FALSE
# [13,] FALSE FALSE
# [14,] TRUE FALSE
# [15,] FALSE FALSE
# [16,] FALSE FALSE
# [17,] FALSE FALSE
# [18,] FALSE FALSE
# [19,] FALSE FALSE
# [20,] FALSE FALSE
Now we can reduce that so that we know of one of the patterns is found within each cell of this one column:
rowSums(
Vectorize(grepl, vectorize.args = "pattern")(manythings[[2]], df[[2]], fixed = TRUE)
) > 0
# [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Now we can iterate that process over each of the vectors of patterns within manythings:
lapply(manythings, function(thing) {
rowSums(
Vectorize(grepl, vectorize.args = "pattern")(thing, df[[2]], fixed = TRUE)
) > 0
})
# $Infection
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# $Hematoma
# [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# $Seroma
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# $Necrosis
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# $CapsularContracture
# [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# $MechanicalComplications
# [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
All of this has been to a single column, df[[2]]. In order to apply this across multiple (selectable) columns, I'll employ some tricks with column-binding (in the code at the top). To break that down,
lapply(df[,2:4], ...) subsets the data we want to search so just a few columns. Any way you want to select columns will fit here. This will return a list of matrices, something like:
lapply(df[,2:4], function(col) Vectorize(grepl, vectorize.args = "pattern")(thing, col, fixed = TRUE))
# $Outcome_Diagnosis_1
# N64.89 M79.81
# [1,] FALSE FALSE
# [2,] FALSE FALSE
# [3,] FALSE FALSE
# ...
# $Outcome_Diagnosis_2
# N64.89 M79.81
# [1,] FALSE FALSE
# [2,] FALSE FALSE
# [3,] FALSE FALSE
# ...
# $Outcome_Diagnosis_3
# N64.89 M79.81
# [1,] FALSE FALSE
# [2,] FALSE FALSE
# [3,] FALSE FALSE
# ...
do.call(cbind, ...) will take each of those embedded matrices and combine them into a single matrix:
do.call(cbind, lapply(df[,2:4], function(col) Vectorize(grepl, vectorize.args = "pattern")(thing, col, fixed = TRUE)))
# N64.89 M79.81 N64.89 M79.81 N64.89 M79.81
# [1,] FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE FALSE FALSE FALSE FALSE FALSE
# [4,] TRUE FALSE FALSE FALSE FALSE FALSE
# ...
which allows us to use rowSums(.) > 0 to determine if any patterns (in each column) is met for each row.

Here is another base R solution you could use albeit similar to some extent. As pointed out cleverly by dear #r2evans I also changed my pattern matching to fixed = TRUE which I was not aware of in the first place:
cbind(df, as.data.frame(do.call(cbind, lst |>
lapply(function(a) {
sapply(a, function(b) {
apply(df[-1], 1, function(c) as.logical(Reduce(`+`, grepl(b, c, fixed = TRUE))))
}) |> rowSums() |> as.logical()
}))))
Infection Hematoma Seroma Necrosis CapsularContracture MechanicalComplications
1 FALSE FALSE FALSE FALSE FALSE FALSE
2 FALSE FALSE FALSE FALSE FALSE FALSE
3 FALSE FALSE FALSE FALSE FALSE FALSE
4 FALSE TRUE FALSE FALSE FALSE TRUE
5 FALSE FALSE FALSE FALSE TRUE TRUE
6 FALSE FALSE FALSE FALSE FALSE FALSE
7 FALSE FALSE FALSE FALSE FALSE FALSE
8 FALSE FALSE FALSE FALSE FALSE FALSE
9 FALSE TRUE FALSE FALSE FALSE FALSE
10 FALSE FALSE FALSE FALSE FALSE FALSE
11 FALSE FALSE FALSE FALSE FALSE FALSE
12 FALSE FALSE FALSE FALSE FALSE FALSE
13 FALSE FALSE FALSE FALSE FALSE FALSE
14 FALSE TRUE FALSE FALSE TRUE TRUE
15 FALSE FALSE FALSE FALSE FALSE FALSE
16 FALSE FALSE FALSE FALSE FALSE FALSE
17 FALSE FALSE FALSE FALSE FALSE FALSE
18 FALSE FALSE FALSE FALSE FALSE FALSE
19 FALSE FALSE FALSE FALSE FALSE FALSE
20 FALSE FALSE FALSE FALSE FALSE FALSE
In order to accommodate the result I only put the output of newly created columns here, but the code binds them to the original data set.
lst <- list(Infection = c("L76", "L00", "L01", "L02", "L03", "L04",
"L05", "L08"), Hematoma = c("N64.89", "M79.81"), Seroma = "L76.34",
Necrosis = c("N64.1", "T86.821"), CapsularContracture = "T85.44",
MechanicalComplications = c("T85", "T85.4", "T85.41", "T85.42",
"T85.43", "T85.49"))

Related

How to convert columns to multiple boolean columns with tidyverse

I have a group of columns for each time and I want to convert it to a lot of boolean columns (one by category) with mutate() and across() like that :
data <- data.frame(category_t1 = c("A","B","C","C","A","B"),
category_t2 = c("A","C","B","B","B",NA),
category_t3 = c("C","C",NA,"B",NA,"A"))
data %>% mutate(across(starts_with("category"),
~case_when(.x == "A" ~ TRUE, !is.na(.x) ~ FALSE),
.names = "{str_replace(.col, 'category', 'A')}"),
across(starts_with("category"),
~case_when(.x == "B" ~ TRUE, !is.na(.x) ~ FALSE),
.names = "{str_replace(.col, 'category', 'B')}"),
across(starts_with("category"),
~case_when(.x == "C" ~ TRUE, !is.na(.x) ~ FALSE),
.names = "{str_replace(.col, 'category', 'C')}"))
Which makes :
category_t1 category_t2 category_t3 A_t1 A_t2 A_t3 B_t1 B_t2 B_t3 C_t1 C_t2
1 A A C TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
2 B C C FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
3 C B <NA> FALSE FALSE NA FALSE TRUE NA TRUE FALSE
4 C B B FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
5 A B <NA> TRUE FALSE NA FALSE TRUE NA FALSE FALSE
6 B <NA> A FALSE NA TRUE TRUE NA FALSE FALSE NA
It works but I would like to know if there is a better idea because here I am doing the same code 3 times instead of one big code (and imagine if I had 10 times to repeat it...). I though I could do it with map() but I didn't manage to make it work.
I think there is a problem because of .names argument in across() that cannot connect with the string I use in case_when().
I think maybe there is something to do in the ... argument, like :
data %>% mutate(across(starts_with("category"),
~case_when(.x == mod ~ TRUE, !is.na(.x) ~ FALSE),
mod = levels(as.factor(data$category_t1)),
.names = "{str_replace(.col, 'category', mod)}"))
But of course that doesn't work here. Do you know how to do that ?
Thanks a lot.
We may use table in across
library(dplyr)
library(stringr)
library(tidyr)
data %>%
mutate(across(everything(), ~ as.data.frame.matrix(table(row_number(), .x) *
NA^(is.na(.x)) > 0),
.names = "{str_remove(.col, 'category_')}")) %>%
unpack(where(is.data.frame), names_sep = ".")
-output
# A tibble: 6 × 12
category_t1 category_t2 category_t3 t1.A t1.B t1.C t2.A t2.B t2.C t3.A t3.B t3.C
<chr> <chr> <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 A A C TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
2 B C C FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
3 C B <NA> FALSE FALSE TRUE FALSE TRUE FALSE NA NA NA
4 C B B FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
5 A B <NA> TRUE FALSE FALSE FALSE TRUE FALSE NA NA NA
6 B <NA> A FALSE TRUE FALSE NA NA NA TRUE FALSE FALSE
Or use model.matrix from base R
data1 <- replace(data, is.na(data), "NA")
lvls <- lapply(data1, \(x) levels(factor(x, levels = c("NA", "A", "B", "C"))))
m1 <- model.matrix(~ 0 + ., data = data1, xlev = lvls)
out <- cbind(data, m1[, -grep("NA", colnames(m1))] > 0)
-output
out
category_t1 category_t2 category_t3 category_t1A category_t1B category_t1C category_t2A category_t2B category_t2C category_t3A category_t3B category_t3C
1 A A C TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
2 B C C FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
3 C B <NA> FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
4 C B B FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
5 A B <NA> TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
6 B <NA> A FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
> colnames(out)
[1] "category_t1" "category_t2" "category_t3"
[4] "category_t1A" "category_t1B" "category_t1C"
[7] "category_t2A" "category_t2B" "category_t2C"
[10] "category_t3A"
[11] "category_t3B" "category_t3C"
Or another option with table
cbind(data, do.call(cbind.data.frame,
lapply(data, \(x) (table(seq_along(x), x)* NA^is.na(x)) > 0)))
-output
category_t1 category_t2 category_t3 category_t1.A category_t1.B category_t1.C category_t2.A category_t2.B category_t2.C category_t3.A category_t3.B
1 A A C TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
2 B C C FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
3 C B <NA> FALSE FALSE TRUE FALSE TRUE FALSE NA NA
4 C B B FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
5 A B <NA> TRUE FALSE FALSE FALSE TRUE FALSE NA NA
6 B <NA> A FALSE TRUE FALSE NA NA NA TRUE FALSE
category_t3.C
1 TRUE
2 TRUE
3 NA
4 FALSE
5 NA
6 FALSE
Not a tidyverse option (although pipe-compatible), it is very easily doable with package fastDummies:
fastDummies::dummy_cols(data, ignore_na = TRUE)
category_t1 category_t2 category_t3 category_t1_A category_t1_B category_t1_C category_t2_A category_t2_B category_t2_C category_t3_A category_t3_B category_t3_C
1 A A C 1 0 0 1 0 0 0 0 1
2 B C C 0 1 0 0 0 1 0 0 1
3 C B <NA> 0 0 1 0 1 0 NA NA NA
4 C B B 0 0 1 0 1 0 0 1 0
5 A B <NA> 1 0 0 0 1 0 NA NA NA
6 B <NA> A 0 1 0 NA NA NA 1 0 0
purrr's map_dfc could match well with your current approach:
library(dplyr)
library(purrr)
bind_cols(data,
map_dfc(LETTERS[1:3], \(letter) { mutate(data,
across(starts_with("category"),
~ case_when(.x == letter ~ TRUE, !is.na(.x) ~ FALSE),
.names = paste0("{str_replace(.col, 'category', '", letter, "')}")),
.keep = "none") }
)
)
Or skip the bind_cols and use .keep = ifelse(letter == "A", "all", "none").
Output:
category_t1 category_t2 category_t3 A_t1 A_t2 A_t3 B_t1 B_t2 B_t3 C_t1 C_t2 C_t3
1 A A C TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
2 B C C FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
3 C B <NA> FALSE FALSE NA FALSE TRUE NA TRUE FALSE NA
4 C B B FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
5 A B <NA> TRUE FALSE NA FALSE TRUE NA FALSE FALSE NA
6 B <NA> A FALSE NA TRUE TRUE NA FALSE FALSE NA FALSE
A base solution with nested lapply():
cbind(data, lapply(data, \(x) {
lev <- levels(factor(x))
sapply(setNames(lev, lev), \(y) x == y)
}))
category_t1 category_t2 category_t3 category_t1.A category_t1.B category_t1.C category_t2.A category_t2.B category_t2.C category_t3.A category_t3.B category_t3.C
1 A A C TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
2 B C C FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
3 C B <NA> FALSE FALSE TRUE FALSE TRUE FALSE NA NA NA
4 C B B FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
5 A B <NA> TRUE FALSE FALSE FALSE TRUE FALSE NA NA NA
6 B <NA> A FALSE TRUE FALSE NA NA NA TRUE FALSE FALSE

How can I remove rows with same elements in all columns of a dataframe?

I have a dataframe with the following elements:
> x[3536:3540,]
V1 V2
3536 2 6
3537 13 6
3538 9 6
3539 6 6
3540 2 2
I want to remove rows with the same elements in all columns.
My desired result is as follows:
> x[3536:3540,]
V1 V2
3536 2 6
3537 13 6
3538 9 6
I tried this:
x<-x[,1] != x[,2]
But I get only boolean values for each row, not matrix with rows with non-same elements in columns:
> x
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[15] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[29] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[43] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[57] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[71] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[99] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[113] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Any help would be greatly appreciated.
You need to subset/filter:
Base R:
x_new <- x[x$V1 != x$V2,]
dplyr:
library(dplyr)
x_new <- x %>%
filter(V1 != V2)
Result:
x_new
V1 V2
2 1 2
3 1 3
Data:
x <- data.frame(
V1 = c(1,1,1,1),
V2 = c(1,2,3,1)
)
The below is assuming you want to subset within specific rows as per original post.
library(data.table)
setDT(df)
df <- df[3536:3540][V1 != V2]

Turns thousands of dummy variables into multinomial variable

I have a dataframe of the following sort:
a<-c('q','w')
b<-c(T,T)
d<-c(F,F)
.e<-c(T,F)
.f<-c(F,F)
.g<-c(F,T)
h<-c(F,F)
i<-c(F,T)
j<-c(T,T)
df<-data.frame(a,b,d,.e,.f,.g,h,i,j)
a b d .e .f .g h i j
1 q TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
2 w TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
I want to turn all variables starting with periods at the start into a single multinomial variable called Index such that the second row would have a value 1 for the Index column, the third row would have a value 2, etc. :
df$Index<-c('e','g')
a b d .e .f .g h i j Index
1 q TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE e
2 w TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE g
Although many rows can have a T for any of period-initial variable, each row can be T for only ONE period-initial variable.
If it were just a few items id do an ifelse statement:
df$Index <- ifelse(df$_10000, '10000',...
But there are 12000 of these. The names for all dummy variables begin with underscores, so I feel like there must be a better way. In pseudocode I would say something like:
for every row:
for every column beginning with '_':
if value == T:
assign the name of the column without '_' to a Column 'Index'
Thanks in advance
Sample data:
df <- cbind(a = letters[1:10], b = LETTERS[1:10],
data.frame(diag(10) == 1))
names(df)[-(1:2)] <- paste0("_", 1:10)
set.seed(42)
df <- df[sample(nrow(df)),]
head(df,3)
# a b _1 _2 _3 _4 _5 _6 _7 _8 _9 _10
# 1 a A TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# 5 e E FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# 10 j J FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Execution:
df$Index <- apply(subset(df, select = grepl("^_", names(df))), 1,
function(z) which(z)[1])
df
# a b _1 _2 _3 _4 _5 _6 _7 _8 _9 _10 Index
# 1 a A TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 1
# 5 e E FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE 5
# 10 j J FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 10
# 8 h H FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE 8
# 2 b B FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 2
# 4 d D FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE 4
# 6 f F FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE 6
# 9 i I FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE 9
# 7 g G FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE 7
# 3 c C FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 3
If there are more than one TRUE in a row of _-columns, the first found will be used, the remainder silently ignored. If there are none, Index will be NA for that row.

How to pull TRUE values out of matrices

I need to pull the genes with the TRUE values out of each column of the matrix and form a list of them for each of my contrasts (columns). How do I go about doing that?
gcQVals=qvalue(eBgcData$p.value)
print(sum(gcQVals$qvalues<=0.01))
gcQs2=gcQVals$qvalues<=0.01
print(gcQs2[1:5,1:6])
Here is the output:
[1] 17969
Contrasts
KOInfvsKOUnInf WTInfvsWTUnInf KOInfvsWTInf KOInfvsWTUnInf
1415670_at FALSE FALSE FALSE FALSE
1415671_at FALSE FALSE FALSE FALSE
1415672_at TRUE FALSE FALSE TRUE
1415673_at FALSE FALSE FALSE FALSE
1415674_a_at FALSE FALSE FALSE FALSE
Contrasts
KOUnInfvsWTInf KOUnInfvsWTUnInf
1415670_at FALSE FALSE
1415671_at FALSE FALSE
1415672_at FALSE FALSE
1415673_at FALSE FALSE
1415674_a_at FALSE FALSE
I tried to figure out what you were saying and built this MWE:
# build the example
a <- matrix(runif(9), 3, 3)>0.5
dimnames(a) <- list(letters[1:3], LETTERS[1:3])
# solution to your supposed problem
lapply(colnames(a), function(name) rownames(a)[a[,name]])
> a
A B C
a FALSE TRUE FALSE
b FALSE FALSE TRUE
c FALSE FALSE FALSE
res <- lapply(colnames(a), function(name) rownames(a)[a[,name]])
names(res) <- colnames(a)
> res
$A
character(0)
$B
[1] "a"
$C
[1] "b"

R: Generate new data frame column based on a mapping of multiple (logical) columns

Clarification of 'map' or 'ordering' at bottom of post
Imagine we have a data frame with several logical columns, and a 'map' which, for specific combinations of those logical columns, gives a value.
What is the best/most efficient way to compute the value associated with each row of the data frame.
I have three possible solutions below: ifelse(), merge() and table(). I'd appreciate any comments or alternative solutions.
[Apologies, a rather long post]
Consider the following example data frame:
# Generate example
#N <- 15
#Data <- data.frame(A=sample(c(FALSE,TRUE),N,TRUE,c(8,2)),
# B=sample(c(FALSE,TRUE),N,TRUE,c(6,4)),
# C=sample(c(FALSE,TRUE),N,TRUE,c(7,3)),
# D=sample(c(FALSE,TRUE),N,TRUE,c(7,3)))
# Specific example used in this question
Data <- structure(list(A = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE),
B = c(FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE,
FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE), C = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, TRUE, FALSE, FALSE, FALSE), D = c(TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE,
FALSE, TRUE, FALSE)), .Names = c("A", "B", "C", "D"),
class = "data.frame", row.names = c(NA,-15L))
A B C D
1 FALSE FALSE FALSE TRUE
2 FALSE FALSE FALSE FALSE
3 FALSE TRUE FALSE FALSE
4 TRUE FALSE FALSE FALSE
5 FALSE FALSE FALSE FALSE
6 FALSE TRUE FALSE FALSE
7 FALSE TRUE FALSE FALSE
8 FALSE FALSE FALSE FALSE
9 FALSE FALSE FALSE FALSE
10 TRUE FALSE TRUE TRUE
11 FALSE TRUE FALSE TRUE
12 FALSE FALSE TRUE FALSE
13 FALSE TRUE FALSE FALSE
14 FALSE FALSE FALSE TRUE
15 FALSE FALSE FALSE FALSE
Combined with the following map:
# A -> B -> C
# \_ D
### To clarify, if someone has both B & D TRUE (with C FALSE), D is higher than B
### i.e. there can be no ties
This defines an ordering of the logical columns. The final value I want is the 'highest' column within each row. Such that, if column C is true we return C always. We only return "D" if C is FALSE and D is true.
The naive way to do this would be nested ifelse statements:
Data$Highest <- with(Data, ifelse( C, "C",
ifelse( D, "D",
ifelse( B, "B",
ifelse( A, "A", "none")
)
)
)
)
But that code is difficult to read/maintain and gets very complicated for complex orderings with many columns.
I can quickly generate a mapping from the column combinations to the desired output:
Map <- expand.grid( lapply( lapply( Data[c("A","B","C","D")], unique ), sort ) )
Map$Value <- factor(NA, levels=c("A","B","C","D","none"))
Map$Value[which(Map$A)] <- "A"
Map$Value[which(Map$B)] <- "B"
Map$Value[which(Map$D)] <- "D"
Map$Value[which(Map$C)] <- "C"
Map$Value[which(is.na(Map$Value))] <- "none"
A B C D Value
1 FALSE FALSE FALSE FALSE none
2 TRUE FALSE FALSE FALSE A
3 FALSE TRUE FALSE FALSE B
4 TRUE TRUE FALSE FALSE B
5 FALSE FALSE TRUE FALSE C
6 TRUE FALSE TRUE FALSE C
7 FALSE TRUE TRUE FALSE C
8 TRUE TRUE TRUE FALSE C
9 FALSE FALSE FALSE TRUE D
10 TRUE FALSE FALSE TRUE D
11 FALSE TRUE FALSE TRUE D
12 TRUE TRUE FALSE TRUE D
13 FALSE FALSE TRUE TRUE C
14 TRUE FALSE TRUE TRUE C
15 FALSE TRUE TRUE TRUE C
16 TRUE TRUE TRUE TRUE C
Which can be used with merge():
merge( Data, Map, by=c("A","B","C","D"), all.y=FALSE )
A B C D Highest Value
1 FALSE FALSE FALSE FALSE none none
2 FALSE FALSE FALSE FALSE none none
3 FALSE FALSE FALSE FALSE none none
4 FALSE FALSE FALSE FALSE none none
5 FALSE FALSE FALSE FALSE none none
6 FALSE FALSE FALSE TRUE D D
7 FALSE FALSE FALSE TRUE D D
8 FALSE FALSE TRUE FALSE C C
9 FALSE TRUE FALSE FALSE B B
10 FALSE TRUE FALSE FALSE B B
11 FALSE TRUE FALSE FALSE B B
12 FALSE TRUE FALSE FALSE B B
13 FALSE TRUE FALSE TRUE D D
14 TRUE FALSE FALSE FALSE A A
15 TRUE FALSE TRUE TRUE C C
However, the merge() function does not currently preserve the row order. There are ways round this though.
My final idea was to use a 4-dimensional table with character entries corresponding to the map:
Map2 <- table( lapply( Data[c("A","B","C","D")], unique ) )
Map2[] <- "none"
Map2["TRUE",,,] <- "A"
Map2[,"TRUE",,] <- "B"
Map2[,,,"TRUE"] <- "D"
Map2[,,"TRUE",] <- "C"
But I find the above lines unclear (perhaps there is a better way to make the table? I thought it would be possible to turn Map into Map2, but I couldn't see how).
We then use matrix-indexing to pull out the corresponding value:
BOB <- as.matrix(Data[c("A","B","C","D")])
cBOB <- matrix(as.character(BOB),nrow=NROW(BOB),ncol=NCOL(BOB),dimnames=dimnames(BOB))
Data$Alt.Highest <- Map2[cBOB]
A B C D Highest Alt.Highest
1 FALSE FALSE FALSE TRUE D D
2 FALSE FALSE FALSE FALSE none none
3 FALSE TRUE FALSE FALSE B B
4 TRUE FALSE FALSE FALSE A A
5 FALSE FALSE FALSE FALSE none none
6 FALSE TRUE FALSE FALSE B B
7 FALSE TRUE FALSE FALSE B B
8 FALSE FALSE FALSE FALSE none none
9 FALSE FALSE FALSE FALSE none none
10 TRUE FALSE TRUE TRUE C C
11 FALSE TRUE FALSE TRUE D D
12 FALSE FALSE TRUE FALSE C C
13 FALSE TRUE FALSE FALSE B B
14 FALSE FALSE FALSE TRUE D D
15 FALSE FALSE FALSE FALSE none none
So in summary, is there a better way to achieve this 'mapping' type operation and any thoughts on the efficiency of these methods?
For the application I'm interested in, I have nine columns and an ordering chart with three branches to apply to 3000 rows. Essentially I am trying to construct a factor based on an awkward data storage format. So clarity of code is my first priority, with speed/memory efficiency my second.
Thanks in advance.
P.S. Suggestions for amending the question title also welcome.
Clarification
The real application involves a questionnaire with 9 questions asking whether the respondent has achieved a given education/qualification level. These are binary yes/no responses.
What we want is to generate a new variable 'highest qualification achieved'.
The problem is that the 9 levels don't form a simple stack. For example, professional qualifications can be achieved without going to university (especially in older respondents).
We have designed an 'map' or 'ordering' such that, for every combination of responses we have a 'highest qualification' (this order is subjective, hence the desire to make it simple to implement alternative orders).
# So given the nine responses: A, B, C, D, E, F, G, H, I
# we define an ordering as:
# D > C > B > A
# F > E
# E > A
# E == B
# I > H
# H == B
# G == B
# which has a set of order relationships. There is equality in this example
# A -> B -> C -> D
# \_ E -> F
# \_ H -> I
# \_ G
# 0 1 2 3 4
# We could then have five levels in out final 'highest' ordered factor: none, 1, 2, 3, 4
# Or we could decide to add more levels to break certain ties.
The R question is, given an ordering (and what to do with ties) that map combinations of the logical columns to a 'highest achieved' value. How best to implement this in R.
I think I might not understand your concept of 'ordering'. If it is the case that no ties are allowed, and you know exactly how each letter compares to all others, that means that there is a strict ordering, that can be broken down into a simple vector from highest to lowest. If this isn't true, then maybe you could give a more difficult example. If it is true, then you could code this quite easily like:
order<-c('C','D','B','A')
reordered.Data<-Data[order]
Data$max<-
c(order,'none')[apply(reordered.Data,1,function(x) min(which(c(x,TRUE))))]
# A B C D max
# 1 FALSE FALSE FALSE TRUE D
# 2 FALSE FALSE FALSE FALSE none
# 3 FALSE TRUE FALSE FALSE B
# 4 TRUE FALSE FALSE FALSE A
# 5 FALSE FALSE FALSE FALSE none
# 6 FALSE TRUE FALSE FALSE B
# 7 FALSE TRUE FALSE FALSE B
# 8 FALSE FALSE FALSE FALSE none
# 9 FALSE FALSE FALSE FALSE none
# 10 TRUE FALSE TRUE TRUE C
# 11 FALSE TRUE FALSE TRUE D
# 12 FALSE FALSE TRUE FALSE C
# 13 FALSE TRUE FALSE FALSE B
# 14 FALSE FALSE FALSE TRUE D
# 15 FALSE FALSE FALSE FALSE none
I think I now understand your concept of 'ordering'. However, I think that you can safely ignore it at first. For example, G is the same level as B. But G and B will never be compared; you can only have one of {B,E,H,G}. So, as long as each "branch" is in the correct order, it won't matter. If you provided some sample data for your new branching, I could test this, but try something like this:
order<-c(D,C,F,I,B,E,H,G,A)
levs<-c(4,3,3,3,2,2,2,2,1)
names(levs)<-order
reordered.Data<-Data[order]
Data$max<-
c(order,'none')[apply(reordered.Data,1,function(x) min(which(c(x,TRUE))))]
Data$lev<-levs[Data$max]
Here's a data.table approach:
require(data.table)
DT <- data.table(Data)
valord <- c('none','A','B','D','C')
DT[,val:={
vals <- c('none'=TRUE,unlist(.SD))[valord]
names(vals)[max(which(vals))]
},by=1:nrow(DT)]
The result is
A B C D val
1: FALSE FALSE FALSE TRUE D
2: FALSE FALSE FALSE FALSE none
3: FALSE TRUE FALSE FALSE B
4: TRUE FALSE FALSE FALSE A
5: FALSE FALSE FALSE FALSE none
6: FALSE TRUE FALSE FALSE B
7: FALSE TRUE FALSE FALSE B
8: FALSE FALSE FALSE FALSE none
9: FALSE FALSE FALSE FALSE none
10: TRUE FALSE TRUE TRUE C
11: FALSE TRUE FALSE TRUE D
12: FALSE FALSE TRUE FALSE C
13: FALSE TRUE FALSE FALSE B
14: FALSE FALSE FALSE TRUE D
15: FALSE FALSE FALSE FALSE none
If you run
class(DT) # [1] "data.table" "data.frame"
you'll see that this is a data.frame, like your "Data," and the same functions can be applied to it.

Resources