add column based on multiple conditions with mutate() in tidy R grep - r

I am working to classify a wide data frame in an added column, but based on the threshold (>0) of multiple columns. Previous examples here on SO require complete names of columns and if else() statements with > and ==. But I need to be able to use grep() or contains() to isolate columns based on a common string.
Input dataframe:
library(tidyverse)
df <- data.frame(
"ID" = c("asdf","vfdkun", "seifu", "seijd", "qweri"),
"elephant_zoo" = c(1,1,1,2,0), #Should not be useful there
"rhino_zoo" = c(1,2,3,1,0), #Should not be useful there
"hippo_zoo" = c(1,1,0,0,0),
"elephant_wild_A" = c(0,0,1,1,3),
"rhino_wild_A" = c(0,0,4,3,1),
"elephant_wild_B" = c(0,0,0,0,0),
"rhino_wild_C" = c(0,0,1,5,7),
"hippo_wild_B" = c(0,0,0,0,0)) %>%
column_to_rownames(var = "ID")
df
In reality, this has many many more columns and rows!
Desired output dataframe has classified rows (ZOO and WILD) and a compilation of these CLASSIFICATION.
df_goal <- data.frame(
"ID" = c("asdf","vfdkun", "seifu", "seijd", "qweri"),
"elephant_zoo" = c(1,1,1,2,2), #Should not be useful there
"rhino_zoo" = c(1,2,3,1,2), #Should not be useful there
"hippo_zoo" = c(1,1,0,0,2),
"elephant_wild_A" = c(0,0,1,1,3),
"rhino_wild_A" = c(1,0,4,3,1),
"elephant_wild_B" = c(0,0,0,0,0),
"rhino_wild_C" = c(6,0,1,5,7),
"hippo_wild_B" = c(0,0,0,0,0)) %>%
column_to_rownames(var = "ID") %>%
add_column(ZOO = c("zoo", "zoo", "zoo", "zoo", "")) %>%
add_column(WILD = c("", "", "wild", "wild", "wild")) %>%
add_column(CLASSIFICATION = c("zoo only", "zoo only", "both", "both", "wild only"))
df_goal
I am hoping to use a combination of mutate() and case_when(), but I cannot get the select of multiple columns correct.
Examples of attempts:
# using an if else statement
df %>%
mutate(ZOO = ifelse(select(contains("zoo")) > 0, "zoo", "F"))
# using mutate and case_when
df %>%
mutate(ZOO = case_when(
select(contains("zoo")) > 0 ~ "zoo",
TRUE ~ ""))
My actual data frame has many more categories, so being able to break it down into the ZOO vs WILD and then following up with the compiled column.

You can try using reduce from purrr package. One may use an intermediate function any_cols to made the code clearer and use it with across:
library(tidyverse)
any_cols <- function(df) reduce(df, `|`)
df %>%
mutate(ZOO = ifelse(any_cols(across(contains("zoo"), ~`>`(.,0))), "zoo", "F"))
elephant_zoo rhino_zoo hippo_zoo elephant_wild_A rhino_wild_A elephant_wild_B rhino_wild_C hippo_wild_B ZOO
1 1 1 1 0 0 0 0 0 zoo
2 1 2 1 0 0 0 0 0 zoo
3 1 3 0 1 4 0 1 0 zoo
4 2 1 0 1 3 0 5 0 zoo
5 0 0 0 3 1 0 7 0 F
df %>%
mutate(ZOO =
case_when(any_cols(across(contains("zoo"), ~`>`(.,0))) ~ "zoo",
TRUE ~ "F"))
elephant_zoo rhino_zoo hippo_zoo elephant_wild_A rhino_wild_A elephant_wild_B rhino_wild_C hippo_wild_B ZOO
1 1 1 1 0 0 0 0 0 zoo
2 1 2 1 0 0 0 0 0 zoo
3 1 3 0 1 4 0 1 0 zoo
4 2 1 0 1 3 0 5 0 zoo
5 0 0 0 3 1 0 7 0 F

Related

extracting unique combinations from a long list of binary variables

I have a dataframe containing a long list of binary variables. Each row represents a participant, and columns represent whether a participant made a certain choice (1) or not (0). For the sakes of simplicity, let's say there's only four binary variables and 6 participants.
df <- data.frame(a = c(0,1,0,1,0,1),
b = c(1,1,1,1,0,1),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
>df
# a b c d
# 1 0 1 0 1
# 2 1 1 0 1
# 3 0 1 0 0
# 4 1 1 1 0
# 5 0 0 1 0
# 6 1 1 1 0
In the dataframe, I want to create a list of columns that reflect each unique combination of variables in df (i.e., abc, abd, bcd, cda). Then, for each row, I want to add value "1" if the row contains the particular combination corresponding to the column. So, if the participant scored 1 on "a", "b", and "c", and 0 on "d" he would have a score 1 in the newly created column "abc", but 0 in the other columns. Ideally, it would look something like this.
>df_updated
# a b c d abc abd bcd cda
# 1 0 1 0 1 0 0 0 0
# 2 1 1 0 1 0 1 0 0
# 3 0 1 0 0 0 0 0 0
# 4 1 1 1 0 1 0 0 0
# 5 0 0 1 0 0 0 0 0
# 6 1 1 1 0 0 0 0 0
The ultimate goal is to have an idea of the frequency of each of the combinations, so I can order them from the most frequently chosen to the least frequently chosen. I've been thinking about this issue for days now, but couldn't find an appropriate answer. I would very much appreciate the help.
Something like this?
funCombn <- function(data){
f <- function(x, data){
data <- data[x]
list(
name = paste(x, collapse = ""),
vec = apply(data, 1, function(x) +all(as.logical(x)))
)
}
res <- combn(names(df), 3, f, simplify = FALSE, data = df)
out <- do.call(cbind.data.frame, lapply(res, '[[', 'vec'))
names(out) <- sapply(res, '[[', 'name')
cbind(data, out)
}
funCombn(df)
# a b c d abc abd acd bcd
#1 0 1 0 1 0 0 0 0
#2 1 1 0 1 0 1 0 0
#3 0 1 0 0 0 0 0 0
#4 1 1 1 0 1 0 0 0
#5 0 0 1 0 0 0 0 0
#6 1 1 1 0 1 0 0 0
Base R option using combn :
n <- 3
cbind(df, do.call(cbind, combn(names(df), n, function(x) {
setNames(data.frame(as.integer(rowSums(df[x] == 1) == n)),
paste0(x, collapse = ''))
}, simplify = FALSE))) -> result
result
# a b c d abc abd acd bcd
#1 0 1 0 1 0 0 0 0
#2 1 1 0 1 0 1 0 0
#3 0 1 0 0 0 0 0 0
#4 1 1 1 0 1 0 0 0
#5 0 0 1 0 0 0 0 0
#6 1 1 1 0 1 0 0 0
Using combn create all combinations of column names taking n columns at a time. For each of those combinations assign 1 to those rows where all the 3 combinations are 1 or 0 otherwise.
If you are just looking for a frequency of the combinations (and they don't need to be back in the original data), then you could use something like this:
df <- data.frame(a = c(0,1,0,1,0,1),
b = c(1,1,1,1,0,1),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
n <- names(df)
out <- sapply(n, function(x)ifelse(df[[x]] == 1, x, ""))
combs <- apply(out, 1, paste, collapse="")
sort(table(combs))
# combs
# abd b bd c abc
# 1 1 1 1 2
Ok, so let's use your data, including one row without any 1's:
df <- data.frame(
a = c(0,1,0,1,0,1,0),
b = c(1,1,1,1,0,1,0),
c = c(0,0,0,1,1,1,0),
d = c(1,1,0,0,0,0,0)
)
Now I want to paste all column names together if they have a 1, and then make that a wide table (so that all have a column for a combination). Of course, I fill all resulting NAs with 0's.
df2 <- df %>%
dplyr::mutate(
combination = paste0(
ifelse(a == 1, "a", ""), # There is possibly a way to automate this as well using across()
ifelse(b == 1, "b", ""),
ifelse(c == 1, "c", ""),
ifelse(d == 1, "d", "")
),
combination = ifelse(
combination == "",
"nothing",
paste0("comb_", combination)
),
value = ifelse(
is.na(combination),
0,
1
),
i = dplyr::row_number()
) %>%
tidyr::pivot_wider(
names_from = combination,
values_from = value,
names_repair = "unique"
) %>%
replace(., is.na(.), 0) %>%
dplyr::select(-i)
Since you want to order the original df by frequency, you can create a summary of all combinations (excluding those without anything filled in). Then you just make it a long table and pull the column for every combination (arranged by frequency) from the table.
comb_in_order <- df2 %>%
dplyr::select(
-tidyselect::any_of(
c(
names(df),
"nothing" # I think you want these last.
)
)
) %>%
dplyr::summarise(
dplyr::across(
.cols = tidyselect::everything(),
.fns = sum
)
) %>%
tidyr::pivot_longer(
cols = tidyselect::everything(),
names_to = "combination",
values_to = "frequency"
) %>%
dplyr::arrange(
dplyr::desc(frequency)
) %>%
dplyr::pull(combination)
The only thing to do then is to reconstruct the original df by these after arranging by the columns.
df2 %>%
dplyr::arrange(
across(
tidyselect::any_of(comb_in_order),
desc
)
) %>%
dplyr::select(
tidyselect::any_of(names(df))
)
This should work for all possible combinations.

Explanation on converting the factor variable labels to numeric with same labels values?

Using ISLR library and the data set.
library(tidyverse)
library(ISLR)
data("Default")
Replacing "No" with 0 and "Yes" with 1 for variables default and student
DF1 <- Default %>%
select(default, student) %>%
mutate(default = factor(default, labels = c("No" = 1, "Yes" = 10)),
student = factor(student, labels = c("No" = 1, "Yes" = 10)))
converting the variables to numeric
DF2 <- DF1 %>%
mutate(default = as.integer(default),
student = as.numeric(student))
The output is as shown below.
> head(DF2)
default student
1 1 1
2 1 2
3 1 1
4 1 1
5 1 1
6 1 2
While converting to numeric type those 0, 1 values are coded as 1,2 etc.
I got SO answer here.
How to convert a factor to integer\numeric without loss of information?
DF2 <- DF1 %>%
mutate(default = as.integer(levels(default))[default],
student = as.numeric(levels(student))[student])
> head(DF2)
default student
1 0 0
2 0 1
3 0 0
4 0 0
5 0 0
6 0 1
Can you please explain a little on the statement
as.numeric(levels(student))[student])
Why doesn't the instruction below give the desired output?
as.numeric(as.factor(student))
Thanks.

Repeated comparisons between current row and following rows(on the same column)

I have a data frame like this (5 rows and 1 column),
data
row1 5
row2 4
row3 12
row4 6
row5 7
I want to make a comparison between current rows and following rows, as this table display.
compare YES NO
row1<row2 0
row1<row3 1
row1<row4 1
row1<row5 1
row2<row3 1
row2<row4 1
row2<row5 1
row3<row4 0
row3<row5 0
row4<row5 1
Another, I've typed some codes in R, with for loop.
for (i in 1:nrow(data)){
if (data[i,] <data[(i+1):5,]){
print("1")
} else {
print ("0")
}
}
However, I get error information.missing value where TRUE/FALSE needed
Can anyone help me to solve this problem? Or, maybe the apply function is better?
Sorry for my poor English, and big thanks for your precious time!
I'm not quite clear on what your final goal is; your expected output looks like an awkward data format. I assume that this is to adhere to some form of custom/legacy data formatting requirements.
That aside here, you could use outer to do all pairwise comparisons, and then do some data reshaping
library(tidyverse)
outer(df$data, df$data, FUN = function(x, y) x < y) %>%
as.data.frame() %>%
rowid_to_column("rowx") %>%
gather(rowy, val, -rowx) %>%
mutate(
rowx = paste0("row", rowx),
rowy = sub("V", "row", rowy)) %>%
filter(rowx < rowy) %>%
unite(compare, rowx, rowy, sep = "<") %>%
transmute(
compare,
Yes = if_else(val == TRUE, 1, 0),
No = if_else(val == FALSE, 1, 0))
)
# compare Yes No
#1 row1<row2 1 0
#2 row1<row3 1 0
#3 row2<row3 1 0
#4 row1<row4 1 0
#5 row2<row4 0 1
#6 row3<row4 0 1
#7 row1<row5 1 0
#8 row2<row5 1 0
#9 row3<row5 0 1
#10 row4<row5 1 0
Sample data
df <- read.table(text =
"data
1 0.05493405
2 0.07844055
3 0.12901255
4 0.0655028
5 0.078554925", header = T)

Multiply columns by columns using substrings

I'm relatively new to R and was struggling with potentially a very simple problem.
I have data that has multiple columns named in a similar way. Here is a sample data:
df = data.frame(PPID = 1:50,
time1 = sample(c(0,1), 50, replace = TRUE),
time2 = sample(c(0,1), 50, replace = TRUE),
time3 = sample(c(0,1), 50, replace = TRUE),
condition1 = sample(c(0:3), 50, replace = TRUE),
condition2 = sample(c(0:3), 50, replace = TRUE))
In my actual data, I have much more columns - approximately 50 for time and 10 for condition.
I want to multiply week columns and condition columns, e.g. in that sample data it should give me 6 extra columns, like: time1_condition1, time1_condition2, time2_condition1, time2_condition2, time3_condition1, time3_condition2.
I tried solutions that were suggested in this thread but they did not work (presumably because I didn't understand how mapply/apply worked and did not make appropriate changes) - it gave me error message that the longer argument is not a multiple of length of shorter.
Any help would be greatly appreciated!
#Get all the columns with "time" columns
time_cols <- grep("^time", names(df))
#Get all the columns with "condition" column
condition_cols <- grep("^condition", names(df))
#Multiply each "time" columns with all the condition columns
# and creating a new dataframe
new_df <- do.call("cbind", lapply(df[time_cols] , function(x) x *
df[condition_cols]))
#Combine both the dataframes
complete_df <- cbind(df,new_df)
We can also generate column names using expand.grid
new_names <- do.call("paste0",
expand.grid(names(df)[condition_cols], names(df)[time_cols]))
colnames(complete_df)[7:12] <- new_names
Here is a tidyverse alternative
library(tidyverse)
idx.time <- grep("time", names(df), value = T)
idx.cond <- grep("condition", names(df), value = T)
bind_cols(
df,
map_dfc(transpose(expand.grid(idx.time, idx.cond, stringsAsFactors = F)),
~setNames(data.frame(df[, .x$Var1] * df[, .x$Var2]), paste(.x$Var1, .x$Var2, sep = "_"))))
# PPID time1 time2 time3 condition1 condition2 time1_condition1
#1 1 1 0 1 3 0 3
#2 2 0 1 1 0 1 0
#3 3 0 1 1 0 2 0
#4 4 0 0 1 0 3 0
#5 5 0 0 0 0 3 0
#...
Explanation: expand.grid creates all pairwise combinations of idx.time and idx.cond. transpose turns a list/data.frame inside-out and returns a list, similar to apply(..., 1, as.list); map_dfc then operates on every element of that list and column-binds results.
Using
library(tidyverse)
a = df[grep("time",names(df))]
b = df[grep("condition",names(df))]
we can do:
map(a,~.x*b)%>%
bind_cols()%>%
set_names(paste(rep(names(a),each=ncol(b)),names(b),sep="_"))
or we can
cross2(a,b)%>%
map(lift(`*`))%>%
set_names(paste(rep(names(a),each=ncol(b)),names(b),sep="_"))%>%
data.frame()
time1_condition1 time2_condition1 time3_condition1 time1_condition2 time2_condition2 time3_condition2
1 3 0 3 2 0 2
2 3 3 0 1 1 0
3 0 0 0 0 0 0
4 3 3 0 0 0 0
5 0 0 2 0 0 1
6 0 0 1 0 0 1
7 2 2 0 0 0 0

Create a dataframe from a dataframe

I'd like to create a dataframe from a dataframe that created before. my first dataframe is:
Sample motif chromosome
1 CT-G.A 1
1 TA-C.C 1
1 TC-G.C 2
2 CG-A.T 2
2 CA-G.T 2
Then I want to create a dataframe like below, for all (96*24-motifs*chromosomes-):
Sample CT-G.A,chr1 TA-C.C,chr1 TC-G.C,chr1 CG-A.T,ch1 CA-G.T,ch1 CT-G.A,chr2 TA-C.C,chr2 TC-G.C,chr2 CG-A.T,ch2 CA-G.T,ch2
1 1 1 0 0 0 0 0 1 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 1 1
Here is a possble solution using dplyr and tidyr.
We add a column value that indicates if a chromosome is present, then complete the data.frame, making sure we have rows for each motif-chromosome-Sample combination, where missing combinations get a 0 in the value column. We create a key out of the motif and chromosome columns, and then discard those columns. Lastly, we reshape the data.frame from long to wide (see here) to get your desired format. Hope this helps!
df = read.table(text="Sample motif chromosome
1 CT-G.A 1
1 TA-C.C 1
1 TC-G.C 2
2 CG-A.T 2
2 CA-G.T 2
2 CA-G.T 2",header=T)
library(tidyr)
library(dplyr)
df %>% mutate(value=1) %>% complete(motif,chromosome,Sample,fill=list(value=0)) %>%
mutate(key=paste0(motif,',chr',chromosome)) %>%
group_by(Sample,key) %>%
summarize(value = sum(value)) %>%
spread(key,value) %>%
as.data.frame
Output:
Sample CA-G.T,chr1 CA-G.T,chr2 CG-A.T,chr1 CG-A.T,chr2 CT-G.A,chr1 CT-G.A,chr2 TA-C.C,chr1 TA-C.C,chr2 TC-G.C,chr1 TC-G.C,chr2
1 1 0 0 0 0 1 0 1 0 0 1
2 2 0 2 0 1 0 0 0 0 0 0
This seems to be a classic case of when you would want to use factors and ensure that the empty factor levels aren't dropped (which dcast and other functions might do unless explicitly told not to).
Using #Florian's sample data, you can try:
library(data.table)
cols <- c("motif", "chromosome")
setDT(df)[, (cols) := lapply(.SD, factor), .SDcols = cols][
, dcast(unique(.SD)[, value := 1L],
Sample ~ motif + chromosome, value.var = "value",
fill = 0L, drop = FALSE)]
# Sample CA-G.T_1 CA-G.T_2 CG-A.T_1 CG-A.T_2 CT-G.A_1 CT-G.A_2 TA-C.C_1 TA-C.C_2 TC-G.C_1 TC-G.C_2
# 1 1 0 0 0 0 1 0 1 0 0 1
# 2 2 0 1 0 1 0 0 0 0 0 0
I've moved "cols" and myfun() outside of the transformation to save some typing and make things look a little more tidy.
Using the "tidyverse", I'd take a slightly different approach from #Florian, perhaps something like:
library(tidyverse)
df %>%
mutate_at(c("motif", "chromosome"), factor) %>%
mutate(value = 1) %>%
distinct() %>%
mutate(key = interaction(motif, chromosome)) %>%
select(-motif, -chromosome) %>%
spread(key, value, fill = 0, drop = FALSE)
Benchmarks
Benchmarks for these approaches and #Florian's can be found at this Gist.
On 10,000 rows, and 20 resulting columns, the results look like:
This will work for you. I have used package tidyr and dplyr. Actually, I had preferred to use unite and expand.grid from base r to achieve by finally using spread
df <- read.table(text = "Sample motif chromosome
1 CT-G.A 1
1 TA-C.C 1
1 TC-G.C 2
2 CG-A.T 2
2 CA-G.T 2", header = TRUE)
#add a column to represent presence of chromosome
df$val <- 1
library(tidyr)
library(dplyr)
#Complete missing rows
df_complete <- left_join(
expand.grid(Sample=unique(df$Sample), motif=unique(df$motif),
chromosome=unique(df$chromosome)),
df, by = c("Sample", "motif", "chromosome"), copy = TRUE)
#Additional rows should have val = 0
df_complete$val[is.na(df_complete$val)] <- 0
df_complete %>%
unite(motif, c("motif", "chromosome"), sep = ",chr" ) %>%
spread(motif, val)
#Result
Sample CA-G.T,chr1 CA-G.T,chr2 CG-A.T,chr1 CG-A.T,chr2 CT-G.A,chr1 CT-G.A,chr2 TA-C.C,chr1 TA-C.C,chr2 TC-G.C,chr1 TC-G.C,chr2
1 1 0 0 0 0 1 0 1 0 0 1
2 2 0 1 0 1 0 0 0 0 0 0

Resources