Match text from one column with another column (vlookup + like) - r

I'm trying to perform a match of 2 columns but without success. I have one DF1 with 2 columns, Id and JSON. In the second DF2, I have one column with a pattern to be matched in each row for DF1$json (something like vlookup + like function).
As an output, I'd like to get DF1$Id but only where any of DF2 is matched with DF1$json.
I've tried some combinations with str_detect but it doesn't work on non-vector values. Maybe some tricks with grep or stringr functions?
For example:
str_detect(DF1$json, fixed(DF2[1,1], ignore_case = TRUE))

df1 <- data.frame(
Id = c("AA", "BB", "CC", "DD"),
json = c("{xxx:yyy:zzz};{mmm:zzz:vvv}", "{ccc:yyy:zzz};{ddd:zzz:vvv}", "{ttt:yyy:zzz};{mmm:zzz:vvv}", "{uuu:yyy:zzz};{mmm:zzz:vvv}")
)
matches <- c("mmm:zzz:vvv", "mmm:yyy:zzz")
library(stringr) # needed for str_extract_all()
Solution using data.table
library(data.table)
setDT(df1)
df1[, match := any(str_extract_all(json, "(?<=\\{).+?(?=\\})")[[1]] %in% matches), by = Id]
df1[match == T, .(Id)]
Solution using dplyr
library(dplyr)
df1 %>%
group_by(Id) %>%
mutate(match = any(str_extract_all(json, "(?<=\\{).+?(?=\\})")[[1]] %in% matches)) %>%
filter(match == T) %>%
select(Id)
Or just directly filter()
df1 %>%
group_by(Id) %>%
filter(any(str_extract_all(json, "(?<=\\{).+?(?=\\})")[[1]] %in% matches)) %>%
select(Id)
Output on both methods
Id
1: AA
2: CC
3: DD

Does this give you the expected result :
my_df <- data.frame("id" = c("AA", "BB", "CC", "DD"),
"json" = c("{x:y:z};{m:z:v}", "{c:y:z};{d:z:v}", "{t:y:z};{m:z:v}", "{u:y:z};{m:z:v}"),
"pattern" = c("m:z:v", "t:y:z", "m:z:v", "t"),
stringsAsFactors = FALSE)
my_f <- function(x) {
my_var <- paste(grep(pattern = my_df[x, "pattern"], x = my_df$json), collapse = " ")
return (my_var)
}
my_df$Value <- lapply(1:nrow(my_df), my_f)

Related

Combine row values into character vector by condition

I have a data.frame where values are repeated in col1.
col1 <- c("A", "A", "B", "B", "C")
col2 <- c(1995, 1997, 1999, 2000, 2005)
df <- data.frame(col1, col2)
I want to combine values in col2 that correspond to the same letter in col1 into one cell, so that col2 shows a range of values for a particular letter in col1. I do this by splitting the data.frame by col1, applying fun, and binding the split data.frames back together.
library(tidyverse)
split_df <- split(df, df$col1)
fun <- function(df) {
if (length(unique(df$col2)) > 1) {
df$col2 <- paste(min(df$col2),
max(df$col2),
sep = "-")
df <- distinct(df)
}
return(df)
}
split_df <- lapply(split_df, fun)
df <- do.call(rbind, split_df)
This works, but I am wondering if there is a more intuitive or more efficient solution?
Base R way using aggregate -
aggregate(col2~col1, df, function(x) paste0(unique(range(x)), collapse = '-'))
# col1 col2
#1 A 1995-1997
#2 B 1999-2000
#3 C 2005
Same can also be written with dplyr -
library(dplyr)
df %>%
group_by(col1) %>%
summarise(col2 = paste0(unique(range(col2)), collapse = '-'))
One option would be the tidyverse, where you can accomplish this a little more succinctly. The basic idea is the same:
library(tidyverse)
new.result <- df %>%
group_by(col1) %>%
summarize(
col2 = ifelse(n() == 1, as.character(col2), paste(min(col2), max(col2), sep = '-'))
)
col1 col2
<chr> <chr>
1 A 1995-1997
2 B 1999-2000
3 C 2005
A different (but possibly overcomplicated) approach assumes that you have at most two years per grouping. We can pivot the start and end years into their own columns, and then paste them together directly. This requires a little more data transformation but avoids having to check explicitly for groups with 1 year:
df %>%
group_by(col1) %>%
mutate(n = row_number()) %>%
pivot_wider(names_from = n, values_from = col2) %>%
rowwise() %>%
mutate(
vec = list(c(`1`, `2`)),
col2 = paste(vec[!is.na(vec)], collapse = '-')
) %>%
select(col1, col2)

How to subset the next column in R

df <- data.frame(intro = c("bob","bob","bob"),
intro_score = c("Excellent","Excellent","Good"),
method = c("sally","sally","sally"),
method_score = c("Excellent","Excellent","Excellent"),
result = c("Norman","Norman","Norman"),
result_score = c("Good","Good","Good"))
If I want to look for "bob" in this dataframe, how do I return the column next to "bob" (intro_score only), assuming I'm not sure if "bob" is in here. Say, if I were to look for "ken", the result should be null. If I were to look for "Norman", the result should return result_score.
I have tried something like this:
name <- "bob"
df_name <- df %>%
if (str_detect(intro, name)) {
select((which(colnames==str_detect(intro, name)))+1)
} else {}
Thank you for your help!
using base R if you need the names you could do:
names(df[unique(which(df=="bob",TRUE)[,2]+1)])
[1] "intro_score"
or if you need the column values, you do:
df[unique(which(df=="bob",TRUE)[,2]+1)]
intro_score
1 Excellent
2 Excellent
3 Good
You could reshape your data into time (intro, method, result), name, and score.
df2 <- reshape(df, direction = "long", varying = list(c(1,3,5), c(2,4,6)), v.names = c("name", "score"), times = c("intro", "method", "result"))
df2[df2$name == "Norman", "score"]
library(purrr)
search_person <- "bob"
colnames(df)[which(map_lgl(df,~all(.x == search_person))) + 1]
"intro_score"
Here is one option with select_if
library(dplyr)
library(magrittr)
df %>%
select_if(~ any(. == "bob")) %>%
names %>%
match(., names(df)) %>%
add(1) %>%
names(df)[.]
#[1] "intro_score"

Select unique values

I need to change this function that doesn't match for unique values. For example, if I want MAPK4, the function matches MAPK41 and AMAPK4 etc. The function must select only the unique values.
Function:
library(dplyr)
df2 <- df %>%
rowwise() %>%
mutate(mutated = paste(mutated_genes[unlist(
lapply(mutated_genes, function(x) grepl(x,genes, ignore.case = T)))], collapse=","),
circuit_name = gsub("", "", circuit_name)) %>%
select(-genes) %>%
data.frame()
data:
df <-structure(list(circuit_name = c("hsa04010__117", "hsa04014__118" ), genes = c("MAP4K4,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP3*,DUSP3*,DUSP3*,DUSP3*,PPM1A,AKT3,AKT3,AKT3,ZAK,MAP3K12,MAP3K13,TRAF2,CASP3,IL1R1,IL1R1,TNFRSF1A,IL1A,IL1A,TNF,RAC1,RAC1,RAC1,RAC1,MAP2K7,MAPK8,MAPK8,MAPK8,MECOM,HSPA1A,HSPA1A,HSPA1A,HSPA1A,HSPA1A,HSPA1A,MAP4K3,MAPK8IP2,MAP4K1", "MAP4K4,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*")), class = "data.frame", row.names = c(NA, -2L))
mutated_genes <- c("MAP4K4", "MAP3K12","TRAF2", "CACNG3")
output:
circuit_name mutated
1 hsa04010__117 MAP4K4,TRAF2
2 hsa04014__118 MAP4K4
A base R approach would be by splitting the genes on "," and return those string which match mutated_genes.
df$mutated <- sapply(strsplit(df$genes, ","), function(x)
toString(grep(paste0(mutated_genes, collapse = "|"), x, value = TRUE)))
df[c(1, 3)]
# circuit_name mutated
#1 hsa04010__117 MAP4K4, MAP3K12, TRAF2
#2 hsa04014__118 MAP4K4
Please note that based on the mutated_genes vector, your expected output is missing MAP3K12 for hsa04010__117.
Here is a tidyverse possibility
df %>%
separate_rows(genes) %>%
filter(genes %in% mutated_genes) %>%
group_by(circuit_name) %>%
summarise(mutated = toString(genes))
## A tibble: 2 x 2
# circuit_name mutated
# <chr> <chr>
#1 hsa04010__117 MAP4K4, MAP3K12, TRAF2
#2 hsa04014__118 MAP4K4
Explanation: We separate comma-separated entries into different rows, then select only those rows where genes %in% mutated_genes and summarise results per circuit_name by concatenating genes entries.
PS. Personally I'd recommend keeping the data in a tidy long format (i.e. don't concatenate entries with toString); that way you have one row per gene, which will make any post-processing of the data much more straightforward.
We can use str_extract
library(stringr)
df$mutated <- sapply(str_extract_all(df$genes, paste(mutated_genes,
collapse="|")), toString)

R dplyr method to replace all empty factors with NA

Instead of writing and reading a dataframe to fill all empty factors in this method,
na.strings=c("","NA")
I wanted to just apply a function to all the columns and substitute the empties with NA. I've selected the factor columns so far but don't know what to do next.
df %>% select_if(is.factor) %>% ....
How would I be able to do this, preferably with dplyr and/or apply methods
We can use mutate_if
df <- df %>%
mutate_if(is.factor, funs(factor(replace(., .=="", NA))))
With dplyr 0.8.0, we can also do
df %>%
mutate_if(is.factor, na_if, y = "")
or change the funs (which is getting deprecated to list as #Frederick mentioned in the comments)
df %>%
mutate_if(is.factor, list(~ na_if(., "")))
Or using base R we can assign the specific levels to NA
j1 <- sapply(df, is.factor)
df[j1] <- lapply(df[j1], function(x) {is.na(x) <- levels(x)==""; x})
data
df <- data.frame(col1 = c("", "A", "B", ""), col2 = c("A", "", "", "C"),
col3 = 1:4)

How to get the name of a data.frame within a list?

How can I get a data frame's name from a list? Sure, get() gets the object itself, but I want to have its name for use within another function. Here's the use case, in case you would rather suggest a work around:
lapply(somelistOfDataframes, function(X) {
ddply(X, .(idx, bynameofX), summarise, checkSum = sum(value))
})
There is a column in each data frame that goes by the same name as the data frame within the list. How can I get this name bynameofX? names(X) would return the whole vector.
EDIT: Here's a reproducible example:
df1 <- data.frame(value = rnorm(100), cat = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
df2 <- data.frame(value = rnorm(100,8), cat2 = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
mylist <- list(cat = df1, cat2 = df2)
lapply(mylist, head, 5)
I'd use the names of the list in this fashion:
dat1 = data.frame()
dat2 = data.frame()
l = list(dat1 = dat1, dat2 = dat2)
> str(l)
List of 2
$ dat1:'data.frame': 0 obs. of 0 variables
$ dat2:'data.frame': 0 obs. of 0 variables
and then use lapply + ddply like:
lapply(names(l), function(x) {
ddply(l[[x]], c("idx", x), summarise,checkSum = sum(value))
})
This remains untested without a reproducible answer. But it should help you in the right direction.
EDIT (ran2): Here's the code using the reproducible example.
l <- lapply(names(mylist), function(x) {
ddply(mylist[[x]], c("idx", x), summarise,checkSum = sum(value))
})
names(l) <- names(mylist); l
Here is the dplyr equivalent
library(dplyr)
catalog =
data_frame(
data = someListOfDataframes,
cat = names(someListOfDataframes)) %>%
rowwise %>%
mutate(
renamed =
data %>%
rename_(.dots =
cat %>%
as.name %>%
list %>%
setNames("cat")) %>%
list)
catalog$renamed %>%
bind_rows(.id = "number") %>%
group_by(number, idx, cat) %>%
summarize(checkSum = sum(value))
you could just firstly use names(list)->list_name and then use list_name[1] , list_name[2] etc. to get each list name. (you may also need as.numeric(list_name[x]) if your list names are numbers.

Resources