How to subset the next column in R - r

df <- data.frame(intro = c("bob","bob","bob"),
intro_score = c("Excellent","Excellent","Good"),
method = c("sally","sally","sally"),
method_score = c("Excellent","Excellent","Excellent"),
result = c("Norman","Norman","Norman"),
result_score = c("Good","Good","Good"))
If I want to look for "bob" in this dataframe, how do I return the column next to "bob" (intro_score only), assuming I'm not sure if "bob" is in here. Say, if I were to look for "ken", the result should be null. If I were to look for "Norman", the result should return result_score.
I have tried something like this:
name <- "bob"
df_name <- df %>%
if (str_detect(intro, name)) {
select((which(colnames==str_detect(intro, name)))+1)
} else {}
Thank you for your help!

using base R if you need the names you could do:
names(df[unique(which(df=="bob",TRUE)[,2]+1)])
[1] "intro_score"
or if you need the column values, you do:
df[unique(which(df=="bob",TRUE)[,2]+1)]
intro_score
1 Excellent
2 Excellent
3 Good

You could reshape your data into time (intro, method, result), name, and score.
df2 <- reshape(df, direction = "long", varying = list(c(1,3,5), c(2,4,6)), v.names = c("name", "score"), times = c("intro", "method", "result"))
df2[df2$name == "Norman", "score"]

library(purrr)
search_person <- "bob"
colnames(df)[which(map_lgl(df,~all(.x == search_person))) + 1]
"intro_score"

Here is one option with select_if
library(dplyr)
library(magrittr)
df %>%
select_if(~ any(. == "bob")) %>%
names %>%
match(., names(df)) %>%
add(1) %>%
names(df)[.]
#[1] "intro_score"

Related

Match text from one column with another column (vlookup + like)

I'm trying to perform a match of 2 columns but without success. I have one DF1 with 2 columns, Id and JSON. In the second DF2, I have one column with a pattern to be matched in each row for DF1$json (something like vlookup + like function).
As an output, I'd like to get DF1$Id but only where any of DF2 is matched with DF1$json.
I've tried some combinations with str_detect but it doesn't work on non-vector values. Maybe some tricks with grep or stringr functions?
For example:
str_detect(DF1$json, fixed(DF2[1,1], ignore_case = TRUE))
df1 <- data.frame(
Id = c("AA", "BB", "CC", "DD"),
json = c("{xxx:yyy:zzz};{mmm:zzz:vvv}", "{ccc:yyy:zzz};{ddd:zzz:vvv}", "{ttt:yyy:zzz};{mmm:zzz:vvv}", "{uuu:yyy:zzz};{mmm:zzz:vvv}")
)
matches <- c("mmm:zzz:vvv", "mmm:yyy:zzz")
library(stringr) # needed for str_extract_all()
Solution using data.table
library(data.table)
setDT(df1)
df1[, match := any(str_extract_all(json, "(?<=\\{).+?(?=\\})")[[1]] %in% matches), by = Id]
df1[match == T, .(Id)]
Solution using dplyr
library(dplyr)
df1 %>%
group_by(Id) %>%
mutate(match = any(str_extract_all(json, "(?<=\\{).+?(?=\\})")[[1]] %in% matches)) %>%
filter(match == T) %>%
select(Id)
Or just directly filter()
df1 %>%
group_by(Id) %>%
filter(any(str_extract_all(json, "(?<=\\{).+?(?=\\})")[[1]] %in% matches)) %>%
select(Id)
Output on both methods
Id
1: AA
2: CC
3: DD
Does this give you the expected result :
my_df <- data.frame("id" = c("AA", "BB", "CC", "DD"),
"json" = c("{x:y:z};{m:z:v}", "{c:y:z};{d:z:v}", "{t:y:z};{m:z:v}", "{u:y:z};{m:z:v}"),
"pattern" = c("m:z:v", "t:y:z", "m:z:v", "t"),
stringsAsFactors = FALSE)
my_f <- function(x) {
my_var <- paste(grep(pattern = my_df[x, "pattern"], x = my_df$json), collapse = " ")
return (my_var)
}
my_df$Value <- lapply(1:nrow(my_df), my_f)

Pass column names into a function using apply or map

I want to apply multiple functions to the same dataframe. However, I have not been able to successfully pass column names as a parameter in purrr::imap. I keep get the following error:
Error in UseMethod("select") : no applicable method for 'select'
applied to an object of class "character"
I have tried many combinations for evaluation (e.g., using !!!, [[, enquo, sys.lang, and on and on). when I apply a function (e.g., check_1) directly to a dataframe, select works fine. However, it does not work when I try to pass column names as a parameter using imap and exec.The format of the column name is part of the issue (e.g., 1.1.), but I have tried quotes and single quotes, etc.
This is a follow up to a previous post, but that post and solution focused on applying multiple functions to individual columns. Now, I need to apply multiple functions, which use more than one column in the dataframe; hence, the need to specify column names in a function.
Minimal Example
Data
df <- structure(
list(
`1.1.` = c("Andrew", "Max", "Sylvia", NA, "1",
NA, NA, "Jason"),
`1.2.` = c(1, 2, 2, NA, 4, 5, 3, NA),
`1.2.1.` = c(
"cool", "amazing", "wonderful", "okay",
NA, NA, "chocolate", "fine"
)
),
class = "data.frame",
row.names = c(NA, -8L)
)
What I have Tried
library(purrr)
library(dplyr)
check_1 <- function(x, col1, col2) {
x %>%
dplyr::select(col1, col2) %>%
dplyr::mutate(row.index = row_number()) %>%
dplyr::filter(col1 == "Jason" & is.na(col2) == TRUE) %>%
dplyr::select(row.index) %>%
unlist() %>%
as.vector()
}
check_2 <- function(x, col1, col2) {
index <- x %>%
dplyr::select(col1, col2) %>%
dplyr::mutate(row.index = row_number()) %>%
dplyr::filter(col1 >= 3 & col1 <= 5 & is.na(col2) == TRUE) %>%
dplyr::select(row.index) %>%
unlist() %>%
as.vector()
return(index)
}
checks <-
list("df" = list(fn = check_1, pars = list(col1 = "1.1.", col2 = "1.2.")),
"df" = list(fn = check_2, pars = list(col1 = "1.2.", col2 = "1.2.1.")))
results <-
purrr::imap(checks, ~ exec(.x$fn, x = .y,!!!.x$pars))
Expected Output
> results
$df
[1] 8
$df
[1] 5 6
Besides the "class character" error, I also get an additional error when I try to test the check_2 function on its own, where it returns no expected values.
[1] 1.2. 1.2.1. row.index
<0 rows> (or 0-length row.names)
I have looked at many other similar SO posts (e.g., this one), but none have solved this issue for me.
The first issue is that you pass the name of the dataframe but not the the dataframe itself. That's why you get the first error as you are trying to select from a character string. To solve this issue add the dataframe to the list you are looping over.
The second issue is that when you pass the column names as character string you have to tell dplyr that these characters refer to columns in your data. This could be achieved by e.g. making use of the .data pronoun.
Finally, instead of select + unlist + as.vector you could simply use dplyr::pull:
library(purrr)
library(dplyr)
check_1 <- function(x, col1, col2) {
x %>%
dplyr::select(all_of(c(col1, col2))) %>%
dplyr::mutate(row.index = row_number()) %>%
dplyr::filter(.data[[col1]] == "Jason" & is.na(.data[[col2]]) == TRUE) %>%
dplyr::pull(row.index)
}
check_2 <- function(x, col1, col2) {
x %>%
dplyr::select(all_of(c(col1, col2))) %>%
dplyr::mutate(row.index = row_number()) %>%
dplyr::filter(.data[[col1]] >= 3 & .data[[col1]] <= 5 & is.na(.data[[col2]]) == TRUE) %>%
dplyr::pull(row.index)
}
checks <-
list(df = list(df = df, fn = check_1, pars = list(col1 = "1.1.", col2 = "1.2.")),
df = list(df = df, fn = check_2, pars = list(col1 = "1.2.", col2 = "1.2.1.")))
purrr::map(checks, ~ exec(.x$fn, x = .x$df, !!!.x$pars))
#> $df
#> [1] 8
#>
#> $df
#> [1] 5 6
Use select({{col1}},{{col2}})
this most probably help you

Select unique values

I need to change this function that doesn't match for unique values. For example, if I want MAPK4, the function matches MAPK41 and AMAPK4 etc. The function must select only the unique values.
Function:
library(dplyr)
df2 <- df %>%
rowwise() %>%
mutate(mutated = paste(mutated_genes[unlist(
lapply(mutated_genes, function(x) grepl(x,genes, ignore.case = T)))], collapse=","),
circuit_name = gsub("", "", circuit_name)) %>%
select(-genes) %>%
data.frame()
data:
df <-structure(list(circuit_name = c("hsa04010__117", "hsa04014__118" ), genes = c("MAP4K4,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP3*,DUSP3*,DUSP3*,DUSP3*,PPM1A,AKT3,AKT3,AKT3,ZAK,MAP3K12,MAP3K13,TRAF2,CASP3,IL1R1,IL1R1,TNFRSF1A,IL1A,IL1A,TNF,RAC1,RAC1,RAC1,RAC1,MAP2K7,MAPK8,MAPK8,MAPK8,MECOM,HSPA1A,HSPA1A,HSPA1A,HSPA1A,HSPA1A,HSPA1A,MAP4K3,MAPK8IP2,MAP4K1", "MAP4K4,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*")), class = "data.frame", row.names = c(NA, -2L))
mutated_genes <- c("MAP4K4", "MAP3K12","TRAF2", "CACNG3")
output:
circuit_name mutated
1 hsa04010__117 MAP4K4,TRAF2
2 hsa04014__118 MAP4K4
A base R approach would be by splitting the genes on "," and return those string which match mutated_genes.
df$mutated <- sapply(strsplit(df$genes, ","), function(x)
toString(grep(paste0(mutated_genes, collapse = "|"), x, value = TRUE)))
df[c(1, 3)]
# circuit_name mutated
#1 hsa04010__117 MAP4K4, MAP3K12, TRAF2
#2 hsa04014__118 MAP4K4
Please note that based on the mutated_genes vector, your expected output is missing MAP3K12 for hsa04010__117.
Here is a tidyverse possibility
df %>%
separate_rows(genes) %>%
filter(genes %in% mutated_genes) %>%
group_by(circuit_name) %>%
summarise(mutated = toString(genes))
## A tibble: 2 x 2
# circuit_name mutated
# <chr> <chr>
#1 hsa04010__117 MAP4K4, MAP3K12, TRAF2
#2 hsa04014__118 MAP4K4
Explanation: We separate comma-separated entries into different rows, then select only those rows where genes %in% mutated_genes and summarise results per circuit_name by concatenating genes entries.
PS. Personally I'd recommend keeping the data in a tidy long format (i.e. don't concatenate entries with toString); that way you have one row per gene, which will make any post-processing of the data much more straightforward.
We can use str_extract
library(stringr)
df$mutated <- sapply(str_extract_all(df$genes, paste(mutated_genes,
collapse="|")), toString)

Dplyr Non Standard Evaluation -- Help Needed

I am making my first baby steps with non standard evaluation (NSE) in dplyr.
Consider the following snippet: it takes a tibble, sorts it according to the values inside a column and replaces the n-k lower values with "Other".
See for instance:
library(dplyr)
df <- cars%>%as_tibble
k <- 3
df2 <- df %>%
arrange(desc(dist)) %>%
mutate(dist2 = factor(c(dist[1:k],
rep("Other", n() - k)),
levels = c(dist[1:k], "Other")))
What I would like is a function such that:
df2bis<-df %>% sort_keep(old_column, new_column, levels_to_keep)
produces the same result, where old_column column "dist" (the column I use to sort the data set), new_column (the column I generate) is "dist2" and levels_to_keep is "k" (number of values I explicitly retain).
I am getting lost in enquo, quo_name etc...
Any suggestion is appreciated.
You can do:
library(dplyr)
sort_keep=function(df,old_column, new_column, levels_to_keep){
old_column = enquo(old_column)
new_column = as.character(substitute(new_column))
df %>%
arrange(desc(!!old_column)) %>%
mutate(use = !!old_column,
!!new_column := factor(c(use[1:levels_to_keep],
rep("Other", n() - levels_to_keep)),
levels = c(use[1:levels_to_keep], "Other")),
use=NULL)
}
df%>%sort_keep(dist,dist2,3)
Something like this?
old_column = "dist"
new_column = "dist2"
levels_to_keep = 3
command = "df2bis<-df %>% sort_keep(old_column, new_column, levels_to_keep)"
command = gsub('old_column', old_column, command)
command = gsub('new_column', new_column, command)
command = gsub('levels_to_keep', levels_to_keep, command)
eval(parse(text=command))

How to get the name of a data.frame within a list?

How can I get a data frame's name from a list? Sure, get() gets the object itself, but I want to have its name for use within another function. Here's the use case, in case you would rather suggest a work around:
lapply(somelistOfDataframes, function(X) {
ddply(X, .(idx, bynameofX), summarise, checkSum = sum(value))
})
There is a column in each data frame that goes by the same name as the data frame within the list. How can I get this name bynameofX? names(X) would return the whole vector.
EDIT: Here's a reproducible example:
df1 <- data.frame(value = rnorm(100), cat = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
df2 <- data.frame(value = rnorm(100,8), cat2 = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
mylist <- list(cat = df1, cat2 = df2)
lapply(mylist, head, 5)
I'd use the names of the list in this fashion:
dat1 = data.frame()
dat2 = data.frame()
l = list(dat1 = dat1, dat2 = dat2)
> str(l)
List of 2
$ dat1:'data.frame': 0 obs. of 0 variables
$ dat2:'data.frame': 0 obs. of 0 variables
and then use lapply + ddply like:
lapply(names(l), function(x) {
ddply(l[[x]], c("idx", x), summarise,checkSum = sum(value))
})
This remains untested without a reproducible answer. But it should help you in the right direction.
EDIT (ran2): Here's the code using the reproducible example.
l <- lapply(names(mylist), function(x) {
ddply(mylist[[x]], c("idx", x), summarise,checkSum = sum(value))
})
names(l) <- names(mylist); l
Here is the dplyr equivalent
library(dplyr)
catalog =
data_frame(
data = someListOfDataframes,
cat = names(someListOfDataframes)) %>%
rowwise %>%
mutate(
renamed =
data %>%
rename_(.dots =
cat %>%
as.name %>%
list %>%
setNames("cat")) %>%
list)
catalog$renamed %>%
bind_rows(.id = "number") %>%
group_by(number, idx, cat) %>%
summarize(checkSum = sum(value))
you could just firstly use names(list)->list_name and then use list_name[1] , list_name[2] etc. to get each list name. (you may also need as.numeric(list_name[x]) if your list names are numbers.

Resources