String with values mapped from other data frame in R - r

I would like to make a string basing on ids from other columns where the real value sits in a dictionary.
Ideally, this would look like:
library(tidyverse)
region_dict <- tibble(
id = c("reg_id1", "reg_id2", "reg_id3"),
name = c("reg_1", "reg_2", "reg_3")
)
color_dict <- tibble(
id = c("col_id1", "col_id2", "col_id3"),
name = c("col_1", "col_2", "col_3")
)
tibble(
region = c("reg_id1", "reg_id2", "reg_id3"),
color = c("col_id1", "col_id2", "col_id3"),
my_string = str_c(
"xxx"_,
region_name,
"_",
color_name
))
#> # A tibble: 3 x 3
#> region color my_string
#> <chr> <chr> <chr>
#> 1 reg_id1 col_id1 xxx_reg_1_col_1
#> 2 reg_id2 col_id2 xxx_reg_2_col_2
#> 3 reg_id3 col_id3 xxx_reg_3_col_3
Created on 2021-03-01 by the reprex package (v0.3.0)
I know of dplyr's recode() function but I can't think of a way to use it the way I want.
I also thought about first using left_join() and then concatenating the string from the new columns. This is what would work but doesn't seem pretty to me as I would get columns that I'd need to remove later. In the real dataset I have 5 variables.
I'll be glad to read your ideas.

This may also be solved with a fuzzyjoin, but based on the similarity in substring, it would make sense to remove the prefix substring from the 'id' columns of each data and do a left_join, then create the 'my_string' by pasteing the columns together
library(stringr)
library(dplyr)
region_dict %>%
mutate(id1 = str_remove(id, '.*_')) %>%
left_join(color_dict %>%
mutate(id1 = str_remove(id, '.*_')), by = 'id1') %>%
transmute(region = id.x, color = id.y,
my_string = str_c('xxx_', name.x, '_', name.y))
-output
# A tibble: 3 x 3
# region color my_string
# <chr> <chr> <chr>
#1 reg_id1 col_id1 xxx_reg_1_col_1
#2 reg_id2 col_id2 xxx_reg_2_col_2
#3 reg_id3 col_id3 xxx_reg_3_col_3

Related

Adding values from lookup-table based on condition to data frame in R

I've got a data frame containing data of participants who rated images (column image_index):
Now I want to add a new column with gender specific values of the rated image from a another dataframe.
Look-up table of image data:
Final data frame:
How can I accomplish this task?
Sample data:
library(tidyverse)
participants_data <- data.frame(
ID = c(1,2,3,4),
gender = c('f','m','d','f'),
image_index = c(19,2,2,19)
)
lookup_data <- data.frame(
index = c(2,19),
male = c(100,110),
female = c(150,125),
diverse = c(130, 90)
)
complete_dataset <- data.frame(
ID = c(1,2,3,4),
gender = c('f','m','d','f'),
image_index = c(19,2,2,19),
external_value = c(125,100,130,150)
)
You need to make a few manipulations on your data to join them together.
Pivot lookup_data longer with tidyr::pivot_longer() so the gender info is in a column to help merge on.
Use dplyr::rename() to make sure the column names are the same between the two tables.
Transform the gender column so it is just 1 letter to match the other table. Here I use stringr::str_sub(x, 1,1) which just takes the first character of a string.
Then I use left_join() to merge. Because the joining column names are already the same I don't need to specify.
Finally I just reorder and sort the data to match your expected output.
library(tidyverse)
participants_data <- data.frame(
ID = c(1,2,3,4),
gender = c('f','m','d','f'),
image_index = c(19,2,2,19)
)
lookup_data <- data.frame(
index = c(2,19),
male = c(100,110),
female = c(150,125),
diverse = c(130, 90)
)
lookup_data %>%
pivot_longer(-index, names_to = "gender", values_to = "external_value") %>%
rename(image_index = index) %>%
mutate(gender = str_sub(gender, 1, 1)) %>%
left_join(., participants_data) %>%
drop_na(ID) %>%
select(ID, gender, image_index, external_value) %>%
arrange(ID)
#> Joining, by = c("image_index", "gender")
#> # A tibble: 4 x 4
#> ID gender image_index external_value
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 f 19 125
#> 2 2 m 2 100
#> 3 3 d 2 130
#> 4 4 f 19 125
Created on 2022-02-18 by the reprex package (v2.0.1)

match values in 2 columns with the corresponding position in another character column

An example dataframe:
example_df = data.frame(Gene.names = c("A", "B"),
Score = c("3.69,2.97,2.57,3.09,2.94",
"3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83"),
ResidueAA = c("S", "Y"),
ResidueNo = c(3, 3),
Sequence = c("MSSYT", "MSSYTRAP") )
I want to check if the character at ResidueAA column at the position at ResidueNo column matches with the corresponding position in the ‘Sequence’ column. The output should be another column, say, ‘Check’ with a Yes or No.
This is working code:
example_df$Check=sapply(1:nrow(example_df),FUN=function(i){d=example_df[i,]; substr(d$Sequence,d$ResidueNo,d$ResidueNo)==d$ResidueAA})
Is there an easier/elegant way to do this? Ideally, I want something that works within a dplyr pipe.
Also, related to this, how can I extract the corresponding value from the 'Score' column into a new column, say, 'Score_1'?
Thanks
We can use substr directly
library(dplyr)
example_df %>%
mutate(Check = substr(Sequence, ResidueNo, ResidueNo) == ResidueAA)
-output
# Gene.names Score ResidueAA ResidueNo Sequence Check
#1 A 3.69,2.97,2.57,3.09,2.94 S 3 MSSYT TRUE
#2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3 MSSYTRAP FALSE
To create a new column with matching 'Score', use match to get the corresponding index instead of == (which does an elementwise comparison) and use the index for extracting the 'Score' element
example_df %>%
mutate(Score2 = Score[match(ResidueAA,
substr(Sequence, ResidueNo, ResidueNo), ResidueAA)])
-output
#Gene.names Score ResidueAA ResidueNo Sequence
#1 A 3.69,2.97,2.57,3.09,2.94 S 3 MSSYT
#2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3 MSSYTRAP
# Score2
#1 3.69,2.97,2.57,3.09,2.94
#2 <NA>
Update
Based on the comments, we need to extract the corresponding element of 'Score' based on the 'ResidueNo' if the substring values of 'Sequence' is the same as the 'ResidueAA'. This can be done by splitting the 'Score' with strsplit into a list, extract the first element ([[1]] - after a rowwise operation) and then use the 'ResidueNo' to get the splitted word on that location
example_df %>%
rowwise %>%
mutate(Score2 = if(substr(Sequence, ResidueNo, ResidueNo) ==
ResidueAA) strsplit(Score, ",")[[1]][ResidueNo] else NA_character_) %>%
ungroup
-output
# A tibble: 2 x 6
# Gene.names Score ResidueAA ResidueNo Sequence Score2
# <chr> <chr> <chr> <dbl> <chr> <chr>
#1 A 3.69,2.97,2.57,3.09,2.94 S 3 MSSYT 2.57
#2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3 MSSYTRAP <NA>
Or another option is separate_rows to split the rows to expand the data, then do a group by 'Gene.names', `summarise to get the corresponding 'Score2' element (similar to previous solution) and do a join with the original dataset
library(tidyr)
example_df %>%
separate_rows(Score, sep= ",") %>%
group_by(Gene.names) %>%
summarise(Score2 = if(substr(first(Sequence), first(ResidueNo), first(ResidueNo)) ==
first(ResidueAA)) Score[first(ResidueNo)] else
NA_character_, .groups = 'drop') %>%
right_join(example_df)
To get an individual score, you would need to split the string and return the index corresponding to the position. You could vectorize this, e.g.:
getScore <- Vectorize(function(x, pos) unlist(strsplit(x, ",", TRUE), use.names = FALSE)[pos])
example_df %>% mutate(check=substr(Sequence, ResidueNo, ResidueNo) == ResidueAA,
MyScore=ifelse(check, as.numeric(getScore(Score, ResidueNo)), NA))
#> Gene.names Score ResidueAA ResidueNo
#> 1 A 3.69,2.97,2.57,3.09,2.94 S 3
#> 2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3
#> Sequence check MyScore
#> 1 MSSYT TRUE 2.57
#> 2 MSSYTRAP FALSE NA

Parse and Evaluate Column of String Expressions in R?

How can I parse and evaluate a column of string expressions in R as part of a pipeline?
In the example below, I produce my desired column, evaluated. But I know this isn't the right approach. I tried taking a tidyverse approach. But I'm just very confused.
library(tidyverse)
df <- tibble(name = LETTERS[1:3],
to_evaluate = c("1-1+1", "iter+iter", "4*iter-1"),
evaluated = NA)
iter = 1
for (i in 1:nrow(df)) {
df[i,"evaluated"] <- eval(parse(text=df$to_evaluate[[i]]))
}
print(df)
# # A tibble: 3 x 3
# name to_evaluate evaluated
# <chr> <chr> <dbl>
# 1 A 1-1+1 1
# 2 B iter+iter 2
# 3 C 4*iter-1 3
As part of a pipeline, I tried:
df %>% mutate(evaluated = eval(parse(text=to_evaluate)))
df %>% mutate(evaluated = !!parse_exprs(to_evaluate))
df %>% mutate(evaluated = parse_exprs(to_evaluate))
df %>% mutate(evaluated = eval(parse_expr(to_evaluate)))
df %>% mutate(evaluated = parse_exprs(to_evaluate))
df %>% mutate(evaluated = eval(parse_exprs(to_evaluate)))
df %>% mutate(evaluated = eval_tidy(parse_exprs(to_evaluate)))
None of these work.
You can try:
df %>%
rowwise() %>%
mutate(iter = 1,
evaluated = eval(parse(text = to_evaluate))) %>%
select(-iter)
name to_evaluate evaluated
<chr> <chr> <dbl>
1 A 1-1+1 1
2 B iter+iter 2
3 C 4*iter-1 3
Following this logic, also other possibilities could work. Using rlang::parse_expr():
df %>%
rowwise() %>%
mutate(iter = 1,
evaluated = eval(rlang::parse_expr(to_evaluate))) %>%
select(-iter)
On the other hand, I think it is important to quote #Martin Mächler:
The (possibly) only connection is via parse(text = ....) and all good
R programmers should know that this is rarely an efficient or safe
means to construct expressions (or calls). Rather learn more about
substitute(), quote(), and possibly the power of using
do.call(substitute, ......).
Here's a slightly different way that does everything within mutate.
df %>% mutate(
evaluated = pmap_dbl(., function(name, to_evaluate, evaluated)
eval(parse(text=to_evaluate)))
)
# A tibble: 3 x 3
name to_evaluate evaluated
<chr> <chr> <dbl>
1 A 1-1+1 1
2 B iter+iter 2
3 C 4*iter-1 3
Note that values of additional variables (such as iter=1 in your case) can be passed directly to eval():
df %>%
mutate( evaluated = map_dbl(to_evaluate, ~eval(parse(text=.x), list(iter=1))) )
One advantage is that it automatically restricts the scope of the variable, keeping its value right next to where it is used.

Make column of input items with purrr::map_df using .id without duplicating inputs for named vector

I often want to map over a vector of column names in a data frame, and keep track of the output using the .id argument. But to write the column names related to each map iteration into that .id column seems to require doubling up their name in the input vector - in other words, by naming each column name with its own name. If I don't name the column with its own name, then .id just stores the index of the iteration.
This is expected behavior, per the purrr::map docs:
.id
Either a string or NULL. If a string, the output will contain a variable with that name, storing either the name (if .x is named) or the index (if .x is unnamed) of the input.
But my approach feels a little clunky, so I imagine I'm missing something. Is there a better way to get a list of the columns I'm iterating over, that doesn't require writing each column name twice in the input vector? Any suggestions would be much appreciated!
Here's an example to work with:
library(rlang)
library(tidyverse)
tb <- tibble(foo = rnorm(10), bar = rnorm(10))
cols_once <- c("foo", "bar")
cols_once %>% map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
# A tibble: 2 x 2
var avg <-- var stores only the iteration index
<chr> <dbl>
1 1 -0.0519
2 2 0.204
cols_twice <- c("foo" = "foo", "bar" = "bar")
cols_twice %>% map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
# A tibble: 2 x 2
var avg <-- var stores the column names
<chr> <dbl>
1 foo -0.0519
2 bar 0.204
Here's an alternative solution for your specific scenario using summarize_at and gather:
tb %>% summarize_at( cols_once, mean ) %>% gather( var, avg )
# # A tibble: 2 x 2
# var avg
# <chr> <dbl>
# 1 foo 0.374
# 2 bar 0.0397
In a more general scenario, I don't think there's a way around naming your cols_once when working with map_dfr, because of the expected behavior you pointed out in your question. However, you can use the "snake case" wrapper for setNames() to do it more elegantly:
cols_once %>% set_names %>%
map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
# # A tibble: 2 x 2
# var avg
# <chr> <dbl>
# 1 foo 0.374
# 2 bar 0.0397
You could create your input vector easily with:
setNames(names(tb), names(tb))
So your code would be:
setNames(names(tb), names(tb)) %>%
map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
Edit following your comment:
Still not the solution you are hoping for, but when you don't use all the column names, you could still use setNames() and subset the ones you want (or subset out the ones you don't).
tb <- tibble(foo = rnorm(10), bar = rnorm(10), taz = rnorm(10))
setNames(names(tb), names(tb))[-3]

stringr: find rows where any column content matches a regex

Consider the following example
> data_text <- data.frame(text = c('where', 'are', 'you'),
blob = c('little', 'nice', 'text'))
> data_text
# A tibble: 3 x 2
text blob
<chr> <chr>
1 where little
2 are nice
3 you text
I want to print the rows that contain the regex text (that is, row 3)
Problem is, I have hundreds of columns and I dont know which one contains this string. str_detect only work with one column at a time...
How can I do that using the stringr package?
Thanks!
With stringr and dplyr you can do this.
You should use filter_all from dplyr >= 0.5.0.
I have extended the data to have a better look on the result:
library(dplyr)
library(stringr)
data_text <- data.frame(text = c('text', 'where', 'are', 'you'),
one_more_text = c('test', 'test', 'test', 'test'),
blob = c('wow', 'little', 'nice', 'text'))
data_text %>%
filter_all(any_vars(str_detect(., 'text')))
# output
text one_more_text blob
1 text test wow
2 you test text
You can treat the data.frame as a list and use purrr::map to check each column, which can then be reduced into a logical vector that filter can handle. Alternatively, purrr::pmap can iterate over all the columns in parallel:
library(tidyverse)
data_text <- data_frame(text = c('where', 'are', 'you'),
blob = c('little', 'nice', 'text'))
data_text %>% filter(map(., ~.x == 'text') %>% reduce(`|`))
#> # A tibble: 1 x 2
#> text blob
#> <chr> <chr>
#> 1 you text
data_text %>% filter(pmap_lgl(., ~any(c(...) == 'text')))
#> # A tibble: 1 x 2
#> text blob
#> <chr> <chr>
#> 1 you text
matches = apply(data_text,1,function(x) sum(grepl("text",x)))>0
result = data_text[matches,]
No other packages required. Hope this helps!

Resources