stringr: replace string by supplying pattern through a character vector

stringr: replace string by supplying pattern through a character vector - r

Here is my data:
df <- tibble::tribble(
~A, ~B,
"C", "G",
"D", "H",
"E", "I",
"F", "J")
value1 <- "D"
value2 <- "C"
And, in variable A, I want to replace D and C with "m" and "n", something like this, but it's not working!
df %>% mutate(X = A %>% str_replace_all(c(value1 = "m", value2 = "n")))
My desired output is:
df %>% mutate(X = A %>% str_replace_all(c("D" = "m", "C" = "n")))
But instead of supplying "D" and "C" manually, I want to programmatically supply these, something in line with...using value1 and value2.
How should I do that?

You could try using setNames to set the names of m and n like:
library(dplyr)
library(stringr)
df %>% mutate(X = A %>% str_replace_all(setNames(c("m","n"), c(value1, value2))))
# A tibble: 4 x 3
# A B X
# <chr> <chr> <chr>
#1 C G n
#2 D H m
#3 E I E
#4 F J F
And then checking that it's equal to your desired result:
identical(
df %>% mutate(X = A %>% str_replace_all(c("D" = "m", "C" = "n"))),
df %>% mutate(X = A %>% str_replace_all(setNames(c("m","n"), c(value1, value2)))))
#[1] TRUE
I also included the other packages you use: dplyr and stringr

You can think of creating a named vector and use it as replacement vector.
replacementVector <- c("m","n")
names(replacementVector) <- c("D","C")
Now, use the replacementVector in dplyr chain along with ifelse as:
df %>% mutate(X = ifelse(is.na(replacementVector[A]), A, replacementVector[A]))
# # A tibble: 4 x 3
# A B X
# <chr> <chr> <chr>
# 1 C G n
# 2 D H m
# 3 E I E
# 4 F J F
Data:
library(tidyverse)
df <- tibble::tribble(
~A, ~B,
"C", "G",
"D", "H",
"E", "I",
"F", "J")

As is vectorized over string and replacement if you put all the values in the same vector you can just run
df %>% mutate(X = A %>% str_replace_all(c("C","D"), c("m","n")))

We could use chartr
df %>%
mutate(X = chartr('DC', 'mn', A))
# A tibble: 4 x 3
# A B X
# <chr> <chr> <chr>
#1 C G n
#2 D H m
#3 E I E
#4 F J F

Related

How to crosstabulate two variables to classify a third categorical variable in R

I want to crosstabulate x by y to obtain in the table cells, the values of z.
library(tidyverse)
df <- tibble(x = c("a", "a", "b", "b"),
y = c("c", "d", "c", "d"),
z = c("e", "g", "f", "h"))
# I want to obtain this result:
# c d
# a e g
# b f h
Created on 2021-07-18 by the reprex package (v2.0.0)

I think you want tidyr::pivot_wider...
df %>% pivot_wider(names_from = y, values_from = z)
# A tibble: 2 x 3
x c d
<chr> <chr> <chr>
1 a e g
2 b f h

left_join in a for loop with different columns names

I have a data.frame called a whose structure is similar to:-
a <- data.frame(X1=c("A", "B", "C", "A", "C", "D"),
X2=c("B", "C", "D", "A", "B", "A"),
X3=c("C", "D", "A", "B", "A", "B")
)
And I have another set which is:-
b <- data.frame(Xn=c("A", "B", "C", "D"),
Feature=c("some", "more", "what", "why"))
I want to add all the Features from set b to set a, such that X1, X2 and X3 have their corresponding feature column in set a. In other words, the columns in set a become:-
colnames(a) <- c("X1", "X2", "X3", "Features1", "Features2", "Features3")
How can I do this using a left_join in a for loop??

In base R, we can unlist a dataframe and match it with b$Xn to get corresponding Feature value. We can cbind this dataframe to original dataframe to get final answer.
temp <- a
temp[] <- b$Feature[match(unlist(temp), b$Xn)]
names(temp) <- paste0('Feature', seq_along(temp))
cbind(a, temp)
# X1 X2 X3 Feature1 Feature2 Feature3
#1 A B C some more what
#2 B C D more what why
#3 C D A what why some
#4 A A B some some more
#5 C B A what more some
#6 D A B why some more
In tidyverse, we can get the data in long format, join the data and get it back to wide format.
library(dplyr)
library(tidyr)
a %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
left_join(b, by = c('value' = 'Xn')) %>%
select(-value) %>%
pivot_wider(names_from = name, values_from = Feature) %>%
select(-row) %>%
rename_all(~paste0('Feature', seq_along(.))) %>%
bind_cols(a, .)

This can be done by using mutate_all to recode all of the columns in a:
library(tidyverse)
a %>%
mutate_all(funs(feat=recode(., !!!set_names(as.character(b$Feature), b$Xn))))
X1 X2 X3 X1_feat X2_feat X3_feat
1 A B C some more what
2 B C D more what why
3 C D A what why some
4 A A B some some more
5 C B A what more some
6 D A B why some more
You can add a rename_at to get the desired names:
a %>%
mutate_all(funs(f=recode(., !!!set_names(as.character(b$Feature), b$Xn)))) %>%
rename_at(vars(matches("f")), ~gsub(".([0-9]).*", "Feature\\1", .))
X1 X2 X3 Feature1 Feature2 Feature3
1 A B C some more what
2 B C D more what why
3 C D A what why some
4 A A B some some more
5 C B A what more some
6 D A B why some more

Join data but ignore missing values

I am having some trouble with joining data frames with dplyr, where I would like to ignore the NAs.
The data that I have is quite big, but a simplified version looks like:
id <- c("id1", "id2", "id3", "id4")
A <- c("E", "F", "G", NA)
B <- c("T", NA, "N", "T")
C <- c(NA, "T", "U", NA)
df <- data.frame(A, B, C)
id A B C
1 id1 E T NA
2 id2 F NA T
3 id3 G N U
4 id4 NA T NA
I have an entry that I would like to match with df, which is e.g.:
df2 <- data.frame(A = "E", B = "T", C = "M")
A B C
1 E T M
As a result I would like to obtain all rows from df that match with df2, but the NAs should be ignored. So the result should look like this:
id A B C
1 id1 E T NA
2 id4 NA T NA
I was trying to do this with semi_join, but it did not work so far:
result <- df %>%
group_by(n = seq(n())) %>%
do(modify_if(., is.na, ~NULL) %>%
semi_join(df2, by = c("A", "B", "C"))) %>%
ungroup %>%
select(-n)
Which results in:
Error: `by` can't contain join column `C` which is missing from LHS
Call `rlang::last_error()` to see a backtrace
Who knows the answer?

Here's a solution with a mix of tidyverse and base R. I think this is pretty clear, but I'd be interested in a pure tidyverse implementation that isn't completely contrived.
The idea is to first expand all entries in df and df2 and then filter through all the columns using a loop.
The data:
id <- c("id1", "id2", "id3", "id4")
A <- c("E", "F", "G", NA)
B <- c("T", NA, "N", "T")
C <- c(NA, "T", "U", NA)
df <- data.frame(id, A, B, C, stringsAsFactors = F) # Make sure to use strings not factors
df2 <- data.frame(A = "E", B = "T", C = "M", stringsAsFactors = F)
Code:
library(tidyr)
results <- crossing(df, df2)
select_columns <- c("A", "B", "C")
for(col in select_columns) {
keep <- is.na(results[[col]]) | results[[col]] == results[[paste0(col, 1)]]
results <- results[keep,, drop=F]
}
results <- results %>% dplyr::select(id, A:C) %>% distinct
results
id A B C
1 id1 E T <NA>
2 id4 <NA> T <NA>

If you only need to do this for a single set of values this is probably the most straightforward approach:
d[A %in% c("E",NA) & B %in%c("T",NA) & C %in% c("M",NA),]

Another example using tidyverse and base (dplyr, tidyr, base):
In this I convert your df2 into a dataframe that includes all combinations of values you want to accept ( (E or NA) & (T or NA) & (M or NA) ) and then I do an inner join with this full set. There are other ways to create a dataframe of all possible combinations but this one uses tidyr fairly easily.
library(dplyr)
library(tidyr)
id <- c("id1", "id2", "id3", "id4")
A <- c("E", "F", "G", NA)
B <- c("T", NA, "N", "T")
C <- c(NA, "T", "U", NA)
df <- data.frame(A, B, C, stringsAsFactors = FALSE)
df2 <- data.frame(A = "E", B = "T", C = "M",stringsAsFactors = FALSE)
df2_expanded <- df2 %>%
rowwise() %>%
mutate(combinations = list(expand.grid(A = c(A,NA),B = c(B,NA),C = c(C,NA),stringsAsFactors = FALSE))) %>%
select(-A,-B,-C) %>%
unnest(combinations)
# A tibble: 8 x 3
# A B C
# <chr> <chr> <chr>
# 1 E T M
# 2 NA T M
# 3 E NA M
# 4 NA NA M
# 5 E T NA
# 6 NA T NA
# 7 E NA NA
# 8 NA NA NA
df %>%
inner_join(df2_expanded)
# A B C
# 1 E T <NA>
# 2 <NA> T <NA>

Loop to Replace Matching Values

I'm looking for an easy and elegant way to accomplish this.
So if I have dataset x and relationship is A -> B -> Z -> Y and D -> H -> G, I would like to create dataset y. Unfortunately, they are not necessarily in order:
> x <- data.frame(
+ from = as.character(c("A", "E", "B", "D", "H", "Z")),
+ to = as.character(c("B", "E", "Z", "H", "G", "Y")))
>
> y <- data.frame(
+ from = as.character(c("A", "E", "B", "D", "H", "Z")),
+ to = as.character(c("Y", "E", "Y", "G", "G", "Y")))
>
> x
from to
1 A B
2 E E
3 B Z
4 D H
5 H G
6 Z Y
> y
from to
1 A Y
2 E E
3 B Y
4 D G
5 H G
6 Z Y
I have a fairly large dataset (currently 500k rows; will grow in the future) and actually care about the performance; I'm not sure if there are any other ways to do this without a for-loop or even to vectorize/parallelize the process.
I'm thinking about splitting and removing all rows where from == to or creating an indicator to skip certain rows so the loop does not have to go through the entire dataset each time.
I'd also like to know what the breakpoint should be if I do create a loop; I'm not sure how to define when the loop should stop.
Any suggestions would be appreciated. Thanks!

We can use dplyr to create a grouping variable by comparing the adjacent elements of 'to' and 'from' and change the values in 'to' the last element of 'to'
library(dplyr)
x %>%
group_by(grp = cumsum(lag(lead(from, default = last(from)) !=
as.character(to), default = TRUE))) %>%
mutate(to = last(to)) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 2
# from to
# <fctr> <fctr>
#1 A D
#2 B D
#3 C D
#4 E E

Another solution can be achieved using lag from dplyr and fill from tidyr as:
library(tidyverse)
x %>% arrange(from) %>%
mutate(samegroup = ifelse(from == lag(to), 1, 0)) %>%
mutate(group = ifelse(samegroup == 0 | is.na(samegroup), row_number(), NA)) %>%
fill(group) %>%
group_by(group) %>%
mutate(to = last(to)) %>%
ungroup() %>%
select(-samegroup, - group)
# A tibble: 6 x 2
# from to
# <chr> <chr>
#1 A D
#2 B D
#3 C D
#4 E E
#5 F H
#6 G H
Data used
x <- data.frame(from = as.character(c("A", "B", "F", "C", "G", "E")),
to = as.character(c("B", "C", "G", "D", "H", "E")),
stringsAsFactors = FALSE)

Use result of subquery in conditional mutate

I want to compute a new row based on the result of a subquery in the same dataframe. Minimal (non) working example:
library(plyr)
library(dplyr)
df <- data.frame(
VAR1 = c("A", "A", "B", "C"),
VAR2 = c("F", "G", "E", "D"),
VAR3 = c("G", "F", "X", "D")
) %>% as_tibble
subquery <- function(v1, v2) {
dplyr::filter(df, as.character(v1) == VAR1, as.character(v2) == VAR2)
}
TEST <-
df %>%
mutate(X = case_when(
plyr::empty(subquery(VAR1, VAR3)) ~ "EMPTY",
TRUE ~ "NON EMPTY"
))
The result dataframe TEST should be
VAR1 VAR2 VAR3 X
<fctr> <fctr> <fctr> <chr>
A F G NON EMPTY
A G F NON EMPTY
B E X EMPTY
C D D NON EMPTY
but is
VAR1 VAR2 VAR3 X
<fctr> <fctr> <fctr> <chr>
A F G NON EMPTY
A G F NON EMPTY
B E X NON EMPTY
C D D NON EMPTY
Many thanks in advance!
remark: If I don't coerce the v1 and v2 to character I get the following error:
Error in mutate_impl(.data, dots) :
Evaluation error: Evaluation error: level sets of factors are different..

I would put the empty function within the subquery function in order to return TRUE or FALSE values. Then it would be possible to vectorise it in order to apply it to every row of your dataframe:
library(plyr)
library(dplyr)
df <- data.frame(
VAR1 = c("A", "A", "B", "C"),
VAR2 = c("F", "G", "E", "D"),
VAR3 = c("G", "F", "X", "D")
) %>% as_tibble
subquery <- function(v1, v2) {
empty(filter(df, as.character(v1) == VAR1, as.character(v2) == VAR2))
}
subquery = Vectorize(subquery)
df %>%
mutate(X = case_when(
subquery(VAR1, VAR3) == FALSE ~ "NON EMPTY",
TRUE ~ "EMPTY"
))
# # A tibble: 4 x 4
# VAR1 VAR2 VAR3 X
# <fct> <fct> <fct> <chr>
# 1 A F G NON EMPTY
# 2 A G F NON EMPTY
# 3 B E X EMPTY
# 4 C D D NON EMPTY
Or you can put empty and case_when within the subquery function like this:
subquery <- function(v1, v2) {
res = empty(filter(df, as.character(v1) == VAR1, as.character(v2) == VAR2))
case_when(res == FALSE ~ "NON EMPTY",
TRUE ~ "EMPTY")
}
subquery = Vectorize(subquery)
df %>% mutate(X = subquery(VAR1, VAR3))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

stringr: replace string by supplying pattern through a character vector - r

As is vectorized over string and replacement if you put all the values in the same vector you can just run df %>% mutate(X = A %>% str_replace_all(c("C","D"), c("m","n")))

We could use chartr df %>% mutate(X = chartr('DC', 'mn', A)) # A tibble: 4 x 3 # A B X # <chr> <chr> <chr> #1 C G n #2 D H m #3 E I E #4 F J F

Related

How to crosstabulate two variables to classify a third categorical variable in R

left_join in a for loop with different columns names

Join data but ignore missing values

Loop to Replace Matching Values

Use result of subquery in conditional mutate

Categories

Resources