Replace strings in variable using lookup vector - r

I have a dataframe df with a character variable and the fromvec and tovec.
df <- tibble(var = c("A", "B", "C", "a", "E", "D", "b"))
fromvec <- c("A", "B", "C")
tovec <- c("X", "Y", "Z")
Use strings in fromvec, check them in df and then replace them with the corresponding strings in tovec so that "A" in df gets replaced with "X", "B" with "Y" and so on to get the desired_df.
desired_df <- tibble(var = c("X", "Y", "Z", "X", "E", "D", "Y"))
I tried following, but not getting the desired result!
from_vec <- paste(fromvec, collapse="|")
to_vec <- paste(tovec, collapse="|")
undesired_df <- df %>%
mutate(var = str_replace(str_to_upper(var), from_vec, to_vec))
i.e. this
tibble(var = c("X|Y|Z", "X|Y|Z", "X|Y|Z", "X|Y|Z", "E", "D", "X|Y|Z"))
How can I get the desired_df?

You could use chartr :
df$var <- chartr(paste(fromvec,collapse=""),
paste(tovec,collapse=""),
toupper(df$var))
# # A tibble: 7 x 1
# var
# <chr>
# 1 X
# 2 Y
# 3 Z
# 4 X
# 5 E
# 6 D
# 7 Y
Or we can use recode
library(dplyr)
df$var <- recode(toupper(df$var), !!!setNames(tovec,fromvec))
If you really want to use str_replace you could do:
library(purrr)
library(stringr)
df$var <- reduce2(fromvec, tovec, str_replace, .init=toupper(df$var))

The correct way to do this with stringr is with str_replace_all:
mutate(df,str_replace_all(str_to_upper(var),setNames(tovec, fromvec)))
(thanks, #Moody_Mudskipper!)

We can use base R
with(df, ifelse(toupper(var) %in% fromvec,
setNames(tovec, fromvec)[toupper(var)], var))
#[1] "X" "Y" "Z" "X" "E" "D" "Y"
which can be also written in two lines by creating a logical condition
i1 <- toupper(df$var) %in% fromvec
df$var[i1] <- setNames(tovec, fromvec)[toupper(df$var)[i1]]
Or using data.table
library(data.table)
setDT(df)[toupper(var) %in% fromvec, var := setNames(tovec, fromvec)[toupper(var)]]

It's not clear the result should be case insensitive.
In my opinion, replacement (update) operations that involve an indeterminate number of changes are best accomplished using JOINs. In this case, it also cements a good practice of tracking your changes in a separate dataframe.
Unfortunately, the tidyverse has no "update dataframe" function....a glaring omission. That means tidyverse-ers must use a work-around, coalesce.
#JOIN Operation
tibble(fromvec, tovec) %>% #< dataframe of changes
right_join(df, by = c("fromvec" = "var")) %>% #< join operation
transmute(var = coalesce(tovec, fromvec)) #< coalesce work-around
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 a
5 E
6 D
7 b
If a case insensitive operation is preferred, consider inserting str_to_upper in the pipeline:
tibble(fromvec, tovec) %>%
right_join(df %>% mutate(var = (str_to_upper(var))), #<modify case
by = c("fromvec" = "var")) %>%
transmute(var = coalesce(tovec, fromvec))
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 X
5 E
6 D
7 Y

Related

Trying to sort character variable into new variable with new value based on conditions

I want to sort a character variable into two categories in a new variable based on conditions, in conditions are not met i want it to return "other".
If variable x cointains 4 character values "A", "B", "C" & "D" I want to sort them into a 2 categories, 1 and 0, in a new variable y, creating a dummy variable
Ideally I want it to look like this
df <- data.frame(x = c("A", "B", "C" & "D")
y <- if x == "A" | "D" then assign 1 in y
if x == "B" | "C" then assign 0 in y
if x == other then assign NA in y
x y
1 "A" 1
2 "B" 0
3 "C" 0
4 "D" 1
library(dplyr)
df <- df %>% mutate ( y =case_when(
(x %in% df == "A" | "D") ~ 1 ,
(x %in% df == "B" | "C") ~ 1,
x %in% df == ~ NA
))
I got this error message
Error: replacement has 3 rows, data has 2
Here's the proper case_when syntax.
df <- data.frame(x = c("A", "B", "C", "D"))
library(dplyr)
df <- df %>%
mutate(y = case_when(x %in% c("A", "D") ~ 1,
x %in% c("B", "C") ~ 0,
TRUE ~ NA_real_))
df
#> x y
#> 1 A 1
#> 2 B 0
#> 3 C 0
#> 4 D 1
You're combining syntaxes in a way that makes sense in speech but not in code.
Generally you can't use foo == "G" | "H". You need to use foo == "G" | foo == "H", or the handy shorthand foo %in% c("G", "H").
Similarly x %in% df == "A" doesn't make sense x %in% df makes sense. df == "A" makes sense. Putting them together x %in% df == ... does not make sense to R. (Okay, it does make sense to R, but not the same sense it does to you. R will use its Order of Operations which evaluates x %in% df first and gets a result from that, and then checks whether that result == "A", which is not what you want.)
Inside a dplyr function like mutate, you don't need to keep specifying df. You pipe in df and now you just need to use the column x. x %in% df looks like you're testing whether the column x is in the data frame df, which you don't need to do. Instead use x %in% c("A", "D"). Aron's answer shows the full correct syntax, I hope this answer helps you understand why.

Compare two vectors within a data frame with %in% with R

Compare two vectors within a data frame with %in%
I have the following data
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
Col1
Col2
a
a,b,c
b
aa,c,d
aa
c,d,e
d
d,f,g
I want to select the rows that contain a character from this vector c("a", "e", "g"), specifying the columna
library(dplyr)
T1 %>% filter(Col1 %in% c("a", "e", "g"))
I returned
1 a a,b,c
It is correct, but if I want to compare two vectors, example:
With unlist and strsplit, I transform the value of each row to a character vector and try to compare it with the reference vector to select the rows that contain any of the values:
unlist(strsplit(T1$Col2[1],","))
[1] "a" "b" "c"
T1 %>% filter(unlist(strsplit(Col2,",")) %in% c("a", "e", "g"))
It gives me an error:
Error in filter():
! Problem while computing ..1 = unlist(strsplit(Col2, ",")) %in% c("a", "e", "g").
✖ Input ..1 must be of size 4 or 1, not size 12.
Run ]8;;rstudio:run:rlang::last_error()rlang::last_error() ]8;; to see where the error occurred.
I can do it like this:
T1[grep(c("a|e|g"), T1$Col2),]
1 a a,b,c
2 b aa,c,d
3 aa c,d,e
4 d d,f,g
But it's wrong, row 3 aa c,d,e, shouldn't be, because it's not a, it's aa
To search for the "a" alone, you would have to do:
T1[grep(c("\\<a\\>"), T1$Col2),]
I think that with this form I will end up making a mistake, it would give me more security to be able to do it comparing vector with vector:
T1 %>% filter(unlist(strsplit(Col2,",")) %in% c("a", "e", "g"))
Edited answer
You can use the syntax \\b for regular expressions word boundary. The | is for boundaries adjacent to like an or operation. You can use the following code:
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
library(dplyr)
library(stringr)
T1 %>%
filter(grepl("\\b(a|e|g)\\b", Col2))
#> Col1 Col2
#> 1 a a,b,c
#> 2 aa c,d,e
#> 3 d d,f,g
Created on 2022-07-16 by the reprex package (v2.0.1)
Note: \\b is for R version 4.1+ otherwise use \b.
old answer
It returns all rows back because you check if one of the strings exists in Col2 and you can see that in row 3, "e" exists which is one of the strings and that's why it returns also row 4. You could also use str_detect like this:
library(dplyr)
library(stringr)
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
vector <- c("a", "e", "g")
T1 %>%
filter(any(str_detect(Col2, paste0(vector, collapse="|"))))
#> Col1 Col2
#> 1 a a,b,c
#> 2 b aa,c,d
#> 3 aa c,d,e
#> 4 d d,f,g
Created on 2022-07-16 by the reprex package (v2.0.1)
If you want to check if the strings exists, one of them, in both columns. You can use the following code:
library(dplyr)
library(stringr)
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
vector <- c("a", "e", "g")
T1 %>%
filter(Reduce(`|`, across(all_of(colnames(T1)), ~str_detect(paste0(vector, collapse="|"), .x))))
#> Col1 Col2
#> 1 a a,b,c
Created on 2022-07-16 by the reprex package (v2.0.1)
Another way you could achieve this (using your original approach with strsplit) is to do it rowwise() and 'sum' the logical test.
T1 %>%
rowwise() %>%
filter(sum(unlist(strsplit(Col2,",")) %in% c("a","e","g")) >= 1)

Select for every row between two columns based on condition in another column in R

may someone help me to find the answer thread or provide a method for solution? I can not find a solution.
What I want to do:
For every row if the value in column "x" is "A" then select the value in column "y" from the same row and if the value in column "x" is "B" then select the value in column "z" from the same row.
Ideally collected in a vector to include as a new column in the df afterwards.
df <- data.frame(x = c("A", "B", "B", "A"), y = c(1,2,3,4), z = c(4,3,2,1), fix.empty.names = FALSE)
df
x y z
1 A 1 4
2 B 2 3
3 B 3 2
4 A 4 1
result
[1] 1 3 2 4
Thank you very much in advance
If we can assume x is always "A" or "B":
ifelse(df$x == "A", df$y, df$z)
More generally:
ifelse(df$x == "A", df$y, ifelse(df$x == "B", df$z, NA))
You can, of course, assign this directly as a new column: df$result <- ifelse...
If you like dplyr:
library(dplyr)
df %>%
mutate(
result = case_when(
x == "A" ~ y,
x == "B" ~ z,
TRUE ~ NA_real_
)
)

Count string followed by separate string occurrences in r

I am trying to count the number of occurrences of a string followed by another string in r. I cannot seem to get the regex worked out to count this correctly.
As an example:
v <- c("F", "F", "C", "F", "F", "C", "F", "F")
b <- str_count(v, "F(?=C)")
I would like b to tell me how many times string F was followed by string C in vector v (which should equal 2).
I have successfully implemented str_count() to count single strings, but I cannot figure out how to count string followed by a different string.
Also, I found that in regex (?=...) should indicated "followed by" however this does not seem to be sufficient.
You don't have one string. You have individual strings. Her you can test if F is followed by C by shifting using [ for subsetting.
sum(v[-length(v)] == "F" & v[-1] == "C")
#sum(v == "F" & c(v[-1] == "C", FALSE)) #Alternative
#[1] 2
To use stringr::str_count you can paste v to one string.
stringr::str_count(paste(v, collapse = ""), "F(?=C)")
#[1] 2
And for rows of a data.frame:
set.seed(42)
v <- as.data.frame(matrix(sample(c("F", "C"), 25, TRUE), 5))
stringr::str_count(apply(v, 1, paste, collapse = ""), "F(?=C)")
#[1] 1 1 2 1 1
You can use lag() from dplyr:
library(dplyr)
sum(v == "C" & lag(v) == "F", na.rm = TRUE)
(The na.rm = TRUE is because the first value of lag(v) is NA).
Your comment notes that you're also interested in applying this across each row of a data frame. This can be done by pivoting the data to be longer, then applying a grouped mutate, then pivoting the data to be wider again. On an example dataset:
example <- tibble(id = 1:3,
s1 = c("F", "F", "F"),
s2 = c("C", "F", "C"),
s3 = c("C", "C", "F"),
s4 = c("F", "C", "C"))
example %>%
pivot_longer(s1:s4) %>%
group_by(id) %>%
mutate(fc_count = sum(value == "C" & lag(value) == "F", na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = value)
Result:
# A tibble: 3 x 6
id fc_count s1 s2 s3 s4
<int> <int> <chr> <chr> <chr> <chr>
1 1 1 F C C F
2 2 1 F F C C
3 3 2 F C F C
Note that this assumed the data had something like an id column that uniquely identifies each original row. If it doesn't, you can add one with mutate(id = row_number()) first.

How can I make a data.frame using chr vertor [duplicate]

This question already has answers here:
aggregating unique values in columns to single dataframe "cell" [duplicate]
(2 answers)
Closed 2 years ago.
I have the data.frame below.
> Chr Chr
> A E
> A F
> A E
> B G
> B G
> C H
> C I
> D E
and... I want to convert the dataset as belows as you may be noticed.
I want to coerce all chr vectors into an row.
chr chr
A E,F
B G
C H,I
D E
they are all characters, so I tried to do several things so that I want to make.
Firstly, I used unique function for FILTER <- unique(chr[,15])1st column and try to subset them using
FILTER data that I created using rbind or bind rows function.
Secondly, I tested to check whether my idea works or not
FILTER <- unique(Top[,15])
NN <- data.frame()
for(i in 1 :nrow(FILTER)){
result = unique(Top10Data[TGT == FILTER[i]]$`NM`))
print(result)
}
to this stage, it seems to be working well.
The problem for me is that when I used both functions, the data frame only creates 1 column and ignored the others vector (2nd variables from above data.frame) all.
Only For the chr [1,1], those functions do work well, but I have chr vectors such as chr[1,n], which is unable to be coerced.
here's my code for your reference.
FILTER <- unique(Top[,15])
NN <- data.frame()
for(i in 1 :nrow(FILTER)){
CGONM <- rbind(NN,unique(Top10Data[TGT == FILTER[i]]$`NM`))
}
Base R solutions:
# Solution 1:
df_str_agg1 <- aggregate(var2~var1, df, FUN = function(x){
paste0(unique(x), collapse = ",")})
# Solution 2:
df_str_agg2 <- data.frame(do.call("rbind",lapply(split(df, df$var1), function(x){
data.frame(var1 = unique(x$var1),
var2 = paste0(unique(x$var2), collapse = ","))
}
)
),
row.names = NULL
)
Tidyverse solution:
library(tidyverse)
df_str_agg3 <-
df %>%
group_by(var1) %>%
summarise(var2 = str_c(unique(var2), collapse = ",")) %>%
ungroup()
Data:
df <- data.frame(var1 = c("A", "A", "A", "B", "B", "C", "C", "D"),
var2 = c("E", "F", "E", "G", "G", "H", "I", "E"), stringsAsFactors = FALSE)

Resources