modify data frame based on regex using gsub in r - r

I have been matching text strings between two vectors in a dataframe. Several values have exactly three characters and match up as part of another word in some other string. I would like to find the regular expression for this. Here is an example:
a <- c("urban", "crabtree", "rba", "rba hks","barbara", "lederbach")
b <- c("rba", "rba", "rba", "rba", "rba", "rba")
df <- data.frame(a, b)
I would like to substitute a blank space (i.e. "") for those values where "rba" only appears as part of the word. The desired output is:
b <- c("", "", "rba", "rba", "", "")
So it's sort of like:
grep("\\b...\\b", df$a, value = TRUE)
But I want to modify column b and insert "" wherever there is no match.
I'm aware that %in% can be used for exact matches, but I was hoping for something using gsub:
funb <- function(x) gsub("\\b...\\b", "", x)
df$b <- lapply(df$b, funb)
but I haven't had much luck. Clearly somthing is amiss, can someone help me get the desired result? Any advice or suggestions would be much appreciated. Thanks.

Based on #David Arenburg's comment above, the general solution to this problem is:
b[!stri_detect_regex(a, paste0("\\b", b, "\\b"))] <- ""
which edits elements in column b, as desired.

Related

How to search a vector of words for words containing two specific letters

So I've got a vector of 5 letter words and I want to be able to create a function that extracts the words that contain ALL of the letters in the pattern.
For example, if my vector is ("aback", "abase", "abate", "agate", "allay") and I'm looking for words that contain BOTH "a" and "b", I want the function to return ("aback", "abase", "abate"). I don't care what position or how many times these letters occur in the words, only that the words contain both of them.
I've tried to do this by creating a function that is meant to combine grepl's with an &. But the problem here is the grepl function doesn't accept vectors as the pattern. My plan was for this function to achieve grepl("a", word_vec) & grepl("b", word_vec). I also need this to be scalable so if I want to search for all words containing "a" AND "b" AND "c", for example.
grepl_cat <- function(str, words_vec) {
pat <- str_split(str, "")
first_let = TRUE
for (i in 1:length(pat)) {
if (first_let){
result <- sapply(pat[i], grepl, x = word_vec)
first_let <- FALSE
}
print(pat[i])
result <- result & sapply(pat[i], grepl, x = word_vec)
}
return(result)
}
word_vec[grepl_cat("abc", word_vec)]
The function I've written above definitely isn't doing what it's intended to do.
I'm wondering if there an easier way to do this with regex patterns or there's a way to input each letter in the str into the grepl function as non-vectors.
A possible solution in base R:
s <- c("aback", "abase", "abate", "agate", "allay")
subset(s, grepl("(a)(b)", s))
#> [1] "aback" "abase" "abate"
Another possible solution, based on tidyverse:
library(tidyverse)
s <- c("aback", "abase", "abate", "agate", "allay")
s %>%
data.frame(s = .) %>%
filter(str_detect(s, "(a)(b)")) %>%
pull(s)
#> [1] "aback" "abase" "abate"
For a,b and c regex solution would be:
^.*a.*b.*c.*$
You may add more letters as needed
Demo1
Alternative regex approach:
^(?=.*a)(?=.*b)(?=.*c).*$
Demo 2

Replacing a string that contains a certain word with that word

I have a column in my df with many different strings, for instance a string would say crossed at point a, or came in through point a. I want to replace that entire string with just a, how could I go about doing this?
Following a comment to the question by user Allan Cameron, here is a full solution with the suggestion I made.
df1 <- data.frame(col = c("crossed at point a",
"doesn't match though it has as",
"came in through point a",
"no"))
df1$col[grepl("\\ba\\b", df1$col)] <- "a"
df1
# col
#1 a
#2 doesn't match though it has as
#3 a
#4 no
Edit
Following another comment by Allan Cameron I have decided to write a small function to make it easier to replace a string that contains a word by that word.
replaceWord <- function(x, word){
pattern <- paste0("\\b", word, "\\b")
i <- grep(pattern, x)
x[i] <- word
x
}
replaceWord(df1$col, "a")

Resolving a formatter string

Suppose I have the following:
format.string <- "#AB#-#BC#/#DF#" #wanted to use $ but it is problematic
value.list <- c(AB="a", BC="bcd", DF="def")
I would like to apply the value.list to the format.string so that the named value is substituted. So in this example I should end up wtih a string: a-bcd/def
I tried to do it like the following:
resolved.string <- lapply(names(value.list),
function(x) {
sub(x = save.data.path.pattern,
pattern = paste0(c("#",x,"#"), collapse=""),
replacement = value.list[x]) })
But it doesn't seem to be working correctly. Where am I going wrong?
The glue package is designed for this. You can change the opening and closing delimiters using .open and .close, but they have to be different. Also note that value.list has to be either a list or a dataframe:
library(glue)
format.string <- "{AB}-{BC}/{DF}"
value.list <- list(AB="a", BC="bcd", DF="def")
glue_data(value.list, format.string)
# a-bcd/def
To answer your actual question, by using lapply over names(value.list) you, as your output shows, take each of the elements of value.list and perform the replacement. However, all this happens independently, i.e., the replacements aren't ultimately combined to a single result.
As to make something very similar to your approach work, we can use Reduce which does exactly this combining:
Reduce(function(x, y) sub(paste0(c("#", y, "#"), collapse = ""), value.list[y], x),
init = format.string, names(value.list))
# [1] "a-bcd/def"
If we call the anonymous function f, then the result is
f(f(f(format.string, "A"), "B"), "C")
exactly as you intended, I believe.
We can use gsubfn that can take a key/value pair as replacement to change the pattern with the 'value'
library(gsubfn)
gsub("#", "", gsubfn("[^#]+", as.list(value.list), format.string))
#[1] "a-bcd/def"
NOTE: 'value.list' is a vector and not a list

How to use regex over entire dataframe in R

new user to R so please go easy on me.
I have dataframe like:
df = data.frame(Mineral = c("Zfeldspar", "Zgranite", "ZSilica"),
Confidence = c("ZLow", "High", "Med"),
Coverage = c("sub", "sub", "super"),
Aspect = c("ZPos", "ZUnd", "Neg"))
actual file is much larger and outputted from old hardware. For some reason some entries have "Z" put in front of them. How do I remove from entire dataset?
I tried df = gsub("Z", " ", df) but it just gives me nonsense. This darn thing!
[1] "1:3" "c(3, 1, 2)" "c(1, 1, 2)" "c(2, 3, 1)"
Looked on here at stackoverflow and tried stringr package but could also not get to work. Anyone know what to do?
Your approach with gsub() is not working because that function operates on vectors, and not dataframes. However, you can apply gsub() over each column of your dataframe to get what you want:
df[] <- lapply(df, function (x) {gsub("Z", "", x)})
For a stringr solution (that also uses dplyr), try:
library(tidyverse)
df <- mutate_all(df,
funs(str_replace_all(., "Z", "")))
P.S. I recommend using df <- instead of df = in the future. Good luck!
EDIT: corrected typo - thanks #thelatemail
You may use a simple ^Z regex in the following way:
df = data.frame(Mineral = c("Zfeldspar", "Zgranite", "ZSilica"),
Confidence = c("ZLow", "High", "Med"),
Coverage = c("sub", "sub", "super"),
Aspect = c("ZPos", "ZUnd", "Neg"))
df[] <- lapply(df, sub, pattern = '^Z', replacement ="")
> df
Mineral Confidence Coverage Aspect
1 feldspar Low sub Pos
2 granite High sub Und
3 Silica Med super Neg
The ^Z pattern matches the start of the string with ^ anchor, and then Z is matched and removed using sub (as there is only one possible match in the each string there is no point using gsub).
You are close. If you want to go with base gsub
data$Mineral = gsub("Z", "", data$Mineral)
You can do this for all columns. Or use a combination of apply strategies (see other answers!)
PS. Naming your data data is not a good idea. At least do my_data
You could do:
as.data.frame(sapply(data, function(x) {gsub("Z", "", x)}))
You asked how to do it in stringr(/stringi) package, to avoid getting the unwanted vector of indices you got:
> as.data.frame(apply(df, 2,
function(col) stringr::str_replace_all(col, '^Z', '')))
> as.data.frame(apply(df, 2,
function(col) stringi::stri_replace_first_regex(col, '^Z', '')))
Mineral Confidence Coverage Aspect
1 feldspar Low sub Pos
2 granite High sub Und
3 Silica Med super Neg
(where the as.data.frame() call is needed to turn the output array back into a df R: apply-like function that returns a data frame?
)
As to figuring out how exactly to call str*_replace function over an entire dataframe, I tried...
the entire df: stri_replace_first_fixed(df, '^Z', '')
by rows: stri_replace_first_fixed(df[1,], '^Z', '')
by columns: stri_replace_first_fixed(df[,1], '^Z', '')
Only the last one works properly. Admittedly a design flaw on str*_replace, they should at minimum recognize an invalid object and produce a useful error message, instead of spewing out indices.

R: Remove consecutive duplicates from comma separated string

I'm having issues removing just the right amount of information from the following data:
18,14,17,2,9,8
17,17,17,14
18,14,17,2,1,1,1,1,9,8,1,1,1
I'm applying !duplicate in order to remove the duplicates.
SplitFunction <- function(x) {
b <- unlist(strsplit(x, '[,]'))
c <- b[!duplicated(b)]
return(paste(c, collapse=","))
}
I'm having issues removing only consecutive duplicates. The result below is what I'm getting.
18,14,17,2,9,8
17,14
18,14,17,2,1,9,8
The data below is what I want to obtain.
18,14,17,2,9,8
17,14
18,14,17,2,1,9,8,1
Can you suggest a way to perform this? Ideally a vectorized approach...
Thanks,
Miguel
you can use rle function to sovle this question.
xx <- c("18,14,17,2,9,8","17,17,17,14","18,14,17,2,1,1,1,1,9,8,1,1,1")
zz <- strsplit(xx,",")
sapply(zz,function(x) rle(x)$value)
And you can refer to this link.
How to remove/collapse consecutive duplicate values in sequence in R?
We can use rle
sapply(strsplit(x, ','), function(x) paste(inverse.rle(within.list(rle(x),
lengths <- rep(1, length(lengths)))), collapse=","))
#[1] "18,14,17,2,9,8" "17,14" "18,14,17,2,1,9,8,1"
data
x <- c('18,14,17,2,9,8', '17,17,17,14', '18,14,17,2,1,1,1,1,9,8,1,1,1')
Great rle-answers. This is just to add an alternative without rle. This gives a list of numeric vectors but can of course easily expanded to return strings:
numbers <- c("18,14,17,2,9,8", "17,17,17,14", "14,17,18,2,9,8,1", "18,14,17,11,8,9,8,8,22,13,6", "14,17,2,9,8", "18,14,17,2,1,1,1,1,1,1,1,1,9,8,1,1,1,1")
result <- sapply(strsplit(numbers, ","), function(x) x[x!=c(x[-1],Inf)])
print(result)

Resources