Shortest way to remove duplicate words from string

Shortest way to remove duplicate words from string - r

I have this string:
x <- c("A B B C")
[1] "A B B C"
I am looking for the shortest way to get this:
[1] "A B C"
I have tried this:
Removing duplicate words in a string in R
paste(unique(x), collapse = ' ')
[1] "A B B C"
# does not work
Background:
In a dataframe column I want to count only the unique word counts.

A regex based approach could be shorter - match the non-white space (\\S+) followed by a white space character (\\s), capture it, followed by one or more occurrence of the backreference, and in the replacement, specify the backreference to return only a single copy of the match
gsub("(\\S+\\s)\\1+", "\\1", x)
[1] "A B C"
Or may need to split the string with strsplit, unlist, get the unique and then paste
paste(unique(unlist(strsplit(x, " "))), collapse = " ")
# [1] "A B C"

Another possible solution, based on stringr::str_split:
library(tidyverse)
str_split(x, " ") %>% unlist %>% unique
#> [1] "A" "B" "C"

Just in case the duplicates are not following each other, also using gsub.
x <- c("A B B C")
gsub("\\b(\\S+)\\s+(?=.*\\b\\1\\b)", "", x, perl=TRUE)
#[1] "A B C"
gsub("\\b(\\S+)\\s+(?=.*\\b\\1\\b)", "", "A B B A ABBA", perl=TRUE)
#[1] "B A ABBA"

You can use ,
gsub("\\b(\\w+)(?:\\W+\\1\\b)+", "\\1", x)

Related

stringr function to concatenate vector of words separated by comma with "and" before last word

I know I can easily write one, but does anyone know if stringr (or stringi) already has a function that concatenates a vector of one or more words separated by commas, but with an "and" before the last word?

You can use the knitr::combine_words function
knitr::combine_words(letters[1:2])
# [1] "a and b"
knitr::combine_words(letters[1:3])
# [1] "a, b, and c"
knitr::combine_words(letters[1:4])
# [1] "a, b, c, and d"

Here's another solution :
enum <- function(x)
paste(c(head(x,-2), paste(tail(x,2), collapse = ", and ")), collapse = ", ")
enum(letters[1])
#> [1] "a"
enum(letters[1:2])
#> [1] "a, and b"
enum(letters[1:3])
#> [1] "a, b, and c"
enum(letters[1:4])
#> [1] "a, b, c, and d"
Created on 2019-05-11 by the reprex package (v0.2.1)

Selectively removing trailing string

I want to remove the last letter "O", except where is is part of the word "HELLO".
I've tried doing this:
Example:
a <- c("HELLO XO","DO HELLO","TWO XO","HO")
gsub("[^HELLO]O\\>","",a)
[1] "HELLO " " HELLO" "T " "HO"
but I want
"HELLO X" "D HELLO" "TW X" "H"

Try replacing using the following pattern:
\b(?!HELLO\b)(\w+)O\b
This says to assert that the word HELLO does not appear as the word, and then captures everything up until the final O, should it occur. Then, it replaces with that optional final O removed.
\b - from the start of the word
(?!HELLO\b) - assert that the word is not HELLO
(\w+)O - match a word ending in O, but don't capture final O
\b - end of word
The capture group, if a match happens, will contain the entire word minus the final O.
Code:
a <- c("HELLO XO","DO HELLO","TWO XO","HO")
gsub("\\b(?!HELLO\\b)(\\w+)O\\b", "\\1", a, perl=TRUE)
[1] "HELLO X" "D HELLO" "TW X" "H"
Note that we must Perl mode enabled (perl=TRUE) with gsub in order to use the negative lookahead.
Demo

Use regex alternation operator |
a <- c("HELLO XO","DO HELLO","TWO XO","HO")
gsub("(HELLO)|O(?!\\S)", "\\1", a, perl=T)
# [1] "HELLO X" "D HELLO" "TW X" "H"
(HELLO)|O this regex does two things,
First it captures all the HELLO string.
Matches all the remaining 0's which are not followed by a non-space character.

Your regular expression is nit correct.[^HELLO] means any character except H, E, L and O. But you need except only exactly HELL before O. So, you should use following expression:
a <- c("HELLO XO","DO HELLO","TWO XO","HO")
gsub("(?<!\\bHELL)O\\b", "", a, perl=TRUE)

a <- c("HELLO XO","DO HELLO","TWO XO","HO")
aa <- gsub("O","",a)
gsub("HELL", "HELLO",aa)

A little lengthy, but you can try like this
a <- c("HELLO XO","DO HELLO","TWO XO","HO")
b <- lapply(a, function(x) unlist(strsplit(x, " ")))
b
> b
[[1]]
[1] "HELLO" "XO"
[[2]]
[1] "DO" "HELLO"
[[3]]
[1] "TWO" "XO"
[[4]]
[1] "HO"
c <- unlist(lapply(b, function(y) paste(ifelse( y == "HELLO", "HELLO", gsub("O", "", y)), collapse = " " )))
c
[1] "HELLO X" "D HELLO" "TW X" "H"

Extract string between spaces

I have this data frame:
df <-c("AA AAAA 1B","A BBB 1", "CC RR 1W3", "SS RGTYC 0")
[1] "AA AAAA 1B" "A BBB 1" "CC RR 1W3" "SS RGTYC 0"
and I want to extract what is between spaces.
Desired result:
[1] "AAAA" "BBB" "RR" "RGTYC"

df <- c("AA AAAA 1B","A BBB 1", "CC RR 1W3", "SS RGTYC 0")
lst <- strsplit(df," ")
sapply(lst, '[[', 2)
# [1] "AAAA" "BBB" "RR" "RGTYC"

Instead of splitting it first and then selecting the relevant split, you can also extract it straight away using the stringr-package:
library(stringr)
str_extract(df, "(?<=\\s)(.*)(?=\\s)")
# [1] "AAAA" "BBB" "RR" "RGTYC"
This solution uses regular expressions, and this pattern is built up like this:
(?<=\\s) checks whether there is whitespace before
(?=\\s) checks whether there is a whitespace after
(.*) extracts everything in between the white spaces

Here is a gsub based approach (from base R). We match one more non-white spaces from the start (^) of the string followed by one or more spaces or (|) one or more white spaces followed by non-white spaces at the end of the string ($) and replace it with blank ("")
gsub("^\\S+\\s+|\\s+\\S+$", "", df)
#[1] "AAAA" "BBB" "RR" "RGTYC"
There is also a convenient function word from stringr
stringr::word(df, 2)
#[1] "AAAA" "BBB" "RR" "RGTYC"

How to use sub on a vector in R?

Consider this R code and output:
> the_string <- "a, b, c"
> the_vec <- strsplit(the_string, ",")
> str(the_vec)
List of 1
$ : chr [1:3] "a" " b" " c"
> str(sub("^ +", "", the_vec))
chr "c(\"a\", \" b\", \" c\")"
Looks like sub returns a single character array instead of a vector of character arrays. I'm hoping for:
chr [1:3] "a" "b" "c"
How do I get that?
Edit: the_string will come from users, so I want to tolerate a variable number of spaces, zero to many.
Edit: the tokens may have spaces in the middle that should be preserved. So, "a, b c,d" should result in c('a', 'b c', 'd').

the_string <- "a, b, c"
the_vec <- unlist(strsplit(the_string, ", "))
If you add the space after the comma and unlist the entire thing you get the vector.
Update:
If the string has a varying amount of space between characters, I would remove all of the excess spaces and then run the same as above. I chose 5 but maybe your string has more. Also I added a second step to split characters that do not have a comma in between characters.
a <- "a, b, c, d, e, f g, h,i"
a <- gsub("( {2,5})", " ",a)
a <- unlist(strsplit(a, ", |,"))
unlist(strsplit(a, " "))
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i"

strsplit creates a list where each element is a vector of the split of each item in the original vector, eg.:
strsplit( c("a, b, c", "d, e"), ",")
[[1]]
[1] "a" " b" " c"
[[2]]
[1] "d" " e"
Here you only have one item in the input vector, so the result is all in the first item of the list:
the_string <- "a, b, c"
the_list <- strsplit(the_string, ",")
sub("^ +", "", the_list[[1]])
[1] "a" "b" "c"
If you don't use [[1]] or unlist, the_list is coerced to a character vector using as.character:
as.character(the_list)
[1] "c(\"a\", \" b\", \" c\")"

One base-R solution
lapply(the_vec, function(x) sub("^ +", "", x))[[1]]
[1] "a" "b" "c"

Pasting two strings using paste function and its collapse argument

I am trying to paste two vectors
vector_1 <- c("a", "b")
vector_2 <- c("x", "y")
paste(vector_1, vector_2, collapse = " + ")
The output I get is
"a + b x + y "
My desired output is
"a + b + x + y"

paste with more then one argument will paste together term-by-term.
> paste(c("a","b","c"),c("A","B","C"))
[1] "a A" "b B" "c C"
the result being the length of the longest vector, with the shorter term recycled. That enables things like this to work:
> paste("A",c("1","2","BBB"))
[1] "A 1" "A 2" "A BBB"
> paste(c("1","2","BBB"),"A")
[1] "1 A" "2 A" "BBB A"
then sep is used within the elements and collapse to join the elements.
> paste(c("a","b","c"),c("A","B","C"))
[1] "a A" "b B" "c C"
> paste(c("a","b","c"),c("A","B","C"),sep="+")
[1] "a+A" "b+B" "c+C"
> paste(c("a","b","c"),c("A","B","C"),sep="+",collapse="#")
[1] "a+A#b+B#c+C"
Note that once you use collapse you get a single result rather than three.
You seem to not want to combine your two vectors element-wise, so you need to turn them into one vector, which you can do with c(), giving us the solution:
> c(vector_1, vector_2)
[1] "a" "b" "x" "y"
> paste(c(vector_1, vector_2), collapse=" + ")
[1] "a + b + x + y"
Note that sep isn't needed - you are just collapsing the individual elements into one string.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Shortest way to remove duplicate words from string - r

Another possible solution, based on stringr::str_split: library(tidyverse) str_split(x, " ") %>% unlist %>% unique #> [1] "A" "B" "C"

Just in case the duplicates are not following each other, also using gsub. x <- c("A B B C") gsub("\\b(\\S+)\\s+(?=.\\b\\1\\b)", "", x, perl=TRUE) #[1] "A B C" gsub("\\b(\\S+)\\s+(?=.\\b\\1\\b)", "", "A B B A ABBA", perl=TRUE) #[1] "B A ABBA"

You can use , gsub("\\b(\\w+)(?:\\W+\\1\\b)+", "\\1", x)

Related

stringr function to concatenate vector of words separated by comma with "and" before last word

Selectively removing trailing string

Extract string between spaces

How to use sub on a vector in R?

Pasting two strings using paste function and its collapse argument

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Shortest way to remove duplicate words from string - r

Another possible solution, based on stringr::str_split: library(tidyverse) str_split(x, " ") %>% unlist %>% unique #> [1] "A" "B" "C"

Just in case the duplicates are not following each other, also using gsub. x <- c("A B B C") gsub("\\b(\\S+)\\s+(?=.*\\b\\1\\b)", "", x, perl=TRUE) #[1] "A B C" gsub("\\b(\\S+)\\s+(?=.*\\b\\1\\b)", "", "A B B A ABBA", perl=TRUE) #[1] "B A ABBA"

You can use , gsub("\\b(\\w+)(?:\\W+\\1\\b)+", "\\1", x)

Related

stringr function to concatenate vector of words separated by comma with "and" before last word

Selectively removing trailing string

Extract string between spaces

How to use sub on a vector in R?

Pasting two strings using paste function and its collapse argument

Categories

Resources

Just in case the duplicates are not following each other, also using gsub. x <- c("A B B C") gsub("\\b(\\S+)\\s+(?=.\\b\\1\\b)", "", x, perl=TRUE) #[1] "A B C" gsub("\\b(\\S+)\\s+(?=.\\b\\1\\b)", "", "A B B A ABBA", perl=TRUE) #[1] "B A ABBA"