Consider this R code and output:
> the_string <- "a, b, c"
> the_vec <- strsplit(the_string, ",")
> str(the_vec)
List of 1
$ : chr [1:3] "a" " b" " c"
> str(sub("^ +", "", the_vec))
chr "c(\"a\", \" b\", \" c\")"
Looks like sub returns a single character array instead of a vector of character arrays. I'm hoping for:
chr [1:3] "a" "b" "c"
How do I get that?
Edit: the_string will come from users, so I want to tolerate a variable number of spaces, zero to many.
Edit: the tokens may have spaces in the middle that should be preserved. So, "a, b c,d" should result in c('a', 'b c', 'd').
the_string <- "a, b, c"
the_vec <- unlist(strsplit(the_string, ", "))
If you add the space after the comma and unlist the entire thing you get the vector.
Update:
If the string has a varying amount of space between characters, I would remove all of the excess spaces and then run the same as above. I chose 5 but maybe your string has more. Also I added a second step to split characters that do not have a comma in between characters.
a <- "a, b, c, d, e, f g, h,i"
a <- gsub("( {2,5})", " ",a)
a <- unlist(strsplit(a, ", |,"))
unlist(strsplit(a, " "))
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i"
strsplit creates a list where each element is a vector of the split of each item in the original vector, eg.:
strsplit( c("a, b, c", "d, e"), ",")
[[1]]
[1] "a" " b" " c"
[[2]]
[1] "d" " e"
Here you only have one item in the input vector, so the result is all in the first item of the list:
the_string <- "a, b, c"
the_list <- strsplit(the_string, ",")
sub("^ +", "", the_list[[1]])
[1] "a" "b" "c"
If you don't use [[1]] or unlist, the_list is coerced to a character vector using as.character:
as.character(the_list)
[1] "c(\"a\", \" b\", \" c\")"
One base-R solution
lapply(the_vec, function(x) sub("^ +", "", x))[[1]]
[1] "a" "b" "c"
Related
I have this string:
x <- c("A B B C")
[1] "A B B C"
I am looking for the shortest way to get this:
[1] "A B C"
I have tried this:
Removing duplicate words in a string in R
paste(unique(x), collapse = ' ')
[1] "A B B C"
# does not work
Background:
In a dataframe column I want to count only the unique word counts.
A regex based approach could be shorter - match the non-white space (\\S+) followed by a white space character (\\s), capture it, followed by one or more occurrence of the backreference, and in the replacement, specify the backreference to return only a single copy of the match
gsub("(\\S+\\s)\\1+", "\\1", x)
[1] "A B C"
Or may need to split the string with strsplit, unlist, get the unique and then paste
paste(unique(unlist(strsplit(x, " "))), collapse = " ")
# [1] "A B C"
Another possible solution, based on stringr::str_split:
library(tidyverse)
str_split(x, " ") %>% unlist %>% unique
#> [1] "A" "B" "C"
Just in case the duplicates are not following each other, also using gsub.
x <- c("A B B C")
gsub("\\b(\\S+)\\s+(?=.*\\b\\1\\b)", "", x, perl=TRUE)
#[1] "A B C"
gsub("\\b(\\S+)\\s+(?=.*\\b\\1\\b)", "", "A B B A ABBA", perl=TRUE)
#[1] "B A ABBA"
You can use ,
gsub("\\b(\\w+)(?:\\W+\\1\\b)+", "\\1", x)
I know I can easily write one, but does anyone know if stringr (or stringi) already has a function that concatenates a vector of one or more words separated by commas, but with an "and" before the last word?
You can use the knitr::combine_words function
knitr::combine_words(letters[1:2])
# [1] "a and b"
knitr::combine_words(letters[1:3])
# [1] "a, b, and c"
knitr::combine_words(letters[1:4])
# [1] "a, b, c, and d"
Here's another solution :
enum <- function(x)
paste(c(head(x,-2), paste(tail(x,2), collapse = ", and ")), collapse = ", ")
enum(letters[1])
#> [1] "a"
enum(letters[1:2])
#> [1] "a, and b"
enum(letters[1:3])
#> [1] "a, b, and c"
enum(letters[1:4])
#> [1] "a, b, c, and d"
Created on 2019-05-11 by the reprex package (v0.2.1)
I have a dataframe structure like this, 39 rows
text.
"A" OR "B" OR "C"
"C" OR "D" OR "E"
and a "black list" of words that I want to delete, that begin and end with the symbol ". (200 words) here an example:
blackList
"A"
"D"
i want to remove them from the starting dataframe, obtaining:
text.
OR "B" OR "C"
"C" OR OR "E"
how can I do? I tried with removeWords, but it does not read the symbol ".
We could create a pattern by pasting all the blacklisted items together with "|" as collapsable argument and then remove all of them.
df$text <- gsub(paste0(blacklist$blackList, collapse = "|"), "", df$text)
df
# text
#1 OR "B" OR "C"
#2 "C" OR OR "E"
data
df <- data.frame(text = c('"A" OR "B" OR "C"','"C" OR "D" OR "E"'))
blacklist <- data.frame(blackList = c('"A"', '"D"'))
gsub('\"A\"', "", '"A" OR "B" OR "C"')
escape the quotes with a backslash and use gsub
I am trying to paste two vectors
vector_1 <- c("a", "b")
vector_2 <- c("x", "y")
paste(vector_1, vector_2, collapse = " + ")
The output I get is
"a + b x + y "
My desired output is
"a + b + x + y"
paste with more then one argument will paste together term-by-term.
> paste(c("a","b","c"),c("A","B","C"))
[1] "a A" "b B" "c C"
the result being the length of the longest vector, with the shorter term recycled. That enables things like this to work:
> paste("A",c("1","2","BBB"))
[1] "A 1" "A 2" "A BBB"
> paste(c("1","2","BBB"),"A")
[1] "1 A" "2 A" "BBB A"
then sep is used within the elements and collapse to join the elements.
> paste(c("a","b","c"),c("A","B","C"))
[1] "a A" "b B" "c C"
> paste(c("a","b","c"),c("A","B","C"),sep="+")
[1] "a+A" "b+B" "c+C"
> paste(c("a","b","c"),c("A","B","C"),sep="+",collapse="#")
[1] "a+A#b+B#c+C"
Note that once you use collapse you get a single result rather than three.
You seem to not want to combine your two vectors element-wise, so you need to turn them into one vector, which you can do with c(), giving us the solution:
> c(vector_1, vector_2)
[1] "a" "b" "x" "y"
> paste(c(vector_1, vector_2), collapse=" + ")
[1] "a + b + x + y"
Note that sep isn't needed - you are just collapsing the individual elements into one string.
This is more a general question on the behavior of lists in R, but the specific problem is:
I have a list of groups of words which I'm trying to manually remove specific words for - where no word is mentioned twice.
Currently, I'm using this method
l = strsplit(c("a b", "c d"), " ")
> l
[[1]]
[1] "a" "b"
[[2]]
[1] "c" "d"
# remove the value "d"
l = lapply(l, function(x) { x[x != "d"] })
> l
[[1]]
[1] "a" "b"
[[2]]
[1] "c"
Is there any sort of built in list indexing method that would be preferable to use? I feel like I should just be able to parse the list without using lapply. If not, is it possible that someone could explain why this is the case?
Thanks
You need to go through each element of the list and check if the vector contains d to filter/remove it.
One of the reason is that a list can contains various type of data (functions, data.frame, numeric, character, boolean, other lists, class) so there can't be vectorized operations (which are - as suggests the name - for vectors).
What you do is to filter you filter on the front end - eg when you have the list. It could be preferable to filter in the back end your vector, eg before obtaining the list:
l = strsplit(gsub('d','',c("a b", "c d")), " ")
#[[1]]
#[1] "a" "b"
#[[2]]
#[1] "c"
Some alternative solution for a front end filtering:
lapply(l, grep, pattern='[^d]', value=T)