Is there a way to "undo" a paste - r

I am looking for a way to split up a string, but instead of splitting by an underscore or specific word, I would want to split from a series of words - and also not have that word deleted. For example,
a <- c("Hello", "Joe", "Simpsons", "Oh_No", "Hiya_Hi", "oh")
b <- c("sum", "sum_one")
x <- paste(a, b, sep = "_")
I then would like a way to separate x into a and b.

it is a bit difficult as the 4th and 5th value include what you are using to paste the strings. The strsplit() function can be used in general for splitting string by specific separators, but you run into some troubles and to solve them you have to know what b looks like at least to make sure you are not separating incorrectly (or use a unique separator):
strsplit(x, split = "_")
[[1]]
[1] "Hello" "sum"
[[2]]
[1] "Joe" "sum" "one"
[[3]]
[1] "Simpsons" "sum"
[[4]]
[1] "Oh" "No" "sum" "one"
[[5]]
[1] "Hiya" "Hi" "sum"
[[6]]
[1] "oh" "sum" "one"
The result is a list where each string is a list item in form of a string vector of diferent lengths.
An option can be to use the value of b as splitter:
rd <- strsplit(x, split = paste0(paste0("_",b), collapse = "|"))
rd
[[1]]
[1] "Hello"
[[2]]
[1] "Joe"
[[3]]
[1] "Simpsons"
[[4]]
[1] "Oh_No"
[[5]]
[1] "Hiya_Hi"
[[6]]
[1] "oh"
# convert this to a vector:
a <- unlist(rd)
a
[1] "Hello" "Joe" "Simpsons" "Oh_No" "Hiya_Hi" "oh"
Now you use this info the other way arroung:
b <- unique(gsub(paste0(paste0(a, "_"), collapse = "|"),"", x))
b
[1] "sum" "sum_one"

As #Gregor Thomas already said in comments, your information is lost. However, depending on the context, there is a way of storing the information in an attribute using a home-grown my_paste function for which we also write a print method and a my_unpaste function.
Here a sketch of the idea:
my_paste <- \(..., sep=" ", collapse=NULL, recycle0=FALSE) { ## new paste fun
o <- `attr<-`(paste(..., sep=sep), 'unpaste', list(...))
return(structure(o, class=c('character', 'my_paste')))
}
print.my_paste <- function(x) { ## print method for class `my_paste'
print(as.character(x))
}
my_unpaste <- \(x, warn=TRUE) { ## the un-paste function
if (!inherits(x, 'my_paste')) {
if (warn) warning('Nothing to unpaste.')
return(x)
} else {
return(attr(x, 'unpaste'))
}
}
Usage
x <- my_paste(a, b, sep='_')
Looks like this,
str(x)
# 'my_paste' chr [1:6] "Hello_sum" "Joe_sum_one" "Simpsons_sum" "Oh_No_sum_one" "Hiya_Hi_sum" "oh_sum_one"
# - attr(*, "unpaste")=List of 2
# ..$ : chr [1:6] "Hello" "Joe" "Simpsons" "Oh_No" ...
# ..$ : chr [1:2] "sum" "sum_one"
but prints normal:
x ## or more verbose `print(x)`
# [1] "Hello_sum" "Joe_sum_one" "Simpsons_sum" "Oh_No_sum_one" "Hiya_Hi_sum" "oh_sum_one"
Now un-paste!
my_unpaste(x)
# [[1]]
# [1] "Hello" "Joe" "Simpsons" "Oh_No" "Hiya_Hi" "oh"
#
# [[2]]
# [1] "sum" "sum_one"
Has a warning:
my_unpaste(a)
# [1] "Hello" "Joe" "Simpsons" "Oh_No" "Hiya_Hi" "oh"
# Warning message:
# In my_unpaste(a) : Nothing to unpaste.
my_unpaste(a, warn=FALSE)
# [1] "Hello" "Joe" "Simpsons" "Oh_No" "Hiya_Hi" "oh"
Note: R >= 4.1 used.
Data:
a <- c("Hello", "Joe", "Simpsons", "Oh_No", "Hiya_Hi", "oh")
b <- c("sum", "sum_one")

Related

Iteratively extract repeated word forms across speaking turns

I'm working on speaking turns in conversation. My interest is in the words that get repeated from a prior turn to a next turn:
turnsX <- data.frame(
speaker = c("A","B","A","B"),
speech = c("let's have a look",
"yeah let's take a look",
"yeah okay so where to start",
"let's start here"), stringsAsFactors = F
)
I want to extract the repeated word forms. To this end I've run a for loop, iteratively defining each speech turn as a regex pattern for the next speech turn and str_extracting the words that get repeated from turn to turn:
library(stringr)
pattern <- c()
extracted <- c()
for(i in 1:nrow(turnsX)){
pattern[i] <- paste0(unlist(str_split(turnsX$speech[i], " ")), collapse = "|")
extracted[i+1] <- str_extract_all(turnsX$speech[i+1], pattern[i])
}
The result however is partly incorrect:
extracted
[[1]]
NULL
[[2]]
[1] "a" "let's" "a" "a" "look"
[[3]]
[1] "yeah" "a" "a"
[[4]]
[1] "start"
[[5]]
[1] NA
The correct result should be:
extracted
[[1]]
NULL
[[2]]
[1] "let's" "a" "look"
[[3]]
[1] "yeah"
[[4]]
[1] "start"
Where's the mistake? How can the code be mended, or what other approach is there, to get the correct result?
Maybe you can use Map and %in%.
x <- strsplit(turnsX$speech, " ")
Map(function(y,z) y[y %in% z], x[-length(x)], x[-1])
#[[1]]
#[1] "let's" "a" "look"
#
#[[2]]
#[1] "yeah"
#
#[[3]]
#[1] "start"
Here's a base R approach using Map :
tmp <- strsplit(turnsX$speech, ' ')
c(NA, Map(intersect, tmp[-1], tmp[-length(tmp)]))
#[[1]]
#[1] NA
#[[2]]
#[1] "let's" "a" "look"
#[[3]]
#[1] "yeah"
#[[4]]
#[1] "start"
You want the word boundaries "\\b"
library(stringr)
pattern <- c()
extracted <- c()
for(i in 2:nrow(turnsX)){
pattern[i - 1] <- paste0(unlist(str_split(turnsX$speech[i - 1], " ")), collapse = "|\\b")
extracted[i] <- str_extract_all(turnsX$speech[i], pattern[i - 1])
}
# [[1]]
# NULL
#
# [[2]]
# [1] "let's" "a" "look"
#
# [[3]]
# [1] "yeah"
#
# [[4]]
# [1] "start"

ft_tokenizer tokenizes words to lower, I want it to be as they are

I am using ft_tokenizer for spark dataframe in R.
and it tokenizes each word and changes it to all lower, I want the words to be in the format they originally are.
text_data <- data_frame(
x = c("This IS a sentence", "So is this")
)
tokenized <- text_data_tbl %>%
ft_tokenizer("x", "word")
tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "this"
##
## [[1]][[2]]
## [1] "is"
##
## [[1]][[3]]
## [1] "a"
I want:
tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "This"
##
## [[1]][[2]]
## [1] "IS"
##
## [[1]][[3]]
## [1] "a"
I guess it is not possible with ft_tokenizer. From ?ft_tokenizer
A tokenizer that converts the input string to lowercase and then splits it by white spaces.
So it's basic feature is to convert the string to lowercase and split on white-space which I guess cannot be changed. Consider doing
text_data$new_x <- lapply(strsplit(text_data$x, "\\s+"), as.list)
which will give the same output as expected and you can continue your process as it is from here.
text_data$new_x
#[[1]]
#[[1]][[1]]
#[1] "This"
#[[1]][[2]]
#[1] "IS"
#[[1]][[3]]
#[1] "a"
#[[1]][[4]]
#[1] "sentence"
#[[2]]
#[[2]][[1]]
#[1] "So"
#[[2]][[2]]
#[1] "is"
#[[2]][[3]]
#[1] "this"

Filter list in R which has nchar > 1

I have a list of names
> x <- c("Test t", "Cuma Nama K", "O", "Test satu dua t")
> name <- strsplit(x, " ")
> name
[[1]]
[1] "Test" "t"
[[2]]
[1] "Cuma" "Nama" "K"
[[3]]
[1] "O"
[[4]]
[1] "Test" "satu" "dua" "t"
How can I filter a list so that it can become like this?
I am trying to find out how to filter the list which has nchar > 1
> name
[[1]]
[1] "Test"
[[2]]
[1] "Cuma" "Nama"
[[4]]
[1] "Test" "satu" "dua"
lapply(name, function(x) x[nchar(x)>1])
Results in:
[[1]]
[1] "Test"
[[2]]
[1] "Cuma" "Nama"
[[3]]
character(0)
[[4]]
[1] "Test" "satu" "dua"
We can loop over the list elements, subset the elements that have nchar greater than 1 and use Filter to remove the elements that 0 elements
Filter(length,lapply(name, function(x) x[nchar(x) >1 ]))
#[[1]]
#[1] "Test"
#[[2]]
#[1] "Cuma" "Nama"
#[[3]]
#[1] "Test" "satu" "dua"
If we want to remove the words with one character from the string, we can also do this without splitting
setdiff(gsub("(^| ).( |$)", "", x), "")
#[1] "Test" "Cuma Nama" "Test satu dua"

Separate string into list in r

I have a string in R that looks like this:
"{[PP]}{[BGH]}{[AC]}{[ETL]}....{[D]}"
I want to convert it into a list so that:
List[[1]] = {[PP]}
List[[2]] = {[BGH]}
....
List[[N]] = {[D]}
If there were commas you could do strsplit but I want to keep the brackets and not get rid of them. Not sure how to do this in R
without regular expressions:
s <- "{[PP]}{[BGH]}{[AC]}{[ETL]}{[D]}"
as.list(paste("{", strsplit(s, "\\{")[[1]][-1], sep = ""))
[[1]]
[1] "{[PP]}"
[[2]]
[1] "{[BGH]}"
[[3]]
[1] "{[AC]}"
[[4]]
[1] "{[ETL]}"
[[5]]
[1] "{[D]}"
strsplit still works if you pass this regular expression (?<=})(?={) which constrains the position to split on:
strsplit(s, "(?<=})(?={)", perl = T)
# [[1]]
# [1] "{[PP]}" "{[BGH]}" "{[AC]}" "{[ETL]}" "{[D]}"
Or as #thelatemail suggested:
strsplit(s, "(?<=})", perl = T)
obligatory stringi answer:
library(stringi)
dat <- "{[PP]}{[BGH]}{[AC]}{[ETL]}{[more]{[D]}"
as.list(stri_match_all_regex(dat, "(\\{\\[[[:alpha:]]+\\]\\})")[[1]][,2])
## [[1]]
## [1] "{[PP]}"
##
## [[2]]
## [1] "{[BGH]}"
##
## [[3]]
## [1] "{[AC]}"
##
## [[4]]
## [1] "{[ETL]}"
##
## [[5]]
## [1] "{[D]}"
There is a convenient function in qdap for this i.e. bracketXtract
library(qdap)
setNames(as.list(bracketXtract(s, "curly", TRUE)), NULL)
#[[1]]
#[1] "{[PP]}"
#[[2]]
#[1] "{[BGH]}"
#[[3]]
#[1] "{[AC]}"
#[[4]]
#[1] "{[ETL]}"
#[[5]]
#[1] "{[D]}"
By default, with = FALSE. So without using with = TRUE, it will remove the bracket.
data
s <- "{[PP]}{[BGH]}{[AC]}{[ETL]}{[D]}"

R, split string to pairs of character

How to split string in R in following way ? Look at example, please
example:
c("ex", "xa", "am", "mp", "pl", "le") ?
x = "example"
substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x))
# [1] "ex" "xa" "am" "mp" "pl" "le"
You could, of course, wrap it into a function, maybe omit non-letters (I don't know if the colon was supposed to be part of your string or not), etc.
To do this to a vector of strings, you can use it as an anonymous function with lapply:
lapply(month.name, function(x) substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x)))
# [[1]]
# [1] "Ja" "an" "nu" "ua" "ar" "ry"
#
# [[2]]
# [1] "Fe" "eb" "br" "ru" "ua" "ar" "ry"
#
# [[3]]
# [1] "Ma" "ar" "rc" "ch"
# ...
Or make it into a named function and use it by name. This would make sense if you'll use it somewhat frequently.
str_split_pairs = function(x) {
substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x))
}
lapply(month.name, str_split_pairs)
## same result as above
Here's another option (though it's slower than #Gregor's answer):
x=c("example", "stackoverflow", "programming")
lapply(x, function(i) {
i = unlist(strsplit(i,""))
paste0(i, lead(i))[-length(i)]
})
[[1]]
[1] "ex" "xa" "am" "mp" "pl" "le"
[[2]]
[1] "st" "ta" "ac" "ck" "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow"
[[3]]
[1] "pr" "ro" "og" "gr" "ra" "am" "mm" "mi" "in" "ng"

Resources