Relocate and duplicate a string in R - r

I aim to relocate words and copy&paste them in certain pattern.
a = 'blahblah (Peter|Sally|Tom)'
b = 'word (apple|grape|tomato) vocabulary (rice|mice|lice)'
c = 'people person (you|me|us) do not know how (it|them) works'
I can relocate a string placed before '(' by using gsub
gsub('\\s*(\\S+)\\s*\\(', '(\\1 ', a)
With the function, I can make string sets below.
a
[1]'(blahblah Peter|Sally|Tom)'
b
[1]'(word apple|grape|tomato) (vocabulary rice|mice|lice)'
c
[1]'people (person you|me|us) do not know (how it|them) works'
However, I have no idea how to copy '\\1' and paste it after '|' like this
a
[1]'(blahblah Peter|blahblah Sally|blahblah Tom)'
b
[1]'(word apple|word grape|word tomato) (vocabulary rice|vocabulary mice|vocabulary lice)'
c
[1]'people (person you|person me|person us) do not know (how it|how them) works'
Is there any way to make this possible?

We can use strsplit
sapply(strsplit(a, "[| ]|\\(|\\)"), function(x) {
x1 <- x[nzchar(x)]
paste0("(", paste(x1[1], x1[-1], collapse="|"), ")")})
#[1] "(blahblah Peter|blahblah Sally|blahblah Tom)"
For multiple cases
paste(sapply(strsplit(b, "(?<=\\))\\s+", perl = TRUE)[[1]],
function(x) sapply(strsplit(x, "[| ]|\\(|\\)"), function(y) {
x1 <- y[nzchar(y)]
paste0("(", paste(x1[1], x1[-1], collapse="|"), ")") })), collapse=' ')
#[1] "(word apple|word grape|word tomato) (vocabulary rice|vocabulary mice|vocabulary lice)"
Another option is str_extract
library(stringr)
m1 <- matrix(str_extract_all(b, "\\w+")[[1]], ncol=2)
do.call(sprintf, c(do.call(paste, c(as.data.frame(matrix(paste(m1[1,][col(m1[-1,])],
m1[-1,]), nrow=2, byrow=TRUE)), sep="|")), list(fmt = "(%s) (%s)")))
#[1] "(word apple|word grape|word tomato) (vocabulary rice|vocabulary mice|vocabulary lice)"
Update
Based on the new pattern showed in the OP's post, we create a more general approach
funPaste <- function(str1){
v1 <- strsplit(str1, "\\s+")[[1]]
i1 <- grep("\\(", v1)
v1[i1] <- mapply(function(x,y) paste0("(", paste(x, y, collapse="|"), ")"),
v1[i1-1], str_extract_all(v1[i1], "\\w+"))
paste(v1[-(i1-1)], collapse=" ")
}
funPaste(a)
#[1] "(blahblah Peter|blahblah Sally|blahblah Tom)"
funPaste(b)
#[1] "(word apple|word grape|word tomato) (vocabulary rice|vocabulary mice|vocabulary lice)"
funPaste(c)
#[1] "people (person you|person me|person us) do not know (how it|how them) works"
Update2
We can also make use of gsubfn
library(gsubfn)
funPaste2 <- function(str1){
gsubfn("(\\w+)\\s+[(]([^)]+)[)]", function(x,y)
paste0("(", paste(x, unlist(strsplit(y, "[|]")), collapse="|"), ")"), str1)
}
funPaste2(c(a, b, c))
#[1] "(blahblah Peter|blahblah Sally|blahblah Tom)"
#[2] "(word apple|word grape|word tomato) (vocabulary rice|vocabulary mice|vocabulary lice)"
#[3] "people (person you|person me|person us) do not know (how it|how them) works"

another method: (with as less regex as possible) - since I don't know much :)
c=unlist(strsplit(b, " "))[c(T,F)] # extract all the single words
# c
# [1] "blahblah"
# [1] "word" "vocabulary"
d=unlist(strsplit)(b, " ")[c(F,T)] # extract the grouped words
# d
# [1] "(Peter|Sally|Tom)"
# [1] "(apple|grape|tomato)" "(rice|mice|lice)"
# now iterate through each 'd', split it on `|` and then clear it on `()` this output is then pasted with contents of 'c'
sapply(seq_along(d), function(x) paste("(", paste(c[x],gsub("(\\(|\\))", "",unlist(strsplit(d[x], "\\|"))),
collapse = "|"),")"))
# [1] "( blahblah Peter|blahblah Sally|blahblah Tom )"
# [1] "( word apple|word grape|word tomato )" "( vocabulary rice|vocabulary mice|vocabulary lice )"

Related

String splitting with a stop character in R

My data is as follows:
“Louis Hamilton”
“Tiger Wolf”
“Sachin Tendulkar”
“Lebron James”
“Michael Shoemaker”
“Hollywood – Career as an Actor”
I need to extract all the characters until a space or a dash(-) is reached
I need to extract no more than 10 characters
My desired output is
“Louis”
“Tiger”
“Sachin”
“Lebron”
“Michael”
“Hollywood”
I tried using below function, but it didn’t work
Sportstars<-function(charvec)
{min.length < 10, (x, hyph.pattern = Null)}
Can anyone help, please?
We can use sub
sub("^([^- ]+).*", "\\1", v1)
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
Or another version with the length condition as well
grep("^.{1,10}$", sub("\\s+.*", "", v1), value = TRUE)
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
Or with word from stringr
library(stringr)
word(v1, 1)
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
Also, if we need to implement the last condition as well
sapply(strsplit(v1, "[– -]"), function(x) {
x1 <- setdiff(x, "")
x1[1][nchar(x1[1]) < 10]})
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
data
v1 <- c( "Louis Hamilton", "Tiger Wolf", "Sachin Tendulkar",
"Lebron James", "Michael Shoemaker", "Hollywood – Career as an Actor")

Insert blank space between letters of word

I'm trying to create a function able to return various versions of the same string but with blank spaces between the letters.
something like:
input <- "word"
returning:
w ord
wo rd
wor d
We first break the string into every character using strsplit. We then append an empty space at every position using sapply.
input <- "word"
input_break <- strsplit(input, "")[[1]]
c(input, sapply(seq(1,nchar(input)-1), function(x)
paste0(append(input_break, " ", x), collapse = "")))
#[1] "word" "w ord" "wo rd" "wor d"
?append gives us append(x, values, after = length(x))
where x is the vector, value is the value to be inserted (here " " ) and after is after which place you want to insert the values.
Here is an option using sub
sapply(seq_len(nchar(input)-1), function(i) sub(paste0('^(.{', i, '})'), '\\1 ', input))
#[1] "w ord" "wo rd" "wor d"
Or with substring
paste(substring(input, 1, 1:3), substring(input, 2:4, 4))
#[1] "w ord" "wo rd" "wor d"

how to remove duplicate words in a certain pattern from a string in R

I aim to remove duplicate words only in parentheses from string sets.
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
What I want to get is just like this
a
[1]'I (have|has) certain (words|word|worded) certain'
[2]'(You|Youre) (can|cans) do this (works|worked)'
[3]'I (am|are) pretty (sure|surely) you know (what|when) (you|her) should (do|)'
In order to get the result, I used a code like this
a = gsub('\\|', " | ", a)
a = gsub('\\(', "( ", a)
a = gsub('\\)', " )", a)
a = vapply(strsplit(a, " "), function(x) paste(unique(x), collapse = " "), character(1L))
However, it resulted in undesirable outputs.
a
[1] "I ( have | has ) certain words word worded"
[2] "( You | Youre ) can cans do this works worked"
[3] "I ( am | are ) sure surely you know what when her should do"
Why did my code remove parentheses located in the latter part of strings?
What should I do for the result I want?
We can use gsubfn. Here, the idea is to select the characters inside the brackets by matching the opening bracket (\\( have to escape the bracket as it is a metacharacter) followed by one or more characters that are not a closing bracket ([^)]+), capture it as a group within the brackets. In the replacement, we split the group of characters (x) with strsplit, unlist the list output, get the unique elements and paste it together
library(gsubfn)
gsubfn("\\(([^)]+)", ~paste0("(", paste(unique(unlist(strsplit(x,
"[|]"))), collapse="|")), a)
#[1] "I (have|has) certain (words|word|worded) certain"
#[2] "(You|Youre) (can|cans) do this (works|worked)"
#[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"
Take the answer above. This is more straightforward, but you can also try:
library(stringi)
library(stringr)
a_new <- gsub("[|]","-",a) # replace this | due to some issus during the replacement later
a1 <- str_extract_all(a_new,"[(](.*?)[)]") # extract the "units"
# some magic using stringi::stri_extract_all_words()
a2 <- unlist(lapply(a1,function(x) unlist(lapply(stri_extract_all_words(x), function(y) paste(unique(y),collapse = "|")))))
# prepare replacement
names(a2) <- unlist(a1)
# replacement and finalization
str_replace_all(a_new, a2)
[1] "I (have|has) certain (words|word|worded) certain"
[2] "(You|Youre) (can|cans) do this (works|worked)"
[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"
The idea is to extract the words within the brackets as unit. Then remove the duplicates and replace the old unit with the updated.
a longer but more elaborate try
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# blank output
new_a <- c()
for (sentence in 1:length(a)) {
split <- trim(unlist(strsplit(a[sentence],"[( )]")))
newsentence <- c()
for (i in split) {
j1 <- as.character(unique(trim(unlist(strsplit(gsub('\\|'," ",i)," ")))))
if( length(j1)==0) {
next
} else {
ifelse(length(j1)>1,
newsentence <- c(newsentence,paste("(",paste(j1,collapse="|"),")",sep="")),
newsentence <- c(newsentence,j1[1]))
}
}
newsentence <- paste(newsentence,collapse=" ")
print(newsentence)
new_a <- c(new_a,newsentence)}
# [1] "I (have|has) certain (words|word|worded) certain"
# [2] "(You|Youre) (can|cans) do this (works|worked)"
# [3] "I (am|are) (sure|surely) you know (what|when) (you|her) should do"

R: Delete first and last part of string based on pattern

This string is a ticker for a bond: OAT 3 25/32 7/17/17. I want to extract the coupon rate which is 3 25/32 and is read as 3 + 25/32 or 3.78125. Now I've been trying to delete the date and the name OAT with gsub, however I've encountered some problems.
This is the code to delete the date:
tkr.bond <- 'OAT 3 25/32 7/17/17'
tkr.ptrn <- '[0-9][[:punct:]][0-9][[:punct:]][0-9]'
gsub(tkr.ptrn, "", tkr.bond)
However it gets me the same string. When I use [0-9][[:punct:]][0-9] in the pattern I manage to delete part of the date, however it also deletes the fraction part of the coupon rate for the bond.
The tricky thing is to find a solution that doesn't involve the pattern of the coupon because the tickers have this form: Name Coupon Date, so, using a specific pattern for the coupon may limit the scope of the solution. For example, if the ticker is this way OAT 0 7/17/17, the coupon is zero.
Just replace first and last word with an empty string.
> tkr.bond <- 'OAT 3 25/32 7/17/17'
> gsub("^\\S+\\s*|\\s*\\S+$", "", tkr.bond)
[1] "3 25/32"
OR
Use gsubfn function in-order to use a function in the replacement part.
> gsubfn("^\\S+\\s+(\\d+)\\s+(\\d+)/(\\d+).*", ~ as.numeric(x) + as.numeric(y)/as.numeric(z), tkr.bond)
[1] "3.78125"
Update:
> tkr.bond1 <- c(tkr.bond, 'OAT 0 7/17/17')
> m <- gsub("^\\S+\\s*|\\s*\\S+$", "", tkr.bond1)
> gsubfn(".+", ~ eval(parse(text=x)), gsub("\\s+", "+", m))
[1] "3.78125" "0"
Try
eval(parse(text=sub('[A-Z]+ ([0-9]+ )([0-9/]+) .*', '\\1 + \\2', tkr.bond)))
#[1] 3.78125
Or you may need
sub('^[A-Z]+ ([^A-Z]+) [^ ]+$', '\\1', tkr.bond)
#[1] "3 25/32"
Update
tkr.bond1 <- c(tkr.bond, 'OAT 0 7/17/17')
v1 <- sub('^[A-Z]+ ([^A-Z]+) [^ ]+$', '\\1', tkr.bond1)
unname(sapply(sub(' ', '+', v1), function(x) eval(parse(text=x))))
#[1] 3.78125 0.00000
Or
vapply(strsplit(tkr.bond1, ' '), function(x)
eval(parse(text= paste(x[-c(1, length(x))], collapse="+"))), 0)
#[1] 3.78125 0.00000
Or without the eval(parse
vapply(strsplit(gsub('^[^ ]+ | [^ ]+$', '', tkr.bond1), '[ /]'), function(x) {
x1 <- as.numeric(x)
sum(x1[1], x1[2]/x1[3], na.rm=TRUE)}, 0)
#[1] 3.78125 0.00000
Similar to akrun's answer, using sub with a replacement. How it works: you put your "desired" pattern inside parentheses and leave the rest out (while still putting regex characters to match what's there and that you don't wish to keep). Then when you say replacement = "\\1" you indicate that the whole string must be substituted by only what's inside the parentheses.
sub(pattern = ".*\\s(\\d\\s\\d+\\/\\d+)\\s.*", replacement = "\\1", x = tkr.bond, perl = TRUE)
# [1] "3 25/32"
Then you can change it to numerical:
temp <- sub(pattern = ".*\\s(\\d\\s\\d+\\/\\d+)\\s.*", replacement = "\\1", x = tkr.bond, perl = TRUE)
eval(parse(text=sub(" ","+",x = temp)))
# [1] 3.78125
You can also use strsplit here. Then evaluate components excluding the first and the last. Like this
> tickers <- c('OAT 3 25/32 7/17/17', 'OAT 0 7/17/17')
>
> unlist(lapply(lapply(strsplit(tickers, " "),
+ function(x) {x[-length(x)][-1]}),
+ function(y) {sum(
+ sapply(y, function (z) {eval(parse(text = z))}) )} ) )
[1] 3.78125 0.00000

How to look for a certain part in a string and only keep that part

What is the cleanest way of finding for example the string ": [1-9]*" and only keeping that part?
You can work with regexec to get the starting points, but isn't there a cleaner way just to get immediately the value?
For example:
test <- c("surface area: 458", "bedrooms: 1", "whatever")
regexec(": [1-9]*", test)
How do I get immediately just
c(": 458",": 1", NA )
You can use base R which handles this just fine.
> x <- c('surface area: 458', 'bedrooms: 1', 'whatever')
> r <- regmatches(x, gregexpr(':.*', x))
> unlist({r[sapply(r, length)==0] <- NA; r})
# [1] ": 458" ": 1" NA
Although, I find it much simpler to just do...
> x <- c('surface area: 458', 'bedrooms: 1', 'whatever')
> sapply(strsplit(x, '\\b(?=:)', perl=T), '[', 2)
# [1] ": 458" ": 1" NA
library(stringr)
str_extract(test, ":.*")
#[1] ": 458" ": 1" NA
Or for a faster approach stringi
library(stringi)
stri_extract_first_regex(test, ":.*")
#[1] ": 458" ": 1" NA
If you need the keep the values of the one that doesn't have the match
gsub(".*(:.*)", "\\1", test)
#[1] ": 458" ": 1" "whatever"
Try any of these. The first two use the base of R only. The last one assumes that we want to return a numeric vector.
1) sub
s <- sub(".*:", ":", test)
ifelse(test == s, NA, s)
## [1] ": 458" ": 1" NA
If there can be more than one : in a string then replace the pattern with "^[^:]*:" .
2) strsplit
sapply(strsplit(test, ":"), function(x) c(paste0(":", x), NA)[2])
## [1] ": 458" ": 1" NA
Do not use this one if there can be more than one : in a string.
3) strapplyc
library(gsubfn)
s <- strapplyc(test, "(:.*)|$", simplify = TRUE)
ifelse(s == "", NA, s)
## [1] ": 458" ": 1" NA
We can omit the ifelse line if "" is ok instead of NA.
4) strapply If the idea is really that there are some digits on the line and we want to return the numbers or NA then try this:
library(gsubfn)
strapply(test, "\\d+|$", as.numeric, simplify = TRUE)
## [1] 458 1 NA

Resources