My data is as follows:
“Louis Hamilton”
“Tiger Wolf”
“Sachin Tendulkar”
“Lebron James”
“Michael Shoemaker”
“Hollywood – Career as an Actor”
I need to extract all the characters until a space or a dash(-) is reached
I need to extract no more than 10 characters
My desired output is
“Louis”
“Tiger”
“Sachin”
“Lebron”
“Michael”
“Hollywood”
I tried using below function, but it didn’t work
Sportstars<-function(charvec)
{min.length < 10, (x, hyph.pattern = Null)}
Can anyone help, please?
We can use sub
sub("^([^- ]+).*", "\\1", v1)
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
Or another version with the length condition as well
grep("^.{1,10}$", sub("\\s+.*", "", v1), value = TRUE)
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
Or with word from stringr
library(stringr)
word(v1, 1)
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
Also, if we need to implement the last condition as well
sapply(strsplit(v1, "[– -]"), function(x) {
x1 <- setdiff(x, "")
x1[1][nchar(x1[1]) < 10]})
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
data
v1 <- c( "Louis Hamilton", "Tiger Wolf", "Sachin Tendulkar",
"Lebron James", "Michael Shoemaker", "Hollywood – Career as an Actor")
Related
What is the best way to extract the initials from a string (except for the last word)? For example convert "GEORGE SMITH BROGAN" to "GS BROGAN"
NAMES <- data.frame(ID = c("GEORGE SMITH BROGAN","ADAM STEVE WILLIS","UNITED INTERNATIONAL SHIPPING STATION")
The desired output for the above names would be GS BROGAN, AS WILLIS, UIS STATION.
We can try with gsub
gsub("\\s+(?=[A-Z]\\b)", "",
gsub("\\b([A-Z])\\w+\\s|\\s(\\w+)$", "\\1 \\2", NAMES$ID), perl = TRUE)
#[1] "GS BROGAN" "AS WILLIS" "UIS STATION"
Or use strsplit with paste
sapply(strsplit(as.character(NAMES$ID), "\\s+"),
function(x) paste(paste(substr(x[-length(x)], 1, 1), collapse=""),
x[length(x)]))
#[1] "GS BROGAN" "AS WILLIS" "UIS STATION"
Here is a different method using gsub:
gsub('\\s(?![A-Z]+$)', '',
gsub('(?<!\\s|^)[A-Z]+\\s', ' ', NAMES$ID,
perl = TRUE), perl = TRUE)
# [1] "GS BROGAN" "AS WILLIS" "UIS STATION"
I am balancing several versions of R and want to change my R libraries loaded depending on which R and which operating system I'm using. As such, I want to stick with base R functions.
I was reading this page to see what the base R equivalent to stringr::str_extract was:
http://stat545.com/block022_regular-expression.html
It suggested I could replicate this functionality with grep. However, I haven't been able to get grep to do more than return the whole string if there is a match. Is this possible with grep alone, or do I need to combine it with another function? In my case I'm trying to distinguish between CentOS versions 6 and 7.
grep(pattern = "release ([0-9]+)", x = readLines("/etc/system-release"), value = TRUE)
1) strcapture If you want to extract a string of digits and dots from "release 1.2.3" using base then
x <- "release 1.2.3"
strcapture("([0-9.]+)", x, data.frame(version = character(0)))
## version
## 1 1.2.3
2) regexec/regmatches There is also regmatches and regexec but that has already been covered in another answer.
3) sub Also it is often possible to use sub:
sub(".* ([0-9.]+).*", "\\1", x)
## [1] "1.2.3"
3a) If you know the match is at the beginning or end then delete everything after or before it:
sub(".* ", "", x)
## [1] "1.2.3"
4) gsub Sometimes we know that the field to be extracted has certain characters and they do not appear elsewhere. In that case simply delete every occurrence of every character that cannot be in the string:
gsub("[^0-9.]", "", x)
## [1] "1.2.3"
5) read.table One can often decompose the input into fields and then pick off the desired one by number or via grep. strsplit, read.table or scan can be used:
read.table(text = x, as.is = TRUE)[[2]]
## [1] "1.2.3"
5a) grep/scan
grep("^[0-9.]+$", scan(textConnection(x), what = "", quiet = TRUE), value = TRUE)
## [1] "1.2.3"
5b) grep/strsplit
grep("^[0-9.]+$", strsplit(x, " ")[[1]], value = TRUE)
## [1] "1.2.3"
6) substring If we know the character position of the field we can use substring like this:
substring(x, 9)
## [1] "1.2.3"
6a) substring/regexpr or we may be able to use regexpr to locate the character position for us:
substring(x, regexpr("\\d", x))
## [1] "1.2.3"
7) read.dcf Sometimes it is possible to convert the input to dcf form in which case it can be read with read.dcf. Such data is of the form name: value
read.dcf(textConnection(sub(" ", ": ", x)))
## release
## [1,] "1.2.3"
You could do
txt <- c("foo release 123", "bar release", "foo release 123 bar release 123")
pattern <- "release ([0-9]+)"
stringr::str_extract(txt, pattern)
# [1] "release 123" NA "release 123"
sapply(regmatches(txt, regexec(pattern, txt)), "[", 1)
# [1] "release 123" NA "release 123"
txt <- c("foo release 123", "bar release", "foo release 123 bar release 123")
pattern <- "release ([0-9]+)"
Extract first match
sapply(
X = txt,
FUN = function(x){
tmp = regexpr(pattern, x)
m = attr(tmp, "match.length")
st = unlist(tmp)
if (st == -1){NA}else{substr(x, start = st, stop = st + m - 1)}
},
USE.NAMES = FALSE)
#[1] "release 123" NA "release 123"
Extract all matches
sapply(
X = txt,
FUN = function(x){
tmp = gregexpr(pattern, x)
m = attr(tmp[[1]], "match.length")
st = unlist(tmp)
if (st[1] == -1){
NA
}else{
sapply(seq_along(st), function(i) substr(x, st[i], st[i] + m[i] - 1))
}
},
USE.NAMES = FALSE)
#[[1]]
#[1] "release 123"
#[[2]]
#[1] NA
#[[3]]
#[1] "release 123" "release 123"
Let's say I have two vectors like so:
a <- c("this", "is", "test")
b <- c("that", "was", "boy")
I also have a string variable like so:
string <- "this is a story about a test"
I want to replace values in string so that it becomes the following:
string <- "that was a story about a boy"
I could do this using a for loop but I want this to be vectorized. How should I do this?
If you're open to using a non-base package, stringi will work really well here:
stringi::stri_replace_all_fixed(string, a, b, vectorize_all = FALSE)
#[1] "that was a story about a boy"
Note that this also works the same way for input strings of length > 1.
To be on the safe side, you can adapt this - similar to RUser's answer - to check for word boundaries before replacing:
stri_replace_all_regex(string, paste0("\\b", a, "\\b"), b, vectorize_all = FALSE)
This would ensure that you don't accidentally replace his with hwas, for example.
Here are some solutions. They each will work even if string is a character vector of strings in which case substitutions will be done on each component of it.
1) Reduce This uses no packages.
Reduce(function(x, i) gsub(paste0("\\b", a[i], "\\b"), b[i], x), seq_along(a), string)
## [1] "that was a story about a boy"
2) gsubfn gsubfn is like gsub but the replacement argument can be a list of substitutions (or certain other objects).
library(gsubfn)
gsubfn("\\w+", setNames(as.list(b), a), string)
## [1] "that was a story about a boy"
3) loop This isn't vectorized but have added for comparison. No packages are used.
out <- string
for(i in seq_along(a)) out <- gsub(paste0("\\b", a[i], "\\b"), b[i], out)
out
## [1] "that was a story about a boy"
Note: There is some question of whether cycles are possible. For example, if
a <- c("a", "A")
b <- rev(a)
do we want
"a" to be replaced with "A" and then back to "a" again, or
"a" and "A" to be swapped.
All the solutions shown above assume the first case. If we wanted the second case then perform the operation twice. We will illustrate with (2) because it is the shortest but the same idea applies to them all:
# swap "a" and "A"
a <- c("a", "A")
b <- rev(a)
tmp <- gsubfn("\\w+", setNames(as.list(seq_along(a)), a), string)
gsubfn("\\w+", setNames(as.list(b), seq_along(a)), tmp)
## [1] "this is A story about A test"
> library(stringi)
> stri_replace_all_regex(string, "\\b" %s+% a %s+% "\\b", b, vectorize_all=FALSE)
#[1] "that was a story about a boy"
Chipping in as well with a little function that relies only on R base:
repWords <- function(string,toRep,Rep,sep='\\s'){
wrds <- unlist(strsplit(string,sep))
ix <- match(toRep,wrds)
wrds[ix] <- Rep
return(paste0(wrds,collapse = ' '))
}
a <- c("this", "is", "test")
b <- c("that", "was", "boy")
string <- "this is a story about a test"
> repWords(string,a,b)
[1] "that was a story about a boy"
Note:
This assumes you have a matching number of replacements. You can define the separator with sep.
Talking of external packages, here's another one:
a <- c("this", "is", "test")
b <- c("that", "was", "boy")
x <- "this is a story about a test"
library(qdap)
mgsub(a,b,x)
which gives:
"that was a story about a boy"
I aim to relocate words and copy&paste them in certain pattern.
a = 'blahblah (Peter|Sally|Tom)'
b = 'word (apple|grape|tomato) vocabulary (rice|mice|lice)'
c = 'people person (you|me|us) do not know how (it|them) works'
I can relocate a string placed before '(' by using gsub
gsub('\\s*(\\S+)\\s*\\(', '(\\1 ', a)
With the function, I can make string sets below.
a
[1]'(blahblah Peter|Sally|Tom)'
b
[1]'(word apple|grape|tomato) (vocabulary rice|mice|lice)'
c
[1]'people (person you|me|us) do not know (how it|them) works'
However, I have no idea how to copy '\\1' and paste it after '|' like this
a
[1]'(blahblah Peter|blahblah Sally|blahblah Tom)'
b
[1]'(word apple|word grape|word tomato) (vocabulary rice|vocabulary mice|vocabulary lice)'
c
[1]'people (person you|person me|person us) do not know (how it|how them) works'
Is there any way to make this possible?
We can use strsplit
sapply(strsplit(a, "[| ]|\\(|\\)"), function(x) {
x1 <- x[nzchar(x)]
paste0("(", paste(x1[1], x1[-1], collapse="|"), ")")})
#[1] "(blahblah Peter|blahblah Sally|blahblah Tom)"
For multiple cases
paste(sapply(strsplit(b, "(?<=\\))\\s+", perl = TRUE)[[1]],
function(x) sapply(strsplit(x, "[| ]|\\(|\\)"), function(y) {
x1 <- y[nzchar(y)]
paste0("(", paste(x1[1], x1[-1], collapse="|"), ")") })), collapse=' ')
#[1] "(word apple|word grape|word tomato) (vocabulary rice|vocabulary mice|vocabulary lice)"
Another option is str_extract
library(stringr)
m1 <- matrix(str_extract_all(b, "\\w+")[[1]], ncol=2)
do.call(sprintf, c(do.call(paste, c(as.data.frame(matrix(paste(m1[1,][col(m1[-1,])],
m1[-1,]), nrow=2, byrow=TRUE)), sep="|")), list(fmt = "(%s) (%s)")))
#[1] "(word apple|word grape|word tomato) (vocabulary rice|vocabulary mice|vocabulary lice)"
Update
Based on the new pattern showed in the OP's post, we create a more general approach
funPaste <- function(str1){
v1 <- strsplit(str1, "\\s+")[[1]]
i1 <- grep("\\(", v1)
v1[i1] <- mapply(function(x,y) paste0("(", paste(x, y, collapse="|"), ")"),
v1[i1-1], str_extract_all(v1[i1], "\\w+"))
paste(v1[-(i1-1)], collapse=" ")
}
funPaste(a)
#[1] "(blahblah Peter|blahblah Sally|blahblah Tom)"
funPaste(b)
#[1] "(word apple|word grape|word tomato) (vocabulary rice|vocabulary mice|vocabulary lice)"
funPaste(c)
#[1] "people (person you|person me|person us) do not know (how it|how them) works"
Update2
We can also make use of gsubfn
library(gsubfn)
funPaste2 <- function(str1){
gsubfn("(\\w+)\\s+[(]([^)]+)[)]", function(x,y)
paste0("(", paste(x, unlist(strsplit(y, "[|]")), collapse="|"), ")"), str1)
}
funPaste2(c(a, b, c))
#[1] "(blahblah Peter|blahblah Sally|blahblah Tom)"
#[2] "(word apple|word grape|word tomato) (vocabulary rice|vocabulary mice|vocabulary lice)"
#[3] "people (person you|person me|person us) do not know (how it|how them) works"
another method: (with as less regex as possible) - since I don't know much :)
c=unlist(strsplit(b, " "))[c(T,F)] # extract all the single words
# c
# [1] "blahblah"
# [1] "word" "vocabulary"
d=unlist(strsplit)(b, " ")[c(F,T)] # extract the grouped words
# d
# [1] "(Peter|Sally|Tom)"
# [1] "(apple|grape|tomato)" "(rice|mice|lice)"
# now iterate through each 'd', split it on `|` and then clear it on `()` this output is then pasted with contents of 'c'
sapply(seq_along(d), function(x) paste("(", paste(c[x],gsub("(\\(|\\))", "",unlist(strsplit(d[x], "\\|"))),
collapse = "|"),")"))
# [1] "( blahblah Peter|blahblah Sally|blahblah Tom )"
# [1] "( word apple|word grape|word tomato )" "( vocabulary rice|vocabulary mice|vocabulary lice )"
This string is a ticker for a bond: OAT 3 25/32 7/17/17. I want to extract the coupon rate which is 3 25/32 and is read as 3 + 25/32 or 3.78125. Now I've been trying to delete the date and the name OAT with gsub, however I've encountered some problems.
This is the code to delete the date:
tkr.bond <- 'OAT 3 25/32 7/17/17'
tkr.ptrn <- '[0-9][[:punct:]][0-9][[:punct:]][0-9]'
gsub(tkr.ptrn, "", tkr.bond)
However it gets me the same string. When I use [0-9][[:punct:]][0-9] in the pattern I manage to delete part of the date, however it also deletes the fraction part of the coupon rate for the bond.
The tricky thing is to find a solution that doesn't involve the pattern of the coupon because the tickers have this form: Name Coupon Date, so, using a specific pattern for the coupon may limit the scope of the solution. For example, if the ticker is this way OAT 0 7/17/17, the coupon is zero.
Just replace first and last word with an empty string.
> tkr.bond <- 'OAT 3 25/32 7/17/17'
> gsub("^\\S+\\s*|\\s*\\S+$", "", tkr.bond)
[1] "3 25/32"
OR
Use gsubfn function in-order to use a function in the replacement part.
> gsubfn("^\\S+\\s+(\\d+)\\s+(\\d+)/(\\d+).*", ~ as.numeric(x) + as.numeric(y)/as.numeric(z), tkr.bond)
[1] "3.78125"
Update:
> tkr.bond1 <- c(tkr.bond, 'OAT 0 7/17/17')
> m <- gsub("^\\S+\\s*|\\s*\\S+$", "", tkr.bond1)
> gsubfn(".+", ~ eval(parse(text=x)), gsub("\\s+", "+", m))
[1] "3.78125" "0"
Try
eval(parse(text=sub('[A-Z]+ ([0-9]+ )([0-9/]+) .*', '\\1 + \\2', tkr.bond)))
#[1] 3.78125
Or you may need
sub('^[A-Z]+ ([^A-Z]+) [^ ]+$', '\\1', tkr.bond)
#[1] "3 25/32"
Update
tkr.bond1 <- c(tkr.bond, 'OAT 0 7/17/17')
v1 <- sub('^[A-Z]+ ([^A-Z]+) [^ ]+$', '\\1', tkr.bond1)
unname(sapply(sub(' ', '+', v1), function(x) eval(parse(text=x))))
#[1] 3.78125 0.00000
Or
vapply(strsplit(tkr.bond1, ' '), function(x)
eval(parse(text= paste(x[-c(1, length(x))], collapse="+"))), 0)
#[1] 3.78125 0.00000
Or without the eval(parse
vapply(strsplit(gsub('^[^ ]+ | [^ ]+$', '', tkr.bond1), '[ /]'), function(x) {
x1 <- as.numeric(x)
sum(x1[1], x1[2]/x1[3], na.rm=TRUE)}, 0)
#[1] 3.78125 0.00000
Similar to akrun's answer, using sub with a replacement. How it works: you put your "desired" pattern inside parentheses and leave the rest out (while still putting regex characters to match what's there and that you don't wish to keep). Then when you say replacement = "\\1" you indicate that the whole string must be substituted by only what's inside the parentheses.
sub(pattern = ".*\\s(\\d\\s\\d+\\/\\d+)\\s.*", replacement = "\\1", x = tkr.bond, perl = TRUE)
# [1] "3 25/32"
Then you can change it to numerical:
temp <- sub(pattern = ".*\\s(\\d\\s\\d+\\/\\d+)\\s.*", replacement = "\\1", x = tkr.bond, perl = TRUE)
eval(parse(text=sub(" ","+",x = temp)))
# [1] 3.78125
You can also use strsplit here. Then evaluate components excluding the first and the last. Like this
> tickers <- c('OAT 3 25/32 7/17/17', 'OAT 0 7/17/17')
>
> unlist(lapply(lapply(strsplit(tickers, " "),
+ function(x) {x[-length(x)][-1]}),
+ function(y) {sum(
+ sapply(y, function (z) {eval(parse(text = z))}) )} ) )
[1] 3.78125 0.00000