I have to extract parts of a string in R based on a symbol and a word. I have a name such as
s <-"++can+you+please-help +me"
and the output would be:
"+ can" "+you" "+please" "-help" "+me"
where all words with the corresponding symbol before are shown. I've tried to use the strsplit and sub functions but I´m struggling in getting the output that I want. Can you please help me? Thanks!
Do
library(stringi)
result = unlist(stri_match_all(regex = "\\W\\w+",str = s))
Result
> result
[1] "+can" "+you" "+please" "-help" "+me"
No symbols
If you only want the words (no symbols), do:
result = unlist(stri_match_all(regex = "\\w+",str = s))
result
[1] "can" "you" "please" "help" "me"
Here is one option using base R
regmatches(s, gregexpr("[[:punct:]]\\w+", s))[[1]]
#[1] "+can" "+you" "+please" "-help" "+me"
Related
I have strings that look like these:
> ABCD.1:f_HJK
> ABFD.1:f_HTK
> CJD:f_HRK
> QQYP.2:f_HDP
So basically, I have always a string in the first part, I could have a part with . and a number, and after this part I always have ':' and a string.
I would like to remove the '. + number' when it is included in the string, using R.
I know that maybe regular expressions could be useful but I have not idea about I can apply them in this context. I know that I can substitute the '.' with gsub, but not idea about how I can add the information about number and ':'.
Thank you for your help.
Does this work:
v <- c('ABCD.1:f_HJK','ABFD.1:f_HTK','CJD:f_HRK','QQYP.2:f_HDP')
v
[1] "ABCD.1:f_HJK" "ABFD.1:f_HTK" "CJD:f_HRK" "QQYP.2:f_HDP"
gsub('([A-Z]{,4})(\\.\\d)?(:.*)','\\1\\3',v)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"
You could also use any of the following depending on the structure of your string
If no other period and numbers in the string
sub("\\.\\d+", "", v)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"
If you are only interested in the first pattern matched.
sub("^([A-Z]+)\\.\\d+:", "\\1:", v)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"
Same as above, invoking perl. ie no captured groups
sub("^[A-Z]+\\K\\.\\d+", "", v, perl = TRUE)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"
If I understood your explanation correctly, this should do the trick:
gsub("(\\.\\d+)", "", string)
I'm trying to split tons of strings as below:
x = "�\001�\001�\001�\001�\001\002CN�\001\bShandong�\001\004Zibo�\002$ABCDEFGHIJK�\002\aIMG_HAS�\002�\002�\002�\002�\002�\002�\002�\002\02413165537405763268743�\002\001�\002�\002�\002�\003�\003�\003����\005�\003�\003�\003�\003"
into four pieces
'CN', 'Shandong', 'Zibo', 'ABCDEFGHIJK'
I've tried
stringr::str_split(x, '\\00.')
which output the origin x.
Also,
trimws(gsub("�\\00?", "", x, perl = T))
which only removes the unknown character �.
Could someone help me with this? Thanks for doing so.
You can try with str_extract_all :
stringr::str_extract_all(x, '[A-Za-z_]+')[[1]]
[1] "CN" "Shandong" "Zibo" "ABCDEFGHIJK" "IMG_HAS"
With base R :
regmatches(x, gregexpr('[A-Za-z_]+', x))[[1]]
Here we extract all the words with upper, lower case or an underscore. Everything else is ignored so characters like �\\00? are not there in final output.
We can use strsplit from base R
setdiff(strsplit(x, "[^A-Za-z]+")[[1]], "")
#[1] "CN" "Shandong" "Zibo" "ABCDEFGHIJK" "IMG" "HAS"
I'm analyzing twitter data and would like to extract all the hashtags in tweets. I used to extract hashtags like this:
tweet = 'I like #apple #orange'
str_extract_all(tweet,"#\\S+")
This works in most situations. But sometimes two hashtags are next to each other.
tweet = 'I like #apple#orange'
str_extract_all(tweet,"#\\S+")
what I got is this:
[[1]]
[1] "#apple#orange"
Does anyone know how I can properly extract hashtags when they are either separated or next to each other?
You are overmatching with \S because that will match a non whitespace character and a # as well.
You could use a negated character class to not match a whitespace char as well as not a #
#[^#\\s]+
Your code might look like
tweet = 'I like #apple#orange'
str_extract_all(tweet,"#[^#\\s]+")
Result
[[1]]
[1] "#apple" "#orange
R demo
My guess is that this simple expression might work:
#([^#\s]+)
which excludes spaces and #s after the first #.
Demo
Another(arguably less concise) base possibility:
gsub("([a-z](?=#))(#\\w)","\\1 \\2",
strsplit(tweet," (?=#+)",perl = TRUE)[[1]][2], perl=TRUE)
[1] "#apple #orange"
If you need them separated:
strsplit(gsub("([a-z](?=#))(#\\w)","\\1 \\2",
strsplit(tweet," (?=#+)",perl = TRUE)[[1]][2], perl=TRUE),
" ")
[[1]]
[1] "#apple" "#orange"
In excel (and Excel VBA) it is really helpful to connect text and variable using "&":
a = 5
msgbox "The value is: " & a
will give
"The value is: 5"
How can I do this in R? I know there is a way to use "paste". However I wonder if there isn't any trick to do it as simple as in Excel VBA.
Thanks in advance.
This blog post suggests to define your own concatenation operator, which is similar to what VBA (and Javascript) has, but it retains the power of paste:
"%+%" <- function(...) paste0(..., sep = "")
"Concatenate hits " %+% "and this."
# [1] "Concatenate hits and this."
I am not a big fan of this solution though because it kind of obscures what paste does under the hood. For instance, is it intuitive to you that this would happen?
"Concatenate this string " %+% "with this vector: " %+% 1:3
# [1] "Concatenate this string with this vector: 1"
# [2] "Concatenate this string with this vector: 2"
# [3] "Concatenate this string with this vector: 3"
In Javascript for instance, this would give you Concatenate this string with this vector: 1,2,3, which is quite different. I cannot speak for Excel, but you should think about whether this solution is not more confusing to you than it is useful.
If you need Javascript-like solution, you can also try this:
"%+%" <- function(...) {
dots = list(...)
dots = rapply(dots, paste, collapse = ",")
paste(dots, collapse = "")
}
"Concatenate this string " %+% "with this string."
# [1] "Concatenate this string with this string."
"Concatenate this string " %+% "with this vector: " %+% 1:3
# [1] "Concatenate this string with this vector: 1,2,3"
But I haven't tested extensively, so be on lookout for unexpected results.
Another possibility is to use sprintf:
a <- 5
cat(sprintf("The value is %d\n",a))
## The value is 5
the %d denotes integer formatting (%f would give "The value is 5.000000"). The \n denotes a newline at the end of the string.
sprintf() can be more convenient than paste or paste0 when you want to put together a lot of pieces, e.g.
sprintf("The value of a is %f (95% CI: {%f,%f})",
a_est,a_lwr,a_upr)
I am trying to get the text between two words in a sentence.
For example the sentence is -
x <- "This is my first sentence"
Now I want the text between This and first which is is my .
I have tried various functions from R like grep, grepl, pmatch , str_split. However, I could not get exactly what I want .
This is the closest what I have reached with gsub.
gsub(".*This\\s*|first*", "", x)
The output it gives is
[1] "is my sentence"
In reality, what I need is only
[1] "is my"
Any help would be appreciated.
You need .* at the end to match zero or more characters after the 'first'
gsub('^.*This\\s*|\\s*first.*$', '', x)
#[1] "is my"
Another approach using rm_between from the qdapRegex package.
library(qdapRegex)
rm_between(x, 'This', 'first', extract=TRUE)[[1]]
# [1] "is my"
Since this question is used as a reference, I'll add some possible solutions to build a complete overview. Both are based on a look-ahead/look-behind regex pattern.
base R
regmatches( x, gregexpr("(?<=This ).*(?= first)", x, perl = TRUE ) )
stringr
stringr::str_extract_all( x, "(?<=This ).+(?= first)" )