I'm a beginner in R and I'm confused by formatting part of this code:
print(sprintf("%03d / %03d", j, n)).
I familiarized myself with ?sprintf and r documentation. I understand each sign but I don't understand how to read it all together especially with '/'. What does this formatting mean?
The "/" is just a literal forward slash that should appear in the output string.
To break the code down, it means "Write the integer j with leading zeros if necessary, so that it is at least 3 characters long (%03d), then write the literal string " / ", then write the integer n with leading zeros if necessary so that it is at least 3 characters long (%03d)"
For example:
sprintf("%03d / %03d", 4, 2)
#> [1] "004 / 002"
In other words, the forward slash could be any text:
sprintf("%03d banana %03d", 4, 2)
#> [1] "004 banana 002"
Related
I want to extract all substrings that begin with M and are terminated by a *
The string below as an example;
vec<-c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
Would ideally return;
MGMTPRLGLESLLE
MTPRLGLESLLE
I have tried the code below;
regmatches(vec, gregexpr('(?<=M).*?(?=\\*)', vec, perl=T))[[1]]
but this drops the first M and only returns the first string rather than all substrings within.
"GMTPRLGLESLLE"
You can use
(?=(M[^*]*)\*)
See the regex demo. Details:
(?= - start of a positive lookahead that matches a location that is immediately followed with:
(M[^*]*) - Group 1: M, zero or more chars other than a * char
\* - a * char
) - end of the lookahead.
See the R demo:
library(stringr)
vec <- c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
matches <- stringr::str_match_all(vec, "(?=(M[^*]*)\\*)")
unlist(lapply(matches, function(z) z[,2]))
## => [1] "MGMTPRLGLESLLE" "MTPRLGLESLLE"
If you prefer a base R solution:
vec <- c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
matches <- regmatches(vec, gregexec("(?=(M[^*]*)\\*)", vec, perl=TRUE))
unlist(lapply(matches, tail, -1))
## => [1] "MGMTPRLGLESLLE" "MTPRLGLESLLE"
This could be done instead with a for loop on a char array converted from you string.
If you encounter a M you start concatenating chars to a new string until you encounter a *, when you do encounter a * you push the new string to an array of strings and start over from the first step until you reach the end of your loop.
It's not quite as interesting as using REGEX to do it, but it's failsafe.
It is not possible to use regular expressions here, because regular languages don't have memory states required for nested matches.
stringr::str_extract_all("abaca", "a[^a]*a") only gives you aba but not the sorrounding abaca.
The first M was dropped, because (?<=M) is a positive look behind which is by definition not part of the match, but just behind it.
I have the following composition of phonenumbers where 33 is the area code:
+331234567
+3301234567
00331234567
003301234567
0331234567
033-123-456-7
0033.1234567
where Im expecting only 331234567
What I have tried to clean those numbers using R
R::tidyverse::str_replace_all(c("+331234567", "033-123-456-7", "0033.1234567"), pattern = "[^0-9.]", replacement = "") removing non-numeric characters
R::tidyverse::str_replace_all("0331234567", pattern = "^0", replacement = "") removing the leading 0
R::tidyverse::str_replace_all("00331234567", pattern = "^00", replacement = "") removing the leading 00
my question is how to remove the zeros in between: 3301234567 or 003301234567 or +3301234567 or 03301234567
Appreciate any help
You can use
gsub("^(?:00?|\\+)330?|\\W", "", x, perl=TRUE)
See the regex demo. See the R demo online.
If there can be more 0s after 33 before the number you need to extract, replace 0? with 0*.
Details
^ - start of string
(?:00?|\+) - 00, 0 or +
330? - 33 or 330
| - or
\W - any non-word char.
You can use ^\+?0*3*0*|[^\s\d]
Pattern explanation:
^ - match beginning of the string
\+? - match + literally, zero or one time.
0* - match zero or more 0
3* - match zero or more 3
| - alternation
[^\s\d] - negated character class - match any character other from whitespace and digit (you could remove \s if you handle one number at a time, it just prevents from matching newline in demo)
Regex demo
It will match unwanted parts separately. First part will clean beginning of a number if it starts with + or 0, second part will clean non-digits inside the number.
I have phonemic transcriptions of English words such as these:
test <- c("ˈsɜːtnli", "ˈtwɛnti", "ˈfɒksi", "kɑːnt", "ʧeɪnʤd", "vɪkˈtɔːrɪə", "wɒznt", "ðeər", "dɪdnt",
"ˈdɪzni", "ˈəʊnli", "ˈfæbrɪks", "sɪˈkjʊərɪti", "ˈnjuːzˌpeɪpər", "ɑhɑː")
I'd like to match mono-syllabic words, i.e., words that contain a single vowel. My set of phonemic vowels is this:
vowel <- "iː|aɪ|ɔː|ɔɪ|əʊ|ɛə|eɪ|aʊ|eə|uː|ɑː|ɪə|ɜː|ʊə|ə|ɪ|ɒ|ʊ|ʌ|æ|e|ɑ|ɛ|i"
Using str_count and the vector vowel as pattern, I'm able to match a fairly good set of words:
library(stringr)
test[str_count(test, vowel) == 1]
[1] "kɑːnt" "ʧeɪnʤd" "wɒznt" "ðeər" "dɪdnt"
However, wɒznt and dɪdntcan be seen as bi-syllabic (as the nsound can replace a vowel so that nt counts as a second vowel). So the question is, how can I match mono-syllabic words except those that end in nt?
What I've tried so far is this set operation, which works well but looks clumsy:
setdiff(test[str_count(test, vowel) == 1], test[str_count(test, paste0("[^", vowel, "]nt$")) == 1])
[1] "kɑːnt" "ʧeɪnʤd" "ðeər"
I'd much rather have a single more concise regex. Any ideas?
You can use
test <- c("ˈsɜːtnli", "ˈtwɛnti", "ˈfɒksi", "kɑːnt", "ʧeɪnʤd", "vɪkˈtɔːrɪə", "wɒznt", "ðeər", "dɪdnt",
"ˈdɪzni", "ˈəʊnli", "ˈfæbrɪks", "sɪˈkjʊərɪti", "ˈnjuːzˌpeɪpər", "ɑhɑː")
vowel <- "iː|aɪ|ɔː|ɔɪ|əʊ|ɛə|eɪ|aʊ|eə|uː|ɑː|ɪə|ɜː|ʊə|ə|ɪ|ɒ|ʊ|ʌ|æ|e|ɑ|ɛ|i"
library(stringr)
p <- paste0("^(?!.*(?<!",vowel,")nt$)(?:(?!",vowel,").)*(?:",vowel,")(?:(?!",vowel,").)*$")
test[str_detect(test, p)]
## => [1] "kɑːnt" "ʧeɪnʤd" "ðeər"
See the online R demo. See the regex demo. The pattern means
^ - start of string
(?!.*(?<!",vowel,")nt$) - immediately to the right, there must not be any 0+ chars other than line break chars as many as possible followed with nt (not preceded with any of the specified vowel sound sequences) and end of string
(?:(?!",vowel,").)* - any char but a line break char, zero or more times as many as possible, that does not start a vowel char sequence
(?:",vowel,") - any of the specified vowel sound sequences
(?:(?!",vowel,").)* - any char but a line break char, zero or more times as many as possible, that does not start a vowel char sequence
$ - end of string.
This is a somewhat concise solution (thanks to #G5W for the decisive hint):
vowel_cc <- paste0(unique(unlist(strsplit(gsub("\\|", "", vowel), ""))), collapse = "")
vowel_cc
[1] "iːaɪɔəʊɛeuɑɜɒʌæ"
test[str_count(test, paste0(vowel, "|[^", vowel_cc, "]+nt$")) == 1]
[1] "kɑːnt" "ʧeɪnʤd" "ðeər"
This solution uses a vector vowel_cc consisting of all unique characters in vowels. These serve as input for a negated character class. The pattern specifies nt as one of the vowel alternatives on the condition that it be preceded by one or more non-vowel_ccs and occur at string end.
I am trying to remove hashtags from beginning of strings in R.
For example:
x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
I want to remove the hashtags at the end of string which are #lateNightThoughts and #movie. Result:
- "I didn't know it could be #boring. guess I need some fun"
I tried :
stringi::stri_replace_last_regex(x,'#\\S+',"")
but it removes only the very last hashtag.
- "I didn't know it could be #boring. guess I need some fun #movie "
Any idea how to get the expected result?
Edit:
How about removing hashtag from beginning of text ?
eg:
x<- "#Thomas20 I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
You may use
> x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
> sub("\\s*\\B#\\w+(?:\\s*#\\w+)*\\s*$", "", x)
[1] "I didn't know it could be #boring. guess I need some fun"
Or, if you do not care about the context of the first # you want to start matching from, you may even use
sub("(?:\\s*#\\w+)+\\s*$", "", x)
See the regex demo.
Details
\s* - zero or more whitespaces
\B - right before the current location, there can be start of string or a non-word char (this is usually used to ensure you do not match # inside a "word", so if you do not need it, you may remove this non-word boundary)
# - a # char
\w+ - 1 or more word chars (letters, digits or _)
(?:\s*#\w+)* - zero or more occurrences of:
\s* - zero or more whitespaces
# - a # char
\w+ - 1+ word chars
\s* - zero or more whitespaces
$ - end of string.
I have a vector of strings that look like this:
a - bc/def_g - A/mn/us/ww
opq - rs/ts_uf - BC/wx/yza
Abc - so/dhie7u - XYZ/En/xy/jkq - QWNE
I'd like to get the text after 2nd dash (-) but before first flash (/), i.e. the result should look like
A
BC
XYZ
What is the best way to do it (the vector has more than 500K rows.)
Thanks
Suppose your string is defined like this:
string <- c("a - bc/def_g - A/mn/us/ww",
"opq - rs/ts_uf - BC/wx/yza",
"Abc - so/dhie7u - XYZ/En/xy/jkq - QWNE")
Then you can use sub
> sub(".*\\-\\s+([A-Z]+)/.*", "\\1", string)
[1] "A" "BC" "XYZ"
See regex in use here
^[^-]*-[^-]*-\s*\K[^/]+
^ Assert position at the start of the line
[^-]* Match any character except - any number of times
- Match this literally
[^-]* Match any character except - any number of times
- Match this literally
\s* Match any number of whitespace characters
\K Resets the starting point of the pattern. Any previously consumed characters are no longer included in the final match
[^/]+ Match any character except / one or more times
Alternatively, as suggested by Jan in the comments below (I believe it has since been deleted) ^(?:\[^-\]*-){2}\s*\K\[^/\]+ may be used. It's shorter and easily scalable, but more adds steps.
See code in use here
x <- c("a - bc/def_g - A/mn/us/ww", "opq - rs/ts_uf - BC/wx/yza", "Abc - so/dhie7u - XYZ/En/xy/jkq - QWNE")
m <- regexpr("^[^-]*-[^-]*-\\s*\\K[^/]+", x, perl=T)
regmatches(x, m)
Result: [1] "A" "BC" "XYZ"