I'm learning about the sub & gsub function,
and after reading the definition, I still dont understand what is:
".*" , "\s"
specifically, the question ask what does the following code chunk return and I have no clue how it works
awards <- c("Won 1 Oscar.",
"Won 1 Oscar. Another 9 wins & 24 nominations.",
"1 win and 2 nominations.",
"2 wins & 3 nominations.",
"Nominated for 2 Golden Globes. 1 more win & 2 nominations.",
"4 wins & 1 nomination.")
sub(".*\\s([0-9]+)\\snomination.*$", "\\1", awards)
".*" = . means any character and * is 0 or more of the previous.
"\s" = means any white space
So
sub(".* #match any character 0 or more times
\\s # follow by a space (whitespace)
([0-9]+) # with at least 1 number the () means extract
\\s # follow by another space
nomination # follow by the word "nomination"
.*$", # with 0 or more characters from end of the line
"\\1", awards) # //1 means replace with the first match
Given your sample of strings, the first string does not have the word nomination in it so the original string is returned. The other strings will all match so the number immediately preceding the word "nomination" will be retuned.
Hope this helps.
Related
I have a single long string variable with 3 obs. I was trying to create a field prob to extract the specific string from the long string. the code and message is below.
data aa: "The probability of being a carrier is 0.0002422359 " " an BRCA1 carrier 0.0001061067 "
" an BRCA2 carrier 0.00013612 "
enter code here
aa$prob <- ifelse(grepl("The probability of being a carrier is", xx)==TRUE, word(aa, 8, 8), ifelse(grepl("BRCA", xx)==TRUE, word(aa, 5, 5), NA))
Warning message: In aa$prob <- ifelse(grepl("The probability of being a carrier is", : Coercing LHS to a list
Here is my previous answer, updated to reflect a data.frame.
library(dplyr)
aa <- data.frame(aa = c("...", "...", "The probability of being a carrier is 0.0002422359 ", " an BRCA1 carrier 0.0001061067 ", " an BRCA2 carrier 0.00013612 ", "..."))
aa %>%
mutate(prob = as.numeric(if_else(grepl("(probability|BRCA[12] carrier)", aa),
gsub("^.*?\\b([0-9]+\\.?[0-9]*)\\s*$", "\\1", aa), NA_character_)))
# aa prob
# 1 ... NA
# 2 ... NA
# 3 The probability of being a carrier is 0.0002422359 0.0002422359
# 4 an BRCA1 carrier 0.0001061067 0.0001061067
# 5 an BRCA2 carrier 0.00013612 0.0001361200
# 6 ... NA
Regex walk-through:
^ and $ are beginning and end of string, respective; \\b is a word-boundary; none of these "consume" any characters, they just mark beginnings and endings
. means one character
? means "zero or one", aka optional; * means "zero or more"; + means "one or more"; all refer to the previous character/class/group
\\s is blank space, including spaces and tabs
[0-9] is a class, meaning any character between 0 and 9; similarly, [a-z] is all lowercase letters, [a-zA-Z] are all letters, [0-9A-F] are hexadecimal digits, etc
(...) is a saved group; it's not uncommon in a group to use | as an "or"; this group is used later in the replacement= part of gsub as numbered groups, so \\1 recalls the first group from the pattern
So grouped and summarized:
"^.*?\\b([0-9]+\\.?[0-9]*)\\s*$"
1 ^^^^^^^^^^^^^^^^^^
2 ^^^
3 ^^^
4 ^^^^
This is the "number" part, that allows for one or more digits, an optional decimal point, and zero or more digits. This is saved in group "1".
The word boundary guarantees that we include leading numbers (it's possible, depending on a few things, for "12.345" to be parsed as "2.345" without this.
Anything before the number-like string.
Some or no blank space after the number.
Grouped logically, in an organized way
Regex isn't unique to R, it's a parsing language that R (and most other programming languages) supports in one way or another.
I have a txt file with this data in it:
1 message («random_choice»)[5];
2 reply («принято»)[2][3];
3 regulate («random_choice»)[5];
4 Early reg («for instance»)[2][3][4];
4xx: Success (загрузка):
6 OK («fine»)[2][3];
I want to turn it into dataframe, consisting of three columns ID, message, comment.
I also want to remove unnecessary numbers at the end in square brackets.
And also some values in ID column have strings (usually xx). In these cases, column must be just empty.
So, desired result must look like this:
ID Message Comment
1 message random_choice
2 reply принято
3 regulate random_choice
4 Early reg for instance
Success загрузка
6 OK fine
How could i do that? Even when i try to read this txt file i get strange error:
df <- read.table("data_received.txt", header = TRUE)
error i get:
Error in read.table("data_received.txt", header = TRUE) :
more columns than column names
You can use strcapture for this.
Fake data, you'll likely do txt <- readLines("data_received.txt"). (Since my locale on windows is not being friendly to those strings, I'll replace with straight ascii, assuming it'll work just fine on your system.)
txt <- readLines(textConnection("1 message («random_choice»)[5];
# 2 reply («asdf»)[2][3];
# 3 regulate («random_choice»)[5];
# 4 Early reg («for instance»)[2][3][4];
# 4xx: Success (something):
# 6 OK («fine»)[2][3];"))
The breakout:
out <- strcapture("^(\\S+)\\s+([^(]+)\\s+\\((.*)\\).*$", txt,
proto = data.frame(ID=0L, Message="", Comment=""))
# Warning in fun(mat[, i]) : NAs introduced by coercion
out
# ID Message Comment
# 1 1 message «random_choice»
# 2 2 reply «asdf»
# 3 3 regulate «random_choice»
# 4 4 Early reg «for instance»
# 5 NA Success something
# 6 6 OK «fine»
The proto= argument indicates what type of columns are generated. Since I set the ID=0L, it assumes it'll be integer, so anything that does not convert to integer becomes NA (which satisfies your fifth row omission).
Explanation on the regex:
in general:
* means zero-or-more of the previous character (or character class)
+ means one-or-more
? (not used, but useful nonetheless) means zero or one
^ and $ mean the beginning and end of the string, respectively (a ^ within [..] is different)
(...) is a capture group: anything within the non-escaped parens is stored, anything not is discarded
[...] is a character group, any of the characters is a match; if this is instead [^..], then it is inverted: anything except what is listed
[[...]] is a character class
^(\\S+), start with (^) one or more (+) non-space characters (\\S);
\\s+ one or more space character (\\s) (discarded);
([^(]+) one or more character that is not a left-paren;
\\((.*)\\)$ a literal left-paren (\\() and then zero or more of anything (.*), all the way to a literal right-paren (\\)) and the end of the string ($).
It should be noted that \\s and \\S are non-POSIX regex characters, where it is generally suggested to use [^[:space:]] for \\S (no space chars) and [[:space:]] for \\s. Those are equivalent but I went with code-golf initially. With this replacement, it looks like
out <- strcapture("^([^[:space:]]+)[[:space:]]+([^(]+)[[:space:]]+\\((.*)\\).*$", txt,
proto = data.frame(ID=0L, Message="", Comment=""))
We can use {unglue}. Here we see you have two patterns, one contains "«" and ID, the other doesn't. {unglue} will use the first pattern that matches. any {foo} or {} expression matches the regex ".*?", and a data.frame is built from the names put between brackets.
txt <- c(
"1 message («random_choice»)[5];", "2 reply («asdf»)[2][3];",
"3 regulate («random_choice»)[5];", "4 Early reg («for instance»)[2][3][4];",
"4xx: Success (something):", "6 OK («fine»)[2][3];")
library(unglue)
patterns <-
c("{id} {Message} («{Comment}»){}",
"{} {Message} ({Comment}){}")
unglue_data(txt, patterns)
#> id Message Comment
#> 1 1 message random_choice
#> 2 2 reply asdf
#> 3 3 regulate random_choice
#> 4 4 Early reg for instance
#> 5 <NA> Success something
#> 6 6 OK fine
I want to know with how many spaces a string starts. Here are some examples:
string.1 <- " starts with 4 spaces"
string.2 <- " starts with only 2 spaces"
My attempt was the following but this leads to 1 in both cases and I understand why this is the case.
stringr::str_count(string.1, "^ ")
stringr::str_count(string.2, "^ ")
I'd prefer if there was a solution completely like this but with another regex.
The ^ pattern matches a single space at the start of the string, that is why both test cases return 1.
To match consecutive spaces at the start of the string, you may use
stringr::str_count(string.1, "\\G ")
Or, to count any whitespaces,
stringr::str_count(string.1, "\\G\\s")
See the R demo
The \G pattern matches a space at the start and each space after the successful match due to the \G anchor.
Another approach: count the length of ^\s+ matches (1 or more whitespace chars at the start of the string):
strings <- c(" starts with 4 spaces", " starts with only 2 spaces")
matches <- regmatches(strings, regexpr("^\\s+", strings))
sapply(matches, nchar)
# => 4 2
One approach might be to take the nchar of the input string, with all content from the first non whitespace character until the end stripped.
string.1 <- " starts with 4 spaces"
nchar(sub("\\S.*$", "", string.1))
I have a simple character string like:
test<-c("two words", "three more words", "something else", "this has a lot of words", "more of this", "pick me")
I would need a function that returns the indices of test where there are only 2 words in the element (in this example this would be index 1, 3 and 6, but 2, 4 and 5 are completely uninteresting). More context: I am searching for "real" names of persons among a large vector that is mixed also with company names (which have often 3 or more words). I have no clue how to perhaps get regex (or any other technique) to do this...
We can use grep to match the word (\\w+) followed by a space followed by other word (\\w+) from the start (^) to end ($) of the string
grep("^\\w+ \\w+$", test)
[#1] 1 3 6
Or with str_count
library(stringr)
which(str_count(test, "\\w+") == 2)
#[1] 1 3 6
One option involving stringr could be.
which(is.na(word(test, 1, 3, fixed(" "))))
[1] 1 3 6
Could someone explain why "Won 1 Oscar." can be picked out according to the regular expression given as below
awards <- c("Won 1 Oscar.",
"Won 1 Oscar. Another 9 wins & 24 nominations.",
"1 win and 2 nominations.",
"2 wins & 3 nominations.",
"Nominated for 2 Golden Globes. 1 more win & 2 nominations.",
"4 wins & 1 nomination.")
sub(".*\\s([0-9]+)\\snomination.*$", "\\1", awards)
I can only get that the pattern is "abcd (any number 0 -9 ) nominationabcd". Once the pattern is matched, the number will replace the whole string. The matched "Won 1 Oscar" comes from the second element. What I am confused is that there is no nomination.* following "Won 1 " and why there seems to be no replacement.
The gsub function takes the regex (or a plain string if you use fixed=TRUE) and tries to find a match in the input character vector. If the match is found, this match is replaced with the replacement string/pattern. If the match is not found, thecurrent character (string) is returned unchanged.
Since you want to get the only nominations value from each element of the character vector, you need to extract them, rather than replace the matches.
You may rely on the stringr str_extract:
> library(stringr)
> str_extract(awards, "[0-9]+(?=\\s*nomination)")
[1] NA "24" "2" "3" "2" "1"
The [0-9]+(?=\\s*nomination) pattern finds 1 or more digits but only those that are followed with 0+ whitespaces and nomination char sequence (these whitespaces and the "nomination" word are excluded from the matches as this is a pattern inside a positive lookahead ((?=...)) construct that is non-consuming, i.e. not putting the matched text into the match value).