split a string on every odd positioned spaces - r

I have strings of varying lengths in this format:
"/S498QSB 0 'Score=0' 1 'Score=1' 2 'Score=2' 3 'Score=3' 7 'Not administered'"
the first item is a column name and the other items tell us how this column is encoded
I want the following output:
/S498QSB
0 'Score=0'
1 'Score=1'
2 'Score=2'
3 'Score=3'
7 'Not administered'"
str_split should do it, but it's not working for me:
str_split("/S498QSB 0 'Score=0' 1 'Score=1' 2 'Score=2' 3 'Score=3' 7 'Not administered'",
"([ ].*?[ ].*?)[ ]")

You can use
str_split(x, "\\s+(?=\\d+\\s+')")
See the regex demo.
Details:
\s+ - one or more whitespaces
(?=\d+\s+') - a positive lookahead that requires the following sequence of patterns immediately to the right of the current location:
\d+ - one or more digits
\s+ - one or more whitespaces
' - a single quotation mark.

Related

gsub extracting string

My sample data is:
c("2\tNO PEMJNUM\t 2\tALTOGETHER HOW MANY JOBS\t216 - 217",
"1\tREFERENCE PERSON 2\tSPOUSE 3\tCHILD 4\tOTHER RELATIVE (PRIMARY FAMILY & UNREL) PRFAMTYP\t2\tFAMILY TYPE RECODE\t155 - 156",
"5\tUNABLE TO WORK PUBUS1\t 2\tLAST WEEK DID YOU DO ANY\t184 - 185",
"2\tNO PEIO1COW\t 2\tINDIVIDUAL CLASS OF WORKER CODE\t432 - 433"
For each line, I'm looking to extract (they are variable names):
Line 1: "PEMJNUM"
Line 2: "PRFAMTYP"
Line 3: "PUBUS1"
Line 4: "PEIO1COW"
My initial goal was to gsub remove the characters to the left and right of each variable name to leave just the variable names, but I was only able to grab everything to the right of the variable name and had issues with grabbing characters to the left. (as shown here https://regexr.com/67r6j).
Not sure if there's a better way to do this!
You can use sub in the following way:
x <- c("2\tNO PEMJNUM\t 2\tALTOGETHER HOW MANY JOBS\t216 - 217",
"1\tREFERENCE PERSON 2\tSPOUSE 3\tCHILD 4\tOTHER RELATIVE (PRIMARY FAMILY & UNREL) PRFAMTYP\t2\tFAMILY TYPE RECODE\t155 - 156",
"5\tUNABLE TO WORK PUBUS1\t 2\tLAST WEEK DID YOU DO ANY\t184 - 185",
"2\tNO PEIO1COW\t 2\tINDIVIDUAL CLASS OF WORKER CODE\t432 - 433")
sub("^(?:.*\\b)?(\\w+)\\s*\\b2\\b.*", "\\1", x, perl=TRUE)
# => [1] "PEMJNUM" "PRFAMTYP" "PUBUS1" "PEIO1COW"
See the online regex demo and the R demo.
Details:
^ - start of string
(?:.*\b)? - an optional non-capturing group that matches any zero or more chars (other than line break chars since I use perl=TRUE, if you need to match line breaks, too, add (?s) at the pattern start) as many as possible, and then a word boundary position
(\w+) - Group 1 (\1): one or more word chars
\s* - zero or more whitespaces
\b - a word boundary
2 - a 2 digit
\b - a word boundary
.* - the rest of the line/string.
If there are always whitespaces before 2, the regex can be written as "^(?:.*\\b)?(\\w+)\\s+2\\b.*".

grep formatted number using r

I have a string format that I would like to select from a character vector. The form is
123 123 1234
where the two spaces can also be a hyphen. i.e. 3 digits followed by space or hyphen, followed by 3 digits, followed by space or hyphen, followed by 4 digits
I am trying to do this by the following:
grep("^([0-9]{3}[ -.])([0-9]{3}[ -.])([0-9]{4}$)",mytext)
however this yields:
integer(0)
What am I doing wrong?
Your string has a whitespace at the end, so you can either consider that white space, like so:
grep("^([0-9]{3}[ -.])([0-9]{3}[ -.])([0-9]{4} $)",mytext)
Or remove the end of line assertion "$", like so:
grep("^([0-9]{3}[ -.])([0-9]{3}[ -.])([0-9]{4})",mytext)
Also, as pointed out by Wiktor Stribiżew, the character class [ -.] will match any character in the range between " " and ".". To match "-","." and " " you have to escape the "-" or put it at the end of the class. Like [ \-.] or [ .-]

Regular Expression To exclude sub-string name(job corps) Includes at least 1 upper case letter, 1 lower case letter, 1 number and 1 symbol except "#"

Regular Expression To exclude sub-string name(job corps)
Includes at least 1 upper case letter, 1 lower case letter, 1 number and 1 symbol except "#"
I have written something like below :
^((?!job corps).)(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[!#$%^&*]).*$
I tested with the above regular expression, not working for special character.
can anyone guide on this..
If I understand well your requirements, you can use this pattern:
^(?![^a-z]*$|[^A-Z]*$|[^0-9]*$|[^!#$%^&*]*$|.*?job corps)[^#]*$
If you only want to allow characters from [a-zA-Z0-9^#$%&*] changes the pattern to:
^(?![^a-z]*$|[^A-Z]*$|[^0-9]*$|[^!#$%^&*]*$|.*?job corps)[a-zA-Z0-9^#$%&*]*$
details:
^ # start of the string
(?! # not followed by any of these cases
[^a-z]*$ # non lowercase letters until the end
|
[^A-Z]*$ # non uppercase letters until the end
|
[^0-9]*$
|
[^!#$%^&*]*$
|
.*?job corps # any characters and "job corps"
)
[^#]* # characters that are not a #
$ # end of the string
demo
Note: you can write the range #$%& like #-& to win a character.
stribizhev, your answer is correct
^(?!.job corps)(?=.[0-9])(?=.[a-z])(?=.[A-Z])(?=.[!#$%^&])(?!.#).$
can verify the expression in following url:
http://www.freeformatter.com/regex-tester.html

Regular expression to match version numbers

I need a regular expression that is matching below criteria
For example : below should be matched
1
1134
1.1
1.4.5.6
Those below should not match:
.1
1.
1..6
You can use
^\d+(\.\d+)*$
See demo
^ - beginning of string
\d+ - 1 or more digits
(\.\d+)* - a group matching 0 or more sequences of . + 1 or more digits
$ - end of string.
You can use a non-capturing group, too: ^\d+(?:\.\d+)*$, but it is not so necessary here.

Regular expression to validate a string representing either number or addition of numbers

Help on validation expression to validate a string representing either number or addition of numbers.
e.g:
2 OK
22 + 3 OK
2+3 not OK
2 +3 not OK
2 + 34 + 45 OK
2 + 33 + not OK
2 + 33+ 4 not OK
This would be quite a simple pattern
^\d+(?: \+ \d+)*$
See it here on Regexr
^ anchor for the start of the string
$ anchor for the end of the string
The anchors are needed, otherwise the pattern will match "partly"
\d+ is at least one digit
(?: \+ \d+)* is a non capturing group that can be there 0 or more times (because of the * quantifier at the end)
Try:
/^\d+(\s+\+\s+\d+)*$/
This matches a number followed by an optional plus sign and number, which can then be repeated.

Resources