Insert characters when a string changes its case R - r

I would like to insert characters in the places were a string change its case. I tried this to insert a '\n' after a fixed number of characters and then a ' ', as I don't figure out how to detect the case change
s <-c("FloridaIslandE7", "FloridaIslandE9", "Meta")
gsub('^(.{7})(.{6})(.*)$', '\\1\\\n\\2 \\3', s )
[1] "Florida\nIsland E7" "Florida\nIsland E9" "Meta"
This works because the positions are fixed but I would like to know how to do it for the general case.

Surely there's a less convoluted regex for this, but you could try:
gsub('([A-Z][0-9])', ' \\1', gsub('([a-z])([A-Z])', '\\1\n\\2', s))
Output:
[1] "Florida\nIsland E7" "Florida\nIsland E9" "Meta"

Here is an option
str_replace_all(s, "(?<=[a-z])(?=[A-Z])", "\n")
#[1] "Florida\nIsland\nE7" "Florida\nIsland\nE9" "Meta"

If you really want to insert \n, try this:
gsub("([a-z])([A-Z])", "\\1\\\n\\2", s)
[1] "Florida\nIsland\nE7" "Florida\nIsland\nE9" "Meta"

Related

Remove all punctuation except underline between characters in R with POSIX character class

I would like to use R to remove all underlines expect those between words. At the end the code removes underlines at the end or at the beginning of a word.
The result should be
'hello_world and hello_world'.
I want to use those pre-built classes. Right know I have learn to expect particular characters with following code but I don't know how to use the word boundary sequences.
test<-"hello_world and _hello_world_"
gsub("[^_[:^punct:]]", "", test, perl=T)
You can use
gsub("[^_[:^punct:]]|_+\\b|\\b_+", "", test, perl=TRUE)
See the regex demo
Details:
[^_[:^punct:]] - any punctuation except _
| - or
_+\b - one or more _ at the end of a word
| - or
\b_+ - one or more _ at the start of a word
One non-regex way is to split and use trimws by setting the whitespace argument to _, i.e.
paste(sapply(strsplit(test, ' '), function(i)trimws(i, whitespace = '_')), collapse = ' ')
#[1] "hello_world and hello_world"
We can remove all the underlying which has a word boundary on either of the end. We use positive lookahead and lookbehind regex to find such underlyings. To remove underlying at the start and end we use trimws.
test<-"hello_world and _hello_world_"
gsub("(?<=\\b)_|_(?=\\b)", "", trimws(test, whitespace = '_'), perl = TRUE)
#[1] "hello_world and hello_world"
You could use:
test <- "hello_world and _hello_world_"
output <- gsub("(?<![^\\W])_|_(?![^\\W])", "", test, perl=TRUE)
output
[1] "hello_world and hello_world"
Explanation of regex:
(?<![^\\W]) assert that what precedes is a non word character OR the start of the input
_ match an underscore to remove
| OR
_ match an underscore to remove, followed by
(?![^\\W]) assert that what follows is a non word character OR the end of the input

Regex, match year listed as range

I have a list of years like this:
2018-
2001–2020
1999-
2005-
I would like to create a regex to match the year with these criteria:
xxxx- matches xxxx
yyyy-nnnn matches nnnn
Can you please help me?
I've tried [[:digit:]]{4}$, or alternatively [[:digit:]]{4}-$, but they only partially work.
To get the last year in the "range," established by - character, the cleanest way
my $year = (split /-/, $range)[-1];
If there isn't anything after the last delimiter then the last returned element by split is what is before it, so the last element in its return list (obtained with index -1) is either the second given year -- as in 2001-2020 -- or the only one, as in other examples. This performs no checking of input.
With a regex, one way is to seek the last number in the string
my ($year) = $range =~ /([0-9]+)[^0-9]*$/;
where if you use [0-9]{4} then there is a small additional measure of checking.
The POSIX character class [[:digit:]] and its negation [[:^digit:]] (or \P{PosixDigit}) can be used instead if desired, but note that these match all manner of Unicode "digit characters," just like \d and \D do (a few hundred), on top of the ascii [0-9] (unless /a modifier is used).
A full test program, for both
use warnings;
use strict;
use feature 'say';
my #ranges = qw(2018- 2001-2020 1999- 2005-);
foreach my $range (#ranges) {
my $year = (split /-/, $range)[-1];
# Or, using regex
# my ($year) = $range =~ /([0-9]+)[^0-9]*$/;
say $year;
}
Prints as desired.
We can capture the 4 digits as group, followed by a - at the end ($) of the string and replace with the backreference (\\1) of the captured group
sub(".*(\\d{4})-?$", "\\1", str1)
#[1] "2018" "2020" "1999" "2005"
data
str1 <- c("2018-", "2001-2020", "1999-", "2005-")
You can split the text on "-" and get the last number.
x <- c("2018-", "2001-2020", "1999-", "2005-")
sapply(strsplit(str1, '-', fixed = TRUE), tail, 1)
#[1] "2018" "2020" "1999" "2005"

How do I replace all the punctuation in a string with '\\W'?

string = 'Hello, how are you?'
What I want to achieve:
Hello\\W how are you\\W
What I've done: Substituting all characters that are not alphanumeric with '\\W'
gsub('(\\W)+[^\\S]+','\\\\W',string,perl=TRUE)
[1] "Hello\\Whow are you?"
I'm not too sure why wasn't the question mark at the end of the sentence substituted with '\\W'and why was the first space being substituted. Could anyone help me out with this? Thank you!
We can do
gsub("[,?]", "\\\\W", string)
#[1] "Hello\\W how are you\\W"
If there are other characters, use [[:punct:]]
gsub("[[:punct:]]", "\\\\W", string)
#[1] "Hello\\W how are you\\W"

Is there a "quote words" operator in R? [duplicate]

This question already has answers here:
Does R have quote-like operators like Perl's qw()?
(6 answers)
Closed 5 years ago.
Is there a "quote words" operator in R, analogous to qw in Perl? qw is a quoting operator that allows you to create a list of quoted items without having to quote each one individually.
Here is how you would do it without qw (i.e. using dozens of quotation marks and commas):
#!/bin/env perl
use strict;
use warnings;
my #NAM_founders = ("B97", "CML52", "CML69", "CML103", "CML228", "CML247",
"CML322", "CML333", "Hp301", "Il14H", "Ki3", "Ki11",
"M37W", "M162W", "Mo18W", "MS71", "NC350", "NC358"
"Oh7B", "P39", "Tx303", "Tzi8",
);
print(join(" ", #NAM_founders)); # Prints array, with elements separated by spaces
Here's doing the same thing, but with qw it is much cleaner:
#!/bin/env perl
use strict;
use warnings;
my #NAM_founders = qw(B97 CML52 CML69 CML103 CML228 CML247 CML277
CML322 CML333 Hp301 Il14H Ki3 Ki11 Ky21
M37W M162W Mo18W MS71 NC350 NC358 Oh43
Oh7B P39 Tx303 Tzi8
);
print(join(" ", #NAM_founders)); # Prints array, with elements separated by spaces
I have searched but not found anything.
Try using scan and a text connection:
qw=function(s){scan(textConnection(s),what="")}
NAM=qw("B97 CML52 CML69 CML103 CML228 CML247 CML277
CML322 CML333 Hp301 Il14H Ki3 Ki11 Ky21
M37W M162W Mo18W MS71 NC350 NC358 Oh43
Oh7B P39 Tx303 Tzi8")
This will always return a vector of strings even if the data in quotes is numeric:
> qw("1 2 3 4")
Read 4 items
[1] "1" "2" "3" "4"
I don't think you'll get much simpler, since space-separated bare words aren't valid syntax in R, even wrapped in curly brackets or parens. You've got to quote them.
For R, the closest thing that I can think of, or that I've found so far, is to create a single block of text and then break it up using strsplit, thus:
#!/bin/env Rscript
NAM_founders <- "B97 CML52 CML69 CML103 CML228 CML247 CML277
CML322 CML333 Hp301 Il14H Ki3 Ki11 Ky21
M37W M162W Mo18W MS71 NC350 NC358 Oh43
Oh7B P39 Tx303 Tzi8"
NAM_founders <- unlist(strsplit(NAM_founders,"[ \n]+"))
print(NAM_founders)
Which prints
[1] "B97" "CML52" "CML69" "CML103" "CML228" "CML247" "CML277" "CML322"
[9] "CML333" "Hp301" "Il14H" "Ki3" "Ki11" "Ky21" "M37W" "M162W"
[17] "Mo18W" "MS71" "NC350" "NC358" "Oh43" "Oh7B" "P39" "Tx303"
[25] "Tzi8"

grep on two strings

I'm working to grab two different elements in a string.
The string look like this,
str <- c('a_abc', 'b_abc', 'abc', 'z_zxy', 'x_zxy', 'zxy')
I have tried with the different options in ?grep, but I can't get it right, 'm doing something like this,
grep('[_abc]:[_zxy]',str, value = TRUE)
and what I would like is,
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
any help would be appreciated.
Use normal parentheses (, not the square brackets [
grep('_(abc|zxy)',str, value = TRUE)
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
To make the grep a bit more flexible, you could do something like:
grep('_.{3}$',str, value = TRUE)
Which will match an underscore _ followed by any character . three times {3} followed immediately by the end of the string $
this should work: grep('_abc|_zxy', str, value=T)
X|Y matches when either X matches or Y matches
In this case just doing:
str[grep("_",str)]
will work... is it more complicated in your specific case?

Resources