How can I use R Regular Expressions to catch a Hebrew word? - r

I've been trying to catch the word
עונה
plus the subsequent number after it in a string such as
כל הילדים אוכלים, עונה 2 , פרק 8-לזניית ירקות וסלמון בדבש
Demonstrating it on Regex101.com was straightforward enough, with עונה(\s+\d+|\d+), but with R I came up empty.
str<-"כל הילדים אוכלים, עונה 2 , פרק 8-לזניית ירקות וסלמון בדבש"
exp<-"עונה(\\s+\\d+|\\d+)"
str_extract_all(str,exp)
Output:
[[1]]
character(0)

You can use this regex:
/[\u0590-\u05FF]/*

Related

Insert characters when a string changes its case R

I would like to insert characters in the places were a string change its case. I tried this to insert a '\n' after a fixed number of characters and then a ' ', as I don't figure out how to detect the case change
s <-c("FloridaIslandE7", "FloridaIslandE9", "Meta")
gsub('^(.{7})(.{6})(.*)$', '\\1\\\n\\2 \\3', s )
[1] "Florida\nIsland E7" "Florida\nIsland E9" "Meta"
This works because the positions are fixed but I would like to know how to do it for the general case.
Surely there's a less convoluted regex for this, but you could try:
gsub('([A-Z][0-9])', ' \\1', gsub('([a-z])([A-Z])', '\\1\n\\2', s))
Output:
[1] "Florida\nIsland E7" "Florida\nIsland E9" "Meta"
Here is an option
str_replace_all(s, "(?<=[a-z])(?=[A-Z])", "\n")
#[1] "Florida\nIsland\nE7" "Florida\nIsland\nE9" "Meta"
If you really want to insert \n, try this:
gsub("([a-z])([A-Z])", "\\1\\\n\\2", s)
[1] "Florida\nIsland\nE7" "Florida\nIsland\nE9" "Meta"

RStudio truncates long strings when pasting into IDE - workaround?

I'm using R 3.6.1 and RStudio 1.3.1056.
When pasting a long character string into RStudio using either paste() or paste0() or even c() or just simply assigning the string to an object, I get some strange results, and it looks to be unique to RStudio (if you place this code directly into R it works just fine; the correct value is given for nchar()):
thing <- paste0("(blah blah
'100869017', '100895297', '100937037', '100952542', '100953872', '100958290', '100977291', '100978521', '100982570', '100983764', '100986439', '100987969', '100988635', '100988637', '100989748', '100992594', '100998300', '100998306', '101000068', '101000556', '101002036', '101002550', '101002813', '101002871', '101002872', '101003492', '101003787', '101003789', '101003830', '101004348', '101004349', '101004400', '101004401', '101005323', '101005738', '101006388', '101006411', '101006413', '101006414', '101006416', '101006417', '101006419', '101006440', '101006441', '101006442', '101006443', '101006444', '101006445', '101006446', '101006447', '101006448', '101006449', '101006450', '101006451', '101006452', '101006453', '101006454', '101006455', '101006456', '101006457', '101006458', '101006554', '101006588', '101006608', '101008736', '101009658', '101011518', '101011680', '101011681', '101012457', '101012495', '101014157', '101014197', '101014240', '101014244', '101014248', '101014301', '101014302', '101014303', '101014304', '101014358', '101014480', '101014481', '101015219', '101017560', '101019383', '101019396', '101019454', '101019480', '101019481', '101019567', '101020977', '101022585', '101024007', '101024436', '101028376', '101028377', '101028405', '101029814', '101030739', '101030940', '101031364', '101031368', '101032356', '101032399', '101032440', '101032441', '101032442', '101032443', '101032444', '101032462', '101032468', '101032482', '101032483', '101032484', '101032485', '101032486', '101032487', '101032488', '101032489', '101032590', '101032591', '101032735', '101032987', '101033456', '101036227', '101037275', '101038196', '101038279', '101038930', '101038932', '101038938', '101039576', '101041116', '101041233', '101041288', '101042815', '101043166', '101043280', '101043281', '101043282', '101043285', '101043288', '101043307', '101043302', '101043329', '101043405', '101043837', '101045392', '101045635', '101046419', '101046440', '101046441', '101047082', '101047224', '101047227', '101047275', '101047281', '101047286', '101047287', '101047288', '101047290', '101047293', '101047295', '101047297', '101047304', '101048355', '101048439', '101048480', '101048905', '101048921', '101050905', '101052305', '101052442', '101052448', '101052449', '101052480', '101052481', '101052485', '101052487', '101052489', '101052522', '101052550', '101052551', '101053187', '101055017', '101055036', '101055039', '101055220', '101055258', '101055313', '101055316', '101055317', '101059567', '101060256', '101060554', '101060810', '101060817', '101061738', '101061739', '101061762', '101062369', '101062469', '101063528', '101063909', '101065440', '101065471', '101065473', '101065536', '101065760', '101065784', '101065805', '101065813', '101068343', '101068346', '101069329', '101069472', '101069478', '101069771', '101069871', '101069902', '101070895', '101071301', '101071303', '101072911', '101072914', '101072915', '101072944', '101072946', '101072949', '101072972', '101072981', '101072984', '101072985', '101073389', '101073806', '101074467', '101074469', '101074650', '101074709', '101074721', '101074869', '101075639', '101075881', '101075887', '101075888', '101076841', '101076843', '101076884', '101076885', '101076889', '101076930', '101077036', '101077872', '101077877', '101078006', '101078141', '101078834', '101079626', '101079624', '101079658', '101080128', '101080146', '101080341', '101080389', '101080732', '101080738', '101080931', '101081744', '101081745', '101082123', '101082443', '101082445', '101082447', '101085919', '101086763', '101086774', '101086801', '101086915', '101086964', '101086965', '101087006', '101088057', '101088465', '101089884', '101089915', '101089945', '101090159', '101090197', '101090225', '101090226', '101090227', '101090229', '101090293', '101091218', '101091232', '101091238', '101091239', '101091635', '101091655', '101092773', '101092997', '101093029', '101093064', '101093067', '101093255', '101093344', '101097283', '101097668', '101098444', '101098514', '101099068', '101099073', '101099076', '101099141', '101099170', '101099172', '101099173', '101099175', '101099177', '101099178', '101099194', '101099204', '101099206', '101099581', '101099666', '101100002', '101100179', '101100492', '101100617', '101101080', '101101088', '101101091', '101101092', '101101115', '101101117', '101101150', '101101158', '101102050', '101102086', '101102101', '101102108', '101102169', '101102650', '101102712', '101103376', '101106299', '101106618', '101107257', '101107277', '101108114', '101108119', '101108670', '101108702', '101108707', '101109772', '101109774', '101109779', '101111022', '101111029', '101113873', '101114376', '101114390', '101115163', '101115246', '101115247', '101115357', '101115358', '101116813', '101116819', '101116870', '101116877', '101118108', '101118175', '101118178', '101118277', '101118441', '101118449', '101118471', '101118505', '101118631', '101119051', '101119448', '101119914', '101120073', '101120076', '101120127', '101120292', '101120334', '101120387', '101120389', '101122367', '101122822', '101122881', '101122886', '101124670', '101124838', '101125490', '101125610', '101126329', '101127340', '101127341', '101127342', '101127343', '101127346', '101127347', '101127360', '101127853', '101127855', '101127856', '101127857', '101128128', '101128126', '101128132', '101128135', '101130135', '101131523', '101132622', '101132648', '101132850', '101132870', '101132931', '101132938', '101132990', '101132994', '101133104', '101133206', '101133248', '101134597', '101134599', '101134611', '101134649', '101134661', '101134704', '101134771', '101135221', '101135276', '101135278', '101135409', '101135444', '101135518', '101135630', '101135633', '101135632', '101137571', '101137750', '101137812', '101137875', '101138237', '101139907', '101139931', '101139968', '101140076', '101140148', '101140181', '101140250', '101140253', '101140460', '101140462', '101140466', '101140469', '101140518', '101150986', '101150987', '101150990', '101150994', '101150995', '101151373', '101151376', '101151416', '101151418', '101151419', '101151434', '101151437', '101151891', '101151974', '101151978', '101151979', '101151996', '101152030', '101152031', '101152032', '101152037', '101152062', '101152063', '101152066', '101152068', '101152069', '101152070', '101152072', '101152073', '101152074', '101152077', '101152078', '101152080', '101152081', '101152083', '101152085', '101152087', '101152088', '101152089', '101152100', '101152103', '101152105', '101153684', '101153944', '101153966', '101153996', '101153999', '101155013', '101155141', '101155149', '101155311', '101155560', '101155880', '101155882', '101155883', '101155884', '101155905', '101156458', '101156459', '101156511', '101156524', '101156546', '101156547', '101156596', '101156611', '101156641', '101156664', '101156752', '101156786', '101156801', '101156842', '101156885', '101156888', '101156892', '101157753', '101157844', '101157881', '101157905', '101157927', '101158001', '101158011', '101158025', '101158028', '101158034', '101158061', '101158081', '101158084', '101158103', '101158107', '101159736', '101160183', '101160203', '101160234', '101160373', '101160377', '101160381', '101160378', '101160451', '101160551', '101162202', '101162245', '101162247', '101162249', '101162492', '101162538', '101162585', '101162595', '101162627', '101162630', '101162634', '101162792', '101162848', '101162876', '101162904', '101164138', '101164337', '101165132', '101165133', '101165134'
blah blah)")
nchar(thing)
The output of nchar() is 4130. In reality, nchar() should be showing 7603.
Why would I do something like this in a script? In this case, it was a SQL query written into a script and being run via RStudio.
Even stranger is removing the "blah blah" at the top and bottom of the string:
thing <- paste0("'100869017', '100895297', '100937037', '100952542', '100953872', '100958290', '100977291', '100978521', '100982570', '100983764', '100986439', '100987969', '100988635', '100988637', '100989748', '100992594', '100998300', '100998306', '101000068', '101000556', '101002036', '101002550', '101002813', '101002871', '101002872', '101003492', '101003787', '101003789', '101003830', '101004348', '101004349', '101004400', '101004401', '101005323', '101005738', '101006388', '101006411', '101006413', '101006414', '101006416', '101006417', '101006419', '101006440', '101006441', '101006442', '101006443', '101006444', '101006445', '101006446', '101006447', '101006448', '101006449', '101006450', '101006451', '101006452', '101006453', '101006454', '101006455', '101006456', '101006457', '101006458', '101006554', '101006588', '101006608', '101008736', '101009658', '101011518', '101011680', '101011681', '101012457', '101012495', '101014157', '101014197', '101014240', '101014244', '101014248', '101014301', '101014302', '101014303', '101014304', '101014358', '101014480', '101014481', '101015219', '101017560', '101019383', '101019396', '101019454', '101019480', '101019481', '101019567', '101020977', '101022585', '101024007', '101024436', '101028376', '101028377', '101028405', '101029814', '101030739', '101030940', '101031364', '101031368', '101032356', '101032399', '101032440', '101032441', '101032442', '101032443', '101032444', '101032462', '101032468', '101032482', '101032483', '101032484', '101032485', '101032486', '101032487', '101032488', '101032489', '101032590', '101032591', '101032735', '101032987', '101033456', '101036227', '101037275', '101038196', '101038279', '101038930', '101038932', '101038938', '101039576', '101041116', '101041233', '101041288', '101042815', '101043166', '101043280', '101043281', '101043282', '101043285', '101043288', '101043307', '101043302', '101043329', '101043405', '101043837', '101045392', '101045635', '101046419', '101046440', '101046441', '101047082', '101047224', '101047227', '101047275', '101047281', '101047286', '101047287', '101047288', '101047290', '101047293', '101047295', '101047297', '101047304', '101048355', '101048439', '101048480', '101048905', '101048921', '101050905', '101052305', '101052442', '101052448', '101052449', '101052480', '101052481', '101052485', '101052487', '101052489', '101052522', '101052550', '101052551', '101053187', '101055017', '101055036', '101055039', '101055220', '101055258', '101055313', '101055316', '101055317', '101059567', '101060256', '101060554', '101060810', '101060817', '101061738', '101061739', '101061762', '101062369', '101062469', '101063528', '101063909', '101065440', '101065471', '101065473', '101065536', '101065760', '101065784', '101065805', '101065813', '101068343', '101068346', '101069329', '101069472', '101069478', '101069771', '101069871', '101069902', '101070895', '101071301', '101071303', '101072911', '101072914', '101072915', '101072944', '101072946', '101072949', '101072972', '101072981', '101072984', '101072985', '101073389', '101073806', '101074467', '101074469', '101074650', '101074709', '101074721', '101074869', '101075639', '101075881', '101075887', '101075888', '101076841', '101076843', '101076884', '101076885', '101076889', '101076930', '101077036', '101077872', '101077877', '101078006', '101078141', '101078834', '101079626', '101079624', '101079658', '101080128', '101080146', '101080341', '101080389', '101080732', '101080738', '101080931', '101081744', '101081745', '101082123', '101082443', '101082445', '101082447', '101085919', '101086763', '101086774', '101086801', '101086915', '101086964', '101086965', '101087006', '101088057', '101088465', '101089884', '101089915', '101089945', '101090159', '101090197', '101090225', '101090226', '101090227', '101090229', '101090293', '101091218', '101091232', '101091238', '101091239', '101091635', '101091655', '101092773', '101092997', '101093029', '101093064', '101093067', '101093255', '101093344', '101097283', '101097668', '101098444', '101098514', '101099068', '101099073', '101099076', '101099141', '101099170', '101099172', '101099173', '101099175', '101099177', '101099178', '101099194', '101099204', '101099206', '101099581', '101099666', '101100002', '101100179', '101100492', '101100617', '101101080', '101101088', '101101091', '101101092', '101101115', '101101117', '101101150', '101101158', '101102050', '101102086', '101102101', '101102108', '101102169', '101102650', '101102712', '101103376', '101106299', '101106618', '101107257', '101107277', '101108114', '101108119', '101108670', '101108702', '101108707', '101109772', '101109774', '101109779', '101111022', '101111029', '101113873', '101114376', '101114390', '101115163', '101115246', '101115247', '101115357', '101115358', '101116813', '101116819', '101116870', '101116877', '101118108', '101118175', '101118178', '101118277', '101118441', '101118449', '101118471', '101118505', '101118631', '101119051', '101119448', '101119914', '101120073', '101120076', '101120127', '101120292', '101120334', '101120387', '101120389', '101122367', '101122822', '101122881', '101122886', '101124670', '101124838', '101125490', '101125610', '101126329', '101127340', '101127341', '101127342', '101127343', '101127346', '101127347', '101127360', '101127853', '101127855', '101127856', '101127857', '101128128', '101128126', '101128132', '101128135', '101130135', '101131523', '101132622', '101132648', '101132850', '101132870', '101132931', '101132938', '101132990', '101132994', '101133104', '101133206', '101133248', '101134597', '101134599', '101134611', '101134649', '101134661', '101134704', '101134771', '101135221', '101135276', '101135278', '101135409', '101135444', '101135518', '101135630', '101135633', '101135632', '101137571', '101137750', '101137812', '101137875', '101138237', '101139907', '101139931', '101139968', '101140076', '101140148', '101140181', '101140250', '101140253', '101140460', '101140462', '101140466', '101140469', '101140518', '101150986', '101150987', '101150990', '101150994', '101150995', '101151373', '101151376', '101151416', '101151418', '101151419', '101151434', '101151437', '101151891', '101151974', '101151978', '101151979', '101151996', '101152030', '101152031', '101152032', '101152037', '101152062', '101152063', '101152066', '101152068', '101152069', '101152070', '101152072', '101152073', '101152074', '101152077', '101152078', '101152080', '101152081', '101152083', '101152085', '101152087', '101152088', '101152089', '101152100', '101152103', '101152105', '101153684', '101153944', '101153966', '101153996', '101153999', '101155013', '101155141', '101155149', '101155311', '101155560', '101155880', '101155882', '101155883', '101155884', '101155905', '101156458', '101156459', '101156511', '101156524', '101156546', '101156547', '101156596', '101156611', '101156641', '101156664', '101156752', '101156786', '101156801', '101156842', '101156885', '101156888', '101156892', '101157753', '101157844', '101157881', '101157905', '101157927', '101158001', '101158011', '101158025', '101158028', '101158034', '101158061', '101158081', '101158084', '101158103', '101158107', '101159736', '101160183', '101160203', '101160234', '101160373', '101160377', '101160381', '101160378', '101160451', '101160551', '101162202', '101162245', '101162247', '101162249', '101162492', '101162538', '101162585', '101162595', '101162627', '101162630', '101162634', '101162792', '101162848', '101162876', '101162904', '101164138', '101164337', '101165132', '101165133', '101165134'")
In this case the console hangs with the + as if awaiting further input.
Again, take those examples and put them directly into R and nchar() provides the correct count.
The worst part of this is that in the first example, the object is created but is truncated, and the final part of the string is retained, i.e. the ending "blah blah". This has resulted in SQL queries excluding some criteria - over 3400 characters worth of criteria!
If anyone was to say that this is an ugly way to write SQL queries, I'd agree. And I certainly wouldn't want a single string so long anywhere in my code, but there's a solid chance they could show up in a team environment where a user is less familiar with R and RStudio.
RStudio gives absolutely no indicator or warning that this is done as far as I can tell.
Is there any way to avoid this behavior besides splitting strings or sourcing SQL scripts or text files?

R regex match things other than known characters

For a text field, I would like to expose those that contain invalid characters. The list of invalid characters is unknown; I only know the list of accepted ones.
For example for French language, the accepted list is
A-z, 1-9, [punc::], space, àéèçè, hyphen, etc.
The list of invalid charactersis unknown, yet I want anything unusual to resurface, for example, I would want
This is an 2-piece à-la-carte dessert to pass when
'Ã this Øs an apple' pumps up as an anomalie
The 'not contain' notion in R does not behave as I would like, for example
grep("[^(abc)]",c("abcdef", "defabc", "apple") )
(those that does not contain 'abc') match all three while
grep("(abc)",c("abcdef", "defabc", "apple") )
behaves correctly and match only the first two. Am I missing something
How can we do that in R ? Also, how can we put hypen together in the list of accepted characters ?
[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+
The above regex matches any of the following (one or more times). Note that the parameter ignore.case=T used in the code below allows the following to also match uppercase variants of the letters.
a-z Any lowercase ASCII letter
1-9 Any digit in the range from 1 to 9 (excludes 0)
[:punct:] Any punctuation character
The space character
àâæçéèêëîïôœùûüÿ Any valid French character with a diacritic mark
- The hyphen character
See code in use here
x <- c("This is an 2-piece à-la-carte dessert", "Ã this Øs an apple")
gsub("[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+", "", x, ignore.case=T)
The code above replaces all valid characters with nothing. The result is all invalid characters that exist in the string. The following is the output:
[1] "" "ÃØ"
If by "expose the invalid characters" you mean delete the "accepted" ones, then a regex character class should be helpful. From the ?regex help page we can see that a hyphen is already part of the punctuation character vector;
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
So the code could be:
x <- 'Ã this Øs an apple'
gsub("[A-z1-9[:punct:] àéèçè]+", "", x)
#[1] "ÃØ"
Note that regex has a predefined, locale-specific "[:alpha:]" named character class that would probably be both safer and more compact than the expression "[A-zàéèçè]" especially since the post from ctwheels suggests that you missed a few. The ?regex page indicates that "[0-9A-Za-z]" might be both locale- and encoding-specific.
If by "expose" you instead meant "identify the postion within the string" then you could use the negation operator "^" within the character class formalism and apply gregexpr:
gregexpr("[^A-z1-9[:punct:] àéèçè]+", x)
[[1]]
[1] 1 8
attr(,"match.length")
[1] 1 1

Is there a "quote words" operator in R? [duplicate]

This question already has answers here:
Does R have quote-like operators like Perl's qw()?
(6 answers)
Closed 5 years ago.
Is there a "quote words" operator in R, analogous to qw in Perl? qw is a quoting operator that allows you to create a list of quoted items without having to quote each one individually.
Here is how you would do it without qw (i.e. using dozens of quotation marks and commas):
#!/bin/env perl
use strict;
use warnings;
my #NAM_founders = ("B97", "CML52", "CML69", "CML103", "CML228", "CML247",
"CML322", "CML333", "Hp301", "Il14H", "Ki3", "Ki11",
"M37W", "M162W", "Mo18W", "MS71", "NC350", "NC358"
"Oh7B", "P39", "Tx303", "Tzi8",
);
print(join(" ", #NAM_founders)); # Prints array, with elements separated by spaces
Here's doing the same thing, but with qw it is much cleaner:
#!/bin/env perl
use strict;
use warnings;
my #NAM_founders = qw(B97 CML52 CML69 CML103 CML228 CML247 CML277
CML322 CML333 Hp301 Il14H Ki3 Ki11 Ky21
M37W M162W Mo18W MS71 NC350 NC358 Oh43
Oh7B P39 Tx303 Tzi8
);
print(join(" ", #NAM_founders)); # Prints array, with elements separated by spaces
I have searched but not found anything.
Try using scan and a text connection:
qw=function(s){scan(textConnection(s),what="")}
NAM=qw("B97 CML52 CML69 CML103 CML228 CML247 CML277
CML322 CML333 Hp301 Il14H Ki3 Ki11 Ky21
M37W M162W Mo18W MS71 NC350 NC358 Oh43
Oh7B P39 Tx303 Tzi8")
This will always return a vector of strings even if the data in quotes is numeric:
> qw("1 2 3 4")
Read 4 items
[1] "1" "2" "3" "4"
I don't think you'll get much simpler, since space-separated bare words aren't valid syntax in R, even wrapped in curly brackets or parens. You've got to quote them.
For R, the closest thing that I can think of, or that I've found so far, is to create a single block of text and then break it up using strsplit, thus:
#!/bin/env Rscript
NAM_founders <- "B97 CML52 CML69 CML103 CML228 CML247 CML277
CML322 CML333 Hp301 Il14H Ki3 Ki11 Ky21
M37W M162W Mo18W MS71 NC350 NC358 Oh43
Oh7B P39 Tx303 Tzi8"
NAM_founders <- unlist(strsplit(NAM_founders,"[ \n]+"))
print(NAM_founders)
Which prints
[1] "B97" "CML52" "CML69" "CML103" "CML228" "CML247" "CML277" "CML322"
[9] "CML333" "Hp301" "Il14H" "Ki3" "Ki11" "Ky21" "M37W" "M162W"
[17] "Mo18W" "MS71" "NC350" "NC358" "Oh43" "Oh7B" "P39" "Tx303"
[25] "Tzi8"

grep on two strings

I'm working to grab two different elements in a string.
The string look like this,
str <- c('a_abc', 'b_abc', 'abc', 'z_zxy', 'x_zxy', 'zxy')
I have tried with the different options in ?grep, but I can't get it right, 'm doing something like this,
grep('[_abc]:[_zxy]',str, value = TRUE)
and what I would like is,
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
any help would be appreciated.
Use normal parentheses (, not the square brackets [
grep('_(abc|zxy)',str, value = TRUE)
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
To make the grep a bit more flexible, you could do something like:
grep('_.{3}$',str, value = TRUE)
Which will match an underscore _ followed by any character . three times {3} followed immediately by the end of the string $
this should work: grep('_abc|_zxy', str, value=T)
X|Y matches when either X matches or Y matches
In this case just doing:
str[grep("_",str)]
will work... is it more complicated in your specific case?

Resources