R function to check the multiple texts in a string - r

I am facing a problem in finding a solution in R
I have to find out the strings having 4 texts :
1. " { M/s ",
2. " { M/s. ",
3. " ( S/O - ",
4. " ( W/O - "
and put the output in if statement in R
dd<- data.frame(narr=c("Ratnakar:LIMITED::::CNAAJPIOP0::::Ratnakar:LIMITED",
"BAR-BOKALAWA:::Kl RAM I:: { M/s. REJOICE CONFECTIONARS ::BARBOKALAWA:::Kl RAM I",
"P2A:::REFUND::: { M/s AANCHAL SAREES :::1(NETPREM KUMAR SINGH)",
"P2A:: SUNDER ( S/O - JITENDER PAL ::REFUND:::::rajdhani:lawn",
"SAA::PRUD:::P2A::::SAA::PRUD",
"SAA-NOON:MOO: RAJNI ( W/O - RAM NIVAS::P2A::REFUND::SAA:NOON:MOO",
"CMS.CAR:::SAA:::CMS::CAR"))
This is running fine : str_detect(dd$narr, " M/s | M/s.| W/O | C/O | S/O ")
But, This is not running : str_detect(dd$narr, " { M/s | { M/s.| ( W/O | ( C/O | ( S/O ")
Error is coming :
Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) :
Error in {min,max} interval. (U_REGEX_BAD_INTERVAL)
Please help me out.

str_detect(dd$narr, " \\{ M/s | \\{ M/s\\.| \\( W/O | \\( C/O | \\( S/O ")

?regexp says: Any metacharacter with special meaning may be quoted by preceding it with a backslash.
stringr::str_detect(dd$narr, " \\{ M/s | \\{ M/s\\.| \\( W/O | \\( C/O | \\( S/O ")
#[1] FALSE TRUE TRUE TRUE FALSE TRUE FALSE

Related

regex, R and [:punct:] in grep --> return items from a list not containing any[:punct:] [duplicate]

This question already has an answer here:
POSIX character class does not work in base R regex
(1 answer)
Closed 2 years ago.
I have this list of strings:
stringg <- c("csv.asef", "ac ed", "asdf$", "asdf", "dasf]", "sadf {sadf")
if I want to get all strings containing special characters like so:
grep("[:punct:]+", stringg, value = TRUE)
--------------------------------------------
Result:
[1] "csv.asef" "ac ed"
What I should get is:
[1] "csv.asef" "asdf$" "dasf]" "sadf {sadf"
if I use:
grep("[!\\"#$%&’()*+,-./:;<=>?#[]^_`{|}~.]+", stringg, value = TRUE)
-----------------------------------------------------------------
Result is ERROR
I want these special characters: € ! " # $ % & ’ ( ) * + , - . / : ; < = > ? # [ ] ^ _ ` { | } ~. which [:punct:] doesn't have
I know if I want the strings not containing any of those characters then I would use:
[^ € ! " # $ % & ’ ( ) * + , - . / : ; < = > ? # [ ] ^ _ ` { | } ~.]
but how do I do it with [:punct:]:
[^:punct:]?
[^:punct:]{0}?
and how could i combine ^[:punct:] | ^€ ?
many thanks
According to ?regex
Most metacharacters lose their special meaning inside a character class. To include a literal ], place it first in the list.
grep("[[:punct:]]+", stringg, value = TRUE)
#[1] "csv.asef" "asdf$" "dasf]" "sadf {sadf"
If we want the opposite, use invert = TRUE
grep("[[:punct:]€]", stringg, value = TRUE, invert = TRUE)
#[1] "ac ed" "asdf"

Including ASCII art in R

I'm writing a small program and wanted to know if there is a way to include ASCII art in R. I was looking for an equivalent of three quotes (""" or ''') in python.
I tried using cat or print with no success.
Unfortunately R can only represent literal strings by using single quotes or double quotes and that makes representing ascii art awkward; however, you can do the following to get a text representation of your art which can be output using R's cat function.
1) First put your art in a text file:
# ascii_art.txt is our text file with the ascii art
# For test purposes we use the output of say("Hello") from cowsay package
# and put that in ascii_art.txt
library(cowsay)
writeLines(capture.output(say("Hello"), type = "message"), con = "ascii_art.txt")
2) Then read the file in and use dput:
art <- readLines("ascii_art.txt")
dput(art)
which gives this output:
c("", " -------------- ", "Hello ", " --------------", " \\",
" \\", " \\", " |\\___/|", " ==) ^Y^ (==",
" \\ ^ /", " )=*=(", " / \\",
" | |", " /| | | |\\", " \\| | |_|/\\",
" jgs //_// ___/", " \\_)", " ")
3) Finally in your code write:
art <- # copy the output of `dput` here
so your code would contain this:
art <-
c("", " -------------- ", "Hello ", " --------------", " \\",
" \\", " \\", " |\\___/|", " ==) ^Y^ (==",
" \\ ^ /", " )=*=(", " / \\",
" | |", " /| | | |\\", " \\| | |_|/\\",
" jgs //_// ___/", " \\_)", " ")
4) Now if we simply cat the art variable it shows up:
> cat(art, sep = "\n")
--------------
Hello
--------------
\
\
\
|\___/|
==) ^Y^ (==
\ ^ /
)=*=(
/ \
| |
/| | | |\
\| | |_|/\
jgs //_// ___/
\_)
Added
This is an addition several years later. In R 4.0 there is a new syntax that makes this even easier. See ?Quotes
Raw character constants are also available using a syntax similar
to the one used in C++: ‘r"(...)"’ with ‘...’ any character
sequence, except that it must not contain the closing sequence
‘)"’. The delimiter pairs ‘[]’ and ‘{}’ can also be used, and ‘R’
can be used in place of ‘r’. For additional flexibility, a number
of dashes can be placed between the opening quote and the opening
delimiter, as long as the same number of dashes appear between the
closing delimiter and the closing quote.
For example:
hello <- r"{
--------------
Hello
--------------
\
\
\
|\___/|
==) ^Y^ (==
\ ^ /
)=*=(
/ \
| |
/| | | |\
\| | |_|/\
jgs //_// ___/
\_)
}"
cat(hello)
giving:
--------------
Hello
--------------
\
\
\
|\___/|
==) ^Y^ (==
\ ^ /
)=*=(
/ \
| |
/| | | |\
\| | |_|/\
jgs //_// ___/
\_)
Alternative approach: Use an API
URL - artii
Steps:
Fetch the data ascii_request <- httr::GET("http://artii.herokuapp.com/make?text=this_is_your_text&font=ascii___")
Retrieve the response - ascii_response <- httr::content(ascii_request,as = "text", encoding = "UTF-8")
cat it out - cat(ascii_response)
If not connected to the web, you can set up your own server. Read more here
Thanks to #johnnyaboh for setting up this amazing service
Try the cat function, something like this should work:
cat(" \"\"\" ")

parameterize UNIX statements

I have a bunch of statements in UNIX that I want to loop to use parameterized value for their calculations.
more /var/xacct_data/xxxx/log_flattener/xxxx/logfile_current | grep " F " //E,I,D
**xxxx = mpay,mmg,tvr**
/var/xacct_data/faff/faff1/log_flattener/faffsnp1/ logfile_current | grep " F " //E , I, D also
/var/xacct_data/faff/faff1/log_flattener/faffdbt1 /logfile_current | grep " F " //E , I, D also
/var/xacct_data/faff/faff1/log_flattener/fafftxn1 /logfile_current | grep " F " //E , I, D also
/var/xacct_data/faff/faff2/log_flattener/faffdbt2/ logfile_current | grep " F " //E , I, D also
I want to store these paths in a file. Read from the file in a unix shell script. and run the unix commands on the above paths, while manipulating the above paths by substituting some values in the path..
For example, in the above code block, in the top most path. I want to replace the xxxx with the three values given. mpay, mmg and tvr. how do i go about it??
For every grep " F " I want to use E, I and D as parameters for the current path. how do i do it??
The left part of the pipe seem truncated but for the grep side, I think you are looking for
... | grep " [FEID] "
This should get you started. I won't write the entire script for you.
In bash, zsh, etc...
for directory in mpay mmg tvr; do
for char in F E I D; do
echo "Looking for lines containing ${char} in ${directory} directory..."
grep "${char}" /var/xacct_data/${directory}/log_flattener/${directory}/logfile_current
done
done
No need for more here. grep takes a filename as input.

How to strsplit using '|' character, it behaves unexpectedly?

I would like to split a string of character at pattern "|"
but
unlist(strsplit("I am | very smart", " | "))
[1] "I" "am" "|" "very" "smart"
or
gsub(pattern="|", replacement="*", x="I am | very smart")
[1] "*I* *a*m* *|* *v*e*r*y* *s*m*a*r*t*"
The problem is that by default strsplit interprets " | " as a regular expression, in which | has special meaning (as "or").
Use fixed argument:
unlist(strsplit("I am | very smart", " | ", fixed=TRUE))
# [1] "I am" "very smart"
Side effect is faster computation.
stringr alternative:
unlist(stringr::str_split("I am | very smart", fixed(" | ")))
| is a metacharacter. You need to escape it (using \\ before it).
> unlist(strsplit("I am | very smart", " \\| "))
[1] "I am" "very smart"
> sub(pattern="\\|", replacement="*", x="I am | very smart")
[1] "I am * very smart"
Edit: The reason you need two backslashes is that the single backslash prefix is reserved for special symbols such as \n (newline) and \t (tab). For more information look in the help page ?regex. The other metacharacters are . \ | ( ) [ { ^ $ * + ?
If you are parsing a table than calling read.table might be a better option. Tiny example:
> txt <- textConnection("I am | very smart")
> read.table(txt, sep='|')
V1 V2
1 I am very smart
So I would suggest to fetch the wiki page with Rcurl, grab the interesting part of the page with XML (which has a really neat function to parse HTML tables also) and if HTML format is not available call read.table with specified sep. Good luck!
Pipe '|' is a metacharacter, used as an 'OR' operator in regular expression.
try
unlist(strsplit("I am | very smart", "\s+\|\s+"))

How to gather characters usage statistics in text file using Unix commands?

I have got a text file created using OCR software - about one megabyte in size.
Some uncommon characters appears all over document and most of them are OCR errors.
I would like find all characters used in document to easily spot errors (like UNIQ command but for characters, not for lines).
I am on Ubuntu.
What Unix command I should use to display all characters used in text file?
This should do what you're looking for:
cat inputfile | sed 's/\(.\)/\1\n/g' | sort | uniq -c
The premise is that the sed puts each character in the file onto a line by itself, then the usual sort | uniq -c sequence strips out all but one of each unique character that occurs, and provides counts of how many times each occurred.
Also, you could append | sort -n to the end of the whole sequence to sort the output by how many times each character occurred. Example:
$ echo hello | sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -n
1
1 e
1 h
1 o
2 l
This will do it:
#!/usr/bin/perl -n
#
# charcounts - show how many times each code point is used
# Tom Christiansen <tchrist#perl.com>
use open ":utf8";
++$seen{ ord() } for split //;
END {
for my $cp (sort {$seen{$b} <=> $seen{$a}} keys %seen) {
printf "%04X %d\n", $cp, $seen{$cp};
}
}
Run on itself, that program produces:
$ charcounts /tmp/charcounts | head
0020 46
0065 20
0073 18
006E 15
000A 14
006F 12
0072 11
0074 10
0063 9
0070 9
If you want the literal character and/or name of the character, too, that’s easy to add.
If you want something more sophisticated, this program figures out characters by Unicode property. It may be enough for your purposes, and if not, you should be able to adapt it.
#!/usr/bin/perl
#
# unicats - show character distribution by Unicode character property
# Tom Christiansen <tchrist#perl.com>
use strict;
use warnings qw<FATAL all>;
use open ":utf8";
my %cats;
our %Prop_Table;
build_prop_table();
if (#ARGV == 0 && -t STDIN) {
warn <<"END_WARNING";
$0: reading UTF-8 character data directly from your tty
\tSo please type stuff...
\t and then hit your tty's EOF sequence when done.
END_WARNING
}
while (<>) {
for (split(//)) {
$cats{Total}++;
if (/\p{ASCII}/) { $cats{ASCII}++ }
else { $cats{Unicode}++ }
my $gcat = get_general_category($_);
$cats{$gcat}++;
my $subcat = get_general_subcategory($_);
$cats{$subcat}++;
}
}
my $width = length $cats{Total};
my $mask = "%*d %s\n";
for my $cat(qw< Total ASCII Unicode >) {
printf $mask, $width => $cats{$cat} || 0, $cat;
}
print "\n";
my #catnames = qw[
L Lu Ll Lt Lm Lo
N Nd Nl No
S Sm Sc Sk So
P Pc Pd Ps Pe Pi Pf Po
M Mn Mc Me
Z Zs Zl Zp
C Cc Cf Cs Co Cn
];
#for my $cat (sort keys %cats) {
for my $cat (#catnames) {
next if length($cat) > 2;
next unless $cats{$cat};
my $prop = length($cat) == 1
? ( " " . q<\p> . $cat )
: ( q<\p> . "{$cat}" . "\t" )
;
my $desc = sprintf("%-6s %s", $prop, $Prop_Table{$cat});
printf $mask, $width => $cats{$cat}, $desc;
}
exit;
sub get_general_category {
my $_ = shift();
return "L" if /\pL/;
return "S" if /\pS/;
return "P" if /\pP/;
return "N" if /\pN/;
return "C" if /\pC/;
return "M" if /\pM/;
return "Z" if /\pZ/;
die "not reached one: $_";
}
sub get_general_subcategory {
my $_ = shift();
return "Lu" if /\p{Lu}/;
return "Ll" if /\p{Ll}/;
return "Lt" if /\p{Lt}/;
return "Lm" if /\p{Lm}/;
return "Lo" if /\p{Lo}/;
return "Mn" if /\p{Mn}/;
return "Mc" if /\p{Mc}/;
return "Me" if /\p{Me}/;
return "Nd" if /\p{Nd}/;
return "Nl" if /\p{Nl}/;
return "No" if /\p{No}/;
return "Pc" if /\p{Pc}/;
return "Pd" if /\p{Pd}/;
return "Ps" if /\p{Ps}/;
return "Pe" if /\p{Pe}/;
return "Pi" if /\p{Pi}/;
return "Pf" if /\p{Pf}/;
return "Po" if /\p{Po}/;
return "Sm" if /\p{Sm}/;
return "Sc" if /\p{Sc}/;
return "Sk" if /\p{Sk}/;
return "So" if /\p{So}/;
return "Zs" if /\p{Zs}/;
return "Zl" if /\p{Zl}/;
return "Zp" if /\p{Zp}/;
return "Cc" if /\p{Cc}/;
return "Cf" if /\p{Cf}/;
return "Cs" if /\p{Cs}/;
return "Co" if /\p{Co}/;
return "Cn" if /\p{Cn}/;
die "not reached two: <$_> " . sprintf("U+%vX", $_);
}
sub build_prop_table {
for my $line (<<"End_of_Property_List" =~ m{ \S .* \S }gx) {
L Letter
Lu Uppercase_Letter
Ll Lowercase_Letter
Lt Titlecase_Letter
Lm Modifier_Letter
Lo Other_Letter
M Mark (combining characters, including diacritics)
Mn Nonspacing_Mark
Mc Spacing_Mark
Me Enclosing_Mark
N Number
Nd Decimal_Number (also Digit)
Nl Letter_Number
No Other_Number
P Punctuation
Pc Connector_Punctuation
Pd Dash_Punctuation
Ps Open_Punctuation
Pe Close_Punctuation
Pi Initial_Punctuation (may behave like Ps or Pe depending on usage)
Pf Final_Punctuation (may behave like Ps or Pe depending on usage)
Po Other_Punctuation
S Symbol
Sm Math_Symbol
Sc Currency_Symbol
Sk Modifier_Symbol
So Other_Symbol
Z Separator
Zs Space_Separator
Zl Line_Separator
Zp Paragraph_Separator
C Other (means not L/N/P/S/Z)
Cc Control (also Cntrl)
Cf Format
Cs Surrogate (not usable)
Co Private_Use
Cn Unassigned
End_of_Property_List
my($short_prop, $long_prop) = $line =~ m{
\b
( \p{Lu} \p{Ll} ? )
\s +
( \p{Lu} [\p{L&}_] + )
\b
}x;
$Prop_Table{$short_prop} = $long_prop;
}
}
For example:
$ unicats book.txt
2357232 Total
2357199 ASCII
33 Unicode
1604949 \pL Letter
74455 \p{Lu} Uppercase_Letter
1530485 \p{Ll} Lowercase_Letter
9 \p{Lo} Other_Letter
10676 \pN Number
10676 \p{Nd} Decimal_Number
19679 \pS Symbol
10705 \p{Sm} Math_Symbol
8365 \p{Sc} Currency_Symbol
603 \p{Sk} Modifier_Symbol
6 \p{So} Other_Symbol
111899 \pP Punctuation
2996 \p{Pc} Connector_Punctuation
6145 \p{Pd} Dash_Punctuation
11392 \p{Ps} Open_Punctuation
11371 \p{Pe} Close_Punctuation
79995 \p{Po} Other_Punctuation
548529 \pZ Separator
548529 \p{Zs} Space_Separator
61500 \pC Other
61500 \p{Cc} Control
As far as using *nix commands, the answer above is good, but it doesn't get usage stats.
However, if you actually want stats (like the rarest used, median, most used, etc) on the file, this Python should do it.
def get_char_counts(fname):
f = open(fname)
usage = {}
for c in f.read():
if c not in usage:
usage.update({c:1})
else:
usage[c] += 1
return usage

Resources