Regular expression to match ID's and classes in CSS page - css

I'm trying to analyze HTML code and extract all CSS classes and ID's from the source. So I need to extract whatever is between two quotation marks, which can be preceded by either class or id:
id="<extract this>"
class="<extract this>"

/(?:id|class)="([^"]*)"/gi
replacement expression: $1
this regex in english: match either "id" or "class" then an equals sign and quote, then capture everything that is not a quote before matching another quote. do this globally and case insensitively.

Since you prefer using regular expression, here is one way I suppose.
\b(?:id|class)\s*=\s*"([^"]*)"
Regular expression:
\b # the boundary between a word char (\w) and not a word char
(?: # group, but do not capture:
id # 'id'
| # OR
class # 'class'
) # end of grouping
\s* # whitespace (\n, \r, \t, \f, and " ") (0 or more times)
= # '='
\s* # whitespace (\n, \r, \t, \f, and " ") (0 or more times)
" # '"'
( # group and capture to \1:
[^"]* # any character except: '"' (0 or more times)
) # end of \1
" # '"'

You may want to try this:
<?php
$css = <<< EOF
id="<extract this>"
class="<extract this>"id="<extract this2>"
class="<extract this3>"id="<extract this4>"
class="<extract this5>"id="<extract this6>"
class="<extract this7>"id="<extract this8>"
class="<extract this9>"
EOF;
preg_match_all('/(?:id|class)="(.*?)"/sim', $css , $classes, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($classes[1]); $i++) {
echo $classes[1][$i]."\n";
}
/*
<extract this>
<extract this>
<extract this2>
<extract this3>
<extract this4>
<extract this5>
<extract this6>
<extract this7>
<extract this8>
<extract this9>
*/
?>
DEMO:
http://ideone.com/Nr9FPt

Related

Find closing parenthesis with regex in r

I have several strings with open and unclosed parenthesis. I managed to remove the opening parenthesis (if there is no closing one), but I do not manage to remove the closing parenthesis if there is no opening one. I want to leave those with matching parenthesis alone
string1 = "This (is solved"
string2 = "This is (fine)"
string3 = "This is the problem)"
This is what I was able to remove the first Problem case with (Opening parenthesis but no opening)
str_remove(data, "[(](?!.*[)])")
But I cannot seem to turn it around. The following grabs all closing parenthesis, but not the one without an oping.
"(?!.*[(])[)]"
Any ideas are appreciated!
If you do not need to handle nested paired (balanced) parentheses, you can use
gsub("(\\([^()]*\\))|[()]", "\\1", string)
See the regex demo. Details:
(\([^()]*\)) - Group 1 (\1 refers to this group value): (, then zero or more chars other than ( and ), and then a ) char
| - or
[()] - a ( or ) char.
See the R demo:
x <- c("This (is solved", "This is (fine)", "This is the problem)")
gsub("(\\([^()]*\\))|[()]", "\\1", x)
# => [1] "This is solved" "This is (fine)" "This is the problem"
If the parentheses can be nested, you can use
gsub("(\\((?:[^()]++|(?1))*\\))|[()]", "\\1", string, perl=TRUE)
See this regex demo. Details:
(\((?:[^()]++|(?1))*\)) - Group 1:
\( - a ( char
(?:[^()\n]++|(?1))* - zero or more sequences of either one or more chars other than ( and ), or the whole Group 1 pattern that is recursed
\) - a ) char
|[()] - or a ( / ) char.

Regex pattern for a number inside square bracket

I am trying to replace strings which start with a number inside a square bracket with a word.
[1]FirstWord!FF >> FirstReplace!!FF
[2]SecondWord!FF >> SecondReplace!!FF
This code only works if there is no [number] at the beginning.
'test!FF' %>% str_replace_all("test(['\"])?!","replace!")
But how can I update the pattern if I have something like below:
'[1]test!FF' %>% str_replace_all("[1]test(['\"])?!","replace!")
You want
'[1]test!FF' %>% str_replace_all("\\[\\d+]test['\"]?!", "replace!")
## => [1] "replace!FF"
Note: If the [number] part is optional, wrap it with an optional non-capturing group: "(?:\\[\\d+])?test['\"]?!".
Details:
\[ - a [ char
\d+ - one or more digits
] - a ] char
test - a test word
['\"]? - an optional " or '
! - a ! char.

calculate the percentage of not null recs in a file in unix

How do i figure out the percentage of not null records in my file in UNIX?
My file like this: I wanted to know the amount of records & the percentage of not null rec's. Tried whole lot of grep n cut commands but nothing seems to be working out. Can anyone help me here please...
"name","country","age","place"
"sam","US","30","CA"
"","","",""
"joe","UK","34","BRIS"
,,,,
"jake","US","66","Ohio"
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
use 5.012; # say, keys #arr
use Text::CSV_XS qw{ csv };
my ($count_all, #count_nonempty);
csv(in => shift,
out => \ 'skip',
headers => 'skip',
on_in => sub {
my (undef, $columns) = #_;
++$count_all;
length $columns->[$_] and $count_nonempty[$_]++
for 0 .. $#$columns;
},
);
for my $column (keys #count_nonempty) {
say "Column ", 1 + $column, ": ",
100 * $count_nonempty[$column] / $count_all, '%';
}
It uses Text::CSV_XS to read the CSV file. It skips the header line, and for each subsequent line, it calls the callback specified in on_in, which increments the count of all lines and also the count of empty fields per column if the length of a field is zero.
Along with choroba, I would normally recommend using a CSV parser on CSV data.
But in this case, all we want to look for is that a record contains any character that is not a comma or quote: if a record contains only commas and/or quotes, it is a "null" record.
awk '
/[^",]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
To handle leading/trailing whitespace
awk '
{sub(/^[[:blank:]]+/,""); sub(/[[:blank:]]+$/,"")}
/[^",]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
If allowing fields containing only whitespace, such as
" ","",,," "
is also a null record, we can simple ignore all whitespace
awk '
/[^",[:blank:]]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file

Why does XQuery add an extra space?

XQuery adds a space and I don't understand why. I have the following simple query :
declare option saxon:output "method=text";
for $i in 1 to 10
return concat(".", $i, " ", 100, "
", ".")
I ran it with Saxon (SaxonEE9-5-1-8J and SaxonHE9-5-1-8J):
java net.sf.saxon.Query -q:query.xq -o:result.txt
The result is the following:
.1 100
. .2 100
. .3 100
. .4 100
. .5 100
. .6 100
. .7 100
. .8 100
. .9 100
. .10 100
.
My question comes from the presence of an extra space between dots. The first line is OK but the folllowing lines (2 to 10) have that space and I don't understand why. What we see as spaces between digits is in fact a tabulation inserted by the character reference.
Could you enlighten me about that behavior ?
PS: I have added saxon as a tag for the question even if the question is not specific to Saxon.
I think your query returns a sequence of string values which are then by default concatenated with a space (see http://www.w3.org/TR/xslt-xquery-serialization/#sequence-normalization where it says "For each subsequence of adjacent strings in S2, copy a single string to the new sequence equal to the values of the strings in the subsequence concatenated in order, each separated by a single space"). If you don't want that then you can use
string-join(for $i in 1 to 10
return concat(".", $i, " ", 100, "
", "."), '')
The space between the dots is basically a separator introduced between the items in the sequence that you are constructing. It would seem that Saxon's text serializer where it outputs to the console inserts that space character to allow you to make sense of the output items.
Considering your code:
declare option saxon:output "method=text";
for $i in 1 to 10
return
concat(".", $i, " ", 100, "
", ".")
The result of for $i in 1 to 10 return is a sequence of 10 xs:string items. From your output you can determine that the space is interspersed between each evaluation of concat(".", $i, " ", 100, "
", ".").
If you want to check that you can rewrite your query as:
for $i in 1 to 10
return
<x>{concat(".", $i, " ", 100, "
", ".")}</x>
And you will see your 10 distinct items with no spaces between.
If you are trying to create a single text string, as you are already controlling the line-breaks, then you could also join all of the 10 xs:string items together yourself, which would have the effect of eliminating the spaces you are seeing between the sequence items. For example:
declare option saxon:output "method=text";
string-join(
for $i in 1 to 10
return
(".", string($i), " ", "100", "
", ".")
, "")

awk field separator , when the separator shows up in double quote

I am trying to use awk to read some input at the field position at 3, $3, field 3 is a string
awk -F'","' '{print $1}' input.txt
my file input.txt looks like this
field1,field2,field3,field4,field5
the problem is that these fields are separated by commas, some of them are double quoted while others are not. And field 5 is double quoted and contains every type of symbols. Example:
imfield1,imfield2,"imfield3",imfield4,"im"",""fi"",el,""d5"
can awk handle a situation like this??
In more gner, how can I get the whole string by typIng $5 ?
You can use Lorance Stinson's Awk CSV parser, in which case it's as simple as:
function parse_csv(..) {
..
}
{
num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
print csv[2]
}
If you're not hell-bent on Awk, Python also comes with a nice CSV parser:
import csv, sys
for row in csv.reader(sys.stdin):
print row[2]
Or from the command line (bit tricky in one line):
python -c 'import csv,sys;[sys.stdout.write(row[2]+"\n") for row in csv.reader(sys.stdin)]' < input.txt
The separator is a simple comma, not a comma betweeen quotes. If the fields do not contain commas, then awk may be up for the task:
awk -F , '
{
if ($3 ~ /^".*"$/) {
$3 = substr($3, 2, length($3)-2);
gsub(/""/, "", $3);
}
print $3;
}' input.txt
This is already getting pretty complicated. If there can be commas inside fields, use a proper CSV parser, for example in Perl or Python. See https://unix.stackexchange.com/questions/7425/is-there-a-robust-command-line-tool-for-processing-csv-files
You can parse the line in awk setting null field separator. Instead of printf("%s",$i) you can assign $i to a var and print out when inda==0
#echo "\"AAA,BBB\",\"CCC\",\"DDD, EEE, FFF\"" > uno
awk 'BEGIN { FS="" }
{
for ( i=1; i<NF; i++) {
if ( $i == "\"" )
if ( inda == 0 )
inda = 1
else
inda = 0
if ( $i == "," )
if ( inda == 0 )
$i="|"
printf("%s",$i)
}
printf("\n")
}' uno

Resources