UNIX matching pattern and extraction

UNIX matching pattern and extraction - unix

I'm new to unix. I have a file which has network connection details.I am trying to extract only the hostname and port number from the file using shell script.The data is like this "(example.easyway.com=(description=(address_list=(protocol=tcp)(host=184.43.35.345)(port=1234))(connect=port))"
I've 100 lines of connection information like this.I have to extract only the Host name and port and paste it in a new file. can anyone guide me to do this?

There a different ways in Unix to do so, something like
sed 's/^..\([^=]*\)=.*port=\([^)]*\).*/\1 \2/' file
I think you will not understand this and want something easier now. You can tryit with some steps, checking after each step:
cut -d= -f1,7 file | cut -d")" -f1 | cut -c2-
The easiest way is, when you are unfamiliar with these tool, is opening the file in some editor, replace global the string =(description=(address_list=(protocol=tcp)(host= by a space (or use regular expressions in your editor), the same for ))(connect=port)) and sit for 10 minutes to edit te remainig part of the 100 lines.

That looks like Oracle TNS configuration to me. Presuming that host always comes before port this call out to Perl would to the trick
perl -ne 'print "$1:$2\n" if(/host=([\w\.-]+).*port=(\d+)/)' < my-tns-config.txt
If the order of port and host is unpredictable then this would work
perl -ne 'print "$1:$2\n" if(/host=([\w\.-]+).*port=(\d+)|port=(\d+).*host=([\w\.-.]+)/)' < my-tns-config.txt
Check https://regex101.com/ or https://regexper.com for an explanation of those regular expressions.
M.

Related

Scraping 5 characters off a webpage using sed or something better?

I have a downloaded webpage I would like to scrape using sed or awk. I have been told by a colleague that what I'm trying to achieve isn't possible with sed and maybe this is probably correct seeing as he is a bit of a linux guru.
What I am trying to achieve:
I am trying to scrape a locally stored html document for every value within this html label which apppears hundreds of times on a webpage..
For example:
<label class="css-l019s7">54321</label>
<label class="css-l019s7">55555</label>
This label class never changes, so it seems the perfect point to do scraping and get the values:
54321
55555
There are hundreds of occurences of this data and I need to get a list of them all.
As sed probably isn't capable of this, I would be forever greatful if someone could demonstrate AWK or something else?
Thank you.
Things I've tried:
sed -En 's#(^.*<label class=\"css-l019s7\">)(.*)(</label>.*$)#\2#gp' TS.html
This code above managed to extract about 40 of the numbers out of 320. There must be a little bug in this sed command for it to work partially.

Use a parser like xmllint:
xmllint --html --recover --xpath '//label[#class="css-l019s7"]/text()' TS.html
As an interest in sed was expressed (note that html can use newlines instead of spaces, for example, so this is not very robust):
sed 't a
s/<label class="css-l019s7">\([^<]*\)/\
\1\
/;D
:a
P;D' TS.html
Using awk:
awk '$1~/^label class="css-l019s7"$/ { print $2 }' RS='<' FS='>' TS.html
or:
awk '$1~/^[\t\n\r ]*label[\t\n\r ]/ &&
$1~/[\t\n\r ]class[\t\n\r ]*=[\t\n\r ]*"css-l019s7"[\t\n\r ]*([\t\n\r ]|$)/ {
print $2
}' RS='<' FS='>' TS.html

someone could demonstrate AWK or something else?
This task seems for me as best fit for using CSS selector. If you are allowed to install tools you might use Element Finder for this following way:
elfinder -s "label.css-l019s7"
which will search for labels with class css-l019s7 in files in current directory.

with grep you can get the values
grep -Eo '([[:digit:]]{5})' file
54321
55555
with awk you can concrete where the values are, here in the lines with label at the beginning and at the end:
awk '/^<label|\/label>$/ {if (match($0,/[[:digit:]]{5}/)) { pattern = substr($0,RSTART,RLENGTH); print pattern}}' file
54321
55555

using GNU awk and gensub:
awk '/label class/ && /css-l019s7/ { str=gensub("(<label class=\"css-l019s7\">)(.*)(</label>)","\\2",$0);print str}' file
Search for lines with "label class" and "css=1019s7". Split the line into three sections and substitute the line for the second section, reading the result into a variable str. Print str.

Transform hexadecimal representation to unicode

I'm dealing with very big files (~10Gb) containing word with ascii representation of unicode :
Nuray \u00d6zdemir
Erol \u010colakovi\u0107 \u0160ehi\u0107
I want to tranform them into unicode before inserting them into a database, like this :
Nuray Özdemir
Erol Čolaković Šehić
I've seen how to do it with vim but it's very slow for very large file. I thought copy/paste of the regex would be OK but it's not.
I actually get things like this:
$ echo "Nuray \u00d6zdemir" | sed -E 's/\\\u(.)(.)(.)(.)/\x\1\x\2\x\3\x\4/g'
Nuray x0x0xdx6zdemir
How can I concatenate the \x and the value of \1 \2...?
I don't want to use echo or an external program due to the size of the file, I want something efficient.

Assuming the unicodes in your file are within BMP (16bit), how about:
perl -pe 'BEGIN {binmode(STDOUT, ":utf8")} s/\\u([0-9a-fA-F]{4})/chr(hex($1))/ge' input_file > output_file
Output:
Nuray Özdemir
Erol Čolaković Šehić
I have generated a 6Gb file to test the speed efficiency.
It took approx. 10 minutes to process the entire file on my 6 year old laptop.
I hope it will be acceptable to you.

I am not a mongoDB expert at all but what I can tell you is the following:
If there is a way to do it at the import directly within the DB engine, this solution should be used, now if this feature is not available.
You can use either use a naive approach to solve it:
while read -r line; do echo -e "$line"; done < input_file
INPUT:
cat input_file
Nuray \u00d6zdemir
Erol \u010colakovi\u0107 \u0160ehi\u0107
OUTPUT:
Nuray Özdemir
Erol Čolaković Šehić
But as you have spotted yourself the call to echo -e at each line will create a resource intensive change of context (generate a sub-process for echo -> memory allocation, new entry in the processes table, priority management, switching back to the parent process) that is not efficient for 10GB files.
Or go for a smarter approach using tools that should be available in your distro example:
whatis ascii2uni
ascii2uni (1) - convert 7-bit ASCII representations to UTF-8 Unicode
Command:
ascii2uni -a U -q input_file
Nuray Özdemir
Erol Čolaković ᘎhić
You can also split (ex split command) the input file in pieces, run in parallel the conversion step on each sub file, and import each converted pieces as soon as it is available to shorten the total execution time.

Trim a file name in Unix

I have a file with name
ROCKET_25_08:00.csv
I want to trim the name of the file to
ROCKET_25_.csv
I tried mv but mv is not what I required because there will be cases where the files may be more than one.
I want the name till the second _.
How to get that in unix.
Please advise.

There are some utilities that provide more flexible renaming. But one solution that won't use anything other but included UNIX tools (like sed) would be:
ls -d * | sed -re 's/^([^_]*_[^_]*_)(.*)(\....)$/mv -v \1\2\3 \1\3/' | bash
This will only work in one directory, it won't process subdirectories.

It's not at all clear what you are actually trying to do, but if you just want to remove text between the last underscore and the period, you can do:
f=ROCKET_25_08:00.csv
echo ${f%_*}_.csv

Is it possible to use wild characters to delete dataset on z/OS

I want to remove lots of temporary PS datasets with dataset name like MYTEST.**, but still can't find an easy way to handle the task.
I meant to use a Shell command below to remove them
cat "//'dataset.list'"| xargs -I '{}' tsocmd "delete '{}'"
However, first I have to save the dataset list into a PS dataset or Unix file. In Unix, we can redirect output of ls command into a text file: "ls MYTEST.* > dslist", but on TSO or ISPF panel, seems no simple command to do that.
Anyone has any clue on this? Your comment would be appreciated.

Rexx ISPF option is probably the easiest and can be used in the future, but options include:
Use the save command in ispf 3.4 to save to a file, then use a rexx program on the file created by the save command
listcat command, in particular
listcat lvl(MYTEST) ofile(ddname)
then write a rexx program to do the actual delete
Alternatively you can use the ISPF services LMDINIT, LMDLISTY & LMDFREE in a rexx program running under ISPF i.e.
/* Rexx ispf program to process datasets */
Address ispexec
"LMDINIT LISTID(lidv) LEVEL(MYTEST)"
"LMDLIST LISTID("lidv") OPTION(list) dataset(dsvar) stats(yes)"
do while rc = 0
/* Delete or whatever */
end
"LMDFREE LISTID("lidv")"
For all these methods you need to fully qualify the first High level qualifier.
Learning what Rexx / ISPF will serve you into the future. In the ISPF Editor, you can use the model command to get Templates / information for all the ISPF commands:
Command ====> Model LMDINIT
will add a template for the lmdinit command. There are templates for rexx, cobol, pl1, ISPF-panels, ISPF-skeletons messages etc.

Thanks Bruce for the comprehensive answer. According to Bruce's tips, I just worked out a one-line Shell command as below:
tsocmd "listcat lvl(MYTEST) " | grep -E "MYTEST(\..+)+" | cut -d' ' -f3 | xargs -I '{}' tsocmd "delete '{}'"
Above command works perfectly.

Update - The IDCAMS DELETE command has had the MASK operand for a while. You use it like:
DELETE 'MYTEST.**' MASK
Documentation for z/OS 2.1 is here.

Remove lines which are between given patterns from a file (using Unix tools)

I have a text file (more correctly, a “German style“ CSV file, i.e. semicolon-separated, decimal comma) which has a date and the value of a measurement on each line.
There are stretches of faulty values which I want to remove before further work. I'd like to store these cuts in some script so that my corrections are documented and I can replay those corrections if necessary.
The lines look like this:
28.01.2005 14:48:38;5,166
28.01.2005 14:50:38;2,916
28.01.2005 14:52:38;0,000
28.01.2005 14:54:38;0,000
(long stretch of values that should be removed; could also be something else beside 0)
01.02.2005 00:11:43;0,000
01.02.2005 00:13:43;1,333
01.02.2005 00:15:43;3,250
Now I'd like to store a list of begin and end patterns like 28.01.2005 14:52:38 + 01.02.2005 00:11:43, and the script would cut the lines matching these begin/end pairs and everything that's between them.
I'm thinking about hacking an awk script, but perhaps I'm missing an already existing tool.

Have a look at sed:
sed '/start_pat/,/end_pat/d'
will delete lines between start_pat and end_pat (inclusive).
To delete multiple such pairs, you can combine them with multiple -e options:
sed -e '/s1/,/e1/d' -e '/s2/,/e2/d' -e '/s3/,/e3/d' ...

Firstly, why do you need to keep a record of what you have done? Why not keep a backup of the original file, or take a diff between the old & new files, or put it under source control?
For the actual changes I suggest using Vim.
The Vim :global command (abbreviated to :g) can be used to run :ex commands on lines that match a regex. This is in many ways more powerful than awk since the commands can then refer to ranges relative to the matching line, plus you have the full text processing power of Vim at your disposal.
For example, this will do something close to what you want (untested, so caveat emptor):
:g!/^\d\d\.\d\d\.\d\d\d\d/ -1 write tmp.txt >> | delete
This matches lines that do NOT start with a date (the ! negates the match), appends the previous line to the file tmp.txt, then deletes the current line.
You will probably end up with duplicate lines in tmp.txt, but they can be removed by running the file through uniq.

you are also use awk
awk '/start/,/end/' file

I would seriously suggest learning the basics of perl (i.e. not the OO stuff). It will repay you in bucket-loads.
It is fast and simple to write a bit of perl to do this (and many other such tasks) once you have grasped the fundamentals, which if you are used to using awk, sed, grep etc are pretty simple.
You won't have to remember how to use lots of different tools and where you would previously have used multiple tools piped together to solve a problem, you can just use a single perl script (usually much faster to execute).
And, perl is installed on virtually every unix/linux distro now.
(that sed is neat though :-)

use grep -L (print none matching lines)
Sorry - thought you just wanted lines without 0,000 at the end

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

UNIX matching pattern and extraction - unix

Related

Scraping 5 characters off a webpage using sed or something better?

Transform hexadecimal representation to unicode

Trim a file name in Unix

Is it possible to use wild characters to delete dataset on z/OS

Remove lines which are between given patterns from a file (using Unix tools)

Categories

Resources