Download web page and remove content except for one html table

Download web page and remove content except for one html table - unix

I am given a large html report from another department quite regularly that requires a fair amount of manual work to edit into a format that is required.
I'd like to work a bit smarter. I can download the page via:
wget -qO- <https://the_page.html>
However I just want to carve out a table that begins:
<!-- START Daily Keystroke
It goes on and on for many lines of html and always ends:
</table>
</div>
</div>
Before the next load of data is begun. I need everything in between these patterns in one chunk of text /file.
I played around with sed and awk which I am not really familiar with but it seems without knowing how many lines are going to be in the file each time these tools are not appropriate for this task. It seems something that can work more on specific patterns is appropriate.
That being the case I can install other utilities potentially. If anyone has any experience of something that might work?

I played around with sed and awk
Be warned that these are best suited for working with things which might be described using regular expressions, HTML could not be. HTML parsers are devices which are destined for usage with HTML documents. Generally you should avoid using regular expression for dealing with Chomsky Type-2 contraptions.
That being the case I can install other utilities potentially. If
anyone has any experience of something that might work?
I suggest trying hxselect as it allows easy extraction of element(s) matching CSS selector. It does use stdin so you might pipe output into it, consider following example: I want to download www.example.com page and extract its' title tag, then I can do:
wget -q -O - https://www.example.com | hxselect -i 'title'
if you encounter some ill-formed HTML you might use hxclean which will try to make it acceptable to hxselect like so
wget -q -O - https://www.example.com | hxclean | hxselect -i 'title'
If either of above works with your URL then you might start looking for CSS selector which describe only table you want to extract. See CSS selectors reference for available features. I am unable to craft selector without seeing whole source of page.

Suggesting gawk cutting on first multi-line record. Followed by sed, head trimming till <!-- ....
gawk 'NR==1{print}' RS="</table>\n</div>\n</div>" input.html |sed '0,/<!-- START Daily Keystroke/d'
Or without intermediate file:
wget -qO- <https://the_page.html>| \
gawk 'NR==1{print}' RS="</table>\n</div>\n</div>" | \
sed '0,/<!-- START Daily Keystroke/d'
This script, tested to work with provided sample text.
gawk Explanation:
The gawk script cuts input text in first occurrence of.
</table>
</div>
</div>
Aligned to the left margin.
NR==1{print}
Print gawk record number 1 only.
The first record is identify by all text (many lines), terminated with matched pattern in RS variable.
RS="</table>\n</div>\n</div>"
A regular expression (RegExp) That match the gawk multi-line record separator.
In case you want to include indenting whitespaces in the RegExp. Try:
</table>
</div>
</div>
RS="[[:space:]]*</table>[[:space:]]*\n[[:space:]]*</div>[[:space:]]*\n[[:space:]]*</div>"
sed Explanation:
Remove all line till first occurrence of RegExp <!-- START Daily Keystroke
0,/<!-- START Daily Keystroke/
sed lines range. Starting from line 0, till first line that match <!-- START Daily Keystroke/
d
Delete/ignore all lines in range.

Related

How to replace the html tags with SED?

i need some help with using sed in unix.
i need to Use the standard Unix command sed to process the input stream and remove all HTML tags, so that for example:
This is my link.
will be replaced by
This is my link.
I tried
sed -r 's/
<[^>]*>
//g'
but it didn't work.

This is extremely bare-bones and unlikely to catch all of the scenarious that HTML will throw at you, but if you are looking to just trim a leading and trailing < and >, then something like this might work:
sed 's/<[^>]*>//g'
But seriously, I'd use a parser.

In the general case you cannot parse HTML with regular expressions.
But, for simple case and assuming that no tag spans more than two lines, you can use:
sed -e 's/<[^<>]*>//g' -e 's/<[^<>]*$//' -e 's/^[^<>]*>//'
The first regex finds and deletes tags contained on one line. The second takes care of tags which begin on a line but end on the next. The third deletes the tails of tags which began on the previous line. I a tag can span more than two lines then something more complicated (or a better tool) is needed.

Removing lead white space - UNIX

I've been working this problem for most of today as I'm a newbie.
Have looked at many examples but still get undesired output.
A snippet of what the file looks like this now -
File
And it needs to look like this.
Desired output
I started with deleting unnecessary files.
However, when I've tried to manipulate the column over to the right I get a screen full of numbers.
I've been working on getting rid of white space and I'm pretty sure I've tried everything on this site. :):
Thanks!

To remove extra whitespaces from a file you may use the following simple approach with sed command:
sed -ri 's/\s{2,}/ /g' testfile
-r option allows using regexp expressions in substitutions
-i option allows in-place file modification

Use grep (or anything) to find keywords lead by a symbol then output comma-separated

I'm attempting to migrate a bunch of my data from one webservice to another, and in the process I want to make sure I do it right so I won't be obsessing about something not being right or out of place.
One of the things I want to do is find words lead by a poundsign within a single file, then extract the word immediately following them, and then print them back comma-separated.
So for example, at some points in the file there'll be "#word - #word2 : #word3" - with completely random stuff between them, mind you, - And then I'd like to be able to kick that back out as
words='word,word2,word3'
ditching the poundsign and any other gibberish around them.
I'm completely useless at anything beyond basic scripting. Any help would be greatly appreciated.

You can try:
grep -o "#[^ ]*" file | tr -d '#' | tr '\n' ','

Viewing Unix Log Files

We are having a discussion at work, what is the best UNIX command tool that to view log files. One side says use LESS, the other says use MORE. Is one better than the other?

A common problem is that logs have too many processes writing to them, I prefer to filter my log files and control the output using:
tail -f /var/log/<some logfile> | grep <some identifier> | more
This combination of commands allows you to watch an active log file without getting overwhelmed by the output.

I opt for less. A reason for this is that (with aid of lessopen) it can read gzipped log (as archived by logrotate).
As an example with this single command I can read in time ordered mode dpkg log, without treating differently gzipped ones:
less $(ls -rt /var/log/dpkg.log*) | less

Multitail is the best option, because you can view multiple logs at the same time. It also colors stuff, and you can set up regex to highlight entries you're looking for.

You can use any program: less, nano, vi, tail, cat etc, they differ in functionality.
There are also many log viewers: gnome-system-log, kiwi etc (they can sort log by date / type etc)

Less is more. Although since when I'm looking at my logs I'm typically searching for something specific or just interested in the last few events I find myself using cat, pipes and grep or tail rather than more or less.

less is the best, imo. It is light weight compared to an editor, it allows forward and backward navigation, it has powerful search capabilities, and many more things. Hit 'h' for help. It's well worth the time getting familiar with it.

On my Mac, using the standard terminal windows, there's one difference between less and more, namely, after exiting:
less leaves less mess on my screen
more leaves more useful information on my screen
Consequently, if I think I might want to do something with the material I'm viewing after the viewer finishes (for example, copy'n'paste operations), I use more; if I don't want to use the material after I've finished, then I use less.
The primary advantage of less is the ability to scroll backwards; therefore, I tend to use less rather than more, but both have uses for me. YMMV (YMWV; W = Will in this case!).

As your question was generically about 'Unix systems', keep into account that
in some cases you have no choice, for old systems you have only MORE available,
but not LESS.
LESS is part of the GNU tools, MORE comes from the UCB times.

Turn on grep's line buffering mode.
Using tail (Live monitoring)
tail -f fileName
Using less (Live monitoring)
less +F fileName
Using tail & grep
tail -f fileName | grep --line-buffered my_pattern
Using less & grep
less +F fileName | grep --line-buffered my_pattern
Using watch & tail to highlight new lines
watch -d tail fileName
Note: For linux systems.

Remove lines which are between given patterns from a file (using Unix tools)

I have a text file (more correctly, a “German style“ CSV file, i.e. semicolon-separated, decimal comma) which has a date and the value of a measurement on each line.
There are stretches of faulty values which I want to remove before further work. I'd like to store these cuts in some script so that my corrections are documented and I can replay those corrections if necessary.
The lines look like this:
28.01.2005 14:48:38;5,166
28.01.2005 14:50:38;2,916
28.01.2005 14:52:38;0,000
28.01.2005 14:54:38;0,000
(long stretch of values that should be removed; could also be something else beside 0)
01.02.2005 00:11:43;0,000
01.02.2005 00:13:43;1,333
01.02.2005 00:15:43;3,250
Now I'd like to store a list of begin and end patterns like 28.01.2005 14:52:38 + 01.02.2005 00:11:43, and the script would cut the lines matching these begin/end pairs and everything that's between them.
I'm thinking about hacking an awk script, but perhaps I'm missing an already existing tool.

Have a look at sed:
sed '/start_pat/,/end_pat/d'
will delete lines between start_pat and end_pat (inclusive).
To delete multiple such pairs, you can combine them with multiple -e options:
sed -e '/s1/,/e1/d' -e '/s2/,/e2/d' -e '/s3/,/e3/d' ...

Firstly, why do you need to keep a record of what you have done? Why not keep a backup of the original file, or take a diff between the old & new files, or put it under source control?
For the actual changes I suggest using Vim.
The Vim :global command (abbreviated to :g) can be used to run :ex commands on lines that match a regex. This is in many ways more powerful than awk since the commands can then refer to ranges relative to the matching line, plus you have the full text processing power of Vim at your disposal.
For example, this will do something close to what you want (untested, so caveat emptor):
:g!/^\d\d\.\d\d\.\d\d\d\d/ -1 write tmp.txt >> | delete
This matches lines that do NOT start with a date (the ! negates the match), appends the previous line to the file tmp.txt, then deletes the current line.
You will probably end up with duplicate lines in tmp.txt, but they can be removed by running the file through uniq.

you are also use awk
awk '/start/,/end/' file

I would seriously suggest learning the basics of perl (i.e. not the OO stuff). It will repay you in bucket-loads.
It is fast and simple to write a bit of perl to do this (and many other such tasks) once you have grasped the fundamentals, which if you are used to using awk, sed, grep etc are pretty simple.
You won't have to remember how to use lots of different tools and where you would previously have used multiple tools piped together to solve a problem, you can just use a single perl script (usually much faster to execute).
And, perl is installed on virtually every unix/linux distro now.
(that sed is neat though :-)

use grep -L (print none matching lines)
Sorry - thought you just wanted lines without 0,000 at the end

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Download web page and remove content except for one html table - unix

Related

How to replace the html tags with SED?

Removing lead white space - UNIX

Use grep (or anything) to find keywords lead by a symbol then output comma-separated

Viewing Unix Log Files

Remove lines which are between given patterns from a file (using Unix tools)

Categories

Resources