Removing lead white space - UNIX - unix

I've been working this problem for most of today as I'm a newbie.
Have looked at many examples but still get undesired output.
A snippet of what the file looks like this now -
File
And it needs to look like this.
Desired output
I started with deleting unnecessary files.
However, when I've tried to manipulate the column over to the right I get a screen full of numbers.
I've been working on getting rid of white space and I'm pretty sure I've tried everything on this site. :):
Thanks!

To remove extra whitespaces from a file you may use the following simple approach with sed command:
sed -ri 's/\s{2,}/ /g' testfile
-r option allows using regexp expressions in substitutions
-i option allows in-place file modification

Related

Download web page and remove content except for one html table

I am given a large html report from another department quite regularly that requires a fair amount of manual work to edit into a format that is required.
I'd like to work a bit smarter. I can download the page via:
wget -qO- <https://the_page.html>
However I just want to carve out a table that begins:
<!-- START Daily Keystroke
It goes on and on for many lines of html and always ends:
</table>
</div>
</div>
Before the next load of data is begun. I need everything in between these patterns in one chunk of text /file.
I played around with sed and awk which I am not really familiar with but it seems without knowing how many lines are going to be in the file each time these tools are not appropriate for this task. It seems something that can work more on specific patterns is appropriate.
That being the case I can install other utilities potentially. If anyone has any experience of something that might work?
I played around with sed and awk
Be warned that these are best suited for working with things which might be described using regular expressions, HTML could not be. HTML parsers are devices which are destined for usage with HTML documents. Generally you should avoid using regular expression for dealing with Chomsky Type-2 contraptions.
That being the case I can install other utilities potentially. If
anyone has any experience of something that might work?
I suggest trying hxselect as it allows easy extraction of element(s) matching CSS selector. It does use stdin so you might pipe output into it, consider following example: I want to download www.example.com page and extract its' title tag, then I can do:
wget -q -O - https://www.example.com | hxselect -i 'title'
if you encounter some ill-formed HTML you might use hxclean which will try to make it acceptable to hxselect like so
wget -q -O - https://www.example.com | hxclean | hxselect -i 'title'
If either of above works with your URL then you might start looking for CSS selector which describe only table you want to extract. See CSS selectors reference for available features. I am unable to craft selector without seeing whole source of page.
Suggesting gawk cutting on first multi-line record. Followed by sed, head trimming till <!-- ....
gawk 'NR==1{print}' RS="</table>\n</div>\n</div>" input.html |sed '0,/<!-- START Daily Keystroke/d'
Or without intermediate file:
wget -qO- <https://the_page.html>| \
gawk 'NR==1{print}' RS="</table>\n</div>\n</div>" | \
sed '0,/<!-- START Daily Keystroke/d'
This script, tested to work with provided sample text.
gawk Explanation:
The gawk script cuts input text in first occurrence of.
</table>
</div>
</div>
Aligned to the left margin.
NR==1{print}
Print gawk record number 1 only.
The first record is identify by all text (many lines), terminated with matched pattern in RS variable.
RS="</table>\n</div>\n</div>"
A regular expression (RegExp) That match the gawk multi-line record separator.
In case you want to include indenting whitespaces in the RegExp. Try:
</table>
</div>
</div>
RS="[[:space:]]*</table>[[:space:]]*\n[[:space:]]*</div>[[:space:]]*\n[[:space:]]*</div>"
sed Explanation:
Remove all line till first occurrence of RegExp <!-- START Daily Keystroke
0,/<!-- START Daily Keystroke/
sed lines range. Starting from line 0, till first line that match <!-- START Daily Keystroke/
d
Delete/ignore all lines in range.

How to work with files and directories with spaces in them

I have ran into this problem many times and searched and searched, but cannot find out how to use, open, remove, etc. files that have spaces in them. For example, I have an image on my desktop named My Text File.txt . How can I do something with this. i.e. nano My Text File.txt . Whenever I try to do something like this, I get three errors(or however many different blocks of text there are in the file name) each stating the file could not be found because it looks for a file My, then Text, and finally, File.txt. Is there a way to do this without getting errors, or is it possible to create a program to allow it? Any help or advice would be great. Thanks!
The appropriate command for opening My Text File.txt would be:
nano "My Text File.txt"

Does grep process line by line or entire file?

As I'm learning more about UNIX commands I started working with sed at work. Sed's design reads a file in line by line, and executes commands on each line individually.
How does grep process files? I've tried various ways of googling "does grep process line by line" and nothing really concrete shows up.
From Why GNU grep is fast :
Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking for newlines would slow grep down by a factor of several times, because to find the newlines it would have to look at every byte!
and then
Don't look for newlines in the input until after you've found a match.
EDIT:
I will correct myself. It is neither line by line nor full file, its in terms of chunks of data which are placed into the buffer.
More details are here http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
The regular expression you pass to grep doesn't have any way of specifying newlines (although you can specify matches against the start or end of a line).
So it appears to work line by line, even though actually it may not treat line ends differently to other characters.

Use grep (or anything) to find keywords lead by a symbol then output comma-separated

I'm attempting to migrate a bunch of my data from one webservice to another, and in the process I want to make sure I do it right so I won't be obsessing about something not being right or out of place.
One of the things I want to do is find words lead by a poundsign within a single file, then extract the word immediately following them, and then print them back comma-separated.
So for example, at some points in the file there'll be "#word - #word2 : #word3" - with completely random stuff between them, mind you, - And then I'd like to be able to kick that back out as
words='word,word2,word3'
ditching the poundsign and any other gibberish around them.
I'm completely useless at anything beyond basic scripting. Any help would be greatly appreciated.
You can try:
grep -o "#[^ ]*" file | tr -d '#' | tr '\n' ','

Remove lines which are between given patterns from a file (using Unix tools)

I have a text file (more correctly, a “German style“ CSV file, i.e. semicolon-separated, decimal comma) which has a date and the value of a measurement on each line.
There are stretches of faulty values which I want to remove before further work. I'd like to store these cuts in some script so that my corrections are documented and I can replay those corrections if necessary.
The lines look like this:
28.01.2005 14:48:38;5,166
28.01.2005 14:50:38;2,916
28.01.2005 14:52:38;0,000
28.01.2005 14:54:38;0,000
(long stretch of values that should be removed; could also be something else beside 0)
01.02.2005 00:11:43;0,000
01.02.2005 00:13:43;1,333
01.02.2005 00:15:43;3,250
Now I'd like to store a list of begin and end patterns like 28.01.2005 14:52:38 + 01.02.2005 00:11:43, and the script would cut the lines matching these begin/end pairs and everything that's between them.
I'm thinking about hacking an awk script, but perhaps I'm missing an already existing tool.
Have a look at sed:
sed '/start_pat/,/end_pat/d'
will delete lines between start_pat and end_pat (inclusive).
To delete multiple such pairs, you can combine them with multiple -e options:
sed -e '/s1/,/e1/d' -e '/s2/,/e2/d' -e '/s3/,/e3/d' ...
Firstly, why do you need to keep a record of what you have done? Why not keep a backup of the original file, or take a diff between the old & new files, or put it under source control?
For the actual changes I suggest using Vim.
The Vim :global command (abbreviated to :g) can be used to run :ex commands on lines that match a regex. This is in many ways more powerful than awk since the commands can then refer to ranges relative to the matching line, plus you have the full text processing power of Vim at your disposal.
For example, this will do something close to what you want (untested, so caveat emptor):
:g!/^\d\d\.\d\d\.\d\d\d\d/ -1 write tmp.txt >> | delete
This matches lines that do NOT start with a date (the ! negates the match), appends the previous line to the file tmp.txt, then deletes the current line.
You will probably end up with duplicate lines in tmp.txt, but they can be removed by running the file through uniq.
you are also use awk
awk '/start/,/end/' file
I would seriously suggest learning the basics of perl (i.e. not the OO stuff). It will repay you in bucket-loads.
It is fast and simple to write a bit of perl to do this (and many other such tasks) once you have grasped the fundamentals, which if you are used to using awk, sed, grep etc are pretty simple.
You won't have to remember how to use lots of different tools and where you would previously have used multiple tools piped together to solve a problem, you can just use a single perl script (usually much faster to execute).
And, perl is installed on virtually every unix/linux distro now.
(that sed is neat though :-)
use grep -L (print none matching lines)
Sorry - thought you just wanted lines without 0,000 at the end

Resources