Man page, as structured data (csv, database, ...)

Man page, as structured data (csv, database, ...) - unix

In order to simplify my question, I mainly consider man pages of commands. eg «man grep».
Man pages are, more or less, structured. Most sections and their presentation are standard, and an explanation can be found on https://www.tldp.org/HOWTO/Man-Page/q3.html
(And the source of man page, in groff, is not really hard to understand, even without knowing groff)
My question is: is there already a database with the (more standard) man pages. Or at least a program, taking as input a man page (as a groff file, probably) and output such a database.
Here, I mean database in a very vague sens. Sqlite or mysql would be perfect. But a zip of csv would also be great.
Let me give you an example using man grep.
The database would have an option table, with an entry for each option. This entry would contains:
- the actual option name(s)
- The abbreviation(s),
- The description of what this option does
- The enclosing section.
. In CSV, an entry would be
--extended-regexp, -E, Interpret PATTERN as an extended regular expression (ERE\, see below). (-E is specified by POSIX .), Matcher Selection
It would have an "exit" table, with:
0, selected lines are found
1, otherwise
2 an error occurred\, unless the -q or --quiet or --silent option is used and a selected line is found.
And so on for each standard kind of sections of the man page.
And a table with every text which was not successfully put in some other table.
I hope that some part of it would be simple to parse, for example creating the option table. But some other part would be quite hard, for example the exit status. Which is why I really want to know whether something like that is already done, in order not to do it myself.

You can download man pages with
git clone http://git.kernel.org/pub/scm/docs/man-pages/man-pages

Related

Open a XML file not knowing the complete name and parse xml

I am using robot framework with RIDE, and for a test I need to find a XML file on my computer and open it to parse the xml and be able to use the datas.
The thing is that I don't know the exact name of the file; the format is numberNameOfTheFile, so it could be 1NameOfTheFile or 25NameOfTheFile.
How can I use regexp in my keyword? Or any other way to achieve this?
Thank you

How would you do it manually - how would you pick the file to use for the verification?
I presume, you are going to look at all the files that are matching a specific name pattern; in Robot Framework you can do that with OperatingSystem's List Files In Directory keyword, which supports passing a name pattern:
${the files}= List Files In Directory /the/path/to/the/dir *NameOfTheFile.xml
Now you have a list object with the filenames that match; if it's empty - there's no such file, which may be a problem (depends on your test/reqs, I don't know). If it has a single member - great, that's your file.
And if there are multiple files - that's another "problem". How would you pick the right file manually? It could be that the newest file is the target one - for that you would go over all of them and find the one through OperatingSystem's Get Modified Tume; or it can be the largest; or the number in its suffix would be the biggest. This really depends on your requirements, and what you are trying to achieve.
"How would you do it manually" is probably the most important question to ask. Think and break down to steps the individual tasks you would do, and now you have the algorithm; see how to put that in code - and presto, the implementation. This applies to scripts, test cases, and business process automation (e.g. software).
I was tempted to mark the question for closing, because precisely this - the algorithm - was missing, only the end goal is stated - while SO is for helping in the implementation part. But, here we are :)

Is there a way to make R code only able to be run and not edited? Essentially read-only?

I am in the midst of writing some scripts to perform data analysis on large excel sheets faster than by hand. However, my company has a strict quality review system where the program used needs to be validated and secure (i.e. no one can edit it, there is proof of what code was run, etc.). So essentially I would like my code to be able to be ran by my coworkers without them being able to edit the script. I was also interested in inserting prompts that they can fill in (e.g. "Which column would you like to analyze?")
Is all of this possible? I have read a few things online about file permissions but I know that these can easily be changed by the user. I also read about obfuscators but am entirely unfamiliar with their use.
One thought I have is to use Rmarkdown as a method of displaying which lines were run for which results. However, I believe that document could be edited as well? This would also leave the issue of the script itself being able to be edited.

how access specific part of data as an input of AWK

Suppose I want to access an online dictionary and need to look for a specific word. I just like to have the specific part of data, which is those related to word and its translation as input of AWK,any idea?
In other words, I just want to have on my machine a margin of data, How can I prevent downloading all the data and hopefully save space and time. Is there any way to do so without downloading all the data to local machine?
This question is related to my last question here.
Edit 1:
I select dictionary as an example because when you want to look up for a word, it is enough to access a specific part of data and there is no need to process whole of it.
I am not an expert in programming so i was thinking I can modify this answer to make it work(that is why I add AWK tag again). I dont use any specific OS or tool. this is just a basic idea to see what are the possibilities so I dont know how can I improve the tags.

awk cannot download. You must download the file and pipe it into a command that terminates as soon as it finds a result:
wget -qqO- http://example.com/path |grep -wim1 "word"
wget -qqO- URL will have no output other than the content of the given URL, which is placed on standard out so you can then parse it. grep -wim1 "word" will find the first bounded word matching "word" and then terminate. If you don't need it outputted, you can use -wiq instead. If the dictionary has one word per line (and nothing else), you're better off with -x instead of -w so that you can match "can" in its entirety rather than "can't" (' is a word boundary). Remove the -i if you want to match case.
In the comments, you asked:
it may improve to jumpt to start of "w" character maybe so not to download whole data from "a" to "w". is it possible? I guess not
Some programs can "resume" downloads and you may be able to play with that, but you'd have to guess where to start. This would be a lot of work and you might seek too far and therefore fail to get a match.
If you are querying this dictionary more than once, I'd recommend downloading it and saving it so you can query it locally. Even the largest dictionary I know of is only 213MB (compressed, search with zgrep), though I am assuming you're talking about a traditional word list rather than a hash table or other arbitrary data form. Of course, anything longer would take such a long time to download that you'd only want to do it once.
If you really don't want to store it locally, you should probably consider a database rather than a flat file.

Log parser/analyzer in Unix

What's the popular tool people use in Unix to parse/analyze log files? Doing counting, find unique, select/copy certain line which have certain patterns. Please advise some tools or some keyword. Since I believe there must be similar questions asked before, but I don't any idea about the keywords. Thanks.

I find it to be a huge failure that many log formats do not separate columns with proper unique field separators. Not because that is best, but because it is the basic premise of unix textutils that operate on table data. Instead they tend to use spaces as separators and quote fields that might contain spaces.
One of the most practical simple changes I made to web log analyzing was to leave the default NCSA log format produced by the nginx web server, to instead use tab as the field separator.
Suddenly I could use all of the primitive unix textutils for quick lookups, but especially awk! Print only lines where the user-agent field contains Googlebot:
awk 'BEGIN {FS="\t"} $7 ~ /Googlebot/ { print; }' < logfile
Find the number of requests on for each unique request
awk 'BEGIN {FS="\t"} { print $4; }' < logfile | sort | uniq -c | sort -n
And of course lots of combinations to find specific visitors.

For regular, nightly checking there is logwatch which have several different scripts in /usr/share/logwatch/scripts/services that check for specific things (like web server stuff, ftp server stuff, sshd related stuff, etc) in syslog. Default install enables most of them, but you are able to enable/disable as you like or even write your own scripts.
For real-time watching there is multitail.

You might want to try out lnav, a curses based log analyzer. It has most of the features you would expect from a log parser like, chronological arrangement of log messages from multiple log files, support for multiple log formats, highlighting of error/warning messages, hotkeys for navigating between error/warning messages, support for SQL queries and lots more. Take a look at the project's website for screenshots and a detailed list of features.

Take a look at some of the generic log parsers listed here. If you use something like syslog, you can probably get a custom parser/analyzer too. Otherwise, for trivial searches, any scripting language like perl, python or even awk suffices.

Any programming language that allows you to open and read files, do string/text manipulations can be used, eg Perl,Python,(g)awk, Ruby,PHP, even Java etc. They support modules for the file formats you are parsing,eg csv, etc.

How to learn work effectively with Unix CLI

Do you know any resources that teach to good habits of working in UNIX command line?
EDIT: I don't mean general books about shell or man pages. I mean the things that you can only see watching professionals working with command line. For example when changing frequently between two directories they use "pushd" command, when repeating a command they use "history". I can read about these commands but I want to make it a habit to use them effectively.

I am speaking out of my own experience so it may not apply to you;
The best way to be efficient is actually using it on a daily basis, instead of using graphical tools even if they make look things easy. You will then become aware of most common tasks you care about, and instead of trying to grok it at once, you get a fairly good starting point to start learning. Man pages are the first thing to look at, but there will be non-obvious tricks which you need to search anyway. Knowing what you exactly want, infinitely increases probability of finding it.
For example, you can find how to search all mp3 files easier in man page of "find" than how to deal with files in general (where to start?).

Some common bash command line actions, not in order:
Command line editing: you'll want to be good with emacs or vi and apply that to editing your commands.
Completion: use TAB to expand file names and paths.
note: There is a huge set of file, command, and history completion functions, and it is configurable. Big topic.
"cd -" : go back to the last directory you were in
~ = home directory (or ~user for users home dir)
"ESC ." : expands to the final arg from the previous command
"!string" : execute the last command starting with string
learn find, grep, sed, piping "|" and redirection ">". You'll often combine these to do useful things.
Loops from the shell prompt, e.g. "for" loop - to do repetitive actions
Learn your regular expressions! Often used for matching files.
example: ls x[0-5]*.{zip,tar} = list files starting with x, followed by a number 0 through 5, followed by any string ending in .zip or .tar
If possible ask others for their favorite tricks, read the manual, and practice.

For the more advanced stuff This seems to be fairly comprehensive

this is a great resource: "Rute User's Tutorial and Exposition" (http://rute.2038bug.com/index.html.gz)

stackoverflow.com esp. the bash tag ;-)
(and of course the bash man page)

If you want things that you can "only see watching professionals working with command line," then you've answered your own question: Watch professionals working with the command line. I don't personally find that very useful unless the other person is doing the same thing multiple times; it's hard to pick something up after just one session because it's hard to watch the screen and the keyboard at the same time.
I think the key is to not try to become an expert right away. Just use the command line frequently, and be aware that you might not be using it as well as you could, but don't let that discourage you from using it anyway.
Browse through the man page of your shell, and through lists of tips, not with the goal of memorizing everything in them, but just to pick out a couple of things to try out. Skim through until something catches your eye and makes you think, "Gee, that sounds useful." Then try it out. Not everything is going to be useful immediately; you might have to wait a while before you encounter a situation where you can try something out. Maybe you could write down some things on Post-It notes by your desk to remind you that certain feats are possible, so when you encounter a situation where a more obscure feature could be handy, you'll be more likely to remember to try it.

Frankly, it's impossible to learn this stuff in a vacuum. You need to have problems to solve.
While it certainly helps to have familiarity with the tools available (of which there are a myriad), "learning" it requires applying it. And applying it requires "real" problems to solve.
For example, the skillset of a System Admin may be different from someone who works with databases because their roles are different.
I use them for data processing, using mostly one off files. /tmp/x.sh and /tmp/x.x are worn bare in the directory folder.
My hammers tend to lean towards: ls, find, sort, sed, vi, awk, grep, and comm. Combined with simple shell scripting like: for i in cat /tmp/list; do .. done
But I do a lot of ETL work, and very few script files, which is why my shell scripting skills are so weak.
I do rely on one script, however:
#!/bin/sh
# latest -- show latest files
ls -lt $# | head
As 95% of the time the files I'm working on are in the top 10 latest files. And "latest *.txt" works a peach.
So, bottom line, you need problems to solve. You need to learn the 'man' command, man -k is nice to find things. You also need to leverage the "See Also" at the bottom of most man pages. That's a treasure trove of "I didn't know you could do that".
Then, just start solving problems. Start figuring out "what would be nice to have" and then see if it exists (it very well may). If not, awk, perl, or python can make those "nice to haves" out of thin air.

Join a LUG. That is where I learned most things early on. Ask the organizers to do a "Bash Tips And Tricks Night".
Deft shell users love to show off.

apropos is a really good tool for this sort of thing. Whenever you find yourself unsure of the best way to do something, or wishing you weren't repeating yourself, just use apropos with a keyword or two to find other commands that can help. In distros like debian, you can also install web-based help tools that search all of the manuals available on the system: texinfo, man pages, html, and pdf etc.
Aside from that, yep, read your shell's manual right through at least once --- preferably, go back to repeatedly it as you learn more, reach limits and want to be more efficient.
The join a LUG idea is also good; you'll definitely learn from others' demos.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex