Using awk/cut/sed with lines containing different numbers of fields - unix

I'm using the pinboard.in API to get a list of my current bookmarks. The results look like this:
<post href="https://www.nocc.meezy.com/doc/view.cgi?id=715" time="2013-02-11T17:38:10Z" description="Disk Errors Process Flow Chart" extended="" tag="nocc work" hash="a3419515b2e956e86886ba630b6028b7" meta="d793aeef6133a26e361695181eb57b9d" />
<post href="https://www.nocc.meezy.com/doc/view.cgi?id=39" time="2013-02-11T17:38:08Z" description="Using socat" extended="" tag="socat work" hash="fd60523bf841b2b95674a0e1d4401f4d" meta="5f2b6ad395fe4da05b2987d199b675ea" />
<post href="https://agora.meezy.com/wiki/Development_Tools" time="2013-02-11T17:38:06Z" description="Development Tools - meezyWiki" extended="" tag="devtools work" hash="dcf904433987a125c00a88bcaf31cad27" meta="5e744562282561390a0417223d323aee" />
I'm only interested in the URL, description, and tags, so I'd like to have the results look like this:
https://www.nocc.meezy.com/doc/view.cgi?id=715 description="Disk Errors Process Flow Chart" tag="nocc work"
https://www.nocc.meezy.com/doc/view.cgi?id=39 description="Using socat" extended="" tag="socat work"
https://agora.meezy.com/wiki/Development_Tools description="Development Tools - meezyWiki" tag="devtools work
I know a little bit about awk/cut/sed but not enough to tell them how to count the fields correctly when the description and tag fields contains spaces and different numbers of strings.
I could probably hack together some really crappy solution if my life depended on it but I'd rather get a proper solution by someone who knows them much better than I do.
Thanks

when you playing with xml with regex/awk/sed.. you should know the risk. here is sed one-liner for your requirement:
sed -r 's/^.*"(http)/\1/; s/" time=.*( desc)/ \1/; s/extended=.*( tag=")/\1/; s/hash=.*//' file
test with your example:
kent$ sed -r 's/^.*"(http)/\1/; s/" time=.*( desc)/ \1/; s/extended=.*( tag=")/\1/; s/hash=.*//' file
https://www.nocc.meezy.com/doc/view.cgi?id=715 description="Disk Errors Process Flow Chart" tag="nocc work"
https://www.nocc.meezy.com/doc/view.cgi?id=39 description="Using socat" tag="socat work"
https://agora.meezy.com/wiki/Development_Tools description="Development Tools - meezyWiki" tag="devtools work"

Related

Atom data-grammar syntax for keybindings

Can someone give a full explanation of the syntax for Atom's data-grammar attribute (used in keybinding selectors)?
For instance, what is the difference between
[data-grammar='source example']
and
[data-grammar~='source example']
?
Also, how do you specify multiple grammars? For instance, how would you specify that a key binding should be limited to html or xml formats?
If there already exists documentation on this somewhere, I have not yet found it, but would appreciate being pointed to it.
Quick Example:
keymap.cson:
"atom-text-editor[data-grammar='text tex latex']":
'f5':'latex:build'
Grammar Information & Documentation
I began by looking at the file-types package. source and text categorize languages - source deals with development languages, while text deals with documentation/logs formats.
You can add and customize language recognition by reading the flight manual. I've linked some specific sections below that are helpful for that.
Flight Manual | Basic Customization:
Language Recognition
Language Specific Settings
Working with [data-grammar]:
The little doocumentation given is listed under the Keymaps in Depth section.
Flight Manual | Keymaps in Depth
Selectors and Custom Packages.
This also describes the not([...]) functionality used below and how to manipulate various rules.
Although in the above, grammars are listed in a dot format, ie source.c, to use them in the [data-grammar='<name>'] format spaces are instead required.
An example of how I might use the data grammar option in my keymap.cson config is as such (here I'm using the latex package):
"atom-text-editor[data-grammar='text tex latex']":
'f5':'latex:build'
The ~ is not the correct syntax for desired functionality with data-grammar. Instead, use something like "atom-text-editor:not([data-grammar='<name>'])":
Note that you wouldn't use data-grammar in something like config.cson. The syntax for language specifics looks something like this instead:
# **config.cson**
".latex.tex.text":
editor:
softWrap: true
Extra useful information - List of registered Grammars
A dump of the output of Object.keys(atom.grammars.grammarsByScopeName).sort().join('\n') through the Dev Console (View > Developer > Toggle Developer Options > Console)
source.c
source.cake
source.clojure
source.coffee
source.cpp
source.cs
source.css
source.css.less
source.css.scss
source.csx
source.diff
source.gfm
source.git-config
source.go
source.gotemplate
source.java
source.java-properties
source.js
source.js.rails source.js.jquery
source.js.regexp
source.js.regexp.replacement
source.json
source.litcoffee
source.makefile
source.nant-build
source.objc
source.objcpp
source.perl
source.perl6
source.plist
source.python
source.python.django
source.regexp.python
source.ruby
source.ruby.gemfile
source.ruby.rails
source.ruby.rails.rjs
source.sass
source.shell
source.sql
source.sql.mustache
source.sql.ruby
source.strings
source.toml
source.verilog
source.yaml
text.bibtex
text.git-commit
text.git-rebase
text.html.basic
text.html.erb
text.html.gohtml
text.html.jsp
text.html.mustache
text.html.php
text.html.ruby
text.hyperlink
text.junit-test-report
text.log.latex
text.plain
text.plain.null-grammar
text.python.console
text.python.traceback
text.shell-session
text.tex
text.tex.latex
text.tex.latex.beamer
text.tex.latex.memoir
text.todo
text.xml
text.xml.plist
text.xml.xsl
To complement Mr G's answer, atom-text-editor[data-grammar~='html'] with the ~= means match an <atom-text-editor> HTML element with a data-grammar attribute that contains "html" amongst any other possible whitespace separated words.
For example, if the current language of the file is PHP, then the text-editor HTML element will look something like this:
<atom-text-editor data-grammar="text html php">
And atom-text-editor[data-grammar~='html'] will match this.
More info on attribute selectors: https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors
As for trying to select multiple grammars, I don't think it's possible unless they share a common word in the data-grammar attribute, e.g., both HTML and PHP share "html", or both C and JavaScript share "source" (but in this case many other grammars share "source"). The only other way is to specify a keymap for each grammar individually even if it's the same key combination.

Notepad++ Function list for SQL

quick question .. I'm trying to get the function list option in Notepad++ going ...
Now, I found this thread:
Notepad++ Function List for PL/SQL
which helped get me started, however, I'm still struggling with something, and I can't seem to wrap my monkey-brain around it.
It'll be this section I need to focus:
<function
mainExpr="^[\t ]*(FUNCTION|PROCEDURE)[\s]*[\w]*[\s]*(\(|IS|AS)*"
displayMode="$functionName">
<functionName>
<nameExpr expr="[\w]+[\s]*(\(|IS|AS)"/>
</functionName>
</function>
That works perfectly fine .. so far.
However, I would like to also see PACKAGE header and PACKAGE BODY names in there as well .. just to help tidy things up.
I figured it'd be easy to tweak the RegExp, however, nothing I've tried is working
So I'm trying to pick out these kinds of scenarios:
CREATE PACKAGE aaa
CREATE OR REPLACE PACKAGE bbb
CREATE PACKAGE BODY ccc
CREATE OR REPLACE PACKAGE BODY ddd
all 4: aaa, bbb, ccc, and ddd.
I can't even get it to pull back one yet.. :(
Hoping I could get some help/hints/something ..
I know this is the main "logic":
mainExpr="^[\t ]*(FUNCTION|PROCEDURE)[\s]*[\w]*[\s]*(\(|IS|AS)*"
that finds the line(s) ..
And trying to matchup the logic with what it finds for .. say, FUNCTIONs, and what I want for PACKAGE ... I tried this:
mainExpr="^[\t ]*(FUNCTION|PROCEDURE|CREATE OR REPLACE PACKAGE)[\s]*[\w]*[\s]*(\(|IS|AS)*"
but even that doesn't pick out the header! O.o
I'm sure there's something I need to do with the part .. but again, not really understanding how it works ??
I've read this :
https://notepad-plus-plus.org/features/function-list.html
but there's obviously something about the syntax/usage of this thing I'm not fully understanding ..
hoping somebody can help me out?
I think your problem is coming from the Regex rather than anything you're doing incorrectly. I've made a new parser based on the one I found here: http://www.hermanmol.nl/?p=240
<parser id="plsql_func" displayName="PL/SQL" commentExpr="((/\*.*?\*)/|(--.*?$))">
<function
mainExpr="^[\w\s]{0,}(PACKAGE BODY|PACKAGE|FUNCTION|PROCEDURE)[\s]{1,}[\w_]{1,}">
<functionName>
<nameExpr expr="^[\w\s]{0,}(PACKAGE BODY|PACKAGE|FUNCTION|PROCEDURE)[\s]{1,}\K[\w_]{1,}"/>
</functionName>
</function>
</parser>
For me this seems to correctly pull out the Package, Procedures and Functions.
One thing to note however, I could not get this to work using a file extension assocation, and used the following instead to test on a text file: <association langID= "0" id="plsql_func" />
I also placed the updated functionList.xml file in both the Program Files (x86)\Notepad++ and the Users\xxxxx\AppData\Roaming\Notepad++ directories.
Edit - a short explanation of the Regex, I'm not great at Regex but it was requested in the comments
^[\w\s]{0,} - From the beginning of the line, find 0 or more letters or white space characters
(PACKAGE BODY|PACKAGE|FUNCTION|PROCEDURE) - followed by any of these
[\s]{1,}[\w_]{1,} - followed by one or more spaces, followed by one or more words
Thanks Chrisrs2292,
I was helped by the location of the functionList.xml file in Users\xxxxx\AppData\Roaming\Notepad++ directories.
RegEX for T-SQL:
<association id= "T-SQL_func" langID="17"/>
<!-- T-SQL-->
<parser displayName="T-SQL" id="T-SQL_func" commentExpr="(?s:/\*.*?\*/)|(?m-s:--.*?$)">
<function mainExpr='(?im)^\h*(create|alter)\s+(function|procedure)\s+((\[|")?[\w_]+(\]|")?\.?)?((\[|")?[\w_]+(\]|")?)?'
displayMode="$functionName">
<functionName>
<nameExpr expr='(?im)(function|procedure)\s+((\[|")?[\w_]+(\]|")?\.?)?((\[|")?[\w_]+(\]|")?)?' />
</functionName>
</function>
Working from what Chrisrs2292 has provided from above, I played around with some PL/SQL code on regex101.com to find a regular expression to find the functions/procedures/etc, and got put that into the functionList.xml.
<parser
displayName="SQL Node"
id="sql_node"
commentExpr="((/\*.*?\*)/|(--.**$))"
>
<function
mainExpr="^[\w\s]+(PACKAGE BODY|PACKAGE|PROCEDURE|FUNCTION)\s+[\w"\.]+"
>
<functionName>
<nameExpr expr="^[\w\s]+(PACKAGE BODY|PACKAGE|PROCEDURE|FUNCTION)\s+\K[\w"\.]+"/>
</functionName>
</function>
</parser>
The big change was I had include the double quotes (using the XML code of ") and dot (the \.) since many times I use the quotes and like to use the fully qualified name (schema.procedure|function|etc). Minor changes are replacing the {0,} with * and {1,} with +. These are minor, cosmetic changes are should be interchangeable.

Can MeCab be configured / enhanced to give me the reading of English words too?

If I begin with a wholly Japanese sentence and run it through MeCab, I get something like this:
$ echo "吾輩は猫である" | mecab
吾輩 名詞,代名詞,一般,*,*,*,吾輩,ワガハイ,ワガハイ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
猫 名詞,一般,*,*,*,*,猫,ネコ,ネコ
で 助動詞,*,*,*,特殊・ダ,連用形,だ,デ,デ
ある 助動詞,*,*,*,五段・ラ行アル,基本形,ある,アル,アル
EOS
If I smash together everything I get from the last column, I get "ワガハイワネコデアル", which I can then feed into a speech synthesis program and get output. Said program, however, doesn't handle English words.
I throw English into MeCab, it manages to tokenise it (probably naively at the spaces), but gives no reading:
$ echo "I am a cat" | mecab
I 名詞,固有名詞,組織,*,*,*,*
am 名詞,一般,*,*,*,*,*
a 名詞,一般,*,*,*,*,*
cat 名詞,固有名詞,組織,*,*,*,*
EOS
I want to get readings for these as well, even if they're not perfect, so that I can get something along the lines of "アイアムアキャット".
I have already scoured the web for solutions and whereas I do find a bunch of web sites which have transliteration that appears to be adequate, I can't find any way to do it in my own code. In a couple of cases, I emailed the site authors and got no response yet after waiting for a few weeks. (Just how far behind on their inboxes are these people?)
There are a number of directions I can go but I hit dead ends on all of them so far, so this is my compound question:
MeCab takes custom dictionaries. Is there a custom dictionary which fills in the English knowledge somewhat?
Is there some other library or tool that can take English and spit out Katakana?
Is there some library or tool that can take IPA (International Phonetic Alphabet) and spit out Katakana? (I know how to get from English to IPA.)
As an aside, I find that the software "VOICEROID" can speak English text (poorly, but adequately for my purposes). This software uses MeCab too (or at least its DLL and dictionary files are included in the install.) It also uses another library, Cabocha, which as far as I can tell by running it does the exact same thing as MeCab. It could be using custom dictionaries for either of these two libraries to do the job, or the code to do it could be in the proprietary AITalk library they are using. More research is needed and I haven't figured out how to run either tool against their dictionaries to test it out directly either.

Pipe file to stdin and keep alive, waiting for new data

A semi-newbie to UNIX piping, so apologies if I'm asking anything obvious here. I'm using a program called CCExtractor to grab the closed captions from a video file. It has the option to receive a file from stdin, and it works great if I do the following:
./ccextractor -stdin < myvideofile.wtv
However, I want to try using it "live" - as a video is recording, it'll transcribe the subtitles. From my understanding, < won't do that, as it'll stop as soon as it reaches the current end of the file. Following this answer on Stack Overflow, it seems like:
tail -c +1 -f myvideofile.wtv | ./ccextractor -stdin
should work - but it doesn't process any part of the video at all (it should, at least, work as well as the previous command and parse the existing data). I figured I'd take a step back and use a simple cat:
cat myvideofile.wtv | ./ccextractor -stdin
and that doesn't work either. I was of the belief that the first and third commands ought to be roughly equivalent, but that's obviously not the case. What are the differences, and how could I get this to work?

How to get alt text from images in a docx file using Open XML and C#

I am creating a web form that will do a 508 compliance check on word documents. I am looking through MSDN and other sites for getting the information I need from a file the user selects. The one thing I can't find is how to find images, and check to see if they have alternative text. Any help would be GREATLY appreciated!
Images inserted into 2007+ Word documents are Drawing objects. So you can traverse the XML for w:drawing members.
http://msdn.microsoft.com/en-us/library/documentformat.openxml.wordprocessing.drawing.aspx
The w:drawing member will have a child called w:inline which is a part of the Inline class.
http://msdn.microsoft.com/en-us/library/documentformat.openxml.drawing.wordprocessing.inline.aspx
The w:inline member will have a member called wd:docPr.
http://msdn.microsoft.com/en-us/library/documentformat.openxml.drawing.wordprocessing.docproperties.aspx
The wd:docPr member may have a field called title which houses the alternative text title and a field called descr which houses all the alternative text.
Example XML:
<w:drawing xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<wp:inline distT="0" distB="0" distL="0" distR="0" wp14:anchorId="357A850A" wp14:editId="384E9053" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing">
<wp:extent cx="5943600" cy="4457700" />
<wp:effectExtent l="0" t="0" r="0" b="0" />
<wp:docPr id="1" name="Picture 1" descr="ALL TEXT HERE" title="ALT TEXT TITLE HERE"/>
...
I highly recommend you use the OpenXML Productivity Tool that comes with the OpenXML SDK.
You can do the same thing slightly more easily with unzip and a copy of lxprintf (part of the LTXML2 toolkit), by unzipping the slides in a loop and running lxprintf on each one to locate the wp:docPr element and output the values of #descr and #title, eg
for f in `unzip -l demo.pptx | grep ppt/slides/slide.*\.xml | awk '{print $NF}'`; do
unzip -p demo.pptx $f |\
lxprintf -e 'w:drawing/wp:inline/wp:docPr' "%s, %s\n" #descr #title -
done

Resources