how access specific part of data as an input of AWK - unix

Suppose I want to access an online dictionary and need to look for a specific word. I just like to have the specific part of data, which is those related to word and its translation as input of AWK,any idea?
In other words, I just want to have on my machine a margin of data, How can I prevent downloading all the data and hopefully save space and time. Is there any way to do so without downloading all the data to local machine?
This question is related to my last question here.
Edit 1:
I select dictionary as an example because when you want to look up for a word, it is enough to access a specific part of data and there is no need to process whole of it.
I am not an expert in programming so i was thinking I can modify this answer to make it work(that is why I add AWK tag again). I dont use any specific OS or tool. this is just a basic idea to see what are the possibilities so I dont know how can I improve the tags.

awk cannot download. You must download the file and pipe it into a command that terminates as soon as it finds a result:
wget -qqO- http://example.com/path |grep -wim1 "word"
wget -qqO- URL will have no output other than the content of the given URL, which is placed on standard out so you can then parse it. grep -wim1 "word" will find the first bounded word matching "word" and then terminate. If you don't need it outputted, you can use -wiq instead. If the dictionary has one word per line (and nothing else), you're better off with -x instead of -w so that you can match "can" in its entirety rather than "can't" (' is a word boundary). Remove the -i if you want to match case.
In the comments, you asked:
it may improve to jumpt to start of "w" character maybe so not to download whole data from "a" to "w". is it possible? I guess not
Some programs can "resume" downloads and you may be able to play with that, but you'd have to guess where to start. This would be a lot of work and you might seek too far and therefore fail to get a match.
If you are querying this dictionary more than once, I'd recommend downloading it and saving it so you can query it locally. Even the largest dictionary I know of is only 213MB (compressed, search with zgrep), though I am assuming you're talking about a traditional word list rather than a hash table or other arbitrary data form. Of course, anything longer would take such a long time to download that you'd only want to do it once.
If you really don't want to store it locally, you should probably consider a database rather than a flat file.

Related

Man page, as structured data (csv, database, ...)

In order to simplify my question, I mainly consider man pages of commands. eg «man grep».
Man pages are, more or less, structured. Most sections and their presentation are standard, and an explanation can be found on https://www.tldp.org/HOWTO/Man-Page/q3.html
(And the source of man page, in groff, is not really hard to understand, even without knowing groff)
My question is: is there already a database with the (more standard) man pages. Or at least a program, taking as input a man page (as a groff file, probably) and output such a database.
Here, I mean database in a very vague sens. Sqlite or mysql would be perfect. But a zip of csv would also be great.
Let me give you an example using man grep.
The database would have an option table, with an entry for each option. This entry would contains:
- the actual option name(s)
- The abbreviation(s),
- The description of what this option does
- The enclosing section.
. In CSV, an entry would be
--extended-regexp, -E, Interpret PATTERN as an extended regular expression (ERE\, see below). (-E is specified by POSIX .), Matcher Selection
It would have an "exit" table, with:
0, selected lines are found
1, otherwise
2 an error occurred\, unless the -q or --quiet or --silent option is used and a selected line is found.
And so on for each standard kind of sections of the man page.
And a table with every text which was not successfully put in some other table.
I hope that some part of it would be simple to parse, for example creating the option table. But some other part would be quite hard, for example the exit status. Which is why I really want to know whether something like that is already done, in order not to do it myself.
You can download man pages with
git clone http://git.kernel.org/pub/scm/docs/man-pages/man-pages

Comment in aspell .dic files?

They look like this:
abaft
abbreviation/M
abdicate/DNGSn
Abelard/M
abider/M
Abidjan
ablaze
abloom
I am using this kind of dictionary with Node.js application, but I will need it to be smarter. Specifically, I want to remember occurrence probability of every word based on already processed text. I'd like to save this information in existing .dic file - but how to do that without making it invalid?
Is there any comment syntax that would allow me to store additional data next to the words in file? Such that normal dictionary parser will ignore it?

How to replace a list of words into a word document automatically?

I have a custom Dictionary I have made in word and it is about 200 words. it contain the common errors that some people mistype and it didn't exist in the original word dictionary.
how to replace any word in my list in any document automatically. I mean whenever word see the wrong word it replace it with the one existed in my list.
I don't want to auto correct every single word every time I open a document.
I used to to something like this some years ago with Windows scripting host and OLE.
So today this would be kind of old fashioned but I bet it still works. However it requires you to have Word installed.
Since Word is remote controllable via COM, you could use nearly every programming language (C++ works natively but is a native nightmare) to remote control it. Then it would be easy to traverse thru and replace the words. Or invoke the search and replace function multiple times.
200 words can easily be held in your source code.
As the "included" feature Windows Scripting host is ok - you can choose between JScript and VB. Just create a .js file and by default it should be executed with WSH. But why don't you use macros (Essentially the same functionality - only seen thru the Word -GUI). But you may also use C# or Delphi....

Change Word 2013 autocorrect behaviour

This question involves bending Microsoft Word 2013 to one's will.
I have been asked to help fix a problem with Word 2013's autocorrect.
We are working on a spell checker for my native language (Afrikaans), and many Afrikaans words contain a diacritical/umlaut (ë, ö, Ü, etc).
The spell checker consists of a .dic file which is basically just a text file that contains about 508 000 words, and an autocorrect list (.acl) file that is used to automatically replace text as you type.
The spell checker works very well for the most part. It replaces the text as you type, which is the desired effect. The problem is that autocorrect doesn't work with all words.
For example, if I want to type the Afrikaans word 'pêrels' (which means 'pearls'), I should only have to type 'perels' (without the ^ character on the 'e'), and autocorrect should automatically change it to the correct form.
Same with 'reën' (rain). If I type 'reen' (without the umlaut), it is supposed to automatically correct it.
However, in both of the above cases, the words remain unchanged. A red line appears under the words, and when you right-click, you can select the correct word from the pop-up autocorrect menu as shown in the image below.
As you can see, the correct form of the word is the first one in the context menu. I need autocorrect to automatically change the wrong word into the first word that appears in said menu. It should completely ignore the other menu items, and just go with the first word.
My initial instinct was to manually add the words to the *.acl file using a text editor, but the file is encrypted and not readable (I used Notepad++).
I then tried adding them inside Word's autocorrect options menu. However, Word 2013 has a maximum autocorrect memory of 64KB, and the size of the file is already at that maximum. Whenever I add more words, it bombs out and basically wipes the file contents. This doesn't seem like the most efficient strategy anyway, since I would need to manually enter hundreds, if not thousands of autocorrect cases. Ain't nobody got time for that!
What makes this even more complicated (ironically), is that there is no real "program". In other words, this isn't a C# program with source code that I can manipulate. I have the two files mentioned above, and Word's built-in options (which I have already explored). That's it. Nothing else.
I'm stuck. Does anyone have any ideas?
Is it perhaps possible for me to hack Word to increase the autocorrect memory to, let's say, 128 KB? Google hasn't turned up anything of use.
Or, is there a way to set Word to not give the autocorrect context menu, and instead default to the first matching word in the dictionary, as mentioned above?
I can probably write a batch script, C# program, or edit the registry if need be. I just need to know where to start.
Thanks for any help!
In case you are still looking for a solution, you might consider using AutoHotkey (http://www.autohotkey.com). It is a very powerful free open-source utility, and can handle substitutions similar to AutoCorrect. Whenever the built-in program features of Word and others fail to handle my needs, I use AutoHotkey. It has the added benefit of not being tied to any specific program (e.g., Word), so the substitutions can occur anywhere needed. I hope it helps you. I have used and depended on AutoHotkey for years of new Windows versions, new Office versions, and highly recommend having a look. You might even get new ideas about time-saving automation with AutoHotkey. Good luck!

Converting words to hashtags when posting to Twitter

I currently use LinqToTwitter to send posts to Twitter. I'd like to convert words in the title of the post to hashtags when it gets fired off as tweet so something like - "Firefox is cool" is the blog post and becomes #Firefox is cool http://myshortu.rl/dhsgeh on Twitter.
So far the way i see it is i need a database table with the words i want to convert to hashtags. I'd have to parse out the title and compare the words to those in the db and add on the pound sign. Is the best way to use a db table? Or can I do it with an in memory collection or keep the words in web.config? Thanks....
The decision on whether to use a database or file (such as web.config) might depend on whether you want to write code that allows you to maintain the list. e.g. Add, Modify, Remove. If so, then a DB sounds like the easiest option. If the list is small and doesn't change, then adding a delimited list to web.config would work fine.
Since you're using ASP.NET you can't hold it in a memory variable, but you can hold the list in Cache. This can make for some very fast lookups, rather than multiple file or DB queries.
Just to put this into perspective though, it's tough to recommend a proper design in a forum because there might be details that aren't known. So, it's best to take my answer as something that helps think about what the tradeoffs are, rather than a definitive recommendation on what you should do.

Resources