Any publicly available word dictionary in a text file?

Any publicly available word dictionary in a text file? - dictionary

I am looking for a word dictionary for different languages(english, spanish, ...). However, almost all dictionaries that I could find are either provided by a program or on a website.
I want to get this word dictionary as a text file. (Also, this file should be publicly available.) Are there any public available word dictionaries in text files?

You can download the Wikitionary for off-line processing, instructions here in the FAQ:
https://en.wiktionary.org/wiki/Help:FAQ#Downloading_Wiktionary

If you're a Linux user, you can use one of the following files (depending on your distribution)
/usr/share/dict/
/var/lib/dict/
The GNU 'aspell' program makes use of this dictionary.

Related

IntelliJ Spell checker can't handle hunspell .dic files correctly

I have some problems installing german spell checker dictionary for IntelliJ
I download the german .dic file from
http://extensions.libreoffice.org/extension-center/german-de-de-frami-dictionaries
(The Zip contains the .dic file)
After importing the dict into IntelliJ I try to use it but I see the results with suffixes. So it is 1:1 like in the
So the question is should there be a different Dictionary or is IntelliJ not capable to handle hunspell .dic file format?

Hunspell plugin is available now, please install it and you will be able to add hunspell dictionaries to IntelliJ spellchecker
When hunspell plugin is installed and enabled navigate to the Settings/Preferences | Editor | Spelling, switch to the Dictionaries tab and select the dictionary folder containing hunspell dictionaires files (.dic and .aff) (till 2018.1 IDE version) or dictionary file .dic (from 2018.1 IDE version).
After this step is done hunspell dictionary will be bound to spellchecker system, it will detect and highlight typos in your text and propose corrections for them as general word list dictionaries do.

Per IntelliJ IDEA's Spelling Settings documentation and the general Spellchecking documentation, the format of files that it uses are just plain text files "containing words separated with a newline". There's no indication that I can find of being able to handle files in Hunspell format.
You can also refer to this older Jetbrains Support forum thread.
You need to either convert the dictionary file you want to use into a format that just uses plain words, or find another source of a dictionary for the words you want to add.

Alfresco localization encoding

Trying to create custom types, aspects and properties for Alfresco, I followed the Alfresco Developer Series guide. When I reached the localization section I found out that Alfresco does not handle UTF-8 encoding in the .properties files that you create. Greek characters are not displayed correctly in Share.
Checking out other built-in .properties files (/opt/alfresco-4.0.e/tomcat/webapps/alfresco/WEB-INF/classes/alfresco/messages) I noticed that in Japanese, for example, the characters are in this notation: \u3059\u3079\u3066\u306e...
So, the question is: do I have to convert the greek words in the above mentioned notation for Share to display them correctly, or is there another -more elegant- way to do it?

The \u#### form is the Java form of the Unicode Escape Sequence, and is used to reference unicode characters without having to worry about the encoding of the file storing them.
This question has some information on how to create and decode them
Another way, which is what Alfresco developers tend to use, is the Native2ASCII tool which ships with Java itself. With that, you can initially write your strings in a UTF-8 (for example) file, then use the tool to turn them into their escaped form.

Translate QT application without text in code

Is there a way to translate a QT app into different languages without defining the texts directly in the source? I want to separate the text from source. This would result in some kind of resource files for ALL languages, including the default language (e.g. English).

You won't be able to leave the English (or your source language, not necessarily English) source out of the XML (.ts) files as lupdate will put it there each time you run it. However as long as a translation exists for the chosen language, the source text will be ignored. If there is no translation text, it will default to the source text. This is useful since you'll be guaranteed to get some sort of text in your translation, but it'll be up to your test team to insure that the translations exist. I wrote a python script to automate the checking of the translation files since we have 9 languages and nearly 1k strings per translation. To test for this, we used a very simple sed script to create pseudo-loc source strings so if there were translations missing, the pseudo-loc text would be very evident.
Regarding the process for editing the .ts files, we farmed out the translations to individual translators, providing them with the .ts file for their language, and usually about an hour's worth of hand's on instruction in using QT Linguist. If the translator was onsite and wanted to see their translations on our device immediately, I wrote an autorun script that would place the resultant .qm file in the right place in our embedded file system and restart the application to display the new translations. If they weren't onsite, we'd run them through the python script mentioned above to check for a number of different problems, then simply check in the .ts file so it'd get built the next time around.
HTH

You might be able to use the QT_TRANSLATE_NOOP family of macros to do what you want. Those simply mark some text as "need to be translated" so that lupdate picks it up, but you can refer to those translated values by variable constants in your code. So ...
const char *kHelloWorld = QT_TRANSLATE_NOOP("SomeScope", "Hello world");
Later...
return qApp->translate("SomeScope", kHelloWorld);
This is still in your source code somewhere, but it is at least one step removed. You could theoretically have the QT_TRANSLATE_NOOP stuff in an entirely different Qt project, but then you'd need some way in your real project to know what you are supposed to translate. (You still need the char* constants somewhere.)
What are you trying to accomplish exactly by not having the English text in the source? That might help people provide more appropriate answers.
EDIT
Based on your comment below about intended usage
OK, then this is precisely what the normal Qt translation process does. Running lupdate creates those XML files (they have a .ts extension). They can be opened by translators in the very-easy-to-use Qt Linguist. They are translated, then sent back to you (the programmer), where you run lrelease on them to create the binary translation files you will ship with the app. The translations are resolved at runtime, so there is no need to recompile.
If you wanted to enable a user to do this, you would:
Ship your application with an empty (untranslated) .ts file and the lrelease program.
Provide instructions on how to use Qt Linguist to translate. (They could use a text editor and modify the Xml directly, but it's a lot easier with Linguist.)
Explain how to run lrelease and where to drop the binary translation files so that your application pulls them in.
On this last step, you could theoretically provide a nice wizard-like app that hides the implementation details.

What we will do is:
* Include a translation for the former default language. Using this *.ts file to auto-generate the other *.ts files. This is required as we keep the translations outside the QT environment as they match with other projects not related to QT.
Then have only have to make sure this translation contains the correct value.
In the future we can define IDs in the code witch represent Text in the default translation. Like translating TXT_ID_ABOUT to "About".

English dictionary needed for a word game

I'm looking for a way to include a full blown English dictionary in an iPhone app (a word game), the database must be able to include all conjugation possibilities for verbs, must include singular and plural spellings. So my app can query the database to check if the spelling is correct.
Is there a free or commercial database that would include those data?

You can use UITextChecker for spell-checking.
Regarding a dictionary, when I built an iOS dictionary library sometime ago (www.lexicontext.com) I used WordNet. WordNet contains a lot of interesting semantic info ...

NSSpellChecker is your easiest option, but it might be more complete to use the online Scrabble official dictionary as well and check it against both (only one match required.)
You could do a web-service request using http://www.hasbro.com/scrabble/en_US/search.cfm

http://www.a2zwordfinder.com/cgi-bin/scrabble.cgi?Letters=&Pattern=______&MatchType=Exactly&MinLetters=3&SortBy=Alpha&SearchType=Scrabble
Change min letters to get different results

The best place to find a database for a spell-checker is probably a free text processing application. So, I'd try with Open Office version of Word. Download it, install it and simply find the dictionary file.
Open Office is licensed under LGPL, so it should be fine, just check if the licence covers the data as well (i.e. the dictionary file).

Maybe this English corpus helps: http://www.wordfrequency.info/free.asp

Generating keywords from a pdf automatically

My application allows user to upload pdf files and store them on the webserver for later viewing. I store the name of the file, location, size, upload date, user name etc in an SQL server database.
I'd like to be able to programatically, just after a file is uploaded, generate a list of keywords (maybe everything except common words) and store them in the sql database as well so that subsequent users can do keyword searches...
Suggestions on how to approach this task? Does these type of routine already exist?
EDIT: Just to clarify my requirements, I wouldn't be concerned with doing OCR, I don't know the insides' of PDF's, but I understand that if it was generated by an app, such as Word->PDF Print, the text of the document is searchable...so really my first task, and the intent of my question is, how do I access the text of a PDF file from an asp.net app? OCR on scanned PDF's is probably beyond my requirements at this point.

As a first step you should extract all text from the PDF.
ghostscript and pdftotext can do this, the PDFBox is another option.
There are certainly other tools as well.
Then you can remove all stopwords and duplicates and write it to the database.
I has been mentioned that this does not work for scanned PDF documents but this is only half the truth. On the one hand there are lots of scanned PDFs which have text additionally embeded, because that is what some scanners drivers do (Canon CanoScan drivers performs OCR and generate searchable PDFs). On the other hand documents generated with LaTeX that contain non-ASCCII characters return garbage in my experience (even when I copy and paste in acrobat).

The only problem I foresee of grabbing every non-common word is that you'll dilute your search results and have to query the DB for more pdfs. One website to look at is Scribd which does something similar to what you are talking about doing with users uploading files and people being able to view them online via a flash app.

That is very interesting topic. The question is how many keywords do you need to define one PDF. If you say:
3 to 10 - I would check methods of text categorization such as bayesian classifier or K-NN (that method will group PDF files into clusters which are similar). I know that similar algorithms are used to filter spam. But it is a system that need input for example if you add keywords to 100 PDF this system will learn the schemas. I am not an expert but this is one way to do it.
more than 10 - then I would suggest brute force -> filter common words -> get most frequent words for a specific document.
I would explore first option. You must surely check such methods as "text categorization", "auto tagging", "text mining", "automatic keyword extraction".
Some links :
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Keyword Extraction Using Naive Bayes

If you are planning on indexing PDF documents, you should consider using a dedicated text search engine like Lucene. Lucene provides features that will be difficult to implement using only SQL and a relational database. You will still need to extract the text from the PDF documents, but won't have to worry about filtering out common words. By filtering out common words, you will completely lose the ability to do phrase searches.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex