where can I download the ispell *.dict and *.affix files? - dictionary

I am quite new to postgresql full text search and I am setting up the configuration as where can I download the ispell *.dict and *.affix filefollowing (exactly as in docs):
CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
DictFile = english,
AffFile = english,
StopWords = english
);
So, this I think expects files english.dict and english.affix on for example:
/usr/share/postgresql/9.2/tsearch_data
But these files are not there. I just have ispell_sample.dict and ispell_sample.affix - which when included above work fine - no problem.
So... I followed this post and downloaded the required dictionary from the open office people and renamed the .dic to .dict and .aff to .affix. Then I have checked (using file -bi dict.affix and file -bi english.dict and they are UTF8 encoded).
When I run the above text search dictionary, I get the error:
ERROR: wrong affix file format for flag
CONTEXT: line 2778 of configuration file "/usr/share/postgresql/9.2/tsearch_data/english.affix": "COMPOUNDMIN 1
"
I was wondering if anyone had clues on how to solve this problem or if anyone had encountered this before..
Thanks./.
UPDATE:1: I guess the question can be rephrased as follows:
where can I download the ispell *.dict and *.affix file for postgres

Here's a good reference: https://www.cs.hmc.edu/~geoff/ispell-dictionaries.html This is a good resource for those dictionaries of any language.

Related

How can I get "canonical" name of the text encoding of current Atom active editor?

I have recently adding a fix for atom-script package to resolve garbled text output in compile&run Java code in non-English Windows environment. (Issue #1166, PR #2471)
After this, now in release script package 3.32.1, javac has options -J-Dfile.encoding=UTF-8 and -encoding UTF-8 both. (diff is here) I have just realized that it is better to provide actual encoding of the current active editor which holds the target source code for -encoding option. After some research, I have learnt that the encoding can be acquired by
atom.workspace.getActiveTextEditor().getEncoding()
but, after I have tested in Japanese edition Windows, it returns shiftjis and this is not valid encoding name for javac command line option. (It should be MS932, SJIS or something similar.) I have no idea where I can get this type of encoding names without writing large conversion table for all possible encoding names. Is there any good utility for such purpose?
EDIT:
For demonstrating what I have supposed to do, I have created branch on my fork. Diff is here.
Getting current source code editor encoding by
const fileEncoding = getJavaTextEncodingName(atom.workspace.getActiveTextEditor().getEncoding())
and pass it to javac via
const cmd = `javac -encoding ${fileEncoding} -J-Dfile.encoding=UTF-8 -Xlint -sourcepath '${sourcePath}' -d '${tempFolder}' '${filepath}' && java -Dfile.encoding=UTF-8 -cp '${tempFolder}' ${classPackages}${className}`
And, function "getJavaTextEncodingName()" is the core of the question.
function getJavaTextEncodingName(atomTextEncodingName) {
switch (atomTextEncodingName) {
case "shiftjis" :
return "MS932"
}
return "UTF-8"
}
It is obvious that this is converting "shiftjis" to "MS932" but, it is not so beautiful if we implement all possible encoding name conversions here, so I am seeking better alternative.

How to save file into a path containing special characters such as '&'? ('&' which is different from '&' typed in English Keyboard)

I need to write out a file to a certain path that contains a special character in R. the path is something like this: C:/Users/Technology & Innovation/Webscraping files/US_data/data
It works totally fine when I access this path through python, but I cannot access the same path in R. And I cannot change this path name or remove '&' as this path is used by a lot of people. Does anyone have a good idea on how to solve it?
I found out it is '&' which has subtle difference from '&' that we usually type in through English Keyboard. May be that's the reason causing the problem?
Here is what I have tried:
write.csv(df, 'C:/Users/Technology & Innovation/Webscraping files/US_data/data/file.csv').
write.csv(df, 'C:\\Users\\Technology & Innovation\\Webscraping files\\US_data/data/file.csv')
Not matter whether I try to read or write a file, it is not working in my case.
I also tried reset the working directory path and got the error message:
Error in setwd("C:/Users/Technology & Innovation/Webscraping files/US_data/data") : cannot change working directory
Write it like this
C:\\Users\\Technology & Innovation\\Webscraping files\\US_data\\data
also, you can change your current directory.
Changing your current directory will help you because you can write read.csv("filename.csv") or write.csv(name_of_file, "filename.csv") as it is without mentioning path.
If you have to write a file you have to use syntax properly.
write.csv(C:\\Users\\Technology & Innovation\\Webscraping files\\US_data\\data,"filename.csv")

Avoid rendering of specific .md files from blogdown::serve_site()

I have a file located at
content/post/data_for_posts/my_file.md
I have it there because it's quite easy to do htmltools::includeMarkdown("data_for_posts/my_file.md") and recycle this file in different posts.
My problem is that when I serve_site() this creates a public/post/data_for_posts/index.html, which means, it gets posted to my website (as a January 1 of 0001). I guess I could change the date to year 10000, but I would rather handle it the way I handle the .Rmd and other files, as suggested here
I have tried to modify my config.toml but have not managed to solve the issue.
ignoreFiles = ["\\.Rmd$", "\\.Rmarkdown$", "_files$", "_cache$", "content/post/data_for_posts/my_file.md"]
Here are a couple techniques that I use to do this:
Rename data_for_posts/my_file.md so it uses a file extension that hugo does not interpret as a known markup language, for example change .md to .markd or mdn.[*]
Rename data_for_posts/my_file.md so it includes a string that you will never use in a real content file, for example data_for_posts-UNPUBLISHED/my_file.md. Then add that string (UNPUBLISHED or whatever) to your config ignoreFiles list.[**]
[*] In the content/ directory, a file with one of the following file extensions will be interpreted by hugo as containing a known markup language: .ad, .adoc, .asciidoc, .htm, .html, .markdown, .md, .mdown, .mmark, .pdc, .pandoc, .org, or .rst (this is an excerpt of something I wrote).
[**] The strings listed in ignoreFiles seem to be case sensitive so I like to use all-upper-case characters in my ignored file names (because I never use upper-case chars in real content file names). Also note that there is no need to specify the path and my experience is that path delimiters (/ or \) cause problems.

Testable map base xml to flat file fails in BTS2013r2

I have lots of BTS2010 unit tests that check an XML file can be mapped to flat file.
I have developed my first of such tests on BTS2013r2 but on executing TestableMapBase.TestMap(_inputFilename, _inputType, outputFilename, _outputType), I get the error "Generate schema instance failure"
I've used reflector to debug the MS assemblies and got as far as the following line within CFrameworkSchemaTreeExtensions.cs of Microsoft.BizTalk.TOM.Adapter :
infoArray = instanceGenerator.GenerateInstance(filename, xmlInstance);
on executing, the infoArray is populated with the following error
ErrorInfo: hexadecimal value 0x00, is an invalid character. Line 2, position 1."
Prior to executing I have taken the content of xmlInstance, pasted into Notepad++ and used the Hex plugin to search for null characters (hex 0x00), there are none.
I have tried many different XML inputs to the maps on two different BizTalk development laptops and get the same result.
Has anyone been able to successfully run tests of XML to flat file in BTS2013r2?
Today I have created the most basic of solutions (1 BizTalk project + 1 unit test project) in order to test if this really is a Microsoft bug. It does seem that way because I got the same error when running this very simple test on a third BizTalk development laptop. I have added the source code to the following github repo: https://github.com/RobBowman/FFMapFailBTS2013r2
Make sure it is not an encoding issue. Finding a 0x00 at that position sounds like the input file is in UTF-16 format, while the processor is expecting UTF-8 or another single-byte encoding.
Microsoft have published a hotfix for this - see: https://social.msdn.microsoft.com/Forums/en-US/cacecbfd-8b71-409c-bd59-2eed26950f25/test-map-to-flat-file-in-bts-2013r2-does-this-ever-work?forum=biztalkgeneral

How to import Geonames into SQLite?

I need to import the Geonames database (http://download.geonames.org/export/dump/) into SQLite (file is about a gigabyte in size, ±8,000,000 records, tab-delimited).
I'm using the built-in SQLite-possibilities of Mac OS X, accessed through terminal. All goes well, until record 381174 (tested with older file, the exact number varies slightly depending on the exact version of the Geonames database, as it is updated every few days), where the error "expected 19 columns of data but found 18" is displayed.
The exact line causing the problem is:
126704 Gora Kyumyurkey Gora Kyumyurkey Gora Kemyurkey,Gora
Kyamyar-Kup,Gora Kyumyurkey,Gora Këmyurkëy,Komur Qu",Komur
Qu',Komurkoy Dagi,Komūr Qū’,Komūr Qū”,Kummer Kid,Kömürköy Dağı,kumwr
qwʾ,كُمور
قوء 38.73335 48.24133 T MT AZ AZ 00 0 2471 Asia/Baku 2014-03-05
I've tested various countries separately, and the western countries all completely imported without a problem, causing me to believe the problem is somewhere in the exotic characters used in some entries. (I've put this line into a separate file and tested with several other database-programs, some did give an error, some imported without a problem).
How do I solve this error, or are there other ways to import the file?
Thanks for your help and let me know if you need more information.
Regarding the question title, a preliminary search resulted in
the GeoNames format description ("tab-delimited text in utf8 encoding")
https://download.geonames.org/export/dump/readme.txt
some libraries (untested):
Perl: https://github.com/mjradwin/geonames-sqlite (+ autocomplete demo JavaScript/PHP)
PHP: https://github.com/robotamer/geonames-to-sqlite
Python: https://github.com/commodo/geonames-dump-to-sqlite
GUI (mentioned by #charlest):
https://github.com/sqlitebrowser/sqlitebrowser/
The SQLite tools have import capability as well:
https://sqlite.org/cli.html#csv_import
It looks like a bi-directional text issue. "كُمور قوء" is expected to be at the end of the comma-separated alternate name list. However, on account of it being dextrosinistral (or RTL), it's displaying on the wrong side of the latitude and longitude values.
I don't have visibility of your import method, but it seems likely to me that that's why it thinks a column is missing.
I found the same problem using the script from the geonames forum here: http://forum.geonames.org/gforum/posts/list/32139.page
Despite adjusting the script to run on Mac OS X (Sierra 10.12.6) I was getting the same errors. But thanks to the script author since it helped me get the sqlite database file created.
After a little while I decided to use the sqlite DB Browser for SQLite (version 3.11.2) rather than continue with the script.
I had errors with this method as well and found that I had to set the "Quote character" setting in the import dialog to the blank state. Once that was done the import from the FULL allCountries.txt file ran to completion taking just under an hour on my MacBookPro (an old one but with SSD).
Although I have not dived in deeper I am assuming that the geonames text files must not be quote parsed in any way. Each line simply needs to be handled as tab delimited UTF-8 strings.
At the time of writing allCountries.txt is 1.5GB with 11,930,517 records. SQLite database file is just short of 3GB.
Hope that helps.
UPDATE 1:
Further investigation has revealed that it is indeed due to the embedded quotes in the geonames files, and looking here: https://sqlite.org/quirks.html#dblquote shows that SQLite has problems with quotes. Hence you need to be able to switch off quote parsing in SQLite.
Despite the 3.11.2 version of DB Browser being based on SQLite 3.27.2 which does not have the required mods to ignore the quotes, I can only assume it must be escaping the quotes when you set the "Quote character" to blank.

Resources