How to import Geonames into SQLite? - sqlite

I need to import the Geonames database (http://download.geonames.org/export/dump/) into SQLite (file is about a gigabyte in size, ±8,000,000 records, tab-delimited).
I'm using the built-in SQLite-possibilities of Mac OS X, accessed through terminal. All goes well, until record 381174 (tested with older file, the exact number varies slightly depending on the exact version of the Geonames database, as it is updated every few days), where the error "expected 19 columns of data but found 18" is displayed.
The exact line causing the problem is:
126704 Gora Kyumyurkey Gora Kyumyurkey Gora Kemyurkey,Gora
Kyamyar-Kup,Gora Kyumyurkey,Gora Këmyurkëy,Komur Qu",Komur
Qu',Komurkoy Dagi,Komūr Qū’,Komūr Qū”,Kummer Kid,Kömürköy Dağı,kumwr
qwʾ,كُمور
قوء 38.73335 48.24133 T MT AZ AZ 00 0 2471 Asia/Baku 2014-03-05
I've tested various countries separately, and the western countries all completely imported without a problem, causing me to believe the problem is somewhere in the exotic characters used in some entries. (I've put this line into a separate file and tested with several other database-programs, some did give an error, some imported without a problem).
How do I solve this error, or are there other ways to import the file?
Thanks for your help and let me know if you need more information.

Regarding the question title, a preliminary search resulted in
the GeoNames format description ("tab-delimited text in utf8 encoding")
https://download.geonames.org/export/dump/readme.txt
some libraries (untested):
Perl: https://github.com/mjradwin/geonames-sqlite (+ autocomplete demo JavaScript/PHP)
PHP: https://github.com/robotamer/geonames-to-sqlite
Python: https://github.com/commodo/geonames-dump-to-sqlite
GUI (mentioned by #charlest):
https://github.com/sqlitebrowser/sqlitebrowser/
The SQLite tools have import capability as well:
https://sqlite.org/cli.html#csv_import

It looks like a bi-directional text issue. "كُمور قوء" is expected to be at the end of the comma-separated alternate name list. However, on account of it being dextrosinistral (or RTL), it's displaying on the wrong side of the latitude and longitude values.
I don't have visibility of your import method, but it seems likely to me that that's why it thinks a column is missing.

I found the same problem using the script from the geonames forum here: http://forum.geonames.org/gforum/posts/list/32139.page
Despite adjusting the script to run on Mac OS X (Sierra 10.12.6) I was getting the same errors. But thanks to the script author since it helped me get the sqlite database file created.
After a little while I decided to use the sqlite DB Browser for SQLite (version 3.11.2) rather than continue with the script.
I had errors with this method as well and found that I had to set the "Quote character" setting in the import dialog to the blank state. Once that was done the import from the FULL allCountries.txt file ran to completion taking just under an hour on my MacBookPro (an old one but with SSD).
Although I have not dived in deeper I am assuming that the geonames text files must not be quote parsed in any way. Each line simply needs to be handled as tab delimited UTF-8 strings.
At the time of writing allCountries.txt is 1.5GB with 11,930,517 records. SQLite database file is just short of 3GB.
Hope that helps.
UPDATE 1:
Further investigation has revealed that it is indeed due to the embedded quotes in the geonames files, and looking here: https://sqlite.org/quirks.html#dblquote shows that SQLite has problems with quotes. Hence you need to be able to switch off quote parsing in SQLite.
Despite the 3.11.2 version of DB Browser being based on SQLite 3.27.2 which does not have the required mods to ignore the quotes, I can only assume it must be escaping the quotes when you set the "Quote character" to blank.

Related

How do I tell Vim to use any/all dictionary files that fit a filepath wildcard glob?

I am trying to set the dictionary option (to allow autocompletion of certain words of my choosing) using wildcards in a filename glob, as follows:
:set dict+=$VIM/dict/dict*.lst
The hope is that, with this line in the initially sourced .vimrc (or, in my case of Windows 10, _vimrc), I can add different dictionary files to my $VIM/dict directory later, and each new invocation of Vim will use those dictionary files, without me needing to modify my .vimrc settings.
However, an error message says that there is no such file. When I give a specific filename (as in :set dict+=$VIM/dict/dict01.lst ), then it works.
The thing is, I could swear that this used to work. I had this setting in my .vimrc files since I started using Vim 7.1, and I don't recall any such error message until recently. Now it shows up on my Linux laptop as well as my Windows 7 and Windows 10 laptops. I can't remember exactly when this started happening.
Yes, I tried using backslashes (as in :set dict+=$VIM\dict\dict*.lst ) in case it was a Windows compatibility issue, but that still doesn't work. (Also this is happening on my Linux laptop, too, and that doesn't use backslashes for filepaths.)
Am I going senile? Or is there some other mysterious force going on?
Assuming for now that it is a change in the latest version of Vim, is there some way to specify "use all the dictionary files that fit this glob"?
-- Edited 2021-02-14 06:17:07
I also checked to see if it was due to having more than one file that fits the wildcard glob. (I thought that if I had more than one file that fit the wildcard, the glob would turn into two filenames, equivalent to saying dict+=$VIM/dict/dict01.lst dict02.lst which would not be syntactically valid.) But it still did not working after removing extra files so that only one file fit my pathname of $VIM/dict/dict*.lst . (I had previously put another Addendum here happily explaining that this was how I solved my problem, but it turned out to be premature.)
You must expand wildcards before setting an option. Multiple file names must be separated by commas. For example,
let &dictionary = tr(expand("$VIM/dict/dict*.lst"), "\n", ",")
If adding a value to a non-empty option, don't forget to add comma too (let is more universal than set, so it's less forgiving):
let &dictionary .= "," . tr(expand(...)...)

Testable map base xml to flat file fails in BTS2013r2

I have lots of BTS2010 unit tests that check an XML file can be mapped to flat file.
I have developed my first of such tests on BTS2013r2 but on executing TestableMapBase.TestMap(_inputFilename, _inputType, outputFilename, _outputType), I get the error "Generate schema instance failure"
I've used reflector to debug the MS assemblies and got as far as the following line within CFrameworkSchemaTreeExtensions.cs of Microsoft.BizTalk.TOM.Adapter :
infoArray = instanceGenerator.GenerateInstance(filename, xmlInstance);
on executing, the infoArray is populated with the following error
ErrorInfo: hexadecimal value 0x00, is an invalid character. Line 2, position 1."
Prior to executing I have taken the content of xmlInstance, pasted into Notepad++ and used the Hex plugin to search for null characters (hex 0x00), there are none.
I have tried many different XML inputs to the maps on two different BizTalk development laptops and get the same result.
Has anyone been able to successfully run tests of XML to flat file in BTS2013r2?
Today I have created the most basic of solutions (1 BizTalk project + 1 unit test project) in order to test if this really is a Microsoft bug. It does seem that way because I got the same error when running this very simple test on a third BizTalk development laptop. I have added the source code to the following github repo: https://github.com/RobBowman/FFMapFailBTS2013r2
Make sure it is not an encoding issue. Finding a 0x00 at that position sounds like the input file is in UTF-16 format, while the processor is expecting UTF-8 or another single-byte encoding.
Microsoft have published a hotfix for this - see: https://social.msdn.microsoft.com/Forums/en-US/cacecbfd-8b71-409c-bd59-2eed26950f25/test-map-to-flat-file-in-bts-2013r2-does-this-ever-work?forum=biztalkgeneral

how to display chinese character properly in sqlite console?

Here is the sample csv file in utf-8 format which can be opened in win7's notepad and the chinese character displayed properly ,please download it .
http://pan.baidu.com/s/1sj0ia4H
Open your cmd ,and set chcp 650001.
C:\Users\pengsir>sqlite3 e:\\test.db
SQLite version 3.8.4.3 2014-04-03 16:53:12
Enter ".help" for usage hints.
sqlite> create table ipo(name TEXT,method TEXT);
sqlite> .separator ","
sqlite> .import "e:\\tmp.csv" ipo
sqlite> select * from ipo;
000001,公开招募
000002,申请表抽签é™é¢è®¤è´­
000004,定å‘å‘è¡Œ
000005,银行储蓄存å•æ–¹å¼
000006,申请表抽签é™é¢è®¤è´­
000007,自办å‘è¡Œ
000008,自办å‘è¡Œ
000009,定å‘å‘è¡Œ
000010,定å‘å‘è¡Œ
000011,申请表抽签等é¢è®¤è´­
sqlite>
why the same sqlite command can get proper display in sqlitemanager?
and how can i set to display chinese character in sqlite console?
In pysqlite3 , it can get right display in python console.
>>> import sqlite3
>>> con=sqlite3.connect("e:\\test.db")
>>> cur=con.cursor()
>>> cur.execute("select * from ipo;")
<sqlite3.Cursor object at 0x01751720>
>>> print(cur.fetchall())
[('000001', '公开招募'), ('000002', '申请表抽签限额认购'), ('000004', '定向发行'
), ('000005', '银行储蓄存单方式'), ('000006', '申请表抽签限额认购'), ('000007',
'自办发行'), ('000008', '自办发行'), ('000009', '定向发行'), ('000010', '定向发
行'), ('000011', '申请表抽签等额认购')]
>>>
This issue concers how
Command Prompt window
shows the characters, and is not about how sqlite3
prints the output;
As a simple demonstration here we absolutely exclude sqlite3 and look at the files by the type command:
Let's see whats happen in other different O.S., for example in OSX:
ISO-8859-1
correspond to (Windows latino 1), windows equivalent code page setting: chcp 819
UTF8
correspond to Unicode (UTF-8), windows equivalent code page setting: chcp 65001
Pretty the same behavior also happens in Windows:
use command chcp to inspect and/or setting-up your current code page
NOTICE: this is a screenshot of an Italian Windows XP and as you can see there is still no luck! :-( , in this case the cause consists in a leak of available fonts configurable in
command prompt properties in my "Windows XP" box:
I hope this is not the case of your "Windows Seven" box ( ..but if it is , please leave me a comment to be a more specific in this part of the answer ).
..when the problem switches to the "fonts available" then Additional Languages supports would be installed and still need forcing UTF-8 by a chcp 65001:
How to get proper fonts
follows the list of steps I followed to get the result on ITA WinXP SP2 as shown in the above screenshot:
Step 1 Install East Asian language files on your computer
lecture link: to install East Asian language files on your computer
In summary these two options have been both checked
and in "Advanced Tab" I've selected Chinese:
Step 2 Switch from raster to chinese font in the terminal/"Command Windows"
Extra Step 3 (Optional) Check font in notepad
Notepad can be useful for some inspections on fonts, for example open the temp.csv and play with fonts but be aware of: Necessary criteria for fonts to be available in a command window
Well the obvious problem is that Windows (pretty much in general) has a problem in dealing with UTF-8. Especially the command line tool is by default set to a country specific codepage rather than unicode.
Usually you can (temporarily) fix it by setting the codepage for the command-line session to utf-8, for example by typing:
chcp 65001
But the problem is that in your case this does not really fix it, since sqlite seems to still run with the default charset, and there does not seem to be any option to set the current sqlite3 session to unicode.
Still the good news above it all is, that your data is correct, and you can work with it correctly using sqlitemanager or similar tools, which are able to handle unicode appropriately.
To further substantiate this: If you open your original csv with Excel it probably also will give you messed up characters (since it usually does not default to unicode). Whereas LibreOffice will typically ask you for the encoding to use, and given unicode will show the correct text, but given a different encoding (eg: western europe, etc.) will give you the same result as excel (you can preview it there quite nicely, give it a shot).
Hope this helps!

Is there a working distribution of sqlite available for OpenVMS?

I am looking for a working distribution of SQLite for OpenVMS. I tried building SQLite 3.7.9 from the amalgamation file, using patches I found in a mailing list, but it does not quite work.
I am using HP C V7.1-015 on OpenVMS Alpha 7.3-2.
Since I cannot install python, which seems to include SQLite3, I have to build from sources.
I compile using the following commands:
$ CC /OPTIMIZE -
/DEFINE=(SQLITE_THREADSAFE=0, -
SQLITE_OMIT_LOAD_EXTENSION=1, -
SQLITE_OMIT_COMPILEOPTION_DIAGS=1, -
SQLITE_OMIT_MEMORYDB=1, -
SQLITE_OMIT_TEMPDB=1, -
SQLITE_OMIT_DEPRECATED=1, -
SQLITE_OMIT_SHARED_CACHE=1, -
_USE_STD_STAT=ENABLE) -
/FLOAT=IEEE_FLOAT -
sqlite3.c
$ CC /OPTIMIZE -
/DEFINE=(SQLITE_THREADSAFE=0, -
SQLITE_OMIT_LOAD_EXTENSION=1, -
SQLITE_OMIT_COMPILEOPTION_DIAGS=1, -
SQLITE_OMIT_MEMORYDB=1, -
SQLITE_OMIT_TEMPDB=1, -
SQLITE_OMIT_DEPRECATED=1, -
SQLITE_OMIT_SHARED_CACHE=1, -
_USE_STD_STAT=ENABLE) -
/FLOAT=IEEE_FLOAT -
shell.c
I copied the defines from the mailing list, and added /FLOAT=IEEE_FLOAT to get rid of most warnings regarding floating points (related to overflows due to exponent 308).
During compilation I got some informationals and warnings.
I get the following messages while linking:
$ LINK shell.obj,sqlite3.obj
...
%LINK-W-NUDFSYMS, 2 undefined symbols:
%LINK-I-UDFSYM, __STD_FSTAT
%LINK-I-UDFSYM, __STD_STAT
...
Since I am a little bit lost here, I rather have SQLite3 sources which compile on OpenVMS.
The specific problem you're getting from the linker arises from the fact that you've requested capability at compile time that your system doesn't have. I believe the _USE_STD_STAT option first became available in OpenVMS v8.2, yet you're on 7.3-2. Your compiler and your headers know what to do when _USE_STD_STAT is defined, but the functions to process the X/Open-compliant stat structure do not exist in the C run-time (CRTL in VMS parlance) on your system, and your linker is telling you, "ain't got those functions."
Ideally you would be able to upgrade your operating system. Current as of this writing is v8.4. v7.3-2 was released eight and a half years ago and v8.2 over seven years ago. I understand that there are technical, budgetary, and even political reason that upgrades aren't always possible. If it were me, and I were stuck on OpenVMS Alpha v7.3-2, I would try removing the _USE_STD_STAT=ENABLE from the compilation and see what blows up.
One of the side effects of enabling _USE_STD_STAT is that you also get _LARGEFILE along with it. If that's the only reason SQLite needs the option, you may be fine but limited to 4GB databases. I suspect there's more to it than that, i.e., SQLite very likely makes use of elements in the stat structure that do actually require the newer structure.
You can read up on the differences in the traditional and standards-compliant stat structures at http://h71000.www7.hp.com/doc/84final/5763/5763profile_062.html#index_x_1699.
I've recently improved my VMSish patch for SQLite and made it available for SQLite version 3.7.14.1: http://www.mail-archive.com/sqlite-users#sqlite.org/msg73570.html (or http://sqlite.1065341.n5.nabble.com/Building-SQLite-3-7-14-1-for-OpenVMS-td65277.html).
POSIX locking still doesn't work though, and I was unable to find out why.
Well, there was a message on the sqlite-users mailing list on getting SQLite 3.7.9 working on OpenVMS. I don't know how relevant that is to the version you've got (or if the patch was adopted by the SQLite developers; they're a bit picky for legal reasons IIRC) but it looks likely to be useful. Good luck.

console print w/o scrolling

I see console apps print colors and seen apps such as ffmpeg print text over itself instead of a new line. How do I print over an existing line? I want to display fps in my console app either at the very top or very bottom and have regular printfs go there and scroll normally.
I need this for windows, but this is meant to be cross platform, so I will eventually have a linux and mac implementation.
There is two simple possibilities which work on linux as well as windows, but only for one line:
printf("\b"); will return for one character, so you might count how many character you want to backspace and fire this in a loop, or you know that you only write n numbers and do it likeprintf("\b\b\b\b\b\b\b\b\b\b");
printf("text to be overwritten by next printf\r"); this will return the cursor to the beginning of the line, so any next printf will overwrite it. Make sure to write a string of same length or longer so you overwrite it entirely.
If you want to rewrite several lines, there is nothing so portable as ncurses, there is libs for it on practically every operating system, and you don't have to take care of the ANSI-differences.
edit: added link to ncurses wikipedia page, gives great overview and introduction, as well as link list and maybe a translation to your preferred language
Check out ncurses. It has bindings for most scripting languages.
You can use '\r' instead of '\n'.
The ASCII character number 8 (A.K.A. Ctrl-H, BS or Backspace) lets you back up one character. ASCII Character number 13 (A.K.A Ctrl-M, CR or Carriage Return) returns the cursor at the beggining of the line.
If you are working in C try putchar(8); and putchar(13);
The magic of the colors, cursor locating and bliking and so on are inside ANSI escape codes. Any text console capable of handling ANSI codes can use them just printing them out to console (i.e. by means of echo in a bash script or printf() function in C).
Unix terminals support ANSI escape sequences and Windows world used to support them back in old MS-DOS days, but the multibyte console support put an end to this. There is more information here. However there are other ways out of just ANSI sequences printing available on Windows. Moreover if you have Cygwin installed on your Windows maching ANSI codes work just as great as on any Unix terminal.
Many people mention Ncurses library that is the de-facto standard for any gui-like text based applications. What this library does is to hide all the terminal differences (Windows/Unix flavours) to represent the same information as identical as possible across all the platforms, though from my own experience I tell you this is not always true (i.e. typical text window frames change because the especial chars are not available under all character encodings). The counterpart of using ncurses is that it is a complete API and it is much harder to start out with it than simply writing out some ANSI escape sequences for simple things such as change the font color, cleaning screen or moving back the cursor to a random position.
For the sake of completeness I paste an example of use of ANSI sequence under Linux that changes the prompt to blue and shows the date:
PS1="\[\033[34m\][\$(date +%H%M)][\u#\h:\w]$ "
You can use Ncurses -
ncurses package is a subroutine library for terminal-independent screen-painting and input-event handling which presents a high level screen model to the programmer, hiding differences between terminal types and doing automatic optimization of output to change one screenfull of text into another
Depending on the platform which you are developing on there's probably a more powerful API which you could use, rather than old ASCII control codes.
e.g. If you are working on Win32 you can actually manipulate the console screen buffer directly.
A good place to start might be here
http://msdn.microsoft.com/en-us/library/ms683171(VS.85).aspx
I have been looking for similar functions/API which would allow me to access the console as something other than a stream of text for other platforms. Haven't found anything yet, but then again, I haven't been looking that hard.
Hope it helps.

Resources