Using only UTF8 encoding in SQLite, what can I trim out of the ICU dataset? - sqlite

ICU provides a way of cutting down the size of the .dat file. I'm almost certain I don't need most of the encodings that are default. If I want to build a CJK .dat file specifically for sqlite, which ones can I cut out.
I just need the tokenizer to work and possibly collation. Seems that all those character conversions may not really be necessary. At 17MB, it is too FAT! For all database, we use
PRAGMA encoding = UTF8;
Data Customizer Link: http://apps.icu-project.org/datacustom/
To put it another way, if I'm using UTF8 in SQLite to collate and index, what parts of the dat file do I really need? I bet the majority is never used. I suspect I don't need the Charset Mapping Tables, and maybe not some of the Misc data.
ICU.
This tool will generate a data library that can only be used with the 4.8 series of ICU. The help page provides information on how to use this tool.
Charset Mapping Tables (4585 KB) <-- axe?
Break Iterator (1747 KB) <-- seems like i need this
Collators (3362 KB) <-- seems like i need this for sorting (but maybe not)
Rule Based Number Format (292 KB) <-- axe?
Transliterators (555 KB) <-- axe?
Formatting, Display Names and Other Localized Data (856 KB) <-- axe?
Miscellaneous Data (5682 KB) <-- axe?
Base Data (311 KB) <-- seems basic
Update. It seems that everything can be removed except for Base Data and Break Iterator. Regarding the Collators from http://userguide.icu-project.org/icudata:
The largest part of the data besides conversion tables is in collation
for East Asian languages. You can remove the collation data for those
languages by removing the CollationElements entries from those
source/data/locales/*.txt files. When you do that, the collation for
those languages will become the same as the Unicode Collation
Algorithm.
This seems "good enough".
On Collation
Starting in release 1.8, the ICU Collation Service is updated to be
fully compliant to the Unicode Collation Algorithm (UCA)
(http://www.unicode.org/unicode/reports/tr10/ ) and conforms to ISO
14651. There are several benefits to using the collation algorithms defined in these standards. Some of the more significant benefits
include:
Unicode contains a large set of characters. This can make it difficult
for collation to be a fast operation or require collation to use
significant memory or disk resources. The ICU collation implementation
is designed to be fast, have a small memory footprint and be highly
customizable.
The algorithms have been designed and reviewed by experts in
multilingual collation, and therefore are robust and comprehensive.
Applications that share sorted data but do not agree on how the data
should be ordered fail to perform correctly. By conforming to the
UCA/14651 standard for collation, independently developed
applications, such as those used for e-business, sort data identically
and perform properly.
The ICU Collation Service also contains several enhancements that are
not available in UCA. For example:
Additional case handling: ICU allows case differences to be ignored or
flipped. Uppercase letters can be sorted before lowercase letters, or
vice-versa.
Easy customization: Services can be easily tailored to address a wide
range of collation requirements.
Flexibility: ICU offers both sort key generation and fast incremental
string comparison. It also provides low-level access to collation data
through the collation element iterator (§)
Update2. If Break Iterator is removed from the .dat, the following occurs:
sqlite> CREATE VIRTUAL TABLE test USING fts4(tokenize=icu);
sqlite> CREATE VIRTUAL TABLE testaux USING fts4aux(test);
sqlite> .import test.csv test
Error: SQL logic error or missing database

(We're talking about the Data Customizer page.)
I started with the biggest items, and was able to omit these entirely:
Charset mapping tables
Miscellaneous Data
I had to include Collators, but only the languages I was supporting.
I tried to trim Break Iterator, but it broke, so I stopped there. Nothing else is nearly as big.

Related

What is the maximum length of Version in Flyway database deployment scripts

on Flyway deployment script name starts with a version.
What is the maximum length one can use? I see that on the table column holding the version is 50 character long
There are a number of limits:
Version must be 50 characters or less
Description must be 200 characters or less
Migration filenames must be compatible with any OS limit
Do you have a specific use case for a version string longer than 50 characters? We're in the middle of work for Flyway 7 and this is a chance for us to change the history table if there's a good reason to do so.
If you read the documentation located here, you'll find that the limit is not one from Flyway. Rather, your limit on the length of the version is based on the OS and it's limit on the size of a file name. You must ensure that you're incrementing your numbers in an appropriate order. However, as you can see in the docs, Flyway supports a wide variety of formats, and the length of your string defining the version number is not an issue you need to worry about.

How do you generate a "safe" password for use in web.config?

I need to programmatically generate a password for a SQL Server connection string that will be stored in web.config.
The password contained < which interfered with the build and deployment process. I would like to avoid having this and other "problem" characters from used by a password generator. Is there a list of such characters, or some code snippets that will generate a secure, but XML "safe" password?
Base64 encoded characters are all considered safe.
The particular set of 64 characters chosen to represent the 64 place-values for the base varies between implementations. The general strategy is to choose 64 characters that are both members of a subset common to most encodings, and also printable. This combination leaves the data unlikely to be modified in transit through information systems, such as email, that were traditionally not 8-bit clean.1 For example, MIME's Base64 implementation uses A–Z, a–z, and 0–9 for the first 62 values. Other variations share this property but differ in the symbols chosen for the last two values; an example is UTF-7.
This is what you can use to embed image data inside an HTML document. For example:
You can easily encode to Base64 using string libraries or online tools and they'll handle special characters like spaces, commas and brackets.
hello there, how (are) <you> doing?
Encodes to
aGVsbG8gdGhlcmUsIGhvdyAoYXJlKSA8eW91PiBkb2luZz8=
It's also worth noting that .NET probably has libraries for generating safe hashes and passwords.

data structure used to store file on Unix system

I just made a simple text editor using inbuilt opening, writing, overwriting file.I used python installed with tkinter. But, I also want to extend the text editor to add some new features like search, replace etc efficiently. In order to make it efficient, I need to know data structure that Unix uses to store data in file and to calculating the time complexity for search.
A text file is stored as a stream of bytes. Depending on the encoding used (ASCII, UTF-8, Unicode, etc.), it can be a fixed value of one or more bytes per character, or in the case of UTF-8 and some other encodings, a varying number of bytes per character.
The best search algorithms have a worst case complexity of O(n + m), where n is the length of the string you're searching for, and m is the length of the string you're searching in. A good example is the Boyer-Moore search algorithm. If the file you're working with is larger than will fit in memory, then you have to worry about buffering and such, which is an added complication, but doesn't impact the efficiency of the search. You'll have to be creative about buffering the input so that you don't miss a string that crosses input buffer boundaries.

NVARCHAR2 data types in an Oracle database with AL32UTF8 character set

I have inherited an Oracle 11g database that has a number of tables with NVARCHAR2 columns.
Is the NVARCHAR2 data type necessary to store Unicode if the database already has the AL32UTF8 character set, and if not, can these columns be converted to VARCHAR2?
Thanks.
If the database character set is AL32UTF8, a VARCHAR2 column will store Unicode data. Most likely, the columns should be converted to VARCHAR2.
Assuming that the national character set is AL16UTF16, which is the default and the only sensible national character set when the database character set already supports Unicode, it is possible that the choice to use NVARCHAR2 was intentional because there is some benefit to the UTF-16 encoding. For example, if those columns are storing primarily Japanese or Chinese data, UTF-16 would generally use 2 bytes of storage per character rather than 3 bytes in UTF-8. There may be other reasons that one would prefer one Unicode encoding to another that might come into play here as well. Most of the time, though, people creating NVARCHAR2 columns in a database that supports Unicode are doing so unintentionally, not because they did a thorough analysis of the benefits of different Unicode encodings.

Disassemble to identify encryption algorithm

Goal (General)
My ultimate (long term) goal is to write an importer for a binary file into another application
Question Background
I am interested in two fields within a binary file format. One is
encrypted, and the other is compressed and possibly also encrypted
(See how I arrived at this conclusion here).
I have a viewer program (I'll call it viewer.exe) which can open these files for viewing. I'm hoping this can offer up some clues.
I will (soon) have a correlated deciphered output to compare and have values to search for.
This is the most relevant stackoverflow Q/A I have found
Question Specific
What is the best strategy given the resources I have to identify the algorithm being used?
Current Ideas
I realize that without the key, identifying the algo from just data is practically impossible
Having a file and a viewer.exe, I must have the key somewhere. Whether it's public, private, symmetric etc...that would be nice to figure out.
I would like to disassemble the viewer.exe using OllyDbg with the findcrypt plugin as a first step. I'm just not proficient enough in this kind of thing to accomplish it yet.
Resources
full example file
extracted binary from the field I am interested in
decrypted data In this zip archive there is a binary list of floats representing x,y,z (model2.vertices) and a binary list of integers (model2.faces). I have also included an "stl" file which you can view with many free programs but because of the weird way the data is stored in STL's, this is not what we expect to come out of the original file.
Progress
1. I disassembled the program with Olly, then did the only thing I know how to do at this poing and "searched for all referenced text" after pausing the porgram right before it imports of of the files. Then I searched for words stings like "crypt, hash, AES, encrypt, SHA, etc etc." I came up with a bunch of things, most notably "Blowfish64" which seems to go nicely with the fact that mydata occasionally is 4 bytes too long (and since it is guranteed to be mod 12 = 0) this to me looks like padding for 64 bit block size (odd amounts of vertices result in non mod 8 amounts of bytes). I also found error messages like...
“Invalid data size, (Size-4) mod 8 must be 0"
After reading Igor's response below, here is the output from signsrch. I've updated this image with green dot's which cause no problems when replaced by int3, red if the program can't start, and orange if it fails when loading a file of interest. No dot means I haven't tested it yet.
Accessory Info
Im using windows 7 64 bit
viewer.exe is win32 x86 application
The data is base64 encoded as well as encrypted
The deciphered data is groups of 12 bytes representing 3 floats (x,y,z coordinates)
I have OllyDb v1.1 with the findcrypt plugin but my useage is limited to following along with this guys youtube videos
Many encryption algorithms use very specific constants to initialize the encryption state. You can check if the binary has them with a program like signsrch. If you get any plausible hits, open the file in IDA and search for the constants (Alt-B (binary search) would help here), then follow cross-references to try and identify the key(s) used.
You can't differentiate good encryption (AES with XTS mode for example) from random data. It's not possible. Try using ent to compare /dev/urandom data and TrueCrypt volumes. There's no way to distinguish them from each other.
Edit: Re-reading your question. The best way to determine which symmetric algorithm, hash and mode is being used (when you have a decryption key) is to try them all. Brute-force the possible combinations and have some test to determine if you do successfully decrypt. This is how TrueCrypt mounts a volume. It does not know the algo beforehand so it tries all the possibilities and tests that the first few bytes decrypt to TRUE.

Resources