Proper method of converting base64 image strings to binary? - sqlite

I have a table in a SQLite database that contains about 15,000 single-page document scans stored as base64 strings. If I understand correctly, converting these to binary would reduce the size of the table by 25%.
Is it correct that it is not possible to convert the images to binary in SQLite directly but the base64 strings need to be converted to images first and then to binary? If so, will creating an image in Tcl from each base64 string and converting to binary suffice? And are there any tricky items that a novice is likely to overlook in attempting to do so?
When the test code below is executed, it appears that img_binary is binary data, but is this the correct approach?
Thank you.
set db "database_name"
sqlite3 dbws $db
#Base64 strings in database are prefixed with "data:image/gif;charset=utf-8;base64,"
set l [expr {[string length {data:image/gif;charset=utf-8;base64,}] -1}]
dbws eval { select img_base64 from lexi_raw where img_no = $nbr } {
image create photo ::img::lexi -data [string replace $img_base64 0 $l]
set img_binary [::img::lexi data -format png]; #Does this return binary to be written to SQLite?
puts $img_binary
}

SQLite doesn't have a built-in base64 decoder, but you can add one.
Try this:
package require sqlite3
sqlite3 db :memory:
db function base64decode -argcount 1 -deterministic -returntype blob {binary decode base64}
db eval {SELECT base64decode('SGVsbG8sIFdvcmxk') AS message}
The trick is the function method, which creates a new function (called base64decode) that is implemented by the given Tcl script fragment (binary decode base64; the argument is appended as a word to that). I'm passing -argcount 1 because we only ever want to pass a single argument here, -deterministic because the result is always the same for the same input, and -returntype blob because we know the result is binary.
If you want to do more complex processing (such as stripping a prefix as well) then it's best to implement by calling a procedure:
db function base64decode -argcount 1 -deterministic -returntype blob myDecoder
proc myDecoder value {
# Strip a leading prefix
regsub {^data:image/gif;charset=utf-8;base64,} $value "" value
# Decode the rest
return [binary decode base64 $value]
}

Related

How do I save an image to the database using caveman and sxql?

I am trying to build a website that takes an uploaded image and saves it in the PostgreSQL database.
From caveman I can do:
(caveman2:request-raw-body caveman2:*request*)
Which gives me a circular stream: CIRCULAR-STREAMS:CIRCULAR-INPUT-STREAM.
I suppose I can then use a read-sequence to put the contents into a byte array
(let ((buffer (make-array 5 :adjustable t :fill-pointer 5)))
(read-sequence buffer (caveman2:request-raw-body caveman2:*request*))
(add-picture-to-db buffer))
The problem occurs when I try to save this byte array into the database using sxql.
(defun add-picture-to-db picture
(with-connection (db)
(datafly:execute
(sxql:update :testpictures
(sxql:set= :picture picture)
(sxql:where (:= :id 1))))))
I guess the reason why it is failing might be because ultimately, sxql will generate a string which won't work well with binary data. Is there something here that I'm missing? How can I make this work?
Ideally, the way to verify the solution would be to retrieve the saved image from the db, serve it as the response of a http request and see if the client gets the image.
It would be much better to use Postmorden for this as it supports the handling of byte data with PostgreSQL
However, it is possible to work around the limitations of sxql. The first thing to understand is that sxql will ultimately generate an SQL query string, which will cause problems if you insert byte data directly into it.
It is necessary to convert the bytes of the file you want to store into HEX so that it can be used in sxql.
(format nil "~{~2,'0X~}" list-of-bytes-from-file)
Running this function through all the bytes of the file will give you a string composed of two digit HEX for each byte. This is important because other methods of converting bytes to HEX may not maintain the two digit padding, leading to an odd number of HEX.
For example:
(write-to-string 0 :base 16)
This will return a single digit HEX.
Next, you store the resulting string as you normally would into a bytea type column in the db using sxql.
When retrieving the file from the database, you get a byte array that represents the HEX string.
Using this function you can convert it back to a HEX string.
(flexi-streams:octets-to-string byte-array :external-format :utf-8)
Next step is to split the resulting string into pairs of HEX, e.g: ("FF" "00" "A2")
Then convert the pair back into a byte using this function on each pair:
(parse-integer pair :radix 16)
Store those bytes into an array of type unsigned byte, and finally return that array as the body of the response in caveman2 (not forgetting to also set the corresponding content-type header).

Teradata query returns bad characters in string column but exporting to CSV from assistant console works

I am using DBI package in R to connect to teradata this way:
library(teradatasql)
query <- "
SELECT sku, description
FROM sku_table
WHERE sku = '12345'
"
dbconn <- DBI::dbConnect(
teradatasql::TeradataDriver(),
host = teradataHostName, database = teradataDBName,
user = teradataUserName, password = teradataPassword
)
dbFetch(dbSendQuery(dbconn, query), -1)
It returns a result as follows:
SKU DESCRIPTION
12345 18V MAXâ×¢ Collated Drywall Screwgun
Notice the bad characters â×¢ above. This is supposed to be superscript TM for trademarked.
When I use SQL assistant to run the query, and export the query results manually to a CSV file, it works fine as in the DESCRIPTION column has correct encoding.
Any idea what is going on and how I can fix this problem? Obviously, I don't want a manual step of exporting to CSV and re-reading results back into R data frame, and into memory.
The Teradata SQL Driver for R (teradatasql package) only supports the UTF8 session character set, and does not support using the ASCII session character set with a client-side character set for encoding and decoding.
If you have stored non-LATIN characters in a CHARACTER SET LATIN column in the database, and are using a client-side character set to encode and decode those characters for the "good" case, that will not work with the teradatasql package.
On the other hand, if you used the UTF8 or UTF16 session character set to store Unicode characters into a CHARACTER SET UNICODE column in the database, then you will be able to retrieve those characters successfully using the teradatasql package.

How to Know The Initial Data type is UTF-8 or UTF-16 in sqlite?

In this, the function sqlite3_column_type can tell me whether the initial data type of the result is text or not, but it will not tell whether it is UTF-8 or UTF-16. Is there a way to know that?
Thanks
If you have a brand new empty database, before any tables are created, you can set the internal encoding used for Unicode text with the encoding pragma, and later use it to see the encoding being used (It defaults to UTF-8).
When storing or retrieving TEXT values, sqlite will automatically convert if needed between UTF-8 and UTF-16, so it doesn't matter too much which one is being used internally unless you're trying to get every last tiny bit of performance out of it.
In the link you provided it says explicitely:
const unsigned char sqlite3_column_text(sqlite3_stmt, int iCol);
const void sqlite3_column_text16(sqlite3_stmt, int iCol);
sqlite3_column_text → UTF-8 TEXT result sqlite3_column_text16 → UTF-16
TEXT result
These routines return information about a single column of the current
result row of a query. In every case the first argument is a pointer
to the prepared statement that is being evaluated (the sqlite3_stmt*
that was returned from sqlite3_prepare_v2() or one of its variants)
and the second argument is the index of the column for which
information should be returned. The leftmost column of the result set
has the index 0. The number of columns in the result can be determined
using sqlite3_column_count().

SQLite: insert binary data from command line

I have this SQLite table:
create table mytable (
aid INTEGER NOT NULL PRIMARY KEY,
bid INTEGER NOT NULL,
image BLOB
);
And I want to insert a binary file into the image field in this table. Is it possible to do it from the sqlite3 command line interface? If so, how? I'm using Ubuntu.
Thank you!
The sqlite3 command line interface adds the following two “application-defined” functions:
readfile
which typically is used as: INSERT INTO table(blob) VALUES (readfile('myimage.jpg'))
writefile
which writes a file with the contents of a database blob and returns the count of bytes written.
You may use a syntax like :
echo "insert into mytable values(1,1, \"`cat image`\")" | sqlite3 yourDb
i'm not sure for the " around blob's value. Note the backquotes around cat command, means the cat command will be executed before the echo.
[EDIT]
Blob are stored as hexa digit with "X" prefix. You can use "hexdump" unix command to produce the hexa string, but it would be better to write a command line tool that read image and do the insert.
More details on this post : http://comments.gmane.org/gmane.comp.db.sqlite.general/64149

How do I find the length (size) of a binary blob?

I have an SQLite table that contains a BLOB I need to do a size/length check on. How do I do that?
According to documentation length(blob) only works on texts and will stop counting after the first NULL. My tests confirmed this. I'm using SQLite 3.4.2.
I haven't had this problem, but you could try length(hex(glob))/2
Update (Aug-2012):
For SQLite 3.7.6 (released April 12, 2011) and later, length(blob_column) works as expected with both text and binary data.
for me length(blob) works just fine, gives the same results like the other.
As an additional answer, a common problem is that sqlite effectively ignores the column type of a table, so if you store a string in a blob column, it becomes a string column for that row. As length works different on strings, it will then only return the number of characters before the final 0 octet. It's easy to store strings in blob columns because you normally have to cast explicitly to insert a blob:
insert into table values ('xxxx'); // string insert
insert into table values(cast('xxxx' as blob)); // blob insert
to get the correct length for values stored as string, you can cast the length argument to blob:
select length(string-value-from-blob-column); // treast blob column as string
select length(cast(blob-column as blob)); // correctly returns blob length
The reason why length(hex(blob-column))/2 works is that hex doesn't stop at internal 0 octets, and the generated hex string doesn't contain 0 octets anymore, so length returns the correct (full) length.
Example of a select query that does this, getting the length of the blob in column myblob, in table mytable, in row 3:
select length(myblob) from mytable where rowid=3;
LENGTH() function in sqlite 3.7.13 on Debian 7 does not work, but LENGTH(HEX())/2 works fine.
# sqlite --version
3.7.13 2012-06-11 02:05:22 f5b5a13f7394dc143aa136f1d4faba6839eaa6dc
# sqlite xxx.db "SELECT docid, LENGTH(doccontent), LENGTH(HEX(doccontent))/2 AS b FROM cr_doc LIMIT 10;"
1|6|77824
2|5|176251
3|5|176251
4|6|39936
5|6|43520
6|494|101447
7|6|41472
8|6|61440
9|6|41984
10|6|41472

Resources