I have an SQLite database in which I altered a table to add a column that will contain a kind of permanently unique ID for each row (in addition to the existing INTEGER PRIMARY KEY which might be reassigned and thus not permanent). I also want to avoid accidentally mixing up the normal ID's and the new "permanent ID's", therefor I decided to use a TEXT column and give each value a prefix, for example pid-.
So I simply added a column named perma_id with the type TEXT and ran UPDATE mytable SET perma_id = 'pid-' || _rowid_ to assign values for the existing rows. I then saved and compacted/vacuumed the database and compressed it into a zip-file because I will include it in an Android APK.
I noticed that the filesize had gone up from 379kB to 417kB after adding the new column. This is of course expected. But as an experiment, I thought maybe I could reduce the filesize by just using p... instead of pid-... for the perma_id column values, so I reassigned all the values. But to my surprise, the filesize had instead increased to 420kB! I experimented a bit further, and I can consitently get the (compressed) filesize o become 417kB with pid-... and 420kB with p.... As expected, using an INTEGER column reduces the filesize further, but only to 414kB.
This makes me wonder - what is the black magic behind the smaller file size when using a longer string as a prefix in the perma_id column? And is there a way to determine which string would produce the smallest filesize?
Edit
Just tried using the prefix perma-id-..., which results in a compressed file size of 414kB - i.e. same as using an INTEGER column with just the number after the prefix. So I tried very-long-permanent-id-with-the-value-... as prefix - 413kB. Mind = blown.
Did you try running the VACUUM command on the database before zipping each time?
When you shortened the Primary Key values it may have reduced the size of the data but kept the .DB file the same size, as SQLite doesnt automatically reduce the file size it just marks chunks of the file as 'overwriteable'. Until, that is, you run the VACUUM to throw away all this spare space.
I'm guessing the 'overwriteable' proportion of your file was hard to zip. Then when it got filled up with lots of repeating text saying "permanent-id-with-the-value-" - it got easier to zip!
Related
I have some very large CSV files (~183mio. rows by 8 columns) that I want to load into a database using R. I use duckdb for this and it its built-in function duckdb_read_csv, which is supposed to auto-detect datatypes for each column. If I enter the following code:
con = dbConnect(duckdb::duckdb(), dbdir="testdata.duckdb", read_only = FALSE)
duckdb_read_csv(con, "d15072021","mydata.csv",
header = TRUE)
It produces this error:
Error: rapi_execute: Failed to run query
Error: Invalid Input Error: Could not convert string '2' to BOOL between line 12492801 and 12493825 in column 9. Parser options: DELIMITER=',', QUOTE='"', ESCAPE='"' (default), HEADER=1, SAMPLE_SIZE=10240, IGNORE_ERRORS=0, ALL_VARCHAR=0
I've looked at the rows in question and I can't find any irregularities in column 9. Unfortunately, I cannot post the dataset because it's confidential. But the entire column is filled with either FALSE or TRUE.
If I set the parameter nrow.check to something larger than 12493825 it doesn't produce the same error but takes very long and simply converts the column to VARCHAR instead of a logical. Setting nrow.check to -1 (meaning it checks every row for a pattern) crashes R and my PC completely.
The weird thing: This isn't consistent. Earlier I imported the dataset whilst keeping the default value for nrow.check at 500 and it read the file with no issue (though still converting column 9 to VARCHAR). I have to read a lot of files that are the same pattern so I need to have a reliable way of reading them. Anyone know how duckdb_read_csv actually works and why I might get this error?
Note that reading the files into memory and then into a database isn't an option because I run out of memory instantly.
the way the sniffer works is by sampling nrow.check rows to figure out the data type, so the result can differ from runs if you get unlucky, increasing it will reduce the chances of failing it, mainly because the sniffer looks at more rows.
If increasing the number of rows is not possible due to performance issues, you can of course first define the schema of the CSV file. But then you must know the schema beforehand.
As an example of how you can define the schema and turn off the sniffer:
select * from
SELECT * FROM read_csv('test.csv', COLUMNS=STRUCT_PACK(a := 'INTEGER', b := 'INTEGER'), auto_detect='false')
I'm importing an .xls file using the following connection string:
If _
SetDBConnect( _
"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" & filepath & _
";Extended Properties=""Excel 8.0;HDR=Yes;IMEX=1""", True) Then
This has been working well for parsing through several Excel files that I've come across. However, with this particular file, when I SELECT * into a DataTable, there is a whole column of data, Item Description, missing from the DataTable. Why?
Here are some things that may set this particular workbook apart from the others that I've been working with:
The workbook has a freeze pane consisting of the first 24 rows (however, all of these rows appear in the DataTable)
There is some weird cell highlighting going on throughout the workbook
That's pretty much it. I can't see anything that would make the Item Description column not import correctly. Its data is comprised of all Strings that really have no special characters apart from &. Additionally, each data entry in this column is a maximum of 20 characters. What is happening? Is there any other way I can get all of the data? Keep in mind I have to use the original file and I cannot alter it, as I want this to ultimately be an automated process.
Thanks!
Some initial thoughts/questions: Is the missing column the very first column? What happens if you remove the space within "Item Description"? Stupid question, but does that column have a column header?
-- EDIT 1 --
If you delete that column, does the problem move to another column (the new index 4), or is the file complete. My reason for asking this -- is the problem specific to data in that column/header, or is the problem more general, on index 4.
-- EDIT 2 --
Ok, so since we know it's that column, we know it's either the header, or the rows. Let's concentrate on rows for now. Start with that ampersand; dump it, and see what happens. Next, work with the first 50% of rows. Does deleting that subset affect anything? What about the latter 50% of rows? If one of those subsets changes the result, you ought to be able to narrow it down to an individual row (hopefully not plural) by halfing your selection each time.
My guess is that you're going to find a unicode character or something else funky is one of the cells. Maybe there's a formula or, as you mentioned, some of that "weird cell highlighting."
It's been years since I worked with excel access, but I recall some problems with excel grouping content into some areas that would act as table inside each sheet. Try copy/paste the content from the problematic sheet to a new workbook and connect to that workbook. If this works you may be able to investigate a bit further about areas.
In PhpExcel library when i am assigning values to IW4 the assigned value not generatted there
Steps:
We are using The code to generate the Value to cell in PHPExcel
**$objPHPExcel->getActiveSheet()->setCellValue('A1', 'cell value here');**
When i am using it to generate value to IW4 cell the value not getting generatted
**$objPHPExcel->getActiveSheet()->setCellValue('IW4', 'cell value here');**
Please Help me to find the solution
BIFF format Excel files only allow 256 columns (up to IV), OfficeOpenXML allows more.
If you set a value in a column beyond the limit, PHPExcel only knows it's invalid at the point where you save (when it knows whether you're saving as an Excel5 or Excel2007 file), Rather than trigger an exception at that point (which would be much more frustrating if it was a long running script), it silently discards the invalid columns or rows.
This is similar behaviour to Excel itself, if you open an xlsx file in an earlier version of Excel that doesn't support as many rows and columns.
I am trying to convert data from Act 2000 to a MySQL database. I have successfully imported the DBF files into individual MySQL tables. However I am having issues with the *.BLB file, which seems to be a non-standard memo file.
The DBF files, identifies themselves as dbase III Plus, No memo format. There is a single *.BLB which is a memo file for multiple DBFs to share BLOB data.
If you read this document: http://cicorp.com/act/sdk/ACT6-SDK-ChapterA.htm#_Toc483994053)
You can see that the REGARDING column is a 6 character one. The description is: This 6-byte field is supplied by the system and contains a reference to a field in the Binary Large Object (BLOB) Database.
Now upon opening the *.BLB I can see that the block size is 64 bytes. All the blocks of text are NULL padded out to that size.
Where I am stumbling is trying to convert the values stored in the REGARDING column to blocks location in the BLB file. My assumption is that 6 character field is an offset.
For example, one value for REGARDING is, (ignoring the square brackets): [ ",J$]
In my Googling, I found this: http://ulisse.elettra.trieste.it/services/doc/dbase/DBFstruct.htm#C1.5
It explains that in memo fields (in normal DBF files at least) the space value is ignore (i.e. it's padding out the column).
Therefore if I'm correct (again, square brackets) [",J$] should be the offset in my BLB file. Luckily I've still got access to the original ACT2000 software, so I can compare the full text in the program / MySQL and BLB file.
Using my example value, I know that the DB row with REGARDING value of [ ",J$] corresponds to a 1024 byte offset (or 16 blocks, assuming my guess of a 64 byte sized block).
I've tried reading some Python code for open source projects that read DBF files - but I'm in over my head.
I think what I need to do is unpack the characters to binary, but am not sure.
How can I find the 64-block based spot to read from based on what's found in the DBF files?
EDIT by Jerry Dodge
I've attempted to reverse-engineer the strings in this field to hexadecimal values, and then to an integer value using StrToInt64, but the result still does not match up with the blob file. I've also tried multiplying this integer value by 64 and not multiplying, but the result keeps winding up outside of the size of the blob file, not actually finding any data.
For example, a value of ___/BD (_ = space) translates to $2f4244 hexidecimal, which in turn translates to the integer value of 3097156, but does not correspond with any relevant portion of data in the blob file, even when multiplied or divided by 64.
According to the SDK you linked, the following happens as I understand:
There is a TYPE field (right behing REGARDING) that encodes what REGARDING is used for (see the second table of the linked chapter). So I'd assume that if type=6 (meeting not held) the REGARDING is either irrelevant or only contains a meeting ID reference from some other table. On that line of thought I would only expect REGARDING to be a BLB offset if type=101 (or possibly 100). I'd also not abandon the thought that in these relevant cases TYPE might be a concatenation of BLB file index and offset (because there is a mention that each file must not be longer than 30K chars and I really expect to be able to store much more data even in one table).
i have a page that allow user to upload an excel file and insert the data in excel file to the SQL Server. Now i have a small issue that, there is a column in excel file with values, such as "001", "029", "236". When it's insert to the SQL Server, the zero in front will be ignored in SQL, so the data would become "1", "29", "239". The data type for the column in SQL is varchar(10). How to solve this?
Excel seems to automatically convert cell values to numbers. Try prefixing the cell contents with a single quote in the Excel sheet prior to processing. Eg '001. If you can't trust the users to do that, use a string formatting routine to left pad the numbers with zeroes.
Something must be converting the data in the excel cell from a string to an integer. How are you performing the insert?
If a user enters 001 into Excel, it will be converted to the number 1.
If the user enters '001 into Excel, it will be saved in the cell as text.
If the cell is pre-formatted with the number format "#", then when the user enters 001 into the cell it will be entered as the text "001". The "#" number format tells Excel that the cell is a text cell and any entry (whether it looks like a number, date, time, fraction, etc...) should simply be placed in the cell as is - as a text cell.
Can you tell your users to pre-format this column with "#"? This is generally the most reliable way to handle this since the user does not have to remember to enter '001.
Maybe setting up the datatype "Text" for an Excel cell will help.
Excel is probably the culprit here. Try converting your file to CSV and see how it comes out. If the leading zeros are gone in the new CSV file, Excel is the problem.
Excel always does this, and its a nuissance. There are three workarounds I know of:
BEFORE entering the data in any cell in Excel format the cell as text (you can do a whole column if needed.) This only works if you control the spreadsheets and users, which is basically never :-).
Assume you'll get a mix of numbers and/or text in the Excel data, and fix it in Excel before import: add a column to the spreadsheet and use the TEXT() function to convert the number to text, as in =TEXT(A2, "000"); fill down. Also assumes you can edit the worksheet.
Assume you have to fix the numbers upon insert in your code. Depending on how you are loading the data, that could happen in T-SQL or in your other code. In TSQL this expression works to pad with zeros to a width of 3 characters: right( '000' + cast ( 2 as varchar(3) ), 3 )