Does SQLite checksum its data? - sqlite

Harddrive bit-rot does happen. I'm using SQLite for a project with fairly critical data. Obviously, I'll be taking regular backups of the database, but does SQLite checksum its data?
I've read about the PRAGMA integrity_check, but can't really say whether it does integrity check on the actual data. The page "How To Corrupt An SQLite Database File" doesn't really mention the fact about bit rot on a harddrive, which is the reason why I'm asking.
Also, the database I am dealing with will be an indexable append-only log. One option would be for me to rotate the database regularly and create an MD5 sum of each rotated file. But maybe that's too much work...
Any input appreciated.

From reading the integrity_check documentation, I would say it would not be guaranteed to detect corruption that only affects user data (due to undetected bit errors on media).
Since your data is an append-only log, you've got it pretty easy. One way would be to write a text file log on a separate hard drive that contains hashes (MD5 or whatever) of every row of your data. Then you can use that hash log to verify the contents of the real database. Obviously backups will be an integral part of your plan.

Just stumbled upon this; I could be using the fzec Python package to recover broken data. Each row would have multiple "fzec block columns" to recover from corruption. Seems pretty neat.

Related

why is Sqlite checksum not same after reversing edits?

obviously editing any column value will change the checksum.
but saving the original value back will not return the file to the original checksum.
I ran VACUUM before and after so it isn't due to buffer size.
I don't have any indexes referencing the column and rows are not added or removed so pk index shouldn't need to change either.
I tried turning off the rollback journal, but that is a separate file so I'm not surprised it had no effect.
I'm not aware of an internal log or modified dates to explain why the same content does not produce the same file bytes.
Looking for insight on what is happening inside the file to explain this and if there is a way to make it behave(I don't see a relevant PRAGMA).
granted https://sqlite.org/dbhash.html exists to work around this problem but I don't see any of these conditions being triggered "... and so forth" is a pretty vague cause
Database files contain (the equivalent of) a timestamp of the last modification so that other processes can detect that the data has changed.
There are many other things that can change in a database file (e.g., the order of pages, the B-tree structure, random data in unused parts) without a difference in the data as seen at the SQL level.
If you want to compare databases at the SQL level, you have to compare a canonical SQL representation of that data, such as the .dump output, or use a specialized tool such as dbhash.

Can you get a log file of 'reads' on specific RECID(Tablename) in Progress-4GL/Openedge at RunTime without access to Source Code?

I want to know which tables are being read by a query.
for each Customer where CustomerID = 12345.
Eventually this customer will be found in the following example, but progress must 'read' many tables before getting to customer 12345.
How do I know exactly which tables are read (By CustomerID), prior to getting to customer 12345?
*NOTE: I do not have access to modify the code being run for this selection. Ideally I would run a separate set of code that is executed at the same time as the customer query above to track the reads.
EDIT: More clearly - Can you track reads from a given program (.p) OR ProcessID and output either a RECID or the PrimaryKey to a file?
I understand the information is being read off the Disk and probably stored in a database buffer. So how would I get at the information in the database buffer?
You seem to be mixing up a few different things.
In a situation like your example where you FIND a specific record in one, and only one table then there is just a single record read. Progress will find that record by first scanning a relevant index. That might be 2 or 3 "logical reads" of the b-tree to get to the proper node. The record block and index blocks may, or may not be read from disk - that depends on what has happened previously.
There are "Virtual System Tables" available that can tell you how many READ operations take place against a particular table or index. But they do not trace the specific ROWID or other identifying data. _TableStat and _IndexStat are aggregates for all users on the system, _UserTableStat and _UserIndexStat are specific to a particular user's activity. You do need to set the -tablerangesize and -indexrangesize parameters adequately to take advantage of these.
If you have enabled the table and index statistics then you can use a tool like ProTop - http://protop.wss.com to get insight into this activity. Or you can write your own code.
OpenEdge Auditing does not track reads. That would be prohibitively expensive.
It's probably not really a good idea but, in theory, you could write FIND triggers for the tables you are interested in. That doesn't require access to the application source but you would need a development license. It will probably kill performance to do this though - so unless this is a non-production test environment that you just want to fiddle with I wouldn't really do that.
You mention wanting to know how you got to that point. That sounds more like you might need to have a "4gl trace". One easy way to get the stack trace of a running process is to execute:
$DLC/bin/proGetStack PID (UNIX)
or
%DLC%\bin\proGetStack PID (Windows)
This command will generate a "protrace.pid" file containing a 4gl stack trace and other interesting information.
There are also more complicated ways to get that info like using PROMON and the "client statement cache" or setting various log entry types at session startup. But proGetStack is pretty convenient and requires no code or scripting changes.
Some great options from Tom above. And all of them may be relevant to you. The option he only skirts around is the logging options. I feel obliged to expand on this because I'm giving a talk on it in a couple of weeks!
Assuming you are running a modern version of Progress, or even 10.2B08, then you have client logging available to you. Start your session with these additional options:
-clientlog "\somefolder\somefile.txt"
-logentrytypes "QryInfo:3"
This will log all the info of all the queries in your session to the file you specified above. If you navigate to the point in the system where you want to analyse your query and empty the logfile and save it, you can then run the offending query and see all the detail you need.
The output tells you all sorts of useful info, including the number of reads on each table, compared with the number returned to the user. You also get the index selected.
Using Tom's advice and/or this will get you what you need.

Knime too slow - performance

I just started to use KNIME and it suppose managed a huge mount of data, but isn't, it's slow and often not response. I'll manage more data than that I'm using now, What am I doing wrong?.
I set in my configuration file "knime.ini":
-XX:MaxPermSize=1024m
-Xmx2048m
I also read data from a database node (millions of rows) but I can't limit it by SQL (I don't really mind, I need this data).
SELECT * FROM foo LIMIT 1000
error:
WARN Database Reader com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'LIMIT 0' at line 1
I had the same issue... and was able to solve it really simply, KNIME has a KNIME.ini file, this one is like the paramethers KNIME uses to execute...
The real issue is that JBDC driver is set for 10 Fetch Size. By default, when Oracle JDBC runs a query, it retrieves a result set of 10 rows  at a time from the database cursor. This is the default Oracle row fetch size value... so whenever you are reading database you will have a big pain waiting to retrieve all the lines.
The fix is simply, go to the folder where KNIME is installed, look for the file KNIME.ini, open it and then add the following sentences to the bottom, it will override the defauld JBDC fetching, and then you will get the data in literally seconds.
-Dknime.database.fetchsize=50000
-Dknime.url.timeout=9000
Hope this helps :slight_smile:
see http://tech.knime.org/forum/knime-users/knime-performance-reading-from-a-database for the rest of this discussion and solutions...
I'm not sure if your question is about the performance problem or the SQL problem.
For the former, I had the same issue and only found a solution when I started searching for Eclipse performance fixes rather than KNIME performance fixes. It's true that increasing the Java heap size is a good thing to do, but my performance problem (and perhaps yours) was caused by something bad going on in the saved workspace metadata. Solution: Delete contents of the knime/workspace/.metadata directory.
As for the latter, not sure why you're getting that error; maybe try adding a semicolon at the end of the SQL statement.

Is there timeout on Sqlite transactions?

Is there any limitations on sqlite transactions? For example inserting large amount of data in one transaction can cause a problem?
No, you can make transaction as big as you like (as long as you have disk space) and as long as you like (as long as nobody else wants to access the database).
Firstly I would like to tell that - SQLITE is not a full fledged database. It is there to repalce file.open() and consider it as writing structured data to local file.
It is not advisable to load large data (migrate) using bigger transactions. Better use smaller dataset if ever wanted transactions. Transactions could lock DB state and make other queries to block.

Updating a local sqlite db that is used for local metadata & caching from a service?

I've searched through the site and haven't found a question/answer that quite answer my question, the closest one I found was: Syncing objects between two disparate systems best approach.
Anyway to begun, because there is no RSS feeds available, I'm screen scraping a webpage, hence it does a fetch then it goes through the webpage to scrap out all of the information that I'm interested in and dumps that information into a sqlite database so that I can query the information at my leisure without doing repeat fetching from the website.
However I'm also storing various metadata on the data itself that is stored in the sqlite db, such as: have I looked at the data, is the data new/old, bookmark to a chunk of data (Think of it as a collection of unrelated data, and the bookmark is just a pointer to where I am in processing/reading of the said data).
So right now my current problem is trying to figure out how to update the local sqlite database with new data and/or changed data from the website in a manner that is effective and straightforward.
Here's my current idea:
Download the page itself
Create a temporary table for the parsed data to go into
Do a comparison between the official and the temporary table and copy updates and/or new information to the official table
This process seems kind of complicated because I would have to figure out how to determine if the data in the temporary table is new, updated, or unchanged. So I am wondering if there isn't a better approach or if anyone has any suggestion on how to architecture/structure such system?
Edit 1:
I'm not sure where to put the additional information, in an comment or as an edit, so I'm going to add it here.
This expands a bit on the metadata in regards of bookmarking, basically the data source can create new data/addition to the current data, so one reason why I was thinking of doing the temporary table idea was so that I would be able to determine if an data source that has been "bookmarked" has any new data or not.
Is it really important to determine if the data in the temporary table is new, updated or unchanged? Do you really need to keep an history of the changes?
NO: don't use the temporary table but just mark as old (timestamp) your old records, don't do updates, and just insert your new data.
YES: your idea seems correct to me but all depends on how much data you need to process each time; i don't think it is feasible with a large amount of data.

Resources