I'm going to run PRAGMA quick_check on a very large SQLite database and would like to estimate the time it will take to complete. Is there a (ballpark) way to do that, assuming a reasonably fast HDD or SSD? Is it O(n) or worse?
I'm obviously not looking for an accurate prediction, just something like "1 to 5 hours per 10 GB".
quick_check checks for out-of-order records, missing pages, malformed records, and CHECK, and NOT NULL constraint errors.
It can be very slow.
This is not intended as an answer, but as a reference point, until someone more knowledgeable than I am can help out with a more general answer.
A 90GB sqlite3 database (1 table, 1 index, 20m rows) took 13 hours on my mid grade SSD with 16 GB RAM, running Windows7/NTFS. The process was clearly disk bound.
Assuming a linear dependency, this comes out as 5-10 minutes per Gigabyte.
According to a few pages I found online, a full PRAGMA check_integrity takes roughly 8 times longer (1h/GB).
Related
I have the following situation:
library(RODBC)
channel <- odbcConnect("ABC", uid="DEF", pwd="GHI")
df <- sqlQuery(channel,query)
The number of rows is 10M+. Is there any faster way to read the data?
The data is in oracle database.
This definitely should be a comment but will be too long for the purposes.
When executing SQL there are a few likely bottlenecks
Executing the query itself
Download the data from the database
Converting the data to align with language specific types (eg. R integers rather than BIGINT etc.
If your query runs fast when executed directly on the database UI, it is unlikely that the bottleneck comes when executing the query itself. This is also immediately clear if your query only contains simple [RIGHT/LEFT/INNER/OUTER] JOIN, as these are not "complex" query operators as such. This is more often caused by more complex nested queries using WITH clauses or window functions. The solution here would be to create a VIEW such that the query will be pre-optimized.
Now what is more likely to be the problem is 2. and 3. You state that your table has 10M data points. Lets assume your table is financial and has only 5 columns, which are all 8bit floats ( FLOAT(8) ) or 8 bit integers, the amount of data to be downloaded is (8 * 5 * 10 M / 1024^3) Gbit = 0.37 Gbit, which itself will take some time to download depending on your cconnection. Assuming you have a 10 Mbit connection the download under optimal condition would be taking at least 37 seconds. And this is the case where your data has only 5 columns! It is not unlikely that you have many more.
Now for 3. it is more difficult to predict the amount of time spent without careful code profiling. This is the step where RODBC, odbc or RJDBC will have to convert the data into types that R understands. I am sorry to say that here it becomes a question of "trial and error" to figure out which packages work best. However for oracle specifics, I would assume DBI + odbc + ROracle (seems to be developed by oracle themselves??) would be a rather safe bet for a good contender.
Do however keep in mind, that the total time spent on getting data imported from any database into R is an aggregate of the above measures. Some databases provide optimized methods for downloading queries/tables as flat-files (csv, parquet etc) and this can in some cases speed up the query quite significantly, but at the cost of having to read from disk. This often also becomes more complex compared to executing the query itself, so one has to evaluate whether it is worth the trouble, or whether it is worth just waiting for the original query to finish executing within R.
Does the average data and instruction access time of the CPU depends on the execution time of an instruction?
For example if miss ratio is 0.1, 50% instructions need memory access,L1 access time 3 clock cycles, mis penalty is 20 and instructions execute in 1 cycles what is the average memory access time?
I'm assume you're talking about a CISC architecture where compute instructions can have memory references. If you have a sequence of ADDs that access memory, then memory requests will come more often than a sequence of the same number of DIVs, because the DIVs take longer. This won't affect the time of the memory access -- only locality of reference will affect the average memory access time.
If you're talking about a RISC arch, then we have separate memory access instructions. If memory instructions have a miss rate of 10%, then the average access latency will be the L1 access time (3 cycles for hit or miss) plus the L1 miss penalty times the miss rate (0.1 * 20), totaling an average access time of 5 cycles.
If half of your instructions are memory instructions, then that would factor into clocks per instruction (CPI), which would depend on miss rate and also dependency stalls. CPI will also be affected by the extent to which memory access time can overlap computation, which would be the case in an out-of-order processor.
I can't answer your question a lot better because you're not being very specific. To do well in a computer architecture class, you will have to learn how to figure out how to compute average access times and CPI.
Well, I'll go ahead and answer your question, but then, please read my comments below to put things into a modern perspective:
Time = Cycles * (1/Clock_Speed) [ unit check: seconds = clocks * seconds/clocks ]
So, to get the exact time you'll need to know the clock speed of your machine, for now, my answer will be in terms of Cycles
Avg_mem_access_time_in_cycles = cache_hit_time + miss_rate*miss_penalty
= 3 + 0.1*20
= 5 cycles
Remember, here I'm assuming your miss rate of 0.1 means 10% of cache accesses miss the cache. If you're meaning 10% of instructions, then you need to halve that (because only 50% of instrs are memory ops).
Now, if you want the average CPI (cycles per instr)
CPI = instr% * Avg_mem_access_time + instr% * Avg_instr_access_time
= 0.5*5 + 0.5*1 = 3 cycles per instruction
Finally, if you want the average instr execution time, you need to multiply 3 by the reciprocal of the frequency (clock speed) of your machine.
Comments:
Comp. Arch classes basically teach you a very simplified way of what the hardware is doing. Current architectures are much much more complex and such a model (ie the equations above) is very unrealistic. For one thing, access time to various levels of cache can be variable (depending on where physically the responding cache is on the multi- or many-core CPU); also access time to memory (which typically 100s of cycles) is also variable depending on contention of resources (eg bandwidth)...etc. Finally, in modern CPUs, instructions typically execute in parallel (ILP) depending on the width of the processor pipeline. This means adding up instr execution latencies is basically wrong (unless your processor is a single-issue processor that only executes one instr at a time and blocks other instructions on miss events such as cache miss and br mispredicts...). However, for educational purpose and for "average" results, the equations are okay.
One more thing, if you have a multi-level cache hierarchy, then the miss_penalty of level 1 cache will be as follows:
L1$ miss penalty = L2 access time + L1_miss_rate*L2_miss_penalty
If you have an L3 cache, you do a similar thing to L2_miss_penalty and so on
There's an SQLite database being used to store static-sized data in a round-robin fashion.
For example, 100 days of data are stored. On day 101, day 1 is deleted and then day 101 is inserted.
The number of rows is the same between days. The the individual fields in the rows are all integers (32-bit or less) and timestamps.
The database is stored on an SD card with poor I/O speed,
something like a read speed of 30 MB/s.
VACUUM is not allowed because it can introduce a wait of several seconds
and the writers to that database can't be allowed to wait for write access.
So the concern is fragmentation, because I'm inserting and deleting records constantly
without VACUUMing.
But since I'm deleting/inserting the same set of rows each day,
will the data get fragmented?
Is SQLite fitting day 101's data in day 1's freed pages?
And although the set of rows is the same,
the integers may be 1 byte day and then 4 bytes another.
The database also has several indexes, and I'm unsure where they're stored
and if they interfere with the perfect pattern of freeing pages and then re-using them.
(SQLite is the only technology that can be used. Can't switch to a TSDB/RRDtool, etc.)
SQLite will reuse free pages, so you will get fragmentation (if you delete so much data that entire pages become free).
However, SD cards are likely to have a flash translation layer, which introduces fragmentation whenever you write to some random sector.
Whether the first kind of fragmentation is noticeable depends on the hardware, and on the software's access pattern.
It is not possible to make useful predictions about that; you have to measure it.
In theory, WAL mode is append-only, and thus easier on the flash device.
However, checkpoints would be nearly as bad as VACUUMs.
I am trying to create a local SPARQL endpoint for Freebase for running some local experiments. While using Virtuoso 7, I regularly see server getting killed by OOM killer. I have followed all the required steps as mentioned here. I have also made the required changes to my virtuoso.ini file as mentioned in RDF Performance Tuning.
My system configuration is:
8 CPU 2.9 Ghz
16 GB RAM
I have enough hard disk too.
Regarding data dumps, I have split the freebase data dump (23GB gzipped, approx 250 GB uncompressed) into 10 smaller gzipped files containing 200,000,000 triples each.
Following are the changes I made to virtuoso.ini
NumberOfBuffers = 1360000
MaxDirtyBuffers = 1000000
MaxCheckpointRemap = 340000 # (1/4th of NumberOfBuffers)
Along with this I have set vm.swapiness = 10 as mentioned in 2.
Am I missing something obvious?
P.S.:
I did try virtuoso-opensource-6.1 too. But it appeared to be too slow.
One interesting observation I had was that during bulk loading process, virtuoso-6.1 memory consumption was rising too slowly, but it might be because general indexing itself was too slow.
Another observation I had was the virtuoso-6.1 at start time occupies almost negligible memory (order of 500MB) whereas virtuoso-7 starts with approx 6500 MB and grows quickly.
Any help in this regard would be highly appreciated.
Numbers of buffers you are using is little bit too high. Do not forget that some memory is also consume by OS and other processes.
Which exact version do you use? (development or stable branch?)
Do you use disk striping ?
I load freebase to Virtuoso 7 too, but I used smaller files. Circa 260 gzipped files, 10mil triples each = circa 100M. A commit is executed after every file load.
Maybe would be easier for you to use images with Virtuoso preloaded by Freebase
I recently implemented an algorithm in Java that used a hash table. I compared it to a few other algorithms with rather large data input sizes such as 100000.
The thing that has struck me is that once my data input size exceeds 10000 the performance of the hash table drops dramatically. To emphasise this drop, what took 4000 ms with input size 1000 suddenly goes up to 172000 ms for input size 5000.
Can anyone please explain to me what the reason for this is? I'd really like to know.
Thanks!
This question is way too ambiguous for anyone to give a definitive answer, but if I had to guess I would say that you are encountering collisions. The stock implementation of java's HashMap uses linked lists to hold the entries whose keys' hashes collide, which will certainly happen if the hashCode method has been incorrectly defined; perhaps returning a constant value.
Having said that, if you're just measuring elapsed time, that doesn't tell you too much. Perhaps you crossed a threshold that caused a major garbage collection to occur. You should try to measure performance after your JVM and hash table are sufficiently warmed up, and take lots of measurements and consider their average, before coming to any conclusions.