Comparing 2 linq applications: Unexpected result - asp.net

I drafted 2 ASP.NET applications using LINQ. One connects to MS SQL Server, another to some proprietary memory structure.
Both applications work with tables of 3 int fields, having 500 000 records (the memory structure is identical to SQL Server table). The controls used are regular: GridView and ObjectDataSource.
In the applications I calculate the average time needed for each paging click processing.
LINQ + MS SQL application demands 0.1 sec per page change.
LINQ + Memory Structure demands 0.8 sec per page change.
This Is shocking result. Why the application handling data in memory works 8 times slower than the application using hard drive? Can anybody tell me why that happens?

The primary factor will probably be algorithmic efficiency. LINQ-to-Objects works with IEnumerable<T> inputs and outputs, which are generally processed sequentially, whereas the database may have indexes that induce substantial speed-ups.

I can think of at least three reasons:
indexes
caching
special optimizations (e.g. TOP N SORT)
Indexes
There are many types of queries that will run very fast if run on a database which is correctly indexed but very slow if you iterate through a list in memory. For example a lookup by ID (primary key) is almost instant in a database because the results are stored in a B-tree with very small height. To find the same element in a list in memory would require scanning the entire list.
Caching
Your assumption is that the database always hits the disk. This is not always true. The database will try to hold as much data in memory as it can, so when you ask it for data it already has the answer ready for you. In particular it will hold commonly used indexes in memory and only hit the disk when necessary. The way the data is stored on disk and in memory is also carefully optimized to reduce disk seeks and page misses.
Optimizations
Even without indexes the database still knows many tricks that could speed things up. For example if you do the following in SQL Server:
list.OrderBy(x => x.Value).Take(1)
it will be almost instant if there is an index on list, but even without the index it will use a special optimization called TOP N SORT that runs in linear time. Check the execution plan for your query to see if this optimization is being used. Note that this optimization is not implemented for LINQ to Objects. We can see this by running this code:
Random random = new Random();
List<Foo> list = new List<Foo>();
for (int i = 0; i < 10000000; ++i)
{
list.Add(new Foo { Id = random.Next() });
}
DateTime now = DateTime.UtcNow;
Foo smallest = list.OrderBy(foo => foo.Id).First();
Console.WriteLine(DateTime.UtcNow - now);
This code takes about 30 seconds to execute, and the execution time grows slower than linearly as more items are added. Replacing the query with this results in it taking less than one second:
int smallestId = list.Min(foo => foo.Id);
This is because in LINQ to objects OrderBy is implemented using an O(n log(n)) algorithm but Min uses a O(n) algorithm. However when executed against SQL Server, both these queries will produce the same SQL and both are linear time - O(n).
So running a paging query like OrderBy(x => x.Something).Skip(50).Take(10) is faster in a database because a lot more effort has gone into making sure that it is faster. After all, the speed of this sort of query is a major selling point for databases.

Related

AWS DAX Performance issues with table scan

Hi I am working on an project that requires to bring all dyanamo db document in memory. I will be using table.scan() boto3 method which nearly takes 33 seconds for all 10k records.
I have configured the DAX and using it for table scan, which takes nearly the 42 seconds with same 10k records with same lambda configuration. I tried multiple times results are same.
I tried below code :
daxclient = amazondax.AmazonDaxClient.resource(endpoint_url="...")
table = daxclient.Table('table_name')
start_time = time.perf_counter()
retry = True
while retry:
try:
response = table.scan(TableName ="table_name")
retry = 'LastEvaluatedKey' in response
scan_args['ExclusiveStartKey'] = response.get('LastEvaluatedKey')
except Exception as e:
print(e)
print(time.perf_counter()-start_time)
I tried boto3 getItem() method this becomes faster like first time it takes 0.4seconds and after that it takes 0.01 seconds.
Not sure why it is not working with table scan method.
Please suggest.
DAX doesn’t cache scan results. You therefore shouldn’t expect a performance boost and, since you’re bouncing through an extra server on the way to the database, can expect a performance penalty.
You must have very large items to see these performance numbers. And you’re doing a scan a lot? You might want to double check DynamoDB is the right fit.

AWS Neptune Query gremlins slowness on cold call

I'm currently running some queries with a big gap of performance between first call (up to 2 minutes) and the following one (around 5 seconds).
This duration difference can be seen through the gremlin REST API in both execution and profile mode.
As the query is loading a big amount of data, I expect the issue is coming from the caching functionalities of Neptune in its default configuration. I was not able to find any way to improve this behavior through configuration and would be glad to have some advices in order to reduce the length of the first call.
Context :
The Neptune database is running on a db.r5.8xlarge instance, and during execution CPU always stay bellow 20%. I'm also the only user on this instance during the tests.
As we don't have differential inputs, the database is recreated on a weekly basis and switched to production once the loader has loaded everything. Our database have then a short lifetime.
The database is containing slightly above 1.000.000.000 nodes and far more edges. (probably around 10.000.000.000) Those edges are splitted across 10 types of labels, and most of them are not used in the current query.
Query :
// recordIds is a table of 50 ids.
g.V(recordIds).HasLabel("record")
// Convert local id to neptune id.
.out('local_id')
// Go to tree parent link. (either myself if edge come back, or real parent)
.bothE('tree_top_parent').inV()
// Clean duplicates.
.dedup()
// Follow the tree parent link backward to get all children, this step load a big amount of nodes members of the same tree.
.in('tree_top_parent')
.not(values('some flag').Is('Q'))
// Limitation not reached, result is between 80k and 100K nodes.
.limit(200000)
// Convert back to local id for the 80k to 100k selected nodes.
.in('local_id')
.id()
Neptune's architecture is comprised of a shared cluster "volume" (where all data is persisted and where this data is replicated 6 times across 3 availability zones) and a series of decoupled compute instances (one writer and up to 15 read replicas in a single cluster). No data is persisted on the instances however, approximately 65% of the memory capacity on an instance is reserved for a buffer pool cache. As data is read from the underlying cluster volume, it is stored in the buffer pool cache until the cache fills. Once the cache fills, a least-recently-used (LRU) eviction policy will clear buffer pool cache space for any newer reads.
It is common to see first reads be slower due to the need to fetch objects from the underlying storage. One can improve this by writing and issuing "prefetch" queries that pull in objects that they think they might need in the near future.
If you have a use case that is filling buffer pool cache and constantly seeing buffer pool cache misses (a metric one can see in the CloudWatch metrics for Neptune), then you may also want to consider using one of the "d" instance types (ex: r5d.8xlarge) and enabling the Lookup Cache feature [1]. This feature specifically focuses on improving access to property values/literals at query time by keeping them in a directly attached NVMe store on the instance.
[1] https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-lookup-cache.html

What could cause a sqlite application to slow down over time with high load?

I'll definitely need to update this based on feedback so I apologize in advance.
The problem I'm trying to solve is roughly this.
The graph shows Disk utilization in the Windows task manager. My sqlite application is a webserver that takes in json requests with timestamps, looks up the existing entry in a 2 column key/value table, merges the request into the existing item (they don't grow over time), and then writes it back to the database.
The db is created as follows. I've experimented with and without WAL without difference.
createStatement().use { it.executeUpdate("CREATE TABLE IF NOT EXISTS items ( key TEXT NOT NULL PRIMARY KEY, value BLOB );") }
The write/set is done as follows
try {
val insertStatement = "INSERT OR REPLACE INTO items (key, value) VALUES (?, ?)"
prepareStatement(insertStatement).use {
it.setBytes(1, keySerializer.serialize(key))
it.setBytes(2, valueSerializer.serialize(value))
it.executeUpdate()
}
commit()
} catch (t: Throwable) {
rollback()
throw t
}
I use a single database connection the entire time which seems to be ok for my use case and greatly improves performance relative to getting a new one for each operation.
val databaseUrl = "jdbc:sqlite:${System.getProperty("java.io.tmpdir")}/$name-map-v2.sqlite"
if (connection?.isClosed == true || connection == null) {
connection = DriverManager.getConnection(databaseUrl)
}
I'm effectively serializing access to the db. I'm pretty sure the default threading mode for the sqlite driver is to serialize and I'm also doing some serializing in kotlin coroutines (via actors).
I'm load testing the application locally and I notice that disk utilization spikes around the one minute mark but I can't determine why. I know that throughput plummets when that happens though. I expect the server to chug along at a more or less constant rate. The db in these tests is pretty small too, hardly reaches 1mb.
Hoping people can recommend some next steps or set me straight as far as performance expectations. I'm assuming there is some sqlite specific thing that happens when throughput is very high for too long, but I would have thought it would be related to WAL or something (which I'm not using).
I have a theory but it's a bit farfetched.
The fact that you hit a performance wall after some time makes me think that either a buffer somewhere is filling up, or some other kind of data accumulation threshold is being reached.
Where exactly the culprit is, I'm not sure.
So, I'd run the following tests.
// At the beginning
connection.setAutoCommit(true);
If the problem is in the driver side of the rollback transaction buffer, then this will slightly (hopefully) slow down operations, "spreading" the impact away from the one-minute mark. Instead of getting fast operations for 59 seconds and then some seconds of full stop, you get not so fast operations the whole time.
In case the problem is further down the line, try
PRAGMA JOURNAL_MODE=MEMORY
PRAGMA SYNCHRONOUS=OFF disables the rollback journal synchronization
(The data will be more at risk in case of a catastrophic powerdown).
Finally, another possibility is that the page translation buffer gets filled after a sufficient number of different keys has been entered. You can test this directly by doing these two tests:
1) pre-fill the database with all the keys in ascending order and a large request, then start updating the same many keys.
2) run the test with only very few keys.
If the slowdown does not occur in the above cases, then it's either TLB buffer management that's not up to the challenge, or database fragmentation is a problem.
It might be the case that issuing
PRAGMA PAGE_SIZE=32768
upon database creation might solve or mitigate the problem. Conversely, PRAGMA PAGE_SIZE=1024 could "spread" the problem avoiding performance bottlenecks.
Another thing to try is closing the database connection and reopening it when it gets older than, say, 30 seconds. If this works, we'll still need to understand why it works (in this case I expect the JDBC driver to be at fault).
First of all, I want to say that I do not use exactly your driver for sqlite, and I use different devices in my work. (but how different are they really?)
From what I see, correct me if im wrong, you use one transaction, for one insert statement. You get request, you use the disc, you use the memory, open, close etc... every time. This can't work fast.
The first thing I do when I have to do inserts in sqlite is to group them, and use a single transaction to do it. That way, you are using your resources in batches.
One transaction, many insert statements, single commit. If there is a problem with a batch, handle the valid separately, log the faulty, move the next batch of requests.

Embedded key-value db vs. just storing one file per key?

I'm confused about the advantage of embedded key-value databases over the naive solution of just storing one file on disk per key. For example, databases like RocksDB, Badger, SQLite use fancy data structures like B+ trees and LSMs but seem to get roughly the same performance as this simple solution.
For example, Badger (which is the fastest Go embedded db) takes about 800 microseconds to write an entry. In comparison, creating a new file from scratch and writing some data to it takes 150 mics with no optimization.
EDIT: to clarify, here's the simple implementation of a key-value store I'm comparing with the state of the art embedded dbs. Just hash each key to a string filename, and store the associated value as a byte array at that filename. Reads and writes are ~150 mics each, which is faster than Badger for single operations and comparable for batched operations. Furthermore, the disk space is minimal, since we don't store any extra structure besides the actual values.
I must be missing something here, because the solutions people actually use are super fancy and optimized using things like bloom filters and B+ trees.
But Badger is not about writing "an" entry:
My writes are really slow. Why?
Are you creating a new transaction for every single key update? This will lead to very low throughput.
To get best write performance, batch up multiple writes inside a transaction using single DB.Update() call.
You could also have multiple such DB.Update() calls being made concurrently from multiple goroutines.
That leads to issue 396:
I was looking for fast storage in Go and so my first try was BoltDB. I need a lot of single-write transactions. Bolt was able to do about 240 rq/s.
I just tested Badger and I got a crazy 10k rq/s. I am just baffled
That is because:
LSM tree has an advantage compared to B+ tree when it comes to writes.
Also, values are stored separately in value log files so writes are much faster.
You can read more about the design here.
One of the main point (hard to replicate with simple read/write of files) is:
Key-Value separation
The major performance cost of LSM-trees is the compaction process. During compactions, multiple files are read into memory, sorted, and written back. Sorting is essential for efficient retrieval, for both key lookups and range iterations. With sorting, the key lookups would only require accessing at most one file per level (excluding level zero, where we’d need to check all the files). Iterations would result in sequential access to multiple files.
Each file is of fixed size, to enhance caching. Values tend to be larger than keys. When you store values along with the keys, the amount of data that needs to be compacted grows significantly.
In Badger, only a pointer to the value in the value log is stored alongside the key. Badger employs delta encoding for keys to reduce the effective size even further. Assuming 16 bytes per key and 16 bytes per value pointer, a single 64MB file can store two million key-value pairs.
Your question assumes that the only operation needed are single random reads and writes. Those are the worst case scenarios for log-structured merge (LSM) approaches like Badger or RocksDB. The range query, where all keys or key-value pairs in a range gets returned, leverages sequential reads (due to the adjacencies of sorted kv within files) to read data at very high speeds. For Badger, you mostly get that benefit if doing key-only or small value range queries since they are stored in a LSM while large values are appended in a not-necessarily sorted log file. For RocksDB, you’ll get fast kv pair range queries.
The previous answer somewhat addresses the advantage on writes - the use of buffering. If you write many kv pairs, rather than storing each in separate files, LSM approaches hold these in memory and eventually flush them in a file write. There’s no free lunch so asynchronous compaction must be done to remove overwritten data and prevent checking too many files for queries.
Previously answered here. Mostly similar to other answers provided here but makes one important, additional point: files in a filesystem can't occupy the same block on disk. If your records are, on average, significantly smaller than typical disk block size (4-16 KiB), storing them as separate files will incur substantial storage overhead.

How to store big data for use in computation?

I have recently started working on big data. Specifically, I have several GBs of data and I have to do computation (addition, modification) on it frequently. Since any computation on the data takes a lot of time, I been thinking of how to store the data for quick computation. Following are the options I have looked into:
Plain text file: The only advantage of this technique is inserting data is very easy. Changes to existing data are pretty slow, since there is no way to search for records efficiently.
Database: Insertion and modification of data are simplified. However, since this is a ongoing research project, schema may need to be updated frequently depending upon experimental results (this has NOT happened uptil now, but will definitely be something that may happen in near future). Besides, moving data around is not simple (as compared to a simple files). Moreover, I have noticed that querying data is not that quick as compared to when data is stored in XML.
XML: Using BeautifulSoup, only loading the XML file containing all the data takes around ~15 minutes and takes up ~15GB of RAM. Since it is quite normal to run scripts multiple times in a day, ~15 minutes for every invocation seems awfully long. The advantage is once the data is loaded, I can search/modify elements (tags) fairly quickly.
JSON and YAML: I have not looked into it deeply. They can surely compress the disk space needed to store the file (relative to XML). However, I have found no way to query records when data is stored in these formats (unlike database or XML).
What do you suggest I do? Do you have any other option in mind?
If you're looking for a flexible database for a large amount of data, MongoDB may be the technology you are looking for.
MongoDB belongs to the family of the NoSQL database systems and is:
based on JSON-alike documents
highly performant even with large amounts of data
schema-free
document-based
open-source
queryable
indexable
It allows you to modify your schema in the future in a very flexible way, quite easy to insert data (1.), modify the data and its structure (2.), faster than XML (3.) and JSON-based for efficient storage (4.).
size of integer is 4,long long int is 8 byte and it can access about 19 digits data and for unsinged long long int size also 8 byte but handle larger value than long long int but this is less than 20 digits.Is there any way that can hangle over 20 digits data.
#include<iostream>
using namespace std;
int main()
{
unsigned long long int a;
cin>>a;
if(a>789456123789456123123)//want to take a higher thand this digits
{
cout<<"a is larger and big data"<<endl;
}
}
I searched about it for a while but didn't find helpful contents.all about is java biginteger..

Resources