I have a general question on memory consumption. I have a program that uses a QAbstractTableModel storing data in a QVector. This is used to populate a Table View for an excel like grid. It is 18 columns and the data consists of general text strings.
When loading approximately 300000 rows, my application uses around 1.1GB of RAM.
Is this normal? If not, any suggestions for reducing memory usage. I do need to see the entire dataset. I can open an equivalent file in Excel and it uses 200MB, so I assume there must be a way to handle this.
Thanks for any suggestions.
Related
I have the following situation:
library(RODBC)
channel <- odbcConnect("ABC", uid="DEF", pwd="GHI")
df <- sqlQuery(channel,query)
The number of rows is 10M+. Is there any faster way to read the data?
The data is in oracle database.
This definitely should be a comment but will be too long for the purposes.
When executing SQL there are a few likely bottlenecks
Executing the query itself
Download the data from the database
Converting the data to align with language specific types (eg. R integers rather than BIGINT etc.
If your query runs fast when executed directly on the database UI, it is unlikely that the bottleneck comes when executing the query itself. This is also immediately clear if your query only contains simple [RIGHT/LEFT/INNER/OUTER] JOIN, as these are not "complex" query operators as such. This is more often caused by more complex nested queries using WITH clauses or window functions. The solution here would be to create a VIEW such that the query will be pre-optimized.
Now what is more likely to be the problem is 2. and 3. You state that your table has 10M data points. Lets assume your table is financial and has only 5 columns, which are all 8bit floats ( FLOAT(8) ) or 8 bit integers, the amount of data to be downloaded is (8 * 5 * 10 M / 1024^3) Gbit = 0.37 Gbit, which itself will take some time to download depending on your cconnection. Assuming you have a 10 Mbit connection the download under optimal condition would be taking at least 37 seconds. And this is the case where your data has only 5 columns! It is not unlikely that you have many more.
Now for 3. it is more difficult to predict the amount of time spent without careful code profiling. This is the step where RODBC, odbc or RJDBC will have to convert the data into types that R understands. I am sorry to say that here it becomes a question of "trial and error" to figure out which packages work best. However for oracle specifics, I would assume DBI + odbc + ROracle (seems to be developed by oracle themselves??) would be a rather safe bet for a good contender.
Do however keep in mind, that the total time spent on getting data imported from any database into R is an aggregate of the above measures. Some databases provide optimized methods for downloading queries/tables as flat-files (csv, parquet etc) and this can in some cases speed up the query quite significantly, but at the cost of having to read from disk. This often also becomes more complex compared to executing the query itself, so one has to evaluate whether it is worth the trouble, or whether it is worth just waiting for the original query to finish executing within R.
I am new to Big Data, trying to understand different file formats from better query executions perspective , while trying to understand more about layout of the files of ORC & Parquet, came across the below terminology and wanted to validate my understanding.
Stripes - as per my understanding they are group of rows which belongs to same data set, for example if we have data for a different countries in a
file , the data for each country is in one Stripe , this helps in reducing the amount of data transferred from disk to memory as the Stripes not relevant can be skipped using Predicate push down.
VectorizedRowBatch , this helps in vectorization of columnar data for efficient CPU processing . it processes Batch by batch. in this context I could not understand the difference between Stripes & Batch. As I understood stripes helps to reduce the amount of data being transferred from disk to Memory , while Batch helps in processing the records paralelly using vectorization approach.
3.Stripes are only specific to ORC file format , similar concept for Parquet files is "Partitions" which is related data stored in RowGroups. which helps in Predicate push down.
for Parquet we do not have vectorization support , does that mean that ORC is better compared to Parquet for better CPU utilization.
Please correct me if my understanding is wrong for the points mentioned.
I have recently started working on big data. Specifically, I have several GBs of data and I have to do computation (addition, modification) on it frequently. Since any computation on the data takes a lot of time, I been thinking of how to store the data for quick computation. Following are the options I have looked into:
Plain text file: The only advantage of this technique is inserting data is very easy. Changes to existing data are pretty slow, since there is no way to search for records efficiently.
Database: Insertion and modification of data are simplified. However, since this is a ongoing research project, schema may need to be updated frequently depending upon experimental results (this has NOT happened uptil now, but will definitely be something that may happen in near future). Besides, moving data around is not simple (as compared to a simple files). Moreover, I have noticed that querying data is not that quick as compared to when data is stored in XML.
XML: Using BeautifulSoup, only loading the XML file containing all the data takes around ~15 minutes and takes up ~15GB of RAM. Since it is quite normal to run scripts multiple times in a day, ~15 minutes for every invocation seems awfully long. The advantage is once the data is loaded, I can search/modify elements (tags) fairly quickly.
JSON and YAML: I have not looked into it deeply. They can surely compress the disk space needed to store the file (relative to XML). However, I have found no way to query records when data is stored in these formats (unlike database or XML).
What do you suggest I do? Do you have any other option in mind?
If you're looking for a flexible database for a large amount of data, MongoDB may be the technology you are looking for.
MongoDB belongs to the family of the NoSQL database systems and is:
based on JSON-alike documents
highly performant even with large amounts of data
schema-free
document-based
open-source
queryable
indexable
It allows you to modify your schema in the future in a very flexible way, quite easy to insert data (1.), modify the data and its structure (2.), faster than XML (3.) and JSON-based for efficient storage (4.).
size of integer is 4,long long int is 8 byte and it can access about 19 digits data and for unsinged long long int size also 8 byte but handle larger value than long long int but this is less than 20 digits.Is there any way that can hangle over 20 digits data.
#include<iostream>
using namespace std;
int main()
{
unsigned long long int a;
cin>>a;
if(a>789456123789456123123)//want to take a higher thand this digits
{
cout<<"a is larger and big data"<<endl;
}
}
I searched about it for a while but didn't find helpful contents.all about is java biginteger..
The data contains information like billions of ID-scores pairs. To quickly access these paired information, I plan to use the hash-table container since its time complexity of search is O(1). Considering the the raw data is around 80G, I don't want to load the data into RAM every time when I need to run search application. What I want to do is to generate the hash-table once and then store it in RAM with persistence of filesystem lifetime (the expense of RAM is not a criteria), and search it with different applications.
Based on my limited understanding, I could use "Memory Mapped Files" (boost C++ libraries). But I have questions:
1) Is it possible to keep the hash-table data structure when write it to the mapped file?
2) How much time it will cost to map the existed file to RAM?
Any answers/comments/suggestions are most welcomed!
Thanks,
1) Yes. The file is just bytes, just like memory.
2) Creating the mapping will be effectively instantaneous. Node that you won't be able to map all of it contiguously at once except on a 64-bit OS. Of course, if the file cache can't hold the portion of the map you're using, it will have to be read from disk.
How big are IDs? How big are pairs? How much locality of reference do you have? (Are there heavily-used pair and lightly used pairs?) How often will you be searching for pairs that aren't present? Is the data read-mostly? There may be better ways to do it. I'd strongly suggest starting with a broader question to make sure you're not stuck on a sub-optimal path.
Updated question to be less vague.
I plan to log sensor data by time so something like sqlite would be perfect, but it requires too much resources in something like an atmega328p. Most of the searching will be done off the uC.
What do other people use? Flat text files? XML? A more complicated data structure?
Thanks for the feedback. It is good to know what other people are using. I've decided to serialize my data structures and save them in a binary file to eliminate string processing on the uC for now.
I've used flat text files for similar projects, albiet years ago, but I believe it's still a good approach for that environment. Since you don't need to process the data on-chip, you want it to be as efficient as possible (as little overhead as possible).
However, if you want more flexibility and weren't as concerned about space, perhaps saving JSON objects would be better, where each field is identified clearly. A tiny bit of overhead for creating the objects, but allows you to add and remove fields without complex logic on the interpreting side. I would pick JSON over XML just because you have about half the overhead (in space, and probably in processing).
With a small micro-controller like the 328, it is very important to determine the space requirements.
How big is each record? How many records do you want to store? How will you get the records off of the micro-controller?
Like Doug, I usually use a flat text to store data. So each record might contain year, day of the year and a value if I am storing a value once a day.
The file would look like:
11,314,100<cr>
11,315,99<cr>
11,316,98<cr>
11,317,220<cr>
You could store approximately 90-100 records, requiring you to dump the data every three months
If you need more then the 1kEEprom holds (200 5 byte records, 100 10 byte records or simliar) then you will need additional memory using an IC, SD or Flash.
If you want to unplug the memory and plug it into a PC, then the SD or Flash would be best.
You could use a vinculum chip from FTDIChip.com to simplify writing the fat files to flash drive.