GoodData Dataset Writer Maximum Rows? - bigdata

Is there a maximum number of records that can be uploaded using a GoodData dataset writer in a single load? I have looked around and I do not see a documented value for this.

There is no limit specified!
However, expect things to get significantly slower somewhere between 10 and 100 millions of rows, especially if there are data relationships involved such as keys in the table.

Related

Azure Analysis Service - partition to refresh modified rows only?

I have an AS tabular model that contains a fact table with 20 mil rows. I have partitioned this so only the new rows get added to each day... however occasionally, a historical row (from years ago) will be modified. I can identify this modified row in SQL (using the last modified timestamp) however would it be possible for me to refresh the row in SSAS to reflect this change without having to refresh my entire data model? How would I achieve this?
First, 20 million rows is not a lot. I’m expecting that will only take 5-10 minutes to process unless your SQL queries are very inefficient or very wide. So why bother to optimize something which may be fast enough already?
If you do need to optimize it, you will first want to partition the large fact table by some date element. Since you only have 20 million rows I would suggest partitioning by year. Optimal compression will be achieved with around 8 million rows per partition. Over-partitioning (such as creating thousands of daily partitions) is counter-productive.
When a new row is added you could perform a ProcessAdd to insert just the new records to the partitions in question. However I would recommend just doing a ProcessFull on any year partitions which have any inserts, updates or deletes in SQL.
SSAS doesn’t support updating a specific row. Thus you have to follow the ProcessFull advice above.
There are several code examples including this one which may help you.
Again this may be overkill if you only have 20 million rows.

CSV/parquet to Dynamo, small file of ~500k rows, only two columns

I am looking for a way to upload CSV/parquet data to dynamo db without having to create a data pipeline.
I have a small (12mb parquet/30mb CSV) file which consists of two columns. It gets generated daily and the dynamo table needs a full refresh each day.
At first I decided to use AWS Athena, which was very easy to setup. But for reads, its slow (each query takes 1.5 to 4 seconds). This process may be used by others in the company in the near future, so I am now seeking something quicker.
I looked into Dynamo DB's batch write item function. But it feels extremely inefficient to make about 500000/25 calls a day to update this relatively small size table.
What is a bit frustrating is that a single call using batchwriteitem has a max size of 16mb, with 400kb per row. Which is almost the size of the file itself.
I looked into perhaps sending the data as one long row, and splitting it. But I could not find such an operation. Curious if anybody has any input on this.

What are maximum limits of Mariadb columnstore?

I want to create wide table of thousands of rows in mariadb columnstore. I didn't find any documentation of max number of columns allowed by storage engine. I would also like to know how will be the performance of columnstore for 1000 integer columns?
(Caveat: This 'Answer' is based on my understanding of the design, not on any 'facts'.)
The disk footprint of a table should be proportional to the number of columns.
As with most things in MariaDB, there is probably a hard limit on the number of columns, but I can think of no reason for it to be under 1000. Perhaps, instead, some larger power of 2.
When referencing only a small number of columns, it should not matter how many columns there are in the table. The way the data is structured should allow fetching each column with a relatively fixed amount of effort.
For filtering, I would expect the effort taken to depend on the number of columns used for filtering, and their distribution. If your WHERE clause references a lot of columns, I would not expect good performance.
With any Engine, having lots of columns is not necessarily a wise design. In general, when you have lots of columns that are not used for filtering or sorting (WHERE, ORDER BY), you may as well toss them into a JSON string (or other structure) and store as a single TEXT or BLOB columns. Then let the application parse the string to get the individual columns.
Columnstore shines for 'filtering'. It also is very good in compressing data, and my JSON suggestion would defeat this. But now you are into speed-vs-space tradeoffs that are very data-specific.
Would you care to describe your proposed dataset?
There is no limitation on number of columns for Columnstore but data ingestion performance is not the best ATM. We are decreasing the timing significantly in the nearest future.
When I tried to create a ColumnStore table which has 2310 columns, it returned "Error Code: 1117. Table definition is too large".
I decreased the number of columns and tried again.
It looks like that the maximum number of columns for MariaDB ColumnStore is 2201.

MariaDB partitionning last 3 month

First, I explain my problem:
This is a table that will contain approximately 5,000,000 record per year, these records will be kept at least 10 years (it is not yet defined). We talk about events of production machine. I generate a report + a dashbord for displaying various information relatively complex (average number of events per 10 minutes a month, graphics, ...) and also wants to see the records themselves. The data displayed will be in large majority of the last 2 months, viewing the rest of the data must always be possible but at a lower speed of access.
I work on MariaDB v10.1.12.
The idea was to make a partition on the last 3 months. I realize now that this is not so easy. I have not found any solution to this partition, in fact, it is impossible to make a partition based on a now() or other current_date() etc. directly or indirectly via another calculated column.
Do you have any ideas for me? Perhaps another solution than a partition.
Thank you in advance.
I recommend PARTITION BY RANGE(TO_DAYS(...)) If you are only now breaking the table into partitions, I would recommend annual partitions for data before this year, then quarterly or monthly partitions henceforth. Yes, that, in theory, leads to an infinite number of partitions, but I predict that you will revamp the data structure within a few years.
20-50 partitions is a good number. More than that leads to inefficiencies due to the multitude of partitions; less than that leads to asking "why bother".
Use InnoDB. Design the PRIMARY KEY carefully, since it may be useful as the primary index into the data.
Usually it is best to put the date/timestamp column last in any indexes. Putting it first would be redundant since partition pruning comes first.
More on partitioning.
It sounds like a main purpose for the table is to summarize the data for graphing, etc. In that case, it may be very beneficial to build and maintain "Summary table(s)" of counts and subtotals over selected time intervals. 100 rows get added up for a 10-minute interval? If so, then the summary table based on 10-minute intervals would have 1/100th as many rows, and the queries would be much faster. Plus, you could 'denormalize' the summary tables to make them even simpler.
More on Summary tables.
It might be worth it to gather data for 10 minutes into a staging table, then summarize it into the summary table. And also throw the raw data into the big table.
Or, if the summary tables have everything you need, you could abandon the big table. Or, as a compromise, keep 12 month's worth of data (partitioned by month), and DROP PARTITION for older data. Meanwhile, the summary tables can continue to grow (although they will be much smaller).
Table partitioning is an advance features, it is not indexing, but rearrangement of tables data. So it is not "duplicate", indeed new data will stored according to the predefined partitioning range.
You must also specify month range criteria as usual. you MUST create index if those column are not used as partition range. When you make a select, algorithm that associate with partition table will handle those merging(if required) in background. So you just treat partition exactly like your typical table.
For more details, please check Mariadb paritioning overview

Database design question: How to handle a huge amount of data in Oracle?

I have over 1.500.000 data entries and it's going to increase gradually over time. This huge amount of data would come from 150 regions.
Now should I create 150 tables to manage this increasing huge data? Will this be efficient? I need fast operation. ASP.NET and Oracle will be used.
If all the data is the same, don't split it in to different tables. Take a look at Oracle's table partitions. One-hundred fifty partitions (or more) split out by region (or more) is probably more in line with what you're going to be looking for.
I would also recommend you look at the Oracle Database Performance Tuning Tips & Techniques book and browse Ask Tom on Oracle's website.
Only 1.5 M rows? Not a lot really...
Use one table; working out how to write a 150-way union across 150 tables will be murder.
1.5 million rows doesn't really seem like that much. How many people are accessing the table(s) at any given point? Do you have any indexes setup? If you expect it to grow much larger, you may want to look into partitioning in databases.
FWIW, I work with databases on a regular basis with 100M+ rows. It shouldn't be this bad unless you have thousands of people using it at a time.
1 table per region is way not normalized; you're probably going to lose a bunch of efficiency there. 1 table per data entry site is pretty unusual too. Normalization is huge, it will save you a ton of time down the road, so I'd make sure you're not storing any duplicate data.
If you're using oracle, you shouldn't need to have multiple tables. It'll support a lot more than 1.5 million rows. If you need to speed up data access, you can try a snowflake schema to pull in commonly accessed data.
If you mean 1,500,000 rows in a table then you do not have much to worry about. Oracle can handle much larger loads than that with ease.
If you need to identify the regions that the data came in, you can create a Region table and tie the ID from that to the big data table.
IMHO, you should post more details and we can help you better.
A database with 2,000 rows can be slow. It all depends on your database design, index, keys and most important is the hardware configuration your database server is running on. The way your application uses this data is also important. Is a read intensive database or transaction intensive? There is no right answer to what you are asking right now.
You first need to consider what operations are going to access the table. How will inserts be performed? Will the existing rows be updated, and if so how? By how much will the rows grow, and what percentage of them will grow? Will rows get deleted? By what criteria? How will you be selecting data? By what criteria and how many per query?
Data partition can be used for volume of data much larger than 1.5m rows. Look into optimizing
the SQL query ,batch processing and storage of data.

Resources