Improve import speed of azure analysis services model - azure-analysis-services

I currently have an azure analysis services model deployed as
S4 instance that pulls data from synapse data warehouse
data warehouse is at at DWU5000c level
database connection permitted in the xlargerc resource group to permit using up to 70% of the available DWU to pull in data
Across the entire model there are only 200 columns in total and the table sizes are as:
[Basket] -- 42,628,410 9 column
[Coupon] -- 238,686,562 37 columns
[Dim_Calendar] -- 14,245 na
[Dim_Product] -- 40,905,055 12 columns
[Discount] -- 9,096 na
[Item] -- 408,550,310 53 columns
[Tender] -- 54,053,087 39 columns
At present, when I try to import a single table, say basket, it will only import just under 150,000 records a second.
Is there a mass bulk import setting I am missing? How can I alter the settings to improve the import speed?

Related

InfluxDB optimize storage for 2.7 billion series and more

We're looking to migrate some data into InfluxDB. I'm working with InfluxDB 2.0 on a test server to determine the best way to stock our data.
As of today, I have about 2.7 billion series to migrate to InfluxDB but that number will only go up.
Here is the structure of the data I need to stock:
ClientId (332 values as of today, string of 7 characters)
Driver (int, 45k values as of today, will increase)
Vehicle (int, 28k values as of today, will increase)
Channel (100 values, should not increase, string of 40 characters)
value of the channel (float, 1 value per channel/vehicle/driver/client at a given timestamp)
At first, I thought of stocking my data this way:
One bucket (as all data have the same data retention)
Measurements = channels (so 100 kind of measurements are stocked)
Tag Keys = ClientId
Fields = Driver, Vehicle, Value of channel
This gave me a cardinality of 1 * 100 * 332 * 3 = 99 600 according to this article
But then I realized that InfluxDB handle duplicate based on "measurement name, tag set, and timestamp".
So for my data, this will not work, as I need the duplicate to be based on ClientId, Channel, Vehicle at the minimum.
But if I change my data structure to be stored this way:
One bucket (as all data have the same data retention)
Measurements = channels (so 100 kind of measurements are stocked)
Tag Keys = ClientId, Vehicle
Fields = Driver, Value of channel
then I'll get a cardinality of 2 788 800 000.
I understand that I need to keep cardinality as low as possible. (And ideally I would even need to be able to search by driver as well as by vehicle.)
My questions are:
If I split the data into different buckets (ex: 1 bucket per clientId), will it decrease my cardinality?
What would be the best way to stock data for such a large amount of series?

Teradata Fastload - Sequential order from flat file

A flat file is being ingested into Teradata staging area using Fastload utility. Post this process, a merge operation will be done on this data which will insert/update into a target table, based on the latest timestamp. I encountered a problem when the timestamp was same for a customer. Let me explain that using the following data in the flat file:
Cust1 | 123 | 15-May-2018 13:01:01
Cust1 | 234 | 15-May-2018 13:01:01
Cust2 | 111 | 15-May-2018 13:02:01
This is the order of data in the flat file. As you can see, both records of Cust1 has same timestamp. But The second record in the flat file is the latest, since the sequential write has written this record in the second row.
How do i fetch this record to be used in the MERGE statement? Currently my MERGE statement partitioned this based on the TIMESTAMP value. Is there anyway to find the sequential order when fastload runs? or some kind of row_id to used ?

Can I reliably query the Firebase intraday tables in BigQuery and get 100% of the event data?

I have two Firebase projects (one iOS and one Android) feeding into Bigquery. I need to combine, flatten, and aggregate some specific data from both projects into one combined table so that I can report off of it without querying all bazillion rows across all daily tables.
In order to populate this aggregate table, I currently have two python scripts querying the iOS and Android intraday tables every 5 minutes. The script gets the max timestamp from the aggregate table, then queries the intraday table to get any records with a greater timestamp (I track the max timestamp separately for iOS and Android because they frequently differ).
I am querying the intraday table with this (abbreviated) wildcard syntax:
SELECT yadda, yadda, timestamp_micros, 'ios' as platform
FROM `myproject.iOSapp.app_events_intraday*`
WHERE timestamp_micros > (Select max(timestamp_micros)
from myAggregateTable WHERE platform = 'ios' )
Is there any danger that when the intraday table flips over to the new day, I will miss any records when my script runs at 23:57 and then again at 00:02?
I thought I would post the results of my testing this for a few months. Here are the basic mechanics as I see them:
New DAY1 intraday table is created at midnight GMT (xyz.app_events_intraday_20180101)
New DAY2 intraday table is created 24 hours later (xyz.app_events_intraday_20180102), but DAY1 intraday table sticks around for a few hours
Eventually, DAY1 table is "renamed" to xyz.app_events_20180101 and you are left with a single (current) intraday table
My tests have shown that additional data is added to the app_events_* tables, even after step 3 has taken place, so it is NOT safe to assume that the data is stable/static once the name has changed. I have new data appear up to 2 or 3 days later.

Update a column in a table from other table with 6billion rows in oracle 11g

I have 2 table Table A, Table B. Both the tables are of size 500GB, Some of the columns of tables are as below.
Table A
ID
Type
DateModified
Added a new column to Table as CID, which is available in Table B.
Table B
ID
CID
DateGenerated
Table A is partitioned on dateModified, Table B is not partitioned, My task is to get the CID from Table B and update it in Table A. Both the tables are having billions of records.
I have tried Merge/SQL but its too slow, which cannot be completed in 2 days.
Adding a new column to an existing table causing row fragmentation. Updating the new column to some value will probably cause massive row chaining, partitioned or not. And yes, that is slow, even when there are sufficient indexes etc.
Recommended approach:
You are on Enterprise Edition since you have partitioning, so you might be able to solve this using the schema versions functionality.
But if this is a one time action and you do not know how to use it well, I would use a "create table ... as" approach. Building the table from scratch and then switching it when ready. Take care to not miss any trickle loaded transactions. With partitioning it will be fast (writing 500 GB at say 50 MB/sec on a strong server is not unrealistic, taking 3 hours).

Importing fields from multiple columns in an Excel spreadsheet into a single row in Access

We get new data for our database from an online form that outputs as an Excel sheet. To normalize the data for the database, I want to combine multiple columns into one row.
Example, I want data like this:
ID | Home Phone | Cell Phone | Work Phone
1 .... 555-1234 ...... 555-3737 ... 555-3837
To become this:
PhoneID | ID | Phone Number | Phone type
1 ............ 1 ....... 555-1234 ....... Home
2 ............ 1 ....... 555-3737 ....... Cell
3 ............ 1 ....... 555-3837 ...... Work
To import the data, I have a button that finds the spreadsheet and then runs a bunch of queries to add the data.
How can I write a query to append this data to the end of an existing table without ending up with duplicate records? The data pulled from the website is all stored and archived in an Excel sheet that will be updated without removing the old data (we don't want to lose this extra backup), so with each import, I need it to disregard all of the previously entered data.
I was able to make a query that lists everything out in the correct from the original spreadsheet (I entered the external spreadsheet into an unnormalized table in Access to test it) but when I try to append it to the phone number table, it adds all of the data repeatedly. I can remove it with a query to remove duplicate data, but I'd rather not leave it like that.
There are several possible approaches to this problem; which one you choose may depend on the size of the dataset relative to the number of updates being processed. Basically, the choices are:
1) Add a unique index to the destination table, so that Access will refuse to add a duplicate record. You'll need to handle the possible warning ("Access was unable to add xxx records due to index violations" or similar).
2) Import the incoming data to a staging table, then outer join the staging table to the destination table and append only records where the key field(s) in the destination table are null (i.e., there's no matching record in the destination table).
I have used both approaches in the past - I like the index approach for its simplicity, and I like the staging approach for its flexibility, because you can do a lot with the incoming data before you append it if you need to.
You could run a delete query on the table where you store the queried data and then run your imports.
Assuming that the data is only being updated.
The delete query will remove all records and then you can run the import to repopulate the table - therefore no duplicates.

Resources