BigQuery streaming best practice - bigdata

I am using Google BigQuery for sometime now, using upload files,
As I get some delays with this method I am now trying to convert my code into streaming.
Looking for best solution here, what is more correct working with BQ:
1. Using multiple (up to 40) different streaming machines ? or directing traffic to single or more endpoints to upload data?
2. Uploading one row at a time or stacking to a list of 100-500 events and uploading it.
3. is streaming the way to go, or stick with files uploading - in terms of high volumes.
some more data:
- we are uploading ~ 1500-2500 rows per second.
- using .net API.
- Need data to be available within ~ 5 minutes
Didn't find such reference elsewhere.

The big difference between streaming data and uploading files is that streaming is intended for live data that is being produced on real time while being streamed, whereas with uploading files, you would upload data that was stored previously.
In your case, I think Streaming makes more sense. If something goes wrong, you would only need to re-send the failed rows, instead of the whole file. And it adapts more to the growing files that I think you're getting.
The best practices in any case are:
Trying to reduce the number of sources that send the data.
Sending bigger chunks of data in each request instead of multiple tiny chunks.
Using exponential back-off to retry those requests that could fail due to server errors (These are common and should be expected).
There are certain limits that apply to Load Jobs as well as to Streaming inserts.
For example, when using streaming you should insert less than 500 rows per request and up to 10,000 rows per second per table.

Related

Mule 4 : Design : How to process data[files/ database records] in Mule 4 without getting "out-of-memory" error?

Scenario :
I have a database which contains 100k records which have a 10 GB size in memory.
My objective is to
fetch these records,
segregate the data based on certain conditions
then generate csv files for each group of data
write these CSV files to a NAS (storage drive accessible over the same network)
To achieve this, I am thinking of the design as follows:
Use a Scheduler component that triggers the flow daily at 9 am for example)
Use a database select operation to fetch the records
Use a batch processing scope
In batch step use reduce function in Transform message and segregate the data in aggregator in the format like :
{
"group_1" : [...],
"group_2" : [...]
}
In the on complete step of batch processing use a file component to write the data in files in the NAS drive
Questions/Concerns :
Case 1 : When reading from database select it loads all the 100k records in memory.
Question : How to optimize this step so that I can still get 100k records to process but not have a spike in memory usage?
Case 2 : When segregating the data I am storing the isolated data in the aggregator object in reduce operator and then the object stays in memory till i write it into files.
Question : Is there a way I can segregate the data and directly write the data in files in the batch aggregator step and quickly clean the memory from the aggregator object space?
Please treat it as a design question for Mule 4 flows and help me. Thanking the community for your help ad support.
Don't load 100K records in memory. Loading high volumes of data in memory will probably cause an out of memory error. You are not providing details in the configurations but the database connector 'streams' pages of records by default so that's taking care. Use the fetchSize attribute to tune the number of records per page that are read. The default is 10. The batch scope uses disk space to buffer data, to avoid using RAM memory. It also has some parameters to help tune the numbers of records processed per step, for example batch block size and batch aggregator size. Using default values would not be anywhere near 100K records. Also be sure to control concurrency to limit resource usage.
Note that even if reducing all configurations it doesn't mean there will be no spike when processing. Any processing consumes resources. The idea is to have a predictable, controlled spike, instead of an uncontrolled one that can exhaust available resources.
This question is not clear. You can't control the aggregator memory other than the aggregator size, but it looks like it only keeps the more recent aggregated records, not all the records. Are you having any problems with that or is this a theoretical question?

Exporting all Marketo Leads in a CSV?

I am trying to export all of my leads from Marketo (we have over 20M+) into a CSV file, but there is a 10k row limit per CSV export.
Is there any other way that I can export a CSV file with more than 10k row? I tried searching for various dataloader tool on Marketo Launchpoint but couldn't find a tool that would work.
Have you considered using the API? It may not be practical unless you have a developer on your team (I'm a programmer).
marketo lead api
If your leads are in salesforce and marketo/salesforce are in parity, instead of exporting all your leads, do a sync from salesforce to the new MA tool (if you are switching) instead. It's a cleaner easier sync.
For important campaigns etc, you can create smart lists and export those.
There is no 10k row limit for exporting Leads from a list. However, there is a practical limit, especially if you choose to export all columns (instead of only the visible columns). I would generally advise on exporting a maximum of 200,000-300,000 leads per list, so you'd need to create multiple Lists.
As Michael mentioned, the API is also a good option. I would still advise to create multiple Lists, so you can run multiple processes in parallel, which will speed things up. You will need to look at your daily API quota: the default is either 10,000 or 50,000. 10,000 API calls allow you to download 3 million Leads (batch size 300).
I am trying out Data Loader for Marketo on Marketo Launchpoint to export my lead and activity data to my local database. Although it cannot transfer marketo data to CSV file directly, you can download Lead to your local database and then export to get a CSV file. For your reference, we have 100K leads and 1 billion activity data.
You might have to run multiple times for 20M leads, but the tool is quite easy and convenient to use so maybe it’s worth a try.
Initially there are 4 steps to get bulk leads from marketo
1. Creating a Job
2. Enqueue Export Lead Job
2. Polling Job Status
3. Retrieving Your Data
http://developers.marketo.com/rest-api/bulk-extract/bulk-lead-extract/

Loading Bulk data in Firebase

I am trying to use the set api to set an object in firebase. The object is fairly large, the serialized json is 2.6 mb in size. The root node has around 90 chidren, and in all there are around 10000 nodes in the json tree.
The set api seems to hang and does not call the callback.
It also seems to cause problems with the firebase instance.
Any ideas on how to work around this?
Since this is a commonly requested feature, I'll go ahead and merge Robert and Puf's comments into an answer for others.
There are some tools available to help with big data imports, like firebase-streaming-import. What they do internally can also be engineered fairly easily for the do-it-yourselfer:
1) Get a list of keys without downloading all the data, using a GET request and shallow=true. Possibly do this recursively depending on the data structure and dynamics of the app.
2) In some sort of throttled fashion, upload the "chunks" to Firebase using PUT requests or the API's set() method.
The critical components to keep in mind here is that the number of bytes in a request and the frequency of requests will have an impact on performance for others viewing the application, and also count against your bandwidth.
A good rule of thumb is that you don't want to do more than ~100 writes per second during your import, preferably lower than 20 to maximize your realtime speeds for other users, and that you should keep the data chunks in low MBs--certainly not GBs per chunk. Keep in mind that all of this has to go over the internets.

Is Xively a good fit where data is simple/infrequent, and processing of that data is done externally to it?

I'm looking to design a solution with a rather large number of Arduino devices all returning a very simple data point (let's say temperature for now so as to not release too much information). The single data point is collected only once a day and sent to a central site, from which reports can be generated.
All of the devices will have some device-specific data (a location ID and device ID, in combination unique across the entire network of devices) burnt into EEPROM. The data collected is simply that device-specific data and the temperature itself (but see question 2 below). So, a very simple payload.
Initial investigations into Xively seem to mandate every device be created within Xively itself but that's going to be a serious pain given the many hundreds we expect, even in the pilot program.
And, given that each device uploads its unique ID along with the temperature, it seems to make little sense to have to configure all of them within Xively when the data itself can be easily segregated and reported on at the back end based on the device-specific data.
The following diagram should illustrate what we're looking at:
So, a few questions:
1/ Is Xively a good fit for this sort of scheme? In other words, is it worth using as just a data collector from which we can access the data at the back end and make nice reports? We have no real interest (yet) in using Xively as the interface - for now, it's enough to collect the data at the central site, generate a PDF file and mail it out.
2/ Is it acceptable in Xively to define your single device (an uber-device) as "my massive cluster of Arduino nodes" and then have each node post its data as the uber-device? They seem to just refer to "device" without actually specifying any restrictions.
3/ Given that timestamp information is important to us, can Xively inject that information into its data when the API call is made to upload the data? That may remove the need for use to provide on-board clocks for the devices.
4/ Have people with Arduino experience implemented any other schemes like this (once a day upload)? The business prefers Xively so they don't have to set up their own servers to receive the data, but there may be other options with the same result.
Here we go:
Yes, this is the exactly what Xively was designed for, a massive data broker, or as the buzzword IoT guys like to call it: a Device Cloud, the most simple and easy to use on the market today.
Not sure if there is any restriction on the number of datareams a unique feed can handle, but having thousands of datastreams per feed does not seem to me the smartest way of using Xivley. Creating individual feeds for each phisical device is the idea behind it, devices must be able to auto-register and activate a pre register feed, read the samples on Xively tutorial, this is not dificult at all, also serial numbers can be added/create in batches from text files.
Sure, you may provide timestamp information while uploading, if you DO NOT provide it, Xively will assume it is a real time feed and will add current upload time to the data.
Surely it was implemented before, it is important to notice that Xyvley does not care who, or what is providing data to a feed, you may share one key & feed number with thousands of devices and they can all upload to the same feed or even to the same datastream, however, managing data uploaded this way can become very messy for lack of granularity and fine control.

Out of Memory Exception in Matrix RDLC

Im working on RDLC report where im using matrix to display the data.
But the problem is when the huge data is loading the report is not opening instead its showing the error System.Outofmemoryexception.
The reports without the matrix with huge data is working fine.
The records im trying to load is around 80,000 records.Do anyone faced the same problem?
The computer does not have sufficient memory to complete the requested operation when one or more of the following conditions are true:
A report is too large or too complex.
The overhead of the other running processes is very high.
The physical memory of the computer
is too small.
A report is processed in two stages. The two stages are execution and rendering. This issue can occur during the execution stage or during the rendering stage.
If this issue occurs during the execution stage, this issue most likely occurs because too much memory is consumed by the data that is returned in the query result. Additionally, the following factors affect memory consumption during the execution stage:
Grouping
Filtering
Aggregation
Sorting
Custom code
If this issue occurs during the rendering stage, the cause is related to what information the report displays and how the report displays the information.
Solution:
configure SQL Server to use more than 2 GB of physical memory
Schedule reports to run at off-hours when memory constraints are lower.
Adjust the MemoryLimit setting accordingly.
Upgrade to a 64-bit version of Microsoft SQL Server 2005 Reporting Services.
Redesign Report like
Return less data in the report queries.
Use a better restriction on the WHERE clause of the report queries.
Move complex aggregations to the data source.
Export the report to a different format. You can reduce memory consumption by using a different format to display the report like Excel, PDF etc
Simplify report design like Include fewer data regions or controls in the report or use a drillthrough report to display details.
In my case it is not problem of how big is dataset (how many rows), it is question of matrix report design. If variable in columns part of matrix has big value domain (more then 300 lets say) and you already have a variable with big value domain in rows part of matrix. It is not problem when both variable with big value domain are in rows or columns part of matrix. So different design, or create dataset which depend of value domain of variables with big value domain.

Resources