Context: We store historical data in Azure Data Lake as versioned parquet files from our existing Databricks pipeline where we write to different Delta tables. One particular log source is about 18 GB a day in parquet. I have read through the documentation and executed some queries using Kusto.Explorer on the external table I have defined for that log source. In the query summary window of Kusto.Explorer I see that I download the entire folder when I search it, even when using the project operator. The only exception to that seems to be when I use the take operator.
Question: Is it possible to prune columns to reduce the amount of data being fetched from external storage? Whether during external table creation or using an operator at query time.
Background: The reason I ask is that in Databricks it is possible to use the SELCECT statement to only fetch the columns I'm interested in. This reduces the query time significantly.
As David wrote above, the optimization does happen on Kusto side, but there's a bug with the "Downloaded Size" metric - it presents the total data size, regardless of the selected columns. We'll fix. Thanks for reporting.
I'm getting the following error when trying to connect Power BI to my tabular model in AS:
AnalysisServices: Cannot query internal supporting structures for column 'table'[column] because they are not processed. Please refresh or recalculate the table 'table'
It is not a calculated column and the connection seems to work fine on the local copy. I would appreciate any help with this!
This would depend on how you are processing the data within your model. If you have just done a Process Data, then the accompanying meta objects such as relationships have not yet been built.
Every column of data that you load needs to also be processed in this way regardless of whether it is a calculated column or not.
This can be achieved by running a Process Recalc on the Database or by loading your tables or table partitions with a Process Full/Process Default rather than just a Process Data, which automatically runs the Process Recalc once the data is loaded.
If you have a lot of calculated columns and tables that result in a Process Recalc taking a long time, you will need to factor this in to your refreshes and model design.
If you run a Process Recalc on your database or a Process Full/Process Default on your table now, you will no longer have those errors in Power BI.
More in depth discussion on this can be found here: http://bifuture.blogspot.com/2017/02/ssas-processing-tabular-model.html
If this question is deemed inappropriate because it does not have a specific code question and is more "am I barking up the right tree," please advise me on a better venue.
If not, I'm a full stack .NET Web developer with no SSRS experience and my only knowledge comes from the last 3 sleepless nights. The app my team is working on requires end users to be able to create as many custom dashboards as they would like by creating instances of a dozen or so predefined widget types. Some widgets are as simple as a chart or table, and the user configures the widget to display a subset of possible fields selected from a larger set. We have a few widgets that are composites. The Web client is all angular and consumes a restful Web api.
There are two more requirements, that a reasonable facsimile of each widget can be downloaded as a PDF report upon request or at scheduled times. There are several solutions to this requirement, so I am not looking for alternate solutions. If SSRS would work, it would save us from having to build a scheduler and either find a way to leverage the existing angular templates or to create views based off of them, populate them and convert that to a pdf. What I am looking for is he'll in understanding how report generation best practices and how they interact witg .NET assemblies.
My specfic task is to investige if SSRS can create a report based on a composite widget and either download it as a PDF or schedule it as one, and if so create a POC based on a composite widget that contains 2 line graphs and a table. The PDF versions do not need to be displayed the same way as the UI where the graphs are on the same row and the table is below. I can show each graph on its' own as long as the display order is in reading order. ( left to right, then down to the next line)
An example case could be that the first graph shows the sales of x-boxes over the course of last year. The line graph next to it shows the number of new releases for the X-Box over the course of last year. The report in the table below shows the number of X-box accessories sold last year grouped by accessory type (controller, headset, etc,) and by month, ordered by the total sales amount per month.
The example above would take 3 queries. The queries are unique to that users specific instance of that widget on that specific dashboard. The user can group, choose sort columns and anything else that is applicable.
How these queries are created is not my task (at least not yet.) So there is an assumption that a magic query engine creates and stores these sql queries correctly in the database.
My target database is sql 2012 and its' reporting service. I'm disappointed it only supports the 2.0 clr.
OI have the rough outline of a plan, but given my lack of experience any help with this would be appreciated.
It appears I can use the Soap service for scheduling and management. That's straight forward.
The rest of my plan sounds pretty crazy. Any corrections, guidance and better suggestions would be welcome. Or maybe a different methodology. The report server is a big security hole, and if I can accomplish the requirements by only referencing the reporting names paces please point me in the right direction. If not, this is the process I have cobbled together after 3 days of research and a few msdn simple tutorials. Here goes:
To successfully create the report definition, I will need to reference every possible field in the entire superset available. It isn't clear yet if the superset for a table is the same as the superset for a graph , but for this POC I will assume they are. This way, I will only need a single stored procedure with an input parameter that identifies the correct query, which I will select and execute. The result set will be a small subset of the possible fields, but the stored procedure will return every field, with nulls for each row of the omitted fields so that the report knows about every field. Terrible. I will probably be returning 5 columns with data and 500 full of nulls. There has to be a better way. Thinking about the performance hit is making me queasy, but that was pretty easy. Now I have a deployable report. I have no idea how I would handle summaries. Would they be additional queries that I would just append to the result set? Maybe the magic query engine knows.
Now for some additional ugliness. I have to request the report url with a query string that identifies the correct query. I am guessing I can also set the scheduler up with the correct parameter. But man do I have issues. I could call the url using httpWebRequest for my download, but how exactly does the scheduler work? I would imagine it would create the report in a similar fashion, and I should be able to tell it in what format to render. But for the download I would be streaming html. How would I tell the report server to convert it to a pdf and then stream it as such? Can that be set in the reports definition before deploying it? It has no problem with the conversion when I play around on the report server. But at least I've found a way to secure the report server by accessing it through the Web api.
Then there is the issue of cleaning up the null columns. There are extension points, such as data processing extensions. I think these are almost analogous to a step in the Web page life cycle but not sure exactly or else they would be called events. I would need to find the right one so that I can remove the null data column or labels on a pie chart at null percent, if that doesn't break the report. And I need to do it while it is still rdl. And just maybe if I still haven't found a way, transform the rdl to a pdf and change the content type. It appears I can add .net assemblies at the extension points. But is any of this correct? I am thinking like a developer, not like a seasoned SSRS pro. I'm trying, but any help pushing me in the right direction would be greatly appreciated.
I had tried revising that question a dozen times before asking, and it still seems unintelligible. Maybe my own answer will make my own question clear, and hopefully save someone else having to go through what I did, or at least be a quick dive into SSRS from a developer standpoint.
Creating a typical SSRS report involves (quick 40,000 foot overview)
1. Creating your data connection
2. Creating a SQL query or Queries which can be parameterized.
3. Datasets that the query result will fill
4. Mapping Dataset columns to Report Items; charts, tables, etc.
Then you build the report and deploy it to your report server, where the report can be requested by url with any SQL parameters Values added as a querystring:
http://reportserver/reportfolder/myreport?param1=data
How this works is that an RDL file (Report Definition Language) which is just an XML document with a specific schema is generated. The RDL has two elements that were relevant to me, and . As the names infer, the first contains the queries and the latter contains the graphs, charts, tables, etc. in the report and the mappings to the columns in the dataset.
When the report is requested, it goes through a processing pipeline on the report server. By implementing Interfaces in the reporting services namespace, one could create .NET assemblies that could transform the RDL at various stages in the pipeline.
Reporting Services also has two reporting API's. One for managing reports, and another for rendering. There is also the reportserver control which is a .NET Webforms control which is pretty rich in functionality and could be used to create and render reports without even needing a report server instance. The report files the control could generate were RDLC files, with the C standing for client.
Armed with all of this knowledge, I found several solution paths, but all of them were not optimal for my purposes and I have moved on to a solution that did not involve reporting services or RDL at all. But these may be of use to someone else.
I could transform the RDL file as it went through the pipeline. Not very performant, as this involved writing to the actual physical file, and then removing the modifications after rendering. I was also using SQL Server 2012, which only supported the 2.0/3.5 framework.
Then there were the services. Using either service, I could retrieve an RDL template as a byte array from my application. I wasn't limited by the CLR version here. With the management server, I could modify the RDL and deploy that to the Report Server. I would only need to modify the RDL once, but given the number of files I would need and having to manage them on the remote server, creating file structures by client/user/Dashboard/ReportWidget looked pretty ugly.
Alternatively, I instead of deploying the RDL templates, why not just store them in the database in byte array format. When I needed a specific instance, I could fetch the RDL template, add my queries and mappings to the template and then pass them to the execution service which would then render them. I could then save the resulting RDL in the database. It would be much easier for me to manage there. But now the report server would be useless, I would need my own services for management and to create subscriptions and to mail them I would need a queue service and an SMTP mailer, removing all the extras I would get from the report server, need to write a ton of custom code, and still be bound by RDL. So I would be creating RDLM, RDL mess.
It was the wrong tool for the job, but it was an interesting exercise, I learned more about Reporting Services from every angle, and was paid for most of that time. Maybe a blog post would be a better venue, but then I would need to go into much greater detail.
I am using Google BigQuery for sometime now, using upload files,
As I get some delays with this method I am now trying to convert my code into streaming.
Looking for best solution here, what is more correct working with BQ:
1. Using multiple (up to 40) different streaming machines ? or directing traffic to single or more endpoints to upload data?
2. Uploading one row at a time or stacking to a list of 100-500 events and uploading it.
3. is streaming the way to go, or stick with files uploading - in terms of high volumes.
some more data:
- we are uploading ~ 1500-2500 rows per second.
- using .net API.
- Need data to be available within ~ 5 minutes
Didn't find such reference elsewhere.
The big difference between streaming data and uploading files is that streaming is intended for live data that is being produced on real time while being streamed, whereas with uploading files, you would upload data that was stored previously.
In your case, I think Streaming makes more sense. If something goes wrong, you would only need to re-send the failed rows, instead of the whole file. And it adapts more to the growing files that I think you're getting.
The best practices in any case are:
Trying to reduce the number of sources that send the data.
Sending bigger chunks of data in each request instead of multiple tiny chunks.
Using exponential back-off to retry those requests that could fail due to server errors (These are common and should be expected).
There are certain limits that apply to Load Jobs as well as to Streaming inserts.
For example, when using streaming you should insert less than 500 rows per request and up to 10,000 rows per second per table.
We are using mondrian olap schema with saiku to analyse our records.We are using star schema model.We have one fact table which contains around 3000000 records. We have four dimension tables timestamp,rank,path and domain. Timestamp is almost unique for each entry . Now after deploying schema in saiku when we are performing analysis saiku takes a lot of time to return results. It takes 10 minutes to fetch 3000 records and if number of records are more than 50000 saiku dies.Please suggest me on what should I do in order to boost performance of saiku and mondrian.
You can easily figure out if this is database issue or saiku/mondrian problem:
Enable sql logging facility in saiku-server/tomcat/webapps/saiku/WEB-INF/classes/log4j.xml (uncomment section below Special Log File specifically for Mondrian SQL Statements text
Restart server
Do couple of typical analysis in Saiku
Get used queries from log
Analyze performance of queries directly in database (e.g. for PostgreSQL there is explain analyze command)
If performance of queries is as slow as in Saiku then you have identified your problem.
Btw. If you really have dimension for timestamp (by second?) than you should consider splitting it into two dimensions with days and seconds.
It's hard to understand what is your particular problem.
Two things helped us when we struggle with saiku performance problems:
indices on all fields and sometimes their combinations that may be
used as dimensions - helps like everywhere in DB
we avoided joines with other tables denormalizing our data