Partition Limit in AAS [closed] - azure-analysis-services

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 7 months ago.
Improve this question
Noob Question
Is there a limit to the number or partitions that can be created in AAS?
We would like to create Date-wise partitions in Azure Analysis Services to speed up the incremental load that we intend to perform on the data we receive from multiple sources and constant updates to the data.

A limit to the number of partitions is not listed here. I would recommend leaning towards having dozens or hundreds of partitions per table not thousands. And I would lean towards ensuring that partitions generally have at least a million rows for optimal performance. Why? Because a million rows is the size of a segment in the Vertipaq compression scheme and if your partitions are all much smaller (say 50,000 rows) then you will limit the maximum segment size and make compression and performance worse.
That being said, partitioning is mainly about processing performance. So if partitioning by day and processing one or a few days of data incrementally minimizes processing time significantly over partitioning weekly or monthly then that sounds like a great partitioning scheme.
I suppose you could merge older daily partitions together into monthly partitions after they get past the window in which they are often processed. I would recommend checking whether this does reduce the number of segments using DAX Studio though. (I can’t recall off the top of my head whether it does.)

Related

How to make a choice between OpenTSDB and InfluxDB or other TSDS? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
They both are open source distributed time series databases, OpenTSDB for metrics, InfluxDB for metrics and events with no external dependencies, on the other OpenTSDB based on HBase.
Any other comparation between them?
And if I want to store and query|analyze metrics real-time with no deterioration loss based on time series, which would be better?
At one of the conferences I've heard people running something like Graphite/OpenTSDB for collecting metrics centrally and InfluxDB locally on each server to collect metrics only for this server. (InfluxDB was chosen for local storage as it is easy to deploy and lightweight on memory).
This is not directly related to your question but the idea appealed to me much so I wanted to share it.
Warp 10 is another option worth considering (I'm part of the team building it), check it out at http://www.warp10.io/.
It is based on HBase but also has a standalone version which will work fine for volumes in the low 100s billions of datapoints, so it should fit most use cases out there.
Among the strengths of Warp 10 is the WarpScript language which is built from the ground up for manipulating (Geo) Time Series.
Yet another open-source option is blueflood: http://blueflood.io.
Disclaimer: like Paul Dix, I'm biased by the fact that I work on Blueflood.
Based on your short list of requirements, I'd say Blueflood is a good fit. Perhaps if you can specify the size of your dataset, the type of analysis you need to run or any other requirements that you think make your project unique, we could help steer you towards a more precise answer. Without knowing more about what you want to do, it's going to be hard for us to answer more meaningfully.

Estimating CPU and Memory Requirements for a Big Data Project [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am working on an analysis of big data, which is based on social network data combined with data on the social network users from other internal sources, such as a CRM database.
I realize there are a lot of good memory profiling, CPU benchmarking, and HPC packages and code snippets out there. I'm currently using the following:
system.time() to measure the current CPU usage of my functions
Rprof(tf <- "rprof.log", memory.profiling=TRUE) to profile memory
usage
Rprofmem("Rprofmem.out", threshold = 10485760) to log objects that
exceed 10MB
require(parallel) to give me multicore and parallel functionality
for use in my functions
source('http://rbenchmark.googlecode.com/svn/trunk/benchmark.R') to
benchmark CPU usage differences in single core and parallel modes
sort( sapply(ls(),function(x){format(object.size(get(x)), units = "Mb")})) to list object sizes
print(object.size(x=lapply(ls(), get)), units="Mb") to give me total memory used at the completion of my script
The tools above give me lots of good data points and I know that many more tools exist to provide related information as well as to minimize memory use and make better use of HPC/cluster technologies, such as those mentioned in this StackOverflow post and from CRAN's HPC task view. However, I don't know a straighforward way to synthesize this information and forecast my CPU, RAM and/or storage memory requirements as the size of my input data increases over time from increased usage of the social network that I'm analyzing.
Can anyone give examples or make recommendations on how to do this? For instance, is it possible to make a chart or a regression model or something like that that shows how many CPU cores I will need as the size of my input data increases, holding constant CPU speed and amount of time the scripts should take to complete?

What is the performance of subqueries vs two separate select queries? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Is one generally faster than the other with SQLite file databases?
Do subqueries benefit of some kind of internal optimization or are they handled internally as if you did two separate queries?
I did some testing but I don't see much difference, probably because my table is too small now (less than 100 records)?
It depends on many factors. Two separate queries means two requests. A request has a little overhead, but this weighs more heavily if the database is on a different server. For subqueries and joins, the data needs to be combined. Small amounts can easily be combined in memory, but if the data gets bigger, then it might not fit, causing the need to swap temporary data to disk, degrading performance.
So, there is no general rule to say which one is faster. It's good to do some testing and find out about these factors. Do some tests with 'real' data, and with the amount of data you are expecting to have in a year from now.
And keep testing in the future. A query that performs well, might suddenly become slow when the environment or the amount of data changes.

Scrum estimation unit [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
My team estimate tasks with hours, which is related to the TFS SCRUM Template nomenclature, however I've heard recently that tasks should be estimated in some abstration unit and using of hours is evil, what is the recommended way?
You can estimate in hours provided your team velocity is also based on hours since that's how you decide how many backlog items are likely to be delivered in a sprint.
But it's not necessary to use hours and it can sometimes give a false sense of exactness.
If you use an abstract unit for both estimating and velocity, you (or, more correctly, stakeholders and others who don't understand Agile) won't be confused into thinking that hours is an exact measure.
The confusion will stem from the fact that velocity is work-units-per-sprint and "hours-per-sprint" will be unchanging if your sprints are always a fixed size (say, four weeks for example, which will always be expected to be four weeks by forty hours by some number of workers).
However, your velocity is actually expected to change over time as the team becomes more adept, or experienced people get replaced with those with less experience, or half the team takes a month off on holidays.
That's why the whole concept of story points exists. They provide such an abstract measure so as to avoid this confusion. You simply estimate your backlog items in story points and keep a record of how many points the team delivers each sprint (the velocity).

Which R functions are useful for analysis of an investment strategy's profitability? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I have multiple variations of an automated strategy for trading certain investment vehicles. For each of these variations I have cross-validated backtests on historic data. I need to pick the best-performing test. There is significant variation between the tests in terms of trades per day, net position size, etc. This makes it difficult to compare one to another.
The nature of the test relies on the predictions of a multidimensional nearest-neighbor search.
Having been recently acquainted with R, I am looking for packages/functions/tests to help me analyze various elements of my strategies' performance. Particularly, I am interested in two things:
1. Packages/functions/metrics that gauge the efficacy of my predictor.
2. Packages/functions/metrics that gauge the relative "profitability" of one variation to another.
If you know something that I should take a look at, please do not hesitate to post it!
I would definitely take a look at these two R Task views:
Taskview Econometrics
Taskview Finance
They provide a broad overview of the kind of packages that are used in these fields. Googling for:
using R for financial analysis
also got me a lot of hits that are relevant for your situation. Good luck!

Resources