SKU comparison for ADX VM - azure-data-explorer

This link shows comparison among various VM SKUs available for ADX cluster. My question is about the following two SKUs:-
D14 v2 (Category: compute-optimized) , SSD:614 GB , Cores:16, RAM:112GB
DS14 v2 + 4 TB PS (Category: storage-optimized) , SSD:4TB , Cores:16 RAM:112GB
Purely looking at the numbers (SSD,RAM,Cores) it looks like #2 has everything #1 has but on top of that #2 also has 4TB of SSD -- whereas #1 has only 614GB of SSD. So based on that I will always choose #2 over #1. So what is the meaning of category here then? #1 falls in the category "compute-optimized" whereas #2 belongs to "storage-optimized". My question is that if a category is decided on the basis of configuration mentioned here then we should be able to call #2 as both storage as well as compute optimized because #2 has the same compute as #1 and then it has something extra over #1. Then why #2 is only listed as storage optimized. I am trying to understand if there an additional edge of using #1 over #2 for compute intensive jobs -- because if I just look at the numbers here I don't see any reason (apart from cost , which too is not much different though) why I shouldn't use #2 over #1. Probably #1 has something unique which is missing in #2 which is not specified in that link.

Based on your question, it appears you're largely disregarding the consideration of cost - the following table (in the same doc you've linked-to) summarizes the main considerations for choosing a SKU - you can see one of them is Cost per GB cache per core.
Another example - let's assume you can reach the same total cache (SSD) size with either SKUs you mentioned - with one, your cluster will have X nodes, and with the other Y nodes. If Y > X, data in the other cluster will be distributed across more nodes, allowing more parallelism during ingestion and queries. Of course, the cost for both options could be different.
Last - I would strongly recommend, given that cost isn't meaningless in your case, that you consult with the cost estimator, and see how a different choice of SKU affects the total estimated cost of your cluster (given you know the volumes of data you're dealing with).

Related

Confusion about raft algorithm

In the paper 《 In Search of an Understandable Consensus Algorithm 》, Figure 8 shows a problem in (d) and (e) that some old logs may be overwritten and never come back.
In section 5.4.2, it says “To eliminate problems like the one in Figure 8, Raft never commits log entries from previous terms by counting replicas. Only log entries from the leader’s current term are committed by counting replicas; once an entry from the current term has been committed in this way, then all prior entries are committed indirectly because of the Log Matching Property”.
I'm confused about that part, how does it works in Figure 8? What will happen and what will not?
By adding the rule to figure 8.
Raft never commits log entries from previous terms by counting replicas.
So now we never commits log entries from previous terms, let see what will happen again at figure 8. I modified figure 8 to show the situation after apply the rule.
(a) and (b) works the same.
Start from (c), log entry at index 2 is append at term 2 since step (a), where I draw a yellow circle. So it is from previous terms. Thus the leader will not replicate that entry (the yellow 2 with my black cross) according the rule. It must start replicate from entry at index 3.
This rule also mentioned at Figure 2 "Rules for Servers" leader's rule 3.1:
Send AppendEntries RPC with entries startting at nextIndex.
The nextIndex is initialized with last log index + 1, so it should start at log index 3 at (c), not index 2.
So for the hypothetical procedure at original (c), it is impossible to append log 2 to majority before log 3(the pink one appended at term 4) replicate at majority. and (d) will not happen.
UPDATE: 2020-12-04
#coderx and #OrlandoL have comments discussed about the (a), (b) and how S5 can't be a leader. Their discuss makes this answer more complete, So I put a reference here.
Basically, (a), (b) is not a must-happen condition, there are cases that S5 won't elected leader, such as S3,S4 have same chance to become leader. (please see the comments for detail)
These assumption is correct that S5 may not become a leader and the following procedure won't happen.
But let's go back to the paper Figure 8 and read the annotation of the figure:
A time sequence showing why a leader cannot determine commitment using log entries from older terms. In
(a) S1 is leader and partially replicates the log entry at index 2.
In (b) S1 crashes; S5 is elected leader for term 3 with votes
from S3, S4, and itself, and accepts a different entry at log
index 2.
IMO, the author is talking about the case that S5 is elected leader. Thus the whole procedure makes scene.
As #OrlandoL mentioned, In a MIT 6.824 Lab, you should consider all conditions to have a correct Raft implementation.
Hope this helps.
Raft doesn't commit entries from previous term because these entries of previous term could be overwritten by future leader just like leader S5 in (d).
Suppose leader S1 in (c) committed entry at index 2 of term 2, then that entry will be applied by S1, S2 and S3. Then S1 crashed, it's totally possible for S5 to become leader like in (d) 'cause its log is more up-to-date than S2, S3 and S4. S5 would overwrite entry of term 2 at index 2 with its own entry of term 3. This means leader S5 overwrites a committed entry! Some servers(S1, S2 and S3) have applied the entry of term 2, others(S4, S5) would apply entry of term 3 at index 2, which violates the State Machine Safety in figure 3.
So leader S1 of term 4 in (c) cannot commit entry of term 2 at index 2 unless it commit an entry of its own term like entry of term 4 at index 3 like in (d). Once entry at index 3 of term 4 is committed, entry at index2 of term 2 is auto-committed and they will never be overwritten by future leaders. (A candidate can become leader only if it has all committed entries from previous term.)

Display used CPU hours with slurm

I have a user account on a super computer where jobs are handled with slurm.
I would like to know the total amount of CPU hours that I have consumed on this super computer. I think that's an understandable question, because there is only a limited number of CPU hours available per project. I'm surprised that an answer is not easy to find.
I know that there are all these commands like sacct, sreport, sshare, etc... but it seems that there is no simple command that displays the used CPU hours.
Can someone help me out?
As others have commented, sacct should give you that information. You will need to look at the man page to get information for past jobs. You can specify a --starttime and --endtime to restrict your query to match your allocation as it ends/renews. The -l options should get you more information than you need so you can get a smaller set of options by specifying what you need with --format.
In your instance, the correct answer is to ask the administrators. You have been given an allocation of time to draw from. They likely have a system that will show you your balance and you can reconcile your balance against the output of sacct. Also, if the system you are using has different node types such as high memory, GPU, MIC, or old, they will likely charge you differently for those resources.
You can get an overview of the used CPU hours with the following:
sacct -SYYYY-mm-dd -u username -ojobid,start,end,alloccpu,cputime | column -t
You will could calculate the total accounting units (SBU in our system) multiplying CPUTime by AllocCPU which means multiplying the total (sysem+user) CPU time by the amount of CPU used.
An example:
JobID NodeList State Start End AllocCPUS CPUTime
------------ --------------- ---------- ------------------- ------------------- ---------- ----------
6328552 tcn[595-604] CANCELLED+ 2019-05-21T14:07:57 2019-05-23T16:48:15 240 506-17:12:00
6328552.bat+ tcn595 CANCELLED 2019-05-21T14:07:57 2019-05-23T16:48:16 24 50-16:07:36
6328552.0 tcn[595-604] FAILED 2019-05-21T14:10:37 2019-05-23T16:48:18 240 506-06:44:00
6332520 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 72 25-00:38:24
6332520.bat+ tcn384 COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 24 8-08:12:48
6332520.0 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:09 2019-05-24T00:26:33 60 20-20:24:00
6332530 tcn[37,41,44,4+ FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 240 400-08:12:00
6332530.bat+ tcn37 FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 24 40-00:49:12
6332530.0 tcn[37,41,44,4+ CANCELLED+ 2019-05-23T17:11:35 2019-05-25T09:13:34 240 400-07:56:00
The fields are shown in the the manpage. They can be shown as -oOPTION (in lower case or in proper POSIX notation --format='Option,AnotherOption...' (a list is in the man).
So far so good. But there is a big caveat here:
What you see here is perfect to get an idea of what you have run or what to expect in terms of CPU / hours. But this will not necessarily reflect your real budget status, as in many cases each node / partition may have an extra parameter, the weight, which is a parameter set for accounting purposes and not part of SLURM. For instance,the GPU nodes may have a weight value of x3, which means that each GPU/hour is measured as 3 SBU instead of 1 for budgetary purposes. What I mean to say is that you can use sacct to gain insight on the CPU times but this will not necessarily reflect how much SBU credits you still have.

Cluster analysis - multiparametric

I have a following problem Im trying to solve.
I have hundreds of particles with their corresponding chemical composition (elements with their weight percentages).
As an example, here are some made-up simplified particles:
Particle 1 - S (32%), K (25%), C (43%)
Particle 2 - S (33%), K (12%), C (15%), O (40%)
Particle 3 - Ti (18%), S (72%)
Particle 4 - Ti (10%), S (79%), K (12%)
In reality there are hundreds of them, some of them quite different to one another, some of them quite similar. As you can see, some particles do not have certain elements (i.e. they could be used as 0%).
What I would try to achieve is perform a cluster analysis, that would group the particles into groups with similar particles and give me some averages in terms of that cluster element composition.
I was looking at how cluster analysis works, but usually it only uses 2 parameters, whereas I have many elements for each particle and I want to take into account more than just one element for each particle while clustering it. I am not so much interested in the exact match in terms of all the elements contained. In other words, if for example some 2 particles were quite similar except that one contained one extra element in a very small quantity, that would be ok too. Very low percentages are sometimes caused by background noise when measuring it.
Once I know which strategy to use I would ideally use R to do it. But giving me just a hint as to how to go about it, or a link, would be enough.

Generating CPU utilization levels

First I would like to let you know that I have recently asked this question already, however it was considered to be unclear, see Linux: CPU benchmark requiring longer time and different CPU utilization levels. This is now a new attempt to formulate the question using a different approach.
What I need: In my research, I look at the CPU utilization of a computer and analyze the CPU utilization pattern within a period of time. For example, a CPU utilization pattern within time period 0 to 10 has the following form:
time, % CPU used
0 , 21.1
1 , 17
2 , 18
3 , 41
4 , 42
5 , 60
6 , 62
7 , 62
8 , 61
9 , 50
10 , 49
I am interested in finding a simple representation for a given CPU utilization pattern. For the evaluation part, I need to create some CPU utilization patterns on my laptop which I will then record and analyse. These CPU utilization patterns that I need to create on my laptop should
be over a time period of more than 5 minutes, ideally of about 20 minutes.
the CPU utilization pattern should have "some kind of dynamic behavior" or in other words, the % CPU used should not be (almost) constant over time, but should vary over time.
My Question: How can I create such a utilization pattern? Of course, I could just run an arbitrary program on my laptop and I will obtain a desired CPU pattern. However, this solution is not ideal since a reader of my work has no means to repeat this experiment if wanted since he has not access to the program I used. Therefore it would be much more beneficial to use something instead of an arbitrary program on my laptop (in my previous post I was thinking about open source CPU benchmarks for example). Can anyone recommend me something?
Many thanks!
I suggest a moving average. Select a window size and use it to average over. You'll need to decide what type of patterns you want to identify since the wider the window, the more smoothing you get and the fewer "features" you'll see. And CPU activity is very bursty. For example, if you are trying to identify cache bottlenecks, you'll want a small window, probably in the 10ms to 100ms range. If instead you want to correlate to longer term features, such as energy or load, you'll want a larger window, perhaps 10sec to minutes.
It looks like you are using OS provided CPU usage and not hardware registers. This means that the OS is already doing some smoothing. It may also be doing estimation for some performance values. Try to find documentation on this if you are integrating over a smaller window. A word of warning: this level of information can be hard to find. You may have to do a lot of digging. Depending upon your familiarity with kernel code, it may be easier to look at the code.

customer segmentation in retail [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I have a large sales database of a 'home and construction' retail.
And I need to know who are the electricians, plumbers, painters, etc. in the store.
My first approach was to select the articles related to a specialty (wires [article] is related to an electrician [specialty], for example) And then, based on customer sales, know who the customers are.
But this is a lot of work.
My second approach is to make a cluster segmentation first, and then discover which cluster belong to a specialty. (this is a lot better because I would be able to discover new segments)
But, how can I do that? What type of clustering should I occupy? Kmeans, fuzzy? What variables should I take to that model? Should I use PCA to know how many cluster to search?
The header of my data (simplified):
customer_id | transaction_id | transaction_date | item_article_id | item_group_id | item_category_id | item_qty | sales_amt
Any help would be appreciated
(sorry my english)
You want to identify classes of customers based on what they buy (I presume this is for marketing reasons). This calls for a clustering approach. I will talk you through the entire setup.
The clustering space
Let us first consider what exactly you are clustering: either orders or customers. In either case, the way you characterize the items and the distances between them is the same. I will discuss the basic case for orders first, and then explain the considerations that apply to clustering by customers instead.
For your purpose, an order is characterized by what articles were purchased, and possibly also how many of them. In terms of a space, this means that you have a dimension for each type of article (item_article_id), for example the "wire" dimension. If all you care about is whether an article is bought or not, each item has a coordinate of either 0 or 1 in each dimension. If some order includes wire but not pipe, then it has a value of 1 on the "wire" dimension and 0 on the "pipe" dimension.
However, there is something to say for caring about the quantities. Perhaps plumbers buy lots of glue while electricians buy only small amounts. In that case, you can set the coordinate in each dimension to the quantity of the corresponding article (presumably item_qty). So suppose you have three articles, wire, pipe and glue, then an order described by the vector (2, 3, 0) includes 2 wire, 3 pipe and 0 glue, while an order described by the vector (0, 1, 4) includes 0 wire, 1 pipe and 4 glue.
If there is a large spread in the quantities for a given article, i.e. if some orders include order of magnitude more of some article than other orders, then it may be helpful to work with a log scale. Suppose you have these four orders:
2 wire, 2 pipe, 1 glue
3 wire, 2 pipe, 0 glue
0 wire, 100 pipe, 1 glue
0 wire, 300 pipe, 3 glue
The former two orders look like they may belong to electricians while the latter two look like they belong to plumbers. However, if you work with a linear scale, order 3 will turn out to be closer to orders 1 and 2 than to order 4. We fix that by using a log scale for the vectors that encode these orders (I use the base 10 logarithm here, but it does not matter which base you take because they differ only by a constant factor):
(0.30, 0.30, 0)
(0.48, 0.30, -2)
(-2, 2, 0)
(-2, 2.48, 0.48)
Now order 3 is closest to order 4, as we would expect. Note that I have used -2 as a special value to indicate the absence of an article, because the logarithm of 0 is not defined (log(x) tends to negative infinity as x tends to 0). -2 means that we pretend that the order included 1/100th of the article; you could make the special value more or less extreme, depending on how much weight you want to give to the fact that an article was not included.
The input to your clustering algorithm (regardless of which algorithm you take, see below) will be a position matrix with one row for each item (order or customer), one column for each dimension (article), and either the presence (0/1), amount, or logarithm of the amount in each cell, depending on which you choose based on the discussion above. If you cluster by customers, you can simply sum the amounts from all orders that belong to that customer before you calculate what goes into each cell of your position matrix (if you use the log scale, sum the amounts before taking the logarithm).
Clustering by orders rather than by customers gives you more detail, but also more noise. Customers may be consistent within an order but not between them; perhaps a customer sometimes behaves like a plumber and sometimes like an electrician. This is a pattern that you will only find if you cluster by orders. You will then find how often each customer belongs to each cluster; perhaps 70% of somebody's orders belong to the electrician type and 30% belong to the plumber type. On the other hand, a plumber may only buy pipe in one order and then only buy glue in the next order. Only if you cluster by customers and sum the amounts of their orders, you get a balanced view of what each customer needs on average.
From here on I will refer to your position matrix by the name my.matrix.
The clustering algorithm
If you want to be able to discover new customer types, you probably want to let the data speak for themselves as much as possible. A good old fashioned
hierarchical clustering with complete linkage (CLINK) may be an appropriate choice in this case. In R, you simply do hclust(dist(my.matrix)) (this will use the Euclidean distance measure, which is probably good enough in your case). It will join closely neighbouring items or clusters together until all items are categorized in a hierarchical tree. You can treat any branch of the tree as a cluster, observe typical article amounts for that branch and decide whether that branch represents a customer segment by itself, should be split in sub-branches, or joined with a sibling branch instead. The advantage is that you find the "full story" of which items and clusters of items are most similar to each other and how much. The disadvantage is that the outcome of the algorithm does not tell you where to draw the borders between your customer segments; you can cut up the clustering tree in many ways, so it's up to your interpretation how you want to identify your customer types.
On the other hand, if you are comfortable fixing the number of clusters (k) beforehand, k-means is a very robust way to get just any segmentation of your customers in k distinct types. In R, you would do kmeans(my.matrix, k). For marketing purposes, it may be sufficient to have (say) 5 different profiles of customers that you make custom advertisement for, rather than treating all customers the same. With k-means you don't explore all of the diversity that is present in your data, but you might not need to do so anyway.
If you don't want to fix the number of clusters beforehand, but you also don't want to manually decide where to draw the borders between the segments afterwards, there is a third possibility. You start with the k-means algorithm, where you let it generate an amount of cluster centers that is much larger than the number of clusters that you hope to end up with (for example, if you hope to end up with somewhere about 10 clusters, let the k-means algorithm look for 200 clusters). Then, use the mean shift algorithm to further cluster the resulting centers. You will end up with a smaller number of compact clusters. The approach is explained in more detail by James Li over here. You can use the mean shift algorithm in R with the ms function from the LPCM package, see this documentation.
About using PCA
PCA will not tell you how many clusters you need. PCA answers a different question: which variables seem to represent a common underlying (hidden) factor. In a sense, it is a way to cluster variables, i.e. properties of entities, not to cluster the entities themselves. The number of principal components (common underlying factors) is not indicative of the number of clusters needed. PCA can still be interesting if you want to learn something about the predictive value of each article about a customer's interests.
Sources
Michael J. Crawley, 2005. Statistics. An Introduction using R.
Gerry P. Quinn and Michael J. Keough, 2002. Experimental Design and Data Analysis for Biologists.
Wikipedia: hierarchical clustering, k-means, mean shift, PCA

Resources