Mondrian 4: Creating Measure Groups according to Virtual Cubes - olap

I am familiar with Mondrian 3 where I use Virtual Cubes when I need to combine measures from two or more regular cubes.
I have two cubes Sales and Warehouse:
They have three shared dimensions Product, Time and Store.
And each cube has one own dimension (Customer resp. Warehouse Info).
There is one calculated member Profit Per Unit Shipped which combines measures from both cubes.
Note that ignoreUnrelatedDimensions="true" is defined for Sales cube, so measures from Sales cube will have non joining dimension members pushed to the top level member.
Mondrian 3 schema:
<VirtualCube name="Warehouse and Sales">
<CubeUsages>
<CubeUsage cubeName="Sales" ignoreUnrelatedDimensions="true"/>
<CubeUsage cubeName="Warehouse"/>
</CubeUsages>
<VirtualCubeDimension cubeName="Sales" name="Customer"/>
<VirtualCubeDimension cubeName="Warehouse" name="Warehouse Info"/>
<!-- shared dimensions -->
<VirtualCubeDimension name="Product"/>
<VirtualCubeDimension name="Store"/>
<VirtualCubeDimension name="Time"/>
<VirtualCubeMeasure cubeName="Sales" name="[Measures].[Sales]"/>
<VirtualCubeMeasure cubeName="Sales" name="[Measures].[Unit Sales]"/>
<VirtualCubeMeasure cubeName="Sales" name="[Measures].[Profit]"/>
<VirtualCubeMeasure cubeName="Warehouse" name="[Measures].[Units Ordered]"/>
<VirtualCubeMeasure cubeName="Warehouse" name="[Measures].[Units Shipped]"/>
<CalculatedMember name="Profit Per Unit Shipped" dimension="Measures">
<Formula>[Measures].[Profit] / [Measures].[Units Shipped]</Formula>
</CalculatedMember>
</VirtualCube>
I know I need to use Measure Groups instead of Virtual Cubes within Mondrian 4 schema definition. How can I create Mondrian 4 schema with Measure Groups which corresponds to Mondrian 3 schema above?

Related

is there a kdtree with a find that returns the nearest point of two dimensions subject to relative range limit of a third dimension?

Given a kdtree defined for 3 dimensions, how to find Pythagorean-nearest in just 2 of those dimensions but subject to the constraint of just being within a certain range in dimension 3?
Actually wrote it and it works. Trying here to see if other implementations exist and might be better.

Knapsack with non-linear constraints & step function including item dependencies

I am trying to solve an optimization which looks similar to a knapsack-problem. The setting is the following:
I am having a pool of ~80,000 players of which I want to build the cheapest squad of exactly 11 players. Each player has multiple attributes, the main position he is playing in, nation, club, league and rating.
The players not only need to be selected but also assigned to a position in the formation:
Stating the following problem:
The first constraint is a minimum rating of the squad, which can simply be formulated as a linear constraint. The second and third constraint make sure that exactly one player is selected for each position and each player can only be selected once.
There are several other linear constrains that can occur like a minimum amount of players from one nation or at most three players from a specific club etc.
The chemistry of a squad is a non-linear constraint with a step function.
A players individual chemistry is the product of his position & link bonus.
The position bonus is defined by what the players main position is and where in the formation he is placed in. A central defender placed in the according position gets 3 points, used as a striker he gets 0 points. The bonuses can be seen in the next table.
This part of the constraint still can be formulated linearly. The link bonus is the non linear component. Each position/node in the formation/graph has a weight between [0-3], two adjacent players have a weight of 1 if they are from the same nation, league or club. Sharing two attributes is a weight of 2 and for three respectively. The bonus for a specific position is the average of all edges multiplied by a factor 3.
This bonus is plugged into a step function, which can be seen in the next figure (mapping values between [0-1] to 0.9 etc.). The link bonus is multiplied by the position bonus and capped to 10. The team chemistry is defined as the sum of the individual player chemistries.
I implemented it as described with miniZinc solving it with the osicbc solver, but even for a player pool of ~100 players this is not really feasible to compute, depending on the additional constraints.
Now I am looking for an implementation that can approximate the solution. I was thinking about a simulated annealing or genetic algorithm. However, due to this chemistry constraint these approaches produce a lot of invalid solutions, wandering around in the dark.
Does anyone have an approach that might be applicable to my problem?

How to divide n destinations into two groups and minimize the sum of TSP distance of two groups?

I have met a very practical problem in robotics field. As I am EE background and not familiar with algorithms, I am seeking for help here.
There are n destinations, and the destinations are to be divided into two groups(group A and group B). There are also two robots, robot A and robot B. Each destination of group A must be visited by robot A at least once. Each destination of group B must be visited by robot B at least once. All the information is given, weights, directions and etc.
Questions:
How to calculate the division, s.t. the two robots travel the minimum distance summing up?
How to calculated the division, s.t. the time that two robots finish visiting all the destinations is shortest?
My recommendation is to look at auction based techniques for task allocation for robots. The idea is that agents bid in a market to add destinations to their plan. Agents that can add new destinations to their plans for lower cost (shorter distance traveled) get awarded the task. Faster to compute solutions (using minimum spanning tree heuristic, for example instead of solving the more difficult TSP problem) tend to find higher cost solutions but will find reasonably good solutions for a large number of robots and a large number of destinations.
A good starting point is the reference:
Dias, M. Bernardine, et al. "Market-based multirobot coordination: A survey and analysis." Proceedings of the IEEE 94.7 (2006): 1257-1270.

Error when Importing XML Document into R

I am trying to import an XML document and convert it to a dataframe in R. Usually the following code works fine:
xmlfile <- xmlTreeParse(file.choose()) ; topxml <- xmlRoot(xmlfile) ;
topxml2 <- xmlSApply(topxml, function(x) xmlSApply(x, xmlValue))
psycinfo <- data.frame(t(topxml2), row.names=NULL, stringsAsFactors=FALSE)
However, when I try this i get a dataframe with one row and 22570 columns (which is the number of rows that ideally want so that each record has its own row with multiple columns.
I've attached a snippet of what my XML data looks like for the first two records, which should be on separate rows.
<records>
<rec resultID="1">
<header shortDbName="psyh" longDbName="PsycINFO" uiTerm="2016-10230-001">
<controlInfo>
<bkinfo>
<btl>Reducing conservativeness of stabilization conditions for switched ts fuzzy systems.</btl>
<aug />
</bkinfo>
<chapinfo />
<revinfo />
<jinfo>
<jtl>Neurocomputing: An International Journal</jtl>
<issn type="Print">09252312</issn>
</jinfo>
<pubinfo>
<dt year="2016" month="02" day="16">20160216</dt>
</pubinfo>
<artinfo>
<ui type="doi">10.1016/j.neucom.2016.01.067</ui>
<tig>
<atl>Reducing conservativeness of stabilization conditions for switched ts fuzzy systems.</atl>
</tig>
<aug>
<au>Jaballi, Ahmed</au>
<au>Hajjaji, Ahmed El</au>
<au>Sakly, Anis</au>
</aug>
<sug>
<subj type="unclass">No terms assigned</subj>
</sug>
<ab>In this paper, less conservative sufficient conditions for the existence of switching laws for stabilizing switched TS fuzzy systems via a fuzzy Lyapunov function (FLF) and estimates the basin of attraction are proposed. The conditions are found by exploring properties of the membership functions and are formulated in terms of linear matrix inequalities (LMIs), which can be solved very efficiently using the convex optimization techniques. Finally, the effectiveness and the reduced conservatism of the proposed results are shown through two numerical examples. (PsycINFO Database Record (c) 2016 APA, all rights reserved)</ab>
<pubtype>Journal</pubtype>
<pubtype>Peer Reviewed Journal</pubtype>
</artinfo>
<language>English</language>
</controlInfo>
<displayInfo>
<pLink>
<url>http://search.ebscohost.com/login.aspx?direct=true&db=psyh&AN=2016-10230-001&site=ehost-live&scope=site</url>
</pLink>
</displayInfo>
</header>
</rec>
<rec resultID="2">
<header shortDbName="psyh" longDbName="PsycINFO" uiTerm="2016-08643-001">
<controlInfo>
<bkinfo>
<btl>Self–other relations in biodiversity conservation in the community: Representational processes and adjustment to new actions.</btl>
<aug />
</bkinfo>
<chapinfo />
<revinfo />
<jinfo>
<jtl>Journal of Community & Applied Social Psychology</jtl>
<issn type="Print">10529284</issn>
<issn type="Electronic">10991298</issn>
</jinfo>
<pubinfo>
<dt year="2016" month="02" day="15">20160215</dt>
</pubinfo>
<artinfo>
<ui type="doi">10.1002/casp.2267</ui>
<tig>
<atl>Self–other relations in biodiversity conservation in the community: Representational processes and adjustment to new actions.</atl>
</tig>
<aug>
<au>Mouro, Carla</au>
<au>Castro, Paula</au>
</aug>
<sug>
<subj type="unclass">No terms assigned</subj>
</sug>
<ab>This research explores the simultaneous role of two Self–Other relations in the elaboration of representations at the micro†and ontogenetic levels, assuming that it can result in acceptance and/or resistance to new laws. Drawing on the Theory of Social Representations, it concretely looks at how individuals elaborate new representations relevant for biodiversity conservation in the context of their relations with their local community (an interactional Other) and with the legal/reified sphere (an institutional Other). This is explored in two studies in Portuguese Natura 2000 sites where a conservation project calls residents to protect an atâ€risk species. Study 1 shows that (i) agreement with the institutional Other (the laws) and metaâ€representations of the interactional Other (the community) as approving of conservation independently help explain (at the ontogenetic level) internalisation of conservation goals and willingness to act; (ii) the same metaâ€representations operating at the microâ€genetic level attenuate the negative relation between ambivalence and willingness to act. Study 2 shows that a metaâ€representation of the interactional Other as showing no clear position regarding conservation increases ambivalence. Findings demonstrate the necessarily social nature of representational processes and the importance of considering them at more than one level for understanding responses to new policy/legal proposals. Copyright © 2016 John Wiley & Sons, Ltd. (PsycINFO Database Record (c) 2016 APA, all rights reserved)</ab>
<pubtype>Journal</pubtype>
<pubtype>Peer Reviewed Journal</pubtype>
</artinfo>
<language>English</language>
</controlInfo>
<displayInfo>
<pLink>
<url>http://search.ebscohost.com/login.aspx?direct=true&db=psyh&AN=2016-08643-001&site=ehost-live&scope=site</url>
</pLink>
</displayInfo>
</header>
</rec>

customer segmentation in retail [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I have a large sales database of a 'home and construction' retail.
And I need to know who are the electricians, plumbers, painters, etc. in the store.
My first approach was to select the articles related to a specialty (wires [article] is related to an electrician [specialty], for example) And then, based on customer sales, know who the customers are.
But this is a lot of work.
My second approach is to make a cluster segmentation first, and then discover which cluster belong to a specialty. (this is a lot better because I would be able to discover new segments)
But, how can I do that? What type of clustering should I occupy? Kmeans, fuzzy? What variables should I take to that model? Should I use PCA to know how many cluster to search?
The header of my data (simplified):
customer_id | transaction_id | transaction_date | item_article_id | item_group_id | item_category_id | item_qty | sales_amt
Any help would be appreciated
(sorry my english)
You want to identify classes of customers based on what they buy (I presume this is for marketing reasons). This calls for a clustering approach. I will talk you through the entire setup.
The clustering space
Let us first consider what exactly you are clustering: either orders or customers. In either case, the way you characterize the items and the distances between them is the same. I will discuss the basic case for orders first, and then explain the considerations that apply to clustering by customers instead.
For your purpose, an order is characterized by what articles were purchased, and possibly also how many of them. In terms of a space, this means that you have a dimension for each type of article (item_article_id), for example the "wire" dimension. If all you care about is whether an article is bought or not, each item has a coordinate of either 0 or 1 in each dimension. If some order includes wire but not pipe, then it has a value of 1 on the "wire" dimension and 0 on the "pipe" dimension.
However, there is something to say for caring about the quantities. Perhaps plumbers buy lots of glue while electricians buy only small amounts. In that case, you can set the coordinate in each dimension to the quantity of the corresponding article (presumably item_qty). So suppose you have three articles, wire, pipe and glue, then an order described by the vector (2, 3, 0) includes 2 wire, 3 pipe and 0 glue, while an order described by the vector (0, 1, 4) includes 0 wire, 1 pipe and 4 glue.
If there is a large spread in the quantities for a given article, i.e. if some orders include order of magnitude more of some article than other orders, then it may be helpful to work with a log scale. Suppose you have these four orders:
2 wire, 2 pipe, 1 glue
3 wire, 2 pipe, 0 glue
0 wire, 100 pipe, 1 glue
0 wire, 300 pipe, 3 glue
The former two orders look like they may belong to electricians while the latter two look like they belong to plumbers. However, if you work with a linear scale, order 3 will turn out to be closer to orders 1 and 2 than to order 4. We fix that by using a log scale for the vectors that encode these orders (I use the base 10 logarithm here, but it does not matter which base you take because they differ only by a constant factor):
(0.30, 0.30, 0)
(0.48, 0.30, -2)
(-2, 2, 0)
(-2, 2.48, 0.48)
Now order 3 is closest to order 4, as we would expect. Note that I have used -2 as a special value to indicate the absence of an article, because the logarithm of 0 is not defined (log(x) tends to negative infinity as x tends to 0). -2 means that we pretend that the order included 1/100th of the article; you could make the special value more or less extreme, depending on how much weight you want to give to the fact that an article was not included.
The input to your clustering algorithm (regardless of which algorithm you take, see below) will be a position matrix with one row for each item (order or customer), one column for each dimension (article), and either the presence (0/1), amount, or logarithm of the amount in each cell, depending on which you choose based on the discussion above. If you cluster by customers, you can simply sum the amounts from all orders that belong to that customer before you calculate what goes into each cell of your position matrix (if you use the log scale, sum the amounts before taking the logarithm).
Clustering by orders rather than by customers gives you more detail, but also more noise. Customers may be consistent within an order but not between them; perhaps a customer sometimes behaves like a plumber and sometimes like an electrician. This is a pattern that you will only find if you cluster by orders. You will then find how often each customer belongs to each cluster; perhaps 70% of somebody's orders belong to the electrician type and 30% belong to the plumber type. On the other hand, a plumber may only buy pipe in one order and then only buy glue in the next order. Only if you cluster by customers and sum the amounts of their orders, you get a balanced view of what each customer needs on average.
From here on I will refer to your position matrix by the name my.matrix.
The clustering algorithm
If you want to be able to discover new customer types, you probably want to let the data speak for themselves as much as possible. A good old fashioned
hierarchical clustering with complete linkage (CLINK) may be an appropriate choice in this case. In R, you simply do hclust(dist(my.matrix)) (this will use the Euclidean distance measure, which is probably good enough in your case). It will join closely neighbouring items or clusters together until all items are categorized in a hierarchical tree. You can treat any branch of the tree as a cluster, observe typical article amounts for that branch and decide whether that branch represents a customer segment by itself, should be split in sub-branches, or joined with a sibling branch instead. The advantage is that you find the "full story" of which items and clusters of items are most similar to each other and how much. The disadvantage is that the outcome of the algorithm does not tell you where to draw the borders between your customer segments; you can cut up the clustering tree in many ways, so it's up to your interpretation how you want to identify your customer types.
On the other hand, if you are comfortable fixing the number of clusters (k) beforehand, k-means is a very robust way to get just any segmentation of your customers in k distinct types. In R, you would do kmeans(my.matrix, k). For marketing purposes, it may be sufficient to have (say) 5 different profiles of customers that you make custom advertisement for, rather than treating all customers the same. With k-means you don't explore all of the diversity that is present in your data, but you might not need to do so anyway.
If you don't want to fix the number of clusters beforehand, but you also don't want to manually decide where to draw the borders between the segments afterwards, there is a third possibility. You start with the k-means algorithm, where you let it generate an amount of cluster centers that is much larger than the number of clusters that you hope to end up with (for example, if you hope to end up with somewhere about 10 clusters, let the k-means algorithm look for 200 clusters). Then, use the mean shift algorithm to further cluster the resulting centers. You will end up with a smaller number of compact clusters. The approach is explained in more detail by James Li over here. You can use the mean shift algorithm in R with the ms function from the LPCM package, see this documentation.
About using PCA
PCA will not tell you how many clusters you need. PCA answers a different question: which variables seem to represent a common underlying (hidden) factor. In a sense, it is a way to cluster variables, i.e. properties of entities, not to cluster the entities themselves. The number of principal components (common underlying factors) is not indicative of the number of clusters needed. PCA can still be interesting if you want to learn something about the predictive value of each article about a customer's interests.
Sources
Michael J. Crawley, 2005. Statistics. An Introduction using R.
Gerry P. Quinn and Michael J. Keough, 2002. Experimental Design and Data Analysis for Biologists.
Wikipedia: hierarchical clustering, k-means, mean shift, PCA

Resources