I'm building a simple system for profiling people. I'm currently using neo4j to build simple relations between users. For example I have simple tuple
mike met sara
But how could I integrate time? For example
mike met sara 2 days ago OR mike will meet sara in 3 days
The main reason is because the relation can happen multiple times at different times. My goal is to be able to answer questions:
has mike met sara in last week?
are mike and sara dating (dating = they meet at least 5 times a week)?
what is the longest period mike and sara did not meet?
does mike have personal problems? (we can introduce mike met bill where sara & bill both have personality attribute "helping people". So we can presume if mike didn't met with sara or bill in last year but has X meeting in last week, something is wrong with him)
What is the best way to get these answers? Is Neo4j the right way to go?
I think what you want to model is events in time. Those events (e.g. Meeting) are nodes that are connected to the participants, places, additional information etc.
Then you can choose to link the events in a ordered list that represents their chronological order, i.e. a timeline.
For fast access of sub-parts of the timeline you could, create a time-tree (year->month->day[->hour]->event)
See this for a concrete example: http://docs.neo4j.org/chunked/milestone/cypher-cookbook-path-tree.html
Related
Good evening everyone,
I am using OBIEE and I am trying to extract a file containing some candidates' information to keep in our records, as my organization will need to delete most data soon.
I have data related to recruiting that people put in their applications for job vacancies.
I am trying to have a single row per candidate per application (i.e. if a candidate applied to 2 different jobs, it will count as 2 rows), and insert the highest education, the related insitution, their most recent job title, and the most recent employer name.
I have these facts:
ID,
degree_type,
institution,
job title,
employer.
and they all have the starting date and the graduation date.
When I extract the report, I get something like this:
ID
degree_type
institution
job_title
employer
001
Doctorate
Univ. A
eater
google
001
Master's
Univ. B
sleeper
samsung
001
Other
Univ. A
jumper
apple
002
Bachelor's
Univ. C
clapper
nutella
002
Master's
Univ. D
somethinger
fujitsu
002
Doctorate
Univ. A
somethinger
fujitsu
003
Other
Univ. E
eater
EU
003
Doctorate
Univ. Z
spy
UN
As you can see, each person might or might not have different levels of education, and when I extract this analysis, I have one ID with multiple rows, as many as every degree and every job experience, sorted by chronological order.
This creates some readability issues. Besides, we only want the highest degree and the most recent job.
So something like this.
ID
degree_type
institution
job_title
employer
001
Doctorate
Univ. A
eater
google
002
Doctorate
Univ. A
somethinger
fujitsu
003
Doctorate
Univ. Z
eater
EU
Instead, when I try to apply filters or step, I can only manage to obtain a result based on either
A) the most recent degree and the most recent employer, or
B) each degree and each work experience that was carried out in the same time period of the degree.
Option A does not work for multiple reasons, e.g., if someone got a certification after a PhD, I will have a person with "other" whereas they should have "doctorate"
Option B is not useful at the moment, as we only want one row. Besides, if I worked after getting a degree, that work experience would not appear as it only shows the work carried out during the studies.
I am new with OBIEE, and I am not familiar with SQL. I usually use R, and for completely different reasons.
If I could assign a value to each degree and then filter by the highest (eg., IF there is a doctorate, THEN show it and STOP. ELSE show master's. IF not master's and doctorate, THEN show bachelor's and STOP.) And then add the work experience by date, that would be great.
Is there a way to do this?
Thank you so much! And apologies if it does not make any sense.
PS> I saw this reply already How To Get Highest Education Using MySQL?
but that person has multiple columns for each degree, whereas I have them altogether.
I am assuming that OBIEE is just a DB and you can use SQL to get the info.
I also assume that the ID column you provide represents unique ID per Employee.
Your task requires intermediate if not advanced SQL techniques to solve. Here are the steps.
you need to codify the sort order of the degree level - in 3NF (third normal form) you would add a reference table to store one row per degree and include degree_name varchar column (primary key) to equal the values you list in your post, but then another column degree_sort integer that sorts the degrees the way you want. You would join to this table on the varchar value and return the degree_sort value
Handling ties: Another complexity is how to handle the possibility of a employee having multiple jobs at the doctorate (I presume that is the highest) education level - you would need a "start_date" or some data point to break ties.
Here's a stack post that explains an analogous scenario, getting the record that represents the latest revision of a document (revision is your degree level, document is your employee ID):
https://stackoverflow.com/a/38854846/1279373
Your partition clause would be:
PARTITION BY id ORDER BY degree_sort DESC, start_date DESC
Note: The where clause (see sql in the referred to answer) handles "return only the rows with rank 1"; use ASC (ascending) and DESC (descending) in the ORDER BY clause to rank "low to high" or "high to low".
Apologies if this is a duplication - I'm a total newbie to Drupal and there's a fair chance I've read the answer to my question and just not realized it.
I have a simple vocabulary with 2 tiers. Structure is:
Canton
-Town
So looks like:
Vaud
-Vevey
-Montreux
Valais
-Sion
-Brig
I am trying to build a form with 2 separate fields as drop down list, 1 for canton and 1 for town, populated from the vocabulary where list 1 has only the cantons (1st tier of vocabulary) and 2nd only has towns (2nd tier) with the relationship to the canton.
Cant figure out how to do it ... after 2 days searching coming here for help.
Any guidance much appreciated
I have a dataset with a list of customers and their product preferences. Basically, it is a simple CSV with a column called "CUSTOMER" and 5 other columns called "PRODUCT_WANTED_A", "PRODUCT_WANTED_B" and so on.
I asked these customers if they were interested to know more about a particular product, and answers could be simply YES or NO (1 or 0 in the dataset). The dataset can be downloaded here. Obviously, there will be customers with many different interests, based on the mix of their YES or NO in these 5 columns.
My goal is to understand which customers are similar to others in such interests. This will help me manage an agenda of product presentations and, in each meeting, I would like to understand the best grouping for it. I started with a hierarchical plot like this:
customer_list <- read.csv("customers_products_wanted.csv", sep=",", header = TRUE)
customer.hclust <- hclust(dist(customers_list))
plot(customer.hclust, customer_list$CUSTOMER)
library(rect.hclust)
rect.clust(customer.hplot,5)
This is the plot I got, asking for 5 clusters:
Tried the same, but with 10 clusters:
Question 1: I know it's always hard to tell, but looking at the charts and dataset, what would be your 'cut' to group customers? 5? 10?
I was reviewing the results, and in the same group, I had CUSTOMER112 with 1,0,1,0,1 as their preferences together with CUSTOMER 110 (1,1,1,1,1), CUSTOMER106 (1,1,1,1,0) and so on. The "distance" can be right, but in a given group I have customers with some relevant differences in their preferences.
Question 2: I don't know if it's a case of total ignorance about clustering, the code I used or even the dataset. Based on your experience, what would be your approach for the best clustering in this case?
Any comments will be highly appreciated. As you see, I did some efforts, but still in doubt.
Thanks a lot!
Ricardo
All answers were important, but #Ben video recommendation and #Samuel Tan advice on breaking the customers into grids, I found a good way to handle it.
The video gave me a lot of insights about "noisy" variables in hierarchical clustering, and the grid recommendation helped me think on what the data is really trying to tell me.
That said, a basic data cleaning process eliminated all customers with no interests in any products (this is obvious, but I didn't pay attention to it at first). Then, I ignored customers with a specific interest (single product). It was done because these customers wouldn't need to attend the workshop series I'm planning (they just want to listen about one product).
Evaluating all the others, interested in more than one product, I realized the product mix could point me to a better classification. From there, I grouped customers into 3 clusters: integration opportunities (2 or 3 products), convergence opportunities (4 products) and transformation opportunities (all products).
Now it's clear to me which customers I should focus on for my workshops, and plan my post-workshop sales campaigns leveraging materials that target each customer group (integration, convergence, transformation).
Thanks for all the advices!
Ricardo
I have a table of Premiership Football teams, and another table of their squad. And I have recorded different stats per player per team. So I have two tables:
TEAM
and
PLAYER
# TEAM
1 Man U
2 Liverpool
3 Tottenham
TEAM # PLAYER SCORED ASSIST ETC
2 Gerrard 4 5
3 Soldado 2 7
2 Sterling 2 3
The TEAM table has the individual players from each team in a subtable for each team.
From this I have created a report from the PLAYER table to show the players name, team, stats etc. What I do need though is to number the records per amount of records in that team in the report itself, almost like page numbers.
For instance, Liverpool say have 20 players, I would like it show Gerrard 9/20, Sterling 18/20. It is one record per page. At the moment, all I can get is their record number, of all the players in the database, like 9/500.
What is the best way of doing numbering records per subgroup(each team)? I thought it would be more simple but doesn't look like it.
It was really simple. I just had to put a
=Count(*)
in the group header, and named it "groupcount". Then add another textbox and have
=[GroupCount]
in the record detail part of the report. Then it gave me the total number of records in each group displayed on each record.
I'm developing a small affiliate structure to understand the concept of graph databases better, as well learn Neo4J and see what it can offer me. I've been with RDBMS for years now and Cypher is pretty rough. I'm trying to build a very simple affiliate system:
Affiliate Joe has referred Mary, Bob and Mark. So, i create all their nodes and create the "referred" relationship. Now Mary refers Julie, Jessica and Joan. Bob refers Billy and Baxter. Mark refers Michael and Marx. And their referrals keep referring people.
For each referral that one of Joe's original referrals, Joe earns a "generation". His first generation is Mary, Bob and Mark. His second generation is Julie, Jessica, Joan, Billy, Baxter, Michael and Marx.
Now, with a Cypher query, how can i discover his generations and, of course, discover their number? Their place in the tree? How can i know who is from his 3rd or 4th generation, and who they are?
My mind is twisting here, hope you guys can help.
Vinny,
look at http://tinyurl.com/7vryzwz, is this what you are lookad for, basically
START referrer=node(1)
MATCH path=referrer-[:referred*1..]->refferee
RETURN referrer,refferee, length(path) as generation
ORDER BY length(path) asc