Neo4j: Merge Duplicate Nodes - graph

I made some wrong moves in Neo4j, and now we have a graph with duplicate nodes. Among the duplicate pairs, the full property set belongs to the first of the pair, and the relationships all belong to the second in the pair. The index is the node_auto_index.
Nodes:
Id Name Age From Profession
1 Bob 23 Canada Doctor
2 Amy 45 Switzerland Lawyer
3 Sam 09 US
4 Bob
5 Amy
6 Sam
Relationships:
Id Start End Type
1 4 6 Family
2 5 6 Family
3 4 5 Divorced
I am trying to avoid redoing the whole batch import. Is there a way to merge the nodes in cypher based on the "name" string property, while keeping all of the properties and the relationships?
Thank you!

Okay, I think I figured it out:
START first=node(*), second=node(*)
WHERE has(first.Name) and has(second.Name) and has(second.Age) and NOT(has(first.Age))
WITH first, second
WHERE first.Name= second.Name
SET first=second
The query is still processing, but is there a more efficient way of doing this?

You create a cross product here between the two sets, so that will be expensive. Better is to do an index lookup for name.
START first=node(*), second=node(*)
WHERE has(first.Name) and has(second.Name) and has(second.Age) and NOT(has(first.Age))
WITH first, second
SKIP 20000 LIMIT 20000
WHERE first.Name= second.Name
SET first=second
And you probably have to paginate the processing as well.
START n=node:node_auto_index("Name:*")
WITH n.Name, collect(n) nodes
SKIP 20000 LIMIT 20000
WHERE length(nodes) == 2
WITH head(filter(x in nodes : not(has(x.Age)))) as first, head(filter(x in nodes : has(x.Age))) as second
SET first=second

Related

R: optimal sorting/allocation/distribution of items

I'm hoping someone may be able to help with a problem I have - trying to solve using R.
Individuals can submit requests for items. The minimum number of requests per person is one. There is a recommended maximum of five, but people can submit more in exceptional circumstances. Each item can only be allocated one individual.
Each item has a 'desirability'/quality score ranging from 10 (high quality) down to 0 (low quality). The idea is to allocate items, in line with requests, such that as many high quality items as possible are allocated. It is less important that individuals have an equitable spread of requests met.
Everyone has to have at least one request met. Next priority is to look at whether we can get anyone who is over the recommended limit within it by allocating requests to others. After that the priority is to look at where the item would rank in each individual's request list based on quality score, and allocate to the person where it would rank highest (eg, if it would be first in someone's list and third in another's, give it to the former).
Effectively I'd need a sorting algorithm of some kind that:
Identifies where an item has been requested more than once
Check all the requests of everyone making said request
If that request is the only one a person has made, give it to them
(if this scenario applies to more than one person, it should be
flagged in some way)
If all requestees have made more than one request, check to see if
any have made more than five requests - if they have it can be taken
off them.
If all are within the recommended limit, see where the request would
rank (based on quality score) and give to the person in whose list it
would rank highest.
The process needs to check that the above step isn't happening to people so many times that it leaves them without any requests...so it
effectively has to check one item at a time.
Does anyone have any ideas about how to approach this? I can think of all kinds of why I could arrange the data to make it easy to identify and see where this needs to happen, but not to automate the process itself. Thanks in advance for any help.
The data (at least the bits needed for this process) looks like the below:
Item ID Person ID Item Score
1 AAG 9
1 AAK 8
2 AAAX 8
2 AN 8
2 AAAK 8
3 Z 8
3 K 8
4 AAC 7
4 AR 5
5 W 10
5 V 9
6 AAAM 7
6 AAAL 7
7 AAAAN 5
7 AAAAO 5
8 AB 9
8 D 9
9 AAAAK 6
9 AAAAC 6
10 A 3
10 AY 3

Identifying, reviewing, and deduplicating records in R

I'm looking to identify duplicate records in my data set based on multiple columns, review the records, and keep the ones with the most complete data in R. I would like to keep the row(s) associated with each name that have the maximum number of data points populated. In the case of date columns, I would also like to treat invalid dates as missing. My data looks like this:
df<-data.frame(Record=c(1,2,3,4,5),
First=c("Ed","Sue","Ed","Sue","Ed"),
Last=c("Bee","Cord","Bee","Cord","Bee"),
Address=c(123,NA,NA,456,789),
DOB=c("12/6/1995","0056/12/5",NA,"12/5/1956","10/4/1980"))
Record First Last Address DOB
1 Ed Bee 123 12/6/1995
2 Sue Cord 0056/12/5
3 Ed Bee
4 Sue Cord 456 12/5/1956
5 Ed Bee 789 10/4/1980
So in this case I would keep records 1, 4, and 5. There are approximately 85000 records and 130 variables, so if there is a way to do this systematically, I'd appreciate the help. Also, I'm a total R newbie (as if you couldn't tell), so any explanation is also appreciated. Thanks!
#Add a new column to the dataframe containing the number of NA values in each row.
df$nMissing <- apply(df,MARGIN=1,FUN=function(x) {return(length(x[which(is.na(x))]))})
#Using ave, find the indices of the rows for each name with min nMissing
#value and use them to filter your data
deduped_df <-
df[which(df$nMissing==ave(df$nMissing,paste(df$First,df$Last),FUN=min)),]
#If you like, remove the nMissinig column
df$nMissing<-deduped_df$nMissing<-NULL
deduped_df
Record First Last Address DOB
1 1 Ed Bee 123 12/6/1995
4 4 Sue Cord 456 12/5/1956
5 5 Ed Bee 789 10/4/1980
Edit: Per your comment, if you also want to filter on invalid DOBs, you can start by converting the column to date format, which will automatically treat invalid dates as NA (missing data).
df$DOB<-as.Date(df$DOB,format="%m/%d/%Y")

How to retrieve movies' genres from wikidata using R

I would like to retrieve information from wikidata and store it in a dataframe. For the sake of simplicity I am going to assume that I want to get the genre of the following movies and then filter those that belong to sci-fi:
movies = c("Star Wars Episode IV: A New Hope", "Interstellar",
"Happythankyoumoreplease")
I know there is a package called WikidataR. If I am not wrong, and according to its vignettes there are two commands that may be useful: find_item and find_property allow you to retrieve a set of Wikidata items or properties where the aliase or descriptions match a particular search term. Apparently they are great for me, so I thought of doing something like
for (i in movies) {
info = find_item(i)
}
This is what I get from each item:
> find_item("Interstellar")
Wikidata item search
Number of results: 10
Results:
1 Interstellar (Q13417189) - 2014 US science fiction film
2 Interstellar (Q6057099)
3 interstellar medium (Q41872) - matter and fields (radiation) that exist in the space between the star systems in a galaxy;includes gas in ionic, atomic or molecular form, dust and cosmic rays. It fills interstellar space and blends smoothly into the surrounding intergalactic space
4 space colonization (Q686876) - concept of permanent human habitation outside of Earth
5 rogue planet (Q167910) - planetary-mass object that orbits the galaxy directly
6 interstellar cloud (Q1054444) - accumulation of gas, plasma and dust in a galaxy
7 interstellar travel (Q834826) - term used for hypothetical manned or unmanned travel between stars
8 Interstellar Boundary Explorer (Q835898)
9 starship (Q2003852) - spacecraft designed for interstellar travel
10 interstellar object (Q2441216) - astronomical object in interstellar space, such as a comet
>
Unfortunately, the information that I get from find_item (see below) has two problems:
it is not a dataframe with all wikidata information of the item I
am searching but a list of what seems to be metadata (wikidata's id,
link...).
it does not have the information I need (wikidata's
properties from each particular wikidata item).
Similarly, find_property provides metadata of a certain property. find_property("genre") retrieves the following information:
> find_property("genre")
Wikidata property search
Number of results: 4
Results:
1 genre (P136) - a creative work's genre or an artist's field of work (P101). Use main subject (P921) to relate creative works to their topic
2 radio format (P415) - describes the overall content broadcast on a radio station
3 sex or gender (P21) - sexual identity of subject: male (Q6581097), female (Q6581072), intersex (Q1097630), transgender female (Q1052281), transgender male (Q2449503). Animals: male animal (Q44148), female animal (Q43445). Groups of same gender use "subclass of" (P279)
4 gender of a scientific name of a genus (P2433) - determines the correct form of some names of species and subdivisions of species, also subdivisions of a genus
This has similar problems:
it is not a dataframe
it just stores metadata about the property
I don't find any way to link each property with each object in movies vector.
Is there any way to end up with a dataframe containing the genre's of those movies? (or a dataframe with all wikidata's information which I will have to manipulate in order to filter or select my desired data?)
These are just lists. you can get a picture with str(find_item("Interstellar")) for example.
Then you can go through each element of the list and pick the item that you need. For example. Getting the title and the label
a <- find_item("Interstellar")
b <- Reduce(rbind,lapply(a, function(x) cbind(x$title,x$label)))
data.frame(b)
## X1 X2
## 1 Q13417189 Interstellar
## 2 Q6057099 Interstellar
## 3 Q41872 interstellar medium
## 4 Q686876 space colonization
## 5 Q167910 rogue planet
## 6 Q1054444 interstellar cloud
## 7 Q834826 interstellar travel
## 8 Q835898 Interstellar Boundary Explorer
## 9 Q2003852 starship
## 10 Q2441216 interstellar object
This works easily for regular data if some element is missing then you will have to handle it for example some items don't have description. So you can get around with the following.
Reduce("rbind",lapply(a,
function(x) cbind(x$title,
x$label,
ifelse(length(x$description)==0,NA,x$description))))

JavaFX TableView every row not a single object

I am coming from Swing where in a JTable, I can just set the column and the row to a value. In a JavaFX TableView, I have to make each row represent an object. I am trying to represent a schedule for a race track. I have round and race number, and then whoever is in each lane.
Round | Race | Lane 1 | Lane 2
1 1 Bob Joe
1 2 Tom Sam
2 1 Sam Joe
2 2 Bob Tom
Each object in a lane, (Bob, Tom, ...) is a Car object. It has various fields but what is being represented in the table should be whatever toString() returns, in this case, the driver's name. I have an array of Round object and each Round has an array of Races which has an array of Cars for lanes. I need a way to represent this data structure in a TableView as shown above. Note that the amount of lanes, races in a round and total rounds can be changed by the user at runtime.

How to select product with multiple categories

According my title of question so i have structure for Access database like this:
Category
categoryid categoryname
1 one
2 two
3 three
Product table:
productid productname categories
1 one 1,2,3
2 two 3
3 three 1,2
When i have categoryid is 1 I dont know the way to select product have multiple categorise. Because when i use In operator,i am getting some error..
Select * from product where categories In (categodyid) because cannot compare a collection with one value.
i'm stuck at here! Please help me! Thanks.
First of all, your tables are not normalized. Look at the Categories column in Product Table. Each cell should have only one value. By allowing multiple values, you risk various problems including update/insert anomalies and what you are seeing now. You also make it very difficult to do selects and other operations. Instead, think about normalizing your tables with this example:
Category
categoryid categoryname
1 one
2 two
3 three
Product
ProdductId ProductName
4 prod1
5 prod2
6 prod 3
Category_Prod
CategoryId ProductId
1 3
1 4
2 3
The third table acts as a way to remedy the many to many pattern. If you have any questions on how to do this or how to use it, let me know
This is a classic parent-child one-to-many relationship. You need a [ProductCategory] table to associate a given Product with multiple Categories:
productid categoryid
1 1
1 2
1 3
2 3
3 1
3 2

Resources