Lets say that I have the following data frame with a user id and location as the two columns. A user id can have multiple locations. I'm interested in finding the could of each possible location sequence based on the user id.
So if my data looked like this:
places = data.frame(user_id=c(1,1,2,3,3,3,4,4,5,5,5,5),
location=c("home","school","work","home","school","work",
"lunch","airport","gym","breakfast","work","home"))
places
I want to find the following:
freq = data.frame(location_path=c("home - school", "work", "home - school - work",
"lunch - airport", "gym - breakfast - work - home"),
count=c(1,1,1,1,1))
freq
This second data frame tells me that the 'home' and 'school' pairing occurred twice in that order. Furthermore, the home, school, and work pairing occurred only once as well.
Of course there could be instances where the pairing occurs multiple times. In the following case the home, school, and work pairing would have a count of 2.
places = data.frame(user_id=c(1,1,2,3,3,3,4,4,5,5,5,5,6,6,6),
location=c("home","school","work","home","school","work",
"lunch","airport","gym","breakfast","work","home",
"home","school","work"))
places
Related
I am looking for advice on the best way to handle data going into Neo4j.
I have a set of structured data, CSV format which relates to journeys. The data is:
"JourneyID" - unique ref#/ Primary Key e.g 1234
"StartID" - ref# , this is a station e.g Station1
"EndIID" - ref# this is a station, e.g Station1 (start and end can be the same)
"Time" – integer e.g. 24
Assume I have 100 journeys/rows of data, showing journeys between 10 different stations.
I can see and work with this data in SQL or Excel. I want to work with this in Neo4j.
This is what I currently have:
StartID with JourneyID as a label
EndID with Journey ID as a label
This means that each row from the CSV for a station is its own node. I then created a relationship between Start and End using the JourneyID (primary key)
the effect was just 100 node connected to 100 nodes. E.g connection from Station1 and Station 2, Station 1 and Station 3, and Station 1 and Station 4. It didn’t show the relationship between Starting Station1 and Ending Station1, 2 and 3 - which is what I want to show.
How best do I model this data so that graph sees 10 unique StartID, connecting to the different EndIDs – showing the relationships between them?
Thanks in advance
(new to Graphs!)
This sample query, which uses MERGE to avoid creating duplicate nodes and relationships, should help you get started:
LOAD CSV WITH HEADERS FROM 'file://input.csv' AS row
MERGE (start:Station {id: row.StartID})
MERGE (end:Station {id: row.EndID})
MERGE (j:Journey {id: row.JourneyID})
ON CREATE SET j.time = row.Time
MERGE (j)-[:FROM]->(start)
MERGE (j)-[:TO]->(end)
I don't think you want a Journey to be a node, you want the Journey ID to be an attribute of the edge:
LOAD CSV WITH HEADERS FROM 'file://input.csv' AS row
MERGE (start:Station {id: row.StartID})
MERGE (end:Station {id: row.EndID})
MERGE (start)-[:JOURNEY {id:row.JourneyID}]->(end)
That more intuitively describes the data, and you could even extend this to different relationship types, if you can describe Journeys in more detail.
Edit:
This is to answer your question, but I can't speak as to how this scales up. I think it depends on the types of queries you plan to make.
Thank you to everyone who commented already! I've edited my post with better code and hopefully some clarity on what I'm trying to do. (I appreciate all of the feedback - this is my first time asking a question here!)
I have a very similar question to this one here (Random Pairings that don't Repeat) but am trying to come up with a function or piece of code that I can run in R to create the pairings. Essentially, I have a pool of employees and want to come up with a way to randomly generate pairs of employees to meet every month, with no pairs repeating in future months/running of the function. (I will need to maintain the history of previous pairings.) The catch is that each employee is assigned to a working location, and I only want matches from different location.
I've gone through a number of previous queries on randomly sampling data sets in R and am comfortable with generating a random pair from my data, or pulling out an existing working group, but it's the "generating a pair that ALWAYS comes from a different group" that's tripping me up, especially since the different groups/locations have different numbers of employees so it's hard to sort the groups evenly.
Here's my dummy data, which currently has 10 "employees". The actual data set currently has over 100 employees with more being added to the pool each month:
ID <- (1:10)
Name <- c("Sansa", "Arya", "Hodor", "Jamie", "Cersei", "Tyrion", "Jon", "Sam", "Dany", "Drogo")
Email <- c("a#a.com","b#b.com", "c#c.com", "d#d.com", "e#e.com", "f#f.com", "g#g.com", "h#h.com", "i#i.com", "j#j.com")
Location <- c("winterfell", "Winterfell", "Winterfell", "Kings Landing", "Kings Landing", "Kings Landing",
"The Wall", "The Wall", "Essos", "Essos")
df <- data.frame(ID, Name, Email, Location)
Basically, I want to write something that would say that Sansa could be randomly paired with anyone who is not Arya or Hodor, because they're in the same location, and that Jon could be paired with anyone but Sam (i.e., anyone whose location is NOT Winterfell.) If Jon and Arya were paired up once, I would like them to not be paired up again going forward, so at that point Jon could be paired with anyone but Sam or Arya. I hope I'm making sense.
I was able to run the combn function on the ID column to generate groups of two, but it doesn't account for the historical pairings that we're trying to avoid.
Is there a way to do this in R? I've tried this one (Using R, Randomly Assigning Students Into Groups Of 4) but it wasn't quite right for my needs, again because of the historical pairings.
Thank you for any help you can provide!
This is an unusual and difficult question which has perplexed me for a number of days and I hope I explain it correctly. I have two databases i.e. data-frames in R, the first is approx 90,000 rows and is a record of every race-horse in the UK. It contains numerous fields, and most importantly the NAME of each horse and its SIRE; one record per horse First database, sample and fields. The second database contains over one-million rows and is a history of every race a horse has taken part in over the last ten years i.e. races it has run or as I call it 'appearances', it contains NAME, DATE, TRACK etc..; one record per appearance.Second database, sample and fields
What I am attempting to do is to write a few lines of code - not a loop - that will provide me with a total number of every appearance made by the siblings of a particular horse i.e. one grand total. The first step is easy - finding the siblings i.e. horses with a common sire - and you can see it below (N.B FindSire is my own function which does what it says and finds the sire of a horse by referencing the same dataframe. I have simplified the code somewhat for clarity)
TestHorse <- "Save The Bees"
Siblings <- which(FindSire(TestHorse) == Horses$Sire)
Sibsname <- Horses[sibs,1]
The produces Sibsname which is a 636 names long (snippet below), although the average horse will only have 50 or so siblings. I could construct a loop and search the second 'appearances' data-frame and individually match the sibling names and then total the appearances of all the siblings combined. However, I would like to know if I could avoid a loop - and the time associated with it - and write a few lines of code to achieve the same end i.e. search all 636 horses in the appearances database and calculate the times each appears in the database and a total of all these appearances, or to put it another way, how many races have the siblings of "save the bees" taken part in. Thanks in advance.
[1] "abdication " "aberdonian " "acclamatory " "accolation " ..... to [636]
Using dplyr, calling your "first database" horses and your "second database" races:
library(dplyr)
test_horse = "Save The Bees"
select(horses, Name, Sire) %>%
filter(Sire == Sire[Name == tolower(test_horse)]) %>%
inner_join(races, c("Name" = "SELECTION_NAME")) %>%
summarize(horse = test_horse, sibling_group_races = n())
I am making the assumption that you want the number of appearances of the sibling group to include the appearances of the test horse - to omit them instead add , Name != tolower(test_horse) to the filter() command.
As you haven't shared data reproducibly, I cannot test the code. If you have additional problems I will not be able to help you solve them unless you share data reproducibly. ycw's comment has a helpful link for doing that - I would encourage you to edit your question to include either (a) code to simulate a small sample of data, or (b) use dput() on an small sample of your data to share a few rows in a copy/pasteable format.
The code above will do for querying one horse at a time - if you intend to use it frequently it would be much simpler to just create a table where each row represents a sibling group and contains the number of races. Then you could just reference the table instead of calculating on the fly every time. That would look like this:
sibling_appearances =
left_join(horses, races, by = c("Name" = "SELECTION_NAME")) %>%
group_by(Sire) %>%
summarize(offspring_appearances = n())
I have a list of book titles and authors, and I am using the Google books API to access additional information about the books (e.g. complete title, ISBNs, etc.) Ultimately, I want to copy the information from Google into my original list only if the author names field of the first item returned by Google includes the name in my author name in my original list.
My question is about whether there is a simple way to convert the result of the query (which is a character object) into a table or dataframe based on patterns in the google result. Below is an example.
library(RCurl)
result<-getURL("https://www.googleapis.com/books/v1/volumes?q=fellowship%20of%20the%20ring%20tolkien&startIndex=0",ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T)
print(result)
This leads to this result:
[1] "{\n \"kind\": \"books#volumes\",\n \"totalItems\": 717,\n \"items\": [\n {\n \"kind\": \"books#volume\",\n \"id\": \"aWZzLPhY4o0C\",\n \"etag\": \"UKfRIR+5nhY\",\n \"selfLink\": \"https://www.googleapis.com/books/v1/volumes/aWZzLPhY4o0C\",\n \"volumeInfo\": {\n \"title\": \"The Fellowship of the Ring\",\n \"subtitle\": \"Being the First Part of The Lord of the Rings\",\n \"authors\": [\n \"J.R.R. Tolkien\"\n ],\n \"publisher\": \"Houghton Mifflin Harcourt\",\n \"publishedDate\": \"2012-02-15\",\n \"description\": \"The first volume in J.R.R. Tolkien's epic adventure THE LORD OF THE RINGS One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them In ancient times the Rings of Power were crafted by the Elven-smiths, and Sauron, the Dark Lord, forged the One Ring, filling it with his own power so that he could rule all others. But the One Ring was taken from him, and though he sought it throughout Mid...
I would like to convert the resulting character object to a list or table or dataframe, and for the most part,
column names enclosed in " ", preceded on the left by a line return \n, and followed by ":" on the right
row values enclosed in " ", preceded on the left by ": ", and follwed ",\n" on the right
But some fields, like the ISBNs, don't follow the pattern exactly.
for example, I'd like result.df to look like:
kind title subtitle authors publisher publishedDate description ISBN_13 ISBN_10
"books#volume" "The Fellowship of the Ring" "Being the First Part of The Lord of the Rings"
"J.R.R. Tolkien" "Houghton Mifflin Harcourt" "2012-02-15" "The first volume in J.R.R. Tolkien's epic adventure THE LORD OF THE RINGS One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them In ancient times the Rings of Power were crafted by the Elven-smiths, and Sauron, the Dark Lord, forged the One Ring, filling it with his own power so that he could rule all others. But the One Ring was taken from him, and though he sought it throughout Middle-earth, it remained lost to him. After many ages it fell into the hands of Bilbo Baggins, as told in The Hobbit. In a sleepy village in the Shire, young Frodo Baggins finds himself faced with an immense task, as his elderly cousin Bilbo entrusts the Ring to his care. Frodo must leave his home and make a perilous journey across Middle-earth to the Cracks of Doom, there to destroy the Ring and foil the Dark Lord in his evil purpose. “A unique, wholly realized other world, evoked from deep in the well of Time, massively detailed, absorbingly entertaining, profound in meaning.” – New York Times" "9780547952017" "0547952015"
Ultimately, I want to be able to copy values from the new list/table/dataframe to another dataframe, if certain values match (e.g., the authors value includes a match to a value in another dataframe), similar to this excerpt from a loop:
if(grepl(books$auth1last[i],result.df$authors[1])==TRUE){
books$isbn13[i] = result.df$isbn13[1]
}else{
books$isbn13[i] = NA}
Is there an elegant way to convert the character object into something more like an organized list/table/df with just a few lines, or will I have to extract each column name and value with a separate line using something like rm_between? Thanks!
You can convert the returned string of json into a list using the jsonlite package. You just need to remove the line breaks for it to work.
example:
library(RCurl)
result <- getURL("https://www.googleapis.com/books/v1/volumes?q=fellowship%20of%20the%20ring%20tolkien&startIndex=0",ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T)
result_no_breaks <- gsub("\\n", " ",result)
json_list <- jsonlite::fromJSON(result_no_breaks)
I am developing a report for a University. They require around 15 different groupings of their choice (e.g. Campus, Faculty, Course, School, Program, Major, Minor, Nationality, Mode of Study... and the list goes on).
They require the Headcount and EFTS (Equivalent Full Time Student) for each of these groupings.
They may require a single grouping, a selection in any order of groupings, or all groupings.
A solution was provided here: https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014834406#77777777-0000-0000-0000-000014837873
The solutions suggests to use conditional blocks and multiple lists. However that would mean I would have 50+ lists, every possible combination of my groups (e.g. Campus + Faculty , Campus + School, School + Faculty, Faculty + Campus .... )
Is these any way for users to dynamically select the order of their groupings, and which groups to exclude/include? Thanks