I am looking for advice on the best way to handle data going into Neo4j.
I have a set of structured data, CSV format which relates to journeys. The data is:
"JourneyID" - unique ref#/ Primary Key e.g 1234
"StartID" - ref# , this is a station e.g Station1
"EndIID" - ref# this is a station, e.g Station1 (start and end can be the same)
"Time" – integer e.g. 24
Assume I have 100 journeys/rows of data, showing journeys between 10 different stations.
I can see and work with this data in SQL or Excel. I want to work with this in Neo4j.
This is what I currently have:
StartID with JourneyID as a label
EndID with Journey ID as a label
This means that each row from the CSV for a station is its own node. I then created a relationship between Start and End using the JourneyID (primary key)
the effect was just 100 node connected to 100 nodes. E.g connection from Station1 and Station 2, Station 1 and Station 3, and Station 1 and Station 4. It didn’t show the relationship between Starting Station1 and Ending Station1, 2 and 3 - which is what I want to show.
How best do I model this data so that graph sees 10 unique StartID, connecting to the different EndIDs – showing the relationships between them?
Thanks in advance
(new to Graphs!)
This sample query, which uses MERGE to avoid creating duplicate nodes and relationships, should help you get started:
LOAD CSV WITH HEADERS FROM 'file://input.csv' AS row
MERGE (start:Station {id: row.StartID})
MERGE (end:Station {id: row.EndID})
MERGE (j:Journey {id: row.JourneyID})
ON CREATE SET j.time = row.Time
MERGE (j)-[:FROM]->(start)
MERGE (j)-[:TO]->(end)
I don't think you want a Journey to be a node, you want the Journey ID to be an attribute of the edge:
LOAD CSV WITH HEADERS FROM 'file://input.csv' AS row
MERGE (start:Station {id: row.StartID})
MERGE (end:Station {id: row.EndID})
MERGE (start)-[:JOURNEY {id:row.JourneyID}]->(end)
That more intuitively describes the data, and you could even extend this to different relationship types, if you can describe Journeys in more detail.
Edit:
This is to answer your question, but I can't speak as to how this scales up. I think it depends on the types of queries you plan to make.
Related
In R Shiny, I have a data table in wide form, let's all this Representation A of the data:
Rank
Person
School
1
Sean
Boston
2
Alicia
I'm using DT::dataTableOutput to present this table. I want to instead present this table in long form (Rank identifies the observations):
Rank
Variable
Value of Variable
1
Person
Sean
1
School
Boston
2
Person
Alicia
2
School
I will then also style this table slightly differently, I will:
Only print the first occurrence of a value of rank
The table then becomes:
Rank
Variable
Value of Variable
1
Person
Sean
School
Boston
2
Person
Alicia
School
I will also:
Drop empty rows
So the final table, which we will call Representation B of the data, becomes:
Rank
Variable
Value of Variable
1
Person
Sean
School
Boston
2
Person
Alicia
When presenting the table in the format B, I still want to keep as much of the functionality as possible that DT::dataTableOutput supplies for form A, e.g. the ability to search and sort (without separating rows belonging together) etc. Here, by sorting I don't mean the order variables are presented in within a given rank, but the order the "rank groups" are presented. For example, sorting on Person should yield the following since Alicia comes before Sean lexicographically:
Rank
Variable
Value of Variable
2
Person
Alicia
1
Person
Sean
School
Boston
What do you think is the easiest way to implement this?
Currently, I consider two different ways. In both plans, instead of having the standard sorting buttons provided by DT::dataTableOutput, I will link e.g. a radio button which will allow the user to choose which variable the table should be sorted on (i.e. Rank, Person, or School).
My first option for implementation: When the radio button is pressed, I will (without the user seeing it) transform the table to representation A, sort it there, then transform it to the properly sorted version of representation B.
My second option for implementation: I could to representation B "attach" representation A (without showing the latter to the user) so that each row of the underlying data contains the full information for that rank:
Rank
Variable
Value of Variable
Rank_Long
Person_Long
School_Long
1
Person
Sean
1
Sean
Boston
School
Boston
1
Sean
Boston
2
Person
Alicia
2
Alicia
If I want to obtain the lexicographically sorted B representation, I order the table above by (Person_Long, Rank_Long, Variable). The extra Variable in the ordering is added to get the rows presented to the user (Rank and Variable) in the right order (Person above School).
In practice, I will have around five to ten variables rather than only two (and each of these ten should be allowed to sort on), and I will be running Shiny Server on an AWS server which will be reached through an iframe on a website.
Pros and cons of the first and second options for implementation:
First option: Fast if the dcast and melt functions are fast enough and if melt can be set to preserve the sorting when transforming to long. Should provide faster initial load time of the table for the user since the table won't have as many columns as the second option.
Second option: No reshape needed, but more data sent to the user at initial load.
Which of my two options do you think is the best, and do you have suggestions about other implementations I haven't considered?
I concatenated my columns of interest using paste with properly placed <br> tags for line break. Using DT::datatable(..., escape = FALSE) I could then obtain the table in the form I wanted.
I defined a sliderInput (outside of the table) whose input I set to trigger events for sorting. The table reloads (thus resetting search) when the slider obtains a new input, I will try to fix that (with some filter or proxy solution, or hopefully I find a simple way).
My task is to get total inbound leads for a group of customers, leads by month for the same group of customers and conversion rate of those leads.
The dataset I'm pulling from is 20 million records so I can't query the whole thing. I have successfully done the first step (getting total lead count for each org with this:
inbound_leads <- domo_get_query('6d969e8b-fe3e-46ca-9ba2-21106452eee2',
auto_limit = TRUE,
query = "select org_id,
COUNT(*)
from table
GROUP BY org_id
ORDER BY org_id"
DOMO is the bi tool I'm pulling from and domo_get_query is an internal function from a custom library my company built. It takes a query argument which is a mysql query)and various others which aren't important right now.
sample data looks like this:
org_id, inserted_at, lead_converted_at
1 10/17/2021 2021-01-27T03:39:03
2 10/18/2021 2021-01-28T03:39:03
1 10/17/2021 2021-01-28T03:39:03
3 10/19/2021 2021-01-29T03:39:03
2 10/18/2021 2021-01-29T03:39:03
I have looked through many aggregation online tutorials but none of them seem to go over how to get data needed pre-aggregation (such as number of leads per month per org, which isn't possible once the aggregation has occurred because in the above sample the aggregation would remove the ability to see more than one instance of org_id 1 for example) from a dataset that needs to be aggregated in order to be accessed in the first place. Maybe I just don't understand this enough to know the right questions to ask. Any direction appreciated.
If you're unable to fit your data in memory, you have a few options. You could process the data in batches (i.e. one year at a time) so that it fits in memory. You could use a package like chunked to help.
But in this case I would bet the easiest way to handle your problem is to solve it entirely in your SQL query. To get leads by month, you'll need to truncate your date column and group by org_id, month.
To get conversion rate for leads in those months, you could add a column (in addition to your count column) that is something like:
sum(case when conversion_date is not null then 1 else 0) as convert_count
Background: I am using R along with some packages to pull JSON data from a ticketing system. I'm pulling all the users and want to build a reporting structure.
I have a data set that contains employees and their managers. The columns are named as such ("Employee" and "Manager"). I am trying to build a tree of a reporting structure that goes up to the root. We are in an IT organization, but I am pulling all employee data, so this would look something like:
Company -> Business Unit -> Executive -> Director -> Group Manager -> Manager -> Employee
That's the basic idea. Some areas have a tree structure that is small, others it's multiple levels. Basically, what I am trying to do is get a tree, or reporting structure I can reference, so I can determine for an employee, who their director is. This could be 1 level removed or up to 5 or 6 levels removed.
I came across data.tree, but so far, as I look at it, I have to provide a pathString that defines that structure. Since I only have the two columns, what I'd like to do is throw this data frame into a function and have it traverse the list as it finds the employee, put it under that manager, when it finds that manager as an employee, nest it under their direct report, along with anything nested under them.
I haven't been able to figure out how to make data.tree do this without defining the pathString, but in doing so, I can only build the pathString on what I know for each row - the employee and their manager. The result is a tree that only has 2 levels and directors aren't connected to their Group Manager and Group Managers aren't connected to their managers and so forth.
I thought about writing some logic/loops to go through and do this, but there must be an easier way or a package that I can use to do this. Maybe I am not defining the pathString correctly....
Ultimately, what I'd like the end result to be is a data frame with columns that look like:
Employee, Manager1, Manager2, Manager3, ManagerX, ...
Of course some rows will only have entries in columns 1 and 2, but others could go up many levels. Once I have this, I can look up devices in our configuration management system, find the owner and aggregate those counts under the appropriate director.
Any help would be appreciate. I cannot post the data, as it is confidential in nature, but it simply contains the employee and their managers. I just need to connect all the dots... Thanks!
The data.tree package has the FromDataFrameNetwork function for just this scenario:
library(data.tree)
DataForTree <- data.frame(manager = c("CEO","sally","sally","sue","mary", "mary"),
employee = c("sally","sue","paul","mary","greg", "don"),
stringsAsFactors = FALSE)
tree <- FromDataFrameNetwork(DataForTree)
print(tree)
Results in:
1 CEO
2 °--sally
3 ¦--sue
4 ¦ °--mary
5 ¦ ¦--greg
6 ¦ °--don
7 °--paul
The hR package is specifically designed to address the needs for data analysis using people/employee data; albeit, it is minimal at this point. The hierarchy function can produce a wide data frame as you would like; this helps with joining in other data and continuing an analysis.
library(hR)
ee = c("Dale#hR.com","Bob#hR.com","Julie#hR.com","Andrea#hR.com")
supv = c("Julie#hR.com","Julie#hR.com","Andrea#hR.com","Susan#hR.com")
hierarchy(ee,supv,format="wide")
Employee Supv1 Supv2 Supv3
1 Dale#hR.com Susan#hR.com Andrea#hR.com Julie#hR.com
2 Bob#hR.com Susan#hR.com Andrea#hR.com Julie#hR.com
3 Julie#hR.com Susan#hR.com Andrea#hR.com <NA>
4 Andrea#hR.com Susan#hR.com <NA> <NA>
Issue:
I have about 9,000 rows of data that show XYZ data for about 10 tags.
All tags ping at different rates. Some at 1 second, some at 7 seconds, some every few hours, and so on.
I want to show how the tagged items move through a factory. Preferably with an animated GIF...
Desired data set:
For each of the 10 tags, on a one minute interval, where is it in the factory? Some tags will have multiple records per minute. Some records will have hours since the last ping.
Is there a way to do this in R? I can brute force it with Excel (shutter) but I'm hoping someone can set me on the right path, even if it's the proper search term to send me on my journey. Thanks!
Without an example data set to look at, there is a limit to what kind of answer I can provide.
However, the first thing that you want to do is get all of your observations on a unified time line. That is, each row should be a time stamp and each column should be the x, y, or z coordinate for one of your tags. Without seeing the data, I can't help you do this. Here are the types of things you need to think about. If tag A has measurements at timestamp 1, 2, 3, 4, 5, and 6 and tag B has measurements at timestamp 1 and 6, what value should tag B have for the measurements at timestamp 2, 3, 4, and 5. You could simply copy the value from timestamp 1 or you could do some type of interpolation between timestamp 1 and timestamp 6. If you don't know what I am talking about here, let me know.
Second, there could be package to make a gif in R. However, I have taken advantage of the R function Sys.sleep. Here is what I would suggest. Once you have your unified timeline for your dataset, write a for loop over the rows of your dataset. At the top of each row do Sys.sleep(0.25). This will make R "sleep" for 0.25 seconds. Then plot the current row of the dataset for all of your tags. This creates a really cool video that you could record with some other tool.
I have imported data relating to about 70 human subjects from three data tables and have merged them into one data frame in R. Some of the 100 fields are straight forward such as date.birth, number.surgeries.lifetime and number.surgeries.12months. Other fields, such as "comments" may contain no value or it may contain one sentence or maybe even several sentences.
Some human subjects have an anomaly, meaning that something is missing or not just right and for those people I have to manually investigate whats up. When I open the data frame as a data frame or even as a table in fix() it is difficult to read. I have to scroll from left to right and then I have to expand some columns by a ridiculous amount to read just one comment.
It would be much better if I could subset the 5 patients I need to explore and report the data as free flowing text. I thought I could do that by exporting to a csv but its difficult to see which fields are what. For example 2001-01-05, 12, 4, had testing done while still living in Los Angeles. That was easy, imagine what happens if there are 100 fields, many are numbers, many are dates, there are several different comment fields.
A better way would be to output a report such as this:
date.birth:2001-01-05, number.surgeries.lifetime:12, number.surgeries.12months:4, comments:will come talk to us on Monday
Each one of the 5 records would follow that format.
field name 1:field 1 value record 1, field name 2:field 2 value record 1...
skip a line (or something easy to see)
field name 1:field 1 value record 2, field name 2:field 2 value record 2
How can I do it?
How about this?
set.seed(1)
age <- abs(rnorm(10, 40, 20))
patient.key <- 101:110
date.birth <- as.Date("2011-02-02") - age * 365
number.surgeries.12months <- rnbinom(10, 1, .5)
number.surgeries.lifetime <- trunc(number.surgeries.12months * (1 + age/10))
comments <- "comments text here"
data <- data.frame(patient.key,
date.birth,
number.surgeries.12months,
number.surgeries.lifetime,
comments)
Subset the data by the patients and fields you are interested in:
selected.patients <- c(105, 109)
selected.fields <- c("patient.key", "number.surgeries.lifetime", "comments")
subdata <- subset(data[ , selected.fields], patient.key %in% selected.patients)
Format the result for printing.
# paste the column name next to each data field
taggeddata <- apply(subdata, 1,
function(row) paste(colnames(data), row, sep = ":"))
# paste all the data fields into one line of text
textdata <- apply(taggeddata, 2,
function(rec) do.call("paste", as.list(rec)))
# write to a file or to screen
writeLines(textdata)
Though I risk repeating myself, I'll make yet another case for the RMySQL package. You will be able to edit your database with your favorite SQL client (I recommend SequelPro). Using SELECT statements / Filtering and then edit it. For example
SELECT patentid, patentname, inability FROM patients LIMIT 5
could display only your needed fields. With a nice SQL client you could edit the result directly and store the result to the database. Afterwards, you can just reload the database into R. I know a lot of folks would argue that your dataset ist too small for such an overhead, but still I´d prefer the editing properties of most SQL editors of R. The same applies for joining tables if it gets trickier. Plus you might be interesting in writing views ("tables" that are updated on access) which will be treated like tables in R.
Check out library( reshape). I think if you start by melt()-ing your data your feet will be on the path to your desired outcome. Let us know if that looks like it will help and how it goes from there.