R - Employee Reporting Structure - r

Background: I am using R along with some packages to pull JSON data from a ticketing system. I'm pulling all the users and want to build a reporting structure.
I have a data set that contains employees and their managers. The columns are named as such ("Employee" and "Manager"). I am trying to build a tree of a reporting structure that goes up to the root. We are in an IT organization, but I am pulling all employee data, so this would look something like:
Company -> Business Unit -> Executive -> Director -> Group Manager -> Manager -> Employee
That's the basic idea. Some areas have a tree structure that is small, others it's multiple levels. Basically, what I am trying to do is get a tree, or reporting structure I can reference, so I can determine for an employee, who their director is. This could be 1 level removed or up to 5 or 6 levels removed.
I came across data.tree, but so far, as I look at it, I have to provide a pathString that defines that structure. Since I only have the two columns, what I'd like to do is throw this data frame into a function and have it traverse the list as it finds the employee, put it under that manager, when it finds that manager as an employee, nest it under their direct report, along with anything nested under them.
I haven't been able to figure out how to make data.tree do this without defining the pathString, but in doing so, I can only build the pathString on what I know for each row - the employee and their manager. The result is a tree that only has 2 levels and directors aren't connected to their Group Manager and Group Managers aren't connected to their managers and so forth.
I thought about writing some logic/loops to go through and do this, but there must be an easier way or a package that I can use to do this. Maybe I am not defining the pathString correctly....
Ultimately, what I'd like the end result to be is a data frame with columns that look like:
Employee, Manager1, Manager2, Manager3, ManagerX, ...
Of course some rows will only have entries in columns 1 and 2, but others could go up many levels. Once I have this, I can look up devices in our configuration management system, find the owner and aggregate those counts under the appropriate director.
Any help would be appreciate. I cannot post the data, as it is confidential in nature, but it simply contains the employee and their managers. I just need to connect all the dots... Thanks!

The data.tree package has the FromDataFrameNetwork function for just this scenario:
library(data.tree)
DataForTree <- data.frame(manager = c("CEO","sally","sally","sue","mary", "mary"),
employee = c("sally","sue","paul","mary","greg", "don"),
stringsAsFactors = FALSE)
tree <- FromDataFrameNetwork(DataForTree)
print(tree)
Results in:
1 CEO
2 °--sally
3 ¦--sue
4 ¦ °--mary
5 ¦ ¦--greg
6 ¦ °--don
7 °--paul

The hR package is specifically designed to address the needs for data analysis using people/employee data; albeit, it is minimal at this point. The hierarchy function can produce a wide data frame as you would like; this helps with joining in other data and continuing an analysis.
library(hR)
ee = c("Dale#hR.com","Bob#hR.com","Julie#hR.com","Andrea#hR.com")
supv = c("Julie#hR.com","Julie#hR.com","Andrea#hR.com","Susan#hR.com")
hierarchy(ee,supv,format="wide")
Employee Supv1 Supv2 Supv3
1 Dale#hR.com Susan#hR.com Andrea#hR.com Julie#hR.com
2 Bob#hR.com Susan#hR.com Andrea#hR.com Julie#hR.com
3 Julie#hR.com Susan#hR.com Andrea#hR.com <NA>
4 Andrea#hR.com Susan#hR.com <NA> <NA>

Related

How to access unaggregated results when aggregation is needed due to dataset size in R

My task is to get total inbound leads for a group of customers, leads by month for the same group of customers and conversion rate of those leads.
The dataset I'm pulling from is 20 million records so I can't query the whole thing. I have successfully done the first step (getting total lead count for each org with this:
inbound_leads <- domo_get_query('6d969e8b-fe3e-46ca-9ba2-21106452eee2',
auto_limit = TRUE,
query = "select org_id,
COUNT(*)
from table
GROUP BY org_id
ORDER BY org_id"
DOMO is the bi tool I'm pulling from and domo_get_query is an internal function from a custom library my company built. It takes a query argument which is a mysql query)and various others which aren't important right now.
sample data looks like this:
org_id, inserted_at, lead_converted_at
1 10/17/2021 2021-01-27T03:39:03
2 10/18/2021 2021-01-28T03:39:03
1 10/17/2021 2021-01-28T03:39:03
3 10/19/2021 2021-01-29T03:39:03
2 10/18/2021 2021-01-29T03:39:03
I have looked through many aggregation online tutorials but none of them seem to go over how to get data needed pre-aggregation (such as number of leads per month per org, which isn't possible once the aggregation has occurred because in the above sample the aggregation would remove the ability to see more than one instance of org_id 1 for example) from a dataset that needs to be aggregated in order to be accessed in the first place. Maybe I just don't understand this enough to know the right questions to ask. Any direction appreciated.
If you're unable to fit your data in memory, you have a few options. You could process the data in batches (i.e. one year at a time) so that it fits in memory. You could use a package like chunked to help.
But in this case I would bet the easiest way to handle your problem is to solve it entirely in your SQL query. To get leads by month, you'll need to truncate your date column and group by org_id, month.
To get conversion rate for leads in those months, you could add a column (in addition to your count column) that is something like:
sum(case when conversion_date is not null then 1 else 0) as convert_count

R biomaRt package: obtaining all values in linked databases

A bioinformatics programming question. In R, I have a classic speciesA-to-speciesB gene symbol conversion, in this example from mouse to human, which I'm performing using biomaRt, and specifically the getLDS function.
x<-c("Lbp","Ndufv3","Ggt1")
require(biomaRt)
convert<-function(x){
human=useMart("ensembl",dataset="hsapiens_gene_ensembl")
mouse=useMart("ensembl",dataset="mmusculus_gene_ensembl")
newgenes=getLDS(
attributes="mgi_symbol",
filters="mgi_symbol",
values=x,
mart=mouse,
attributesL="hgnc_symbol",
martL=human,
uniqueRows=TRUE
)
humanx<-unique(newgenes)
return(humanx)
}
conversion<-convert(x)
However, I would like to obtain ALL ids present in the linked database: in other words, all mouse/human pairs (in this example). Something to tell the parameter value in the getLDS function to retrieve all ids, not just those specified in the x variable. I am talking about a full map, tens of thousands of lines long, specifying all orthologous relationships between symbols of the two databases.
Any ideas or workarounds? Thanks a lot!
I believe a workaround could be retrieving all IDs from the Biomart database itself, here: https://www.ensembl.org/biomart/martview/
Click on choose database -> Ensembl Genes
Choose dataset -> your selected species (e.g. Mouse genes)
Click on Results -> Check "Unique results only" -> Go
Profit
The list retrieved here has currently 53605 ids, which is, I believe, what you need.
Enjoy!

How to ingest data effectively in Neo4j

I am looking for advice on the best way to handle data going into Neo4j.
I have a set of structured data, CSV format which relates to journeys. The data is:
"JourneyID" - unique ref#/ Primary Key e.g 1234
"StartID" - ref# , this is a station e.g Station1
"EndIID" - ref# this is a station, e.g Station1 (start and end can be the same)
"Time" – integer e.g. 24
Assume I have 100 journeys/rows of data, showing journeys between 10 different stations.
I can see and work with this data in SQL or Excel. I want to work with this in Neo4j.
This is what I currently have:
StartID with JourneyID as a label
EndID with Journey ID as a label
This means that each row from the CSV for a station is its own node. I then created a relationship between Start and End using the JourneyID (primary key)
the effect was just 100 node connected to 100 nodes. E.g connection from Station1 and Station 2, Station 1 and Station 3, and Station 1 and Station 4. It didn’t show the relationship between Starting Station1 and Ending Station1, 2 and 3 - which is what I want to show.
How best do I model this data so that graph sees 10 unique StartID, connecting to the different EndIDs – showing the relationships between them?
Thanks in advance
(new to Graphs!)
This sample query, which uses MERGE to avoid creating duplicate nodes and relationships, should help you get started:
LOAD CSV WITH HEADERS FROM 'file://input.csv' AS row
MERGE (start:Station {id: row.StartID})
MERGE (end:Station {id: row.EndID})
MERGE (j:Journey {id: row.JourneyID})
ON CREATE SET j.time = row.Time
MERGE (j)-[:FROM]->(start)
MERGE (j)-[:TO]->(end)
I don't think you want a Journey to be a node, you want the Journey ID to be an attribute of the edge:
LOAD CSV WITH HEADERS FROM 'file://input.csv' AS row
MERGE (start:Station {id: row.StartID})
MERGE (end:Station {id: row.EndID})
MERGE (start)-[:JOURNEY {id:row.JourneyID}]->(end)
That more intuitively describes the data, and you could even extend this to different relationship types, if you can describe Journeys in more detail.
Edit:
This is to answer your question, but I can't speak as to how this scales up. I think it depends on the types of queries you plan to make.

Is it possible to aggregate data with varying nesting depth in Grafana?

I have data in Grafana with different nesting depths. It looks like this (the nesting depth differs depending on the message type):
foo.<host>.type.<type-id>
foo.<host>.type.<type-id>.<subtype-id>
foo.<host>.type.<type-id>.<subtype-id>.<more-nesting>
...
The <host> field can be the IP of the server sending the data and <type-id> is the type of message that it handled. There are quite a lot of message types but for the visualization I am only interested in the first level of <type-id> aggregated over all hosts.
For example, if I have this data:
foo.ip1.type.type1 = 3
foo.ip1.type.type2.subtype1 = 5
foo.ip2.type.type1 = 4
foo.ip2.type.type2.subtype1 = 9
foo.ip2.type.type2.subtype2 = 13
I would rather see it like this:
foo.*.type.type1 = 7 (3+4)
foo.*.type.type2 = 27 (5+9+13)
Then it would be easier to produce a graph where you can see which types of messages are most frequent.
I have not found a way to express that in Grafana. The only option that I see is to create a graph by manually creating queries for each message type. If there were only a handful of types that would be OK, but in my example, the number of types is quite high and even worse, they can change over time. When new message types are added, I would like to see them without having to change the graph.
Does Grafana support to aggregate the data in such a way? Can it visualize the data aggregated by one node and while summing up everything that comes after the node (like the --max-depth option in the Unix du command)?
I am not very experienced with Grafana, but I am starting to believe this functionality is not supported. Not sure whether Grafana allows to preprocess the data, but if the data could be transformed to
foo.ip1.type.type1 = 3
foo.ip1.type.type2_subtype1 = 5
foo.ip2.type.type1 = 4
foo.ip2.type.type2_subtype1 = 9
foo.ip2.type.type2_subtype2 = 13
it would also be valid workaround as the number of subtypes in very low in my data (often there is even only one subtype).
I think the groupByNode function might be useful to you. By doing something like:
groupByNode(foo.ip1.type.*.*,3,"sumSeries")
You'll need to repeat this for each level of nesting. Hope that helps.
More information is available here:
http://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.groupByNode
If you want to do it the way you alluded to in your example you could use aliasSub

HBase keyvalue (NOSQL) to Hive table (SQL)

I have some tables in Hive that I need to join together. Since I need to do some work on each of them, normalize the key, remove outliers.... and as I add more and more tables... This chaining process turned out to be a big mass.
It is so easy to get lost where you are and the query is getting out of control.
However, I have a pretty clear idea how the final table should look like and each column is fairly independent of the other tables.
For examp, here is an example:
table_class1
name id score
Alex 1 90
Chad 3 50
...
table_class2
name id score
Alexandar 1 50
Benjamin 2 100
...
In the end I really want something looks like:
name id class1 class2 ...
alex 1 90 50
ben 2 100 NA
chad 3 50 NA
I know it could be a left outer join, but I am really having a hard time to create a seperate table for each of them after the normalization and then use left outer join with the union of the keys to left outer join each of them...
I am thinking about using NOSQL(HBase) to dump the processed data into NOSQL format.. like:
(source, key, variable, value)
(table_class1, (alex, 1), class1, 90)
(table_class1, (chad, 3), class1, 50)
(table_class2, (alex, 1), class2, 50)
(table_class2, (benjamin, 2), class2, 100)
...
In the end, I want to use something like the melt and cast in R reshape package to bring that data back to be a table.
Since this is a big data project, and there will be hundreds of millions of key value pairs in HBase.
(1) I don't know if this is a legit approach
(2) If so, is there any big data tool to pivot long HBase table into a Hive table.
Honestly, I would love to help more, but I am not clear about what you're trying to achieve (maybe because I've never used R), please elaborate and I'll try to improve my answer if necessary.
Why do you need HBase for? You can store your processed data in new tables and work with them, you can even CREATE VIEW to simplify the query if it's too large, maybe that's what you're looking for (HIVE manual). Unless you have a good reason for using HBase, I'll stick just to HIVE to avoid additional complexity, don't get me wrong, there are a lot of valid reasons for using HBase.
About your second question, you can define and use HBase tables as HIVE tables, you can even CREATE and SELECT INSERT into them all inside HIVE, is that what you're looking for?: HBase/HIVE integration doc
One last thing in case you don't know, you can create custom functions in HIVE very easily to help you with the tedious normalization process, take a look at this.

Resources