Analyzing Path Data - r

I have data representing the paths people take across a fixed set of points (discrete, e.g., nodes and edges). So far I have been using igraph.
I haven't found a good way yet (in igraph or another package) to create canonical paths summarizing what significant sub-groups of respondents are doing.
A canonical path can be operationalized in any reasonable way and is just meant to represent a typical path or sub-path for a significant portion of the population.
Does there already exist a function to create these within igraph or another package?

One option: represent each person's movement as a directed edge. Create an aggregate graph such that each edge has a weight corresponding to the number of times that edge occurred. Those edges with large weights will be "typical" 1-paths.
Of course, it gets more interesting to find common k-paths or explore how paths vary among individuals. The naive approach for 2-paths would be to create N additional nodes that correspond to nodes when visited in the middle of the 2-path. For example, if you have nodes a_1, ..., a_N you would create nodes b_1, ..., b_N. The aggregate network might have an edge (a_3, b_5, 10) and an edge (b_5, a_7, 10); this would represent the two-path (a_3, b_5, a_7) occurring 10 times. The task you're interested in corresponds to finding those two-paths with large weights.
Both the igraph and network packages would suffice for this sort of analysis.
If you have some bound on k (ie. only 6-paths occur in your dataset), I might also suggest enumerating all the paths that are taken and computing the histogram of each unique path. I don't know of any functions that do this automagically for you.

Related

R: calculating single shortest path between two vertices

Currently, I am working on a project that involves NYC Taxi data, in which I am given where a person is picked up and dropped off in a network.
I am working with an ESRI shapefile, which I can load into R as an igraph object with the shp2graph package; I need to utilize Dijkstra's algorithm (or a similar shortest-path algorithm) to find the single shortest path between two given vertices. I thought that the get.shortest.paths() method of the igraph package would be my solution, but to my surprise, this calculates all shortest paths from a vertex to all others in a network.
To me, this seems like overkill, because I need only one single path between two specified nodes. I did some poking around online and in the igraph documentation, but all I can find are methods surrounding calculating many shortest paths from a given vertex to all others.
Due to how computationally expensive it would be to calculate every single shortest path from a vertex, and then just select one from the behemoth of a list, I'm looking for a way to utilize Dijkstra's algorithm between two specified vertices in a graph. Is there a way to do this in the igraph package, or if not, is there a good way to do this with a different package in R?
EDIT: In the end, I am hoping to look for a function that will take in the graph object and the ID of two vertices I wish to find the shortest path between, then return a list of paths/edges (or IDs) along that shortest path. This would help me to inspect each individual street along the shortest path between the two vertices.
EDIT: As an example of how I am currently using the function:
path <- get.shortest.paths(NYCgraph, from=32, mode="out"). Something I would hope to find is path <- shortestPathFunction(NYCgraph, from=32, to=37) to arbitrary calculate a shortest path between vertex ID 32 and vertex ID 37 (two random street intersections in the network).
I found my issue, which occurred before I called get.shortest.paths(). For those who are curious on how to read in an ESRI shapefile, and find a single shortest path between two points (which was my dilemma):
myShapefile <- readOGR(dsn=".", layer="MyShapefileName") # i.e. "MyShapefileName.shp"
shpData <- readshpnw(myShapefile, ELComputed=TRUE)
igraphShpObject <- nel2igraph(shpData[[2]], shpData[[3]], weight=shpData[[4]])
testPath <- get.shortest.paths(igraphShpObject, from=42, to=52) # arbitrary nodes
testPath[1] # print the node IDs to the console
Furthermore, if one was interested in getting the ID of the edge connecting two nodes (perhaps from nodes in the testPath):
get.edge.ids(igraphShpObject, c(42,45) # arbitrary nodes 42 and 45
This indexing is the same as the indexing in shpData; for example, if you want to get the length of edge ID x, as found in get.edge.ids(), you may type shpData[[4]][x].
I hope these tidbits may be helpful to somebody in the future encountering the same problems! This method utilizes the shp2graph, rgdal, and igraph packages in R.

What data structures allow for efficient lookup in nested intervals?

I’m looking for a data structure that would help me find the smallest interval (the (low, high) pair) that encloses a given point. Intervals may nest properly. For example:
Looking for point 3 in (2,7), (2,3), (4,5), (8,12), (9,10) should yield (2,3).
During the construction of the data structure, intervals are added in no particular order and, specifically, not according to their nesting. Is there a good way to map this problem to a search tree data structure?
Segment tree should do the job. In nodes of a segment tree you keep the length of the shortest interval that covers this node, as well as the reference to the interval itself. When processing a query for a given point, you simply return the interval referenced by the node of that point.

Multiple events in traminer

I'm trying to analyse multiple sequences with TraMineR at once. I've had a look at seqdef but I'm struggling to understand how I'd create a TraMineR dataset when I'm dealing with multiple variables. I guess I'm working with something similar to the dataset used by Aassve et al. (as mentioned in the tutorial), whereby each wave has information about several states (e.g. children, marriage, employment). All my variables are binary. Here's an example of a dataset with three waves (D,W2,W3) and three variables.
D<-data.frame(ID=c(1:4),A1=c(1,1,1,0),B1=c(0,1,0,1),C1=c(0,0,0,1))
W2<-data.frame(A2=c(0,1,1,0),B2=c(1,1,0,1),C2=c(0,1,0,1))
W3<-data.frame(A3=c(0,1,1,0),B3=c(1,1,0,1),C3=c(0,1,0,1))
L<-data.frame(D,W2,W3)
I may be wrong but the material I found deals with the data management and analysis of one variable at a time only (e.g. employment status across several waves). My dataset is much larger than the above so I can't really impute these manually as shown on page 48 of the tutorial. Has anyone dealt with this type of data using TraMineR (or similar package)?
1) How would you feed the data above to TraMineR?
2) How would you compute the substitution costs and then cluster them?
Many thanks
When using sequence analysis, we are interested in the evolution of one variable (for instance, a sequence of one variable across several waves). You have then multiple possibilities to analyze several variables:
Create on sequences per variable and then analyze the links between the cluster of sequences. In my opinion, this is the best way to go, if your variables measure different concepts (for instance, family and employment).
Create a new variable for each wave that is the interaction of the different variables of one wave using the interaction function. For instance, for wave one, use L$IntVar1 <- interaction(L$A1, L$B1, L$C1, drop=T) (use drop=T to remove unused combination of answers). And then analyze the sequence of this newly created variable. In my opinion, this is the prefered way if your variables are different dimensions of the same concept. For instance, marriage, children and union are all related to familly life.
Create one sequence object per variable and then use seqdistmc to compute the distance (multi-channel sequence analysis). This is equivalent to the previous method depending on how you will set substitution costs (see below).
If you use the second strategy, you could use the following substitution costs. You can count the differences between the original variable to set the substition costs. For instance, between states "Married, Child" and "Not married and Child", you could set the substitution to "1" because there is only a difference on the "marriage" variable. Similarly, you would set the substition cost between states "Married, Child" and "Not married and No Child" to "2" because all of your variables are different. Finally, you set the indel cost to half the maximum substitution cost. This is the strategy used by seqdistmc.
Hope this helps.
In Biemann and Datta (2013) they talk about multi dimensional analysis. That means creating multiple sequences for the same "individuals".
I used the following approach to do so:
1) define 3 dimensional sequences
comp.seq <- seqdef(comp,NULL,states=comp.scodes,labels=comp.labels, alphabet=comp.alphabet,missing="Z")
titles.seq <- seqdef(titles,NULL,states=titles.scodes,labels=titles.labels, alphabet=titles.alphabet,missing="Z")
member.seq <- seqdef(member,NULL,states=member.scodes,labels=member.labels, alphabet=member.alphabet,missing="Z")
2) Compute the multi channel (multi dimension) distance
mcdist <- seqdistmc(channels=list(comp.seq,member.seq,titles.seq),method="OM",sm=list("TRATE","TRATE","TRATE"),with.missing=TRUE)
3) cluster it with ward's method:
library(cluster)
clusterward<- agnes(mcdist,diss=TRUE,method="ward")
plot(clusterward,which.plots=2)
Nevermind the parameters like "missing" or "left" and etc. but i hope the brief code sample helps.

Movement data analysis in R; Flights and temporal subsampling

I want to analyse angles in movement of animals. I have tracking data that has 10 recordings per second. The data per recording consists of the position (x,y) of the animal, the angle and distance relative to the previous recording and furthermore includes speed and acceleration.
I want to analyse the speed an animal has while making a particular angle, however since the temporal resolution of my data is so high, each turn consists of a number of minute angles.
I figured there are two possible ways to work around this problem for both of which I do not know how to achieve such a thing in R and help would be greatly appreciated.
The first: Reducing my temporal resolution by a certain factor. However, this brings the disadvantage of losing possibly important parts of the data. Despite this, how would I be able to automatically subsample for example every 3rd or 10th recording of my data set?
The second: By converting straight movement into so called 'flights'; rule based aggregation of steps in approximately the same direction, separated by acute turns (see the figure). A flight between two points ends when the perpendicular distance from the main direction of that flight is larger than x, a value that can be arbitrarily set. Does anyone have any idea how to do that with the xy coordinate positional data that I have?
It sounds like there are three potential things you might want help with: the algorithm, the math, or R syntax.
The algorithm you need may depend on the specifics of your data. For example, how much data do you have? What format is it in? Is it in 2D or 3D? One possibility is to iterate through your data set. With each new point, you need to check all the previous points to see if they fall within your desired column. If the data set is large, however, this might be really slow. Worst case scenario, all the data points are in a single flight segment, meaning you would check the first point the same number of times as you have data points, the second point one less, etc. The means n + (n-1) + (n-2) + ... + 1 = n(n-1)/2 operations. That's O(n^2); the operating time could have quadratic growth with respect to the size of your data set. Hence, you may need something more sophisticated.
The math to check whether a point is within your desired column of x is pretty straightforward, although maybe more sophisticated math could help inform a better algorithm. One approach would be to use vector arithmetic. To take an example, suppose you have points A, B, and C. Your goal is to see if B falls in a column of width x around the vector from A to C. To do this, find the vector v orthogonal to C, then look at whether the magnitude of the scalar projection of the vector from A to B onto v is less than x. There is lots of literature available for help with this sort of thing, here is one example.
I think this is where I might start (with a boolean function for an individual point), since it seems like an R function to determine this would be convenient. Then another function that takes a set of points and calculates the vector v and calls the first function for each point in the set. Then run some data and see how long it takes.
I'm afraid I won't be of much help with R syntax, although it is on my list of things I'd like to learn. I checked out the manual for R last night and it had plenty of useful examples. I believe this is very doable, even for an R novice like myself. It might be kind of slow if you have a big data set. However, with something that works, it might also be easier to acquire help from people with more knowledge and experience to optimize it.
Two quick clarifying points in case they are helpful:
The above suggestion is just to start with the data for a single animal, so when I talk about growth of data I'm talking about the average data sample size for a single animal. If that is slow, you'll probably need to fix that first. Then you'll need to potentially analyze/optimize an algorithm for processing multiple animals afterwards.
I'm implicitly assuming that the definition of flight segment is the largest subset of contiguous data points where no "sub" flight segment violates the column rule. That is to say, I think I could come up with an example where a set of points satisfies your rule of falling within a column of width x around the vector to the last point, but if you looked at the column of width x around the vector to the second to last point, one point wouldn't meet the criteria anymore. Depending on how you define the flight segment then (e.g. if you want it to be the largest possible set of points that meet your condition and don't care about what happens inside), you may need something different (e.g. work backwards instead of forwards).

The approach to calculating 'similar' objects based on certain weighted criteria

I have a site that has multiple Project objects. Each project has (for example):
multiple tags
multiple categories
a size
multiple types
etc.
I would like to write a method to grab all 'similar' projects based on the above criteria. I can easily retrieve similar projects for each of the above singularly (i.e. projects of a similar size or projects that share a category etc.) but I would like it to be more intelligent then just choosing projects that either have all the above in common, or projects that have at least one of the above in common.
Ideally, I would like to weight each of the criteria, i.e. a project that has a tag in common is less 'similar' then a project that is close in size etc. A project that has two tags in common is more similar than a project that has one tag in common etc.
What approach (practically and mathimatically) can I take to do this?
The common way to handle this (in machine learning at least) is to create a metric which measures the similarity -- A Jaccard metric seems like a good match here, given that you have types, categories, tags, etc, which are not really numbers.
Once you have a metric, you can speed up searching for similar items by using a KD tree, vp-tree or another metric tree structure, provided your metric obeys the triangle inequality( d(a,b) < d(a,c) + d(c, b) )
The problem is, that there are obviously an infinite number of ways of solving this.
First of all, define a similarity measure for each of your attributes (tag similarity, category similarity, description similarity, ...)
Then try to normalize all these similarities to use a common scale, e.g. 0 to 1, with 0 being most similar, and the values having a similar distribution.
Next, assign each feature a weight. E.g. tag similarity is more important than description similarity.
Finally, compute a combined similarity as weighted sum of the individual similarities.
There is an infinite number of ways, as you can obviously assign arbitrary weights, have various choices for the single-attribute similarities already, infinite number of ways of normalizing the individual values. And so on.
There are methods for learning the weights. See ensemble methods. However, to learn the weights you need to have user input on what is a good result and what not. Do you have such training data?
Start with a value of 100 in each category.
Apply penalties. Like, -1 for each kB difference in size, or -2 for each tag not found in the other project. You end up with a value of 0..100 in each category.
Multiply each category's value with the "weight" of the category (i.e., similarity in size is multiplied with 1, similarity in tags with 3, similarity in types with 2).
Add up the weighted values.
Divide by the sum of weight factors (in my example, 1 + 3 + 2 = 6) to get an overall similarity of 0..100.
Possibilities to reduce the comparing of projects below the initial O(n^2) (i.e. comparing each project with each other) is heavily depending on context. It might be the real crux of your software, or it might not be necessary at all if n is low.

Resources