Plot movement over time in (preferably) Google Maps - plot

I have a spreadsheet with columns for person, date, event, place name, latitude, and longitude. This is the result of many years of genealogical research that shows the birth, marriage, and death locations for several hundred of my direct ancestors as they migrated across the world and finally converged in South Africa for the last few generations.
I'd very much like to create an animation or video showing their movements over time, preferably with a marker flashing at the location, then fading away, with or without lines linking the markers for the duration of the person's life. At 9 generations ago this would then show 512 births happening at roughly the same time, moving on to them converging into 256 places as couples got married, then between those 256 marriages and the original 512 deaths, the 256 births of people of the next generation would flash on, and so on, finally converging on just my birth. I believe such an animation would be an excellent way to make the vast family tree accessible in a visual way, and other genealogical researchers would probably also enjoy doing this. The ability to automatically zoom in on the bounding box of the locations at any given time would be needed to show movements within a smaller geographic location, but first and foremost I simply want to plot points over time.
Does anyone know of a free or commercial tool that would allow doing this? I have explored this in most genealogical software solutions but they provide very limited tools showing one person or one couple at a time, so I suspect I'm going to have to plug this into a generic 'plot movement over time' tool in a good map service.

I have used GraphXR for plotting family tree members linked to one of their several maps, with the edges being either a birth, marriage or death date. The data is queried from Neo4j which has a seamless interface with GraphXR.
I'm now working on a Neo4j PlugIn for genealogy and collaborating with GraphXR developers to make such visualizations easier for end users.
It's not exactly what you are looking for, but it may be helpful?
http://gfg.md/blogpost/7

Related

What algorithm to extrapolate traffic data on graph used for routing (OSM)

I'm planning to use either of popular routing projects OSMR, Graphhopper or Valhalla to find the fastest route including historical traffic data.
Question: I don't have traffic data for all graph edges representing roads (only for a subset), and the missing data has to be extrapolated. What mathematical tools (or ready solutions) can I use to extrapolate/ fill/ guess the missing traffic data given following assumptions:
Ideal arbitrage is performed by drivers. Taking a route without (or partially without) traffic data shouldn't give an advantage.
The routing queries will be limited to an area of a typical European city, say 25km x 25km, what results in a fairly small graph.
The solution can be either: geo-agnostic (refer only on nodes and edge weights) or take into account spatial data ie. physical proximity (or direction!) of graph edges.
Any heuristic can take advantage of the fact that the routing happens for morning or evening rush hours. The traffic significantly differs depending on the direction.
Thank you for your help!

Recommendation systems - converting transaction counts to star ratings

I'm doing some exploratory work on recommendation systems and have been reading about collaborative filtering techniques involving user-based, item-based, and SVD algorithms. I am also trying out R's recommenderlab package.
One apparent assumption in the literature is that the user data has labelled items based on a rating scale, e.g. between 1 and 5 stars. I'm looking at problems where the user data does not have ratings but rather just transactions. For example, if I want to recommend restaurants to a user, the only data I have is how often he has visited other restaurants.
How can I convert these "transaction" counts into ratings that can be used by recommendation algorithms that expect a fixed-scale rating? One approach I thought of is simple binning:
0 stars = 0-1 visits
1 star = 2-3 visits
...
5 stars = 10+ visits
However, that doesn't seem like it would work well. For example, if someone visited a restaurant only once, he may still really love it.
Any help would be appreciated.
I would try different approaches. As you said, only visited once may indicate that the user still loves the restaurant but you don't know for sure. Your goal is not to optimize for one single user rather for all users. So for this, you can split your data into training and test data. Train on the training data with different scales and test on the test data.
The different scales may be
a binary scale (0:never visited, 1: visited). This is mostly used in online shops (bought or not). Would support your assuption with the one time visit.
your presented scale or other ranges for the 5 stars. You can also use more than 5 stars. I would potentially not group 0-1 visits.
The approach with the best accuracy should be chosen.
Here's an idea: restaurants the user has visited zero or one times tell you nothing about what they like. Restaurants they have visited many times tell you lots. Why not just look for restaurants similar to those the customer most regularly frequents? In this way, you're using positive information (what they like) but none of the negative since you don't have access to it anyway.
If you absolutely had to infer some continuous measure, I think it would only be sensible to look at the propensity for another visit given past behaviour. This would start with the prior probability of choosing that restaurant (background frequency, or just uniform over restaurants) with a likelihood term related to the number of visits to that restaurant. In this way the more a user visits a restaurant the more likely they are to visit again.

Any free mapping service to display and filter 250000+ datapoints?

I have participated in a Hackathon in my city, and the traffic department made public a dataset with more than 250 thousand traffic accident datapoints, each one containing Latitude, Longitude, type of accident, vehicles involved, etc.
I made a test to display the data using Google Maps API and Google Fusion Tables, but the usage limits were quickly reached with the first two years of a total of 13 years of records.
The data for two years can be displayed and filtered here.
So my question is:
Which free online services could I use in order to interactively display and filter 250 thousand such datapoints as map layers?
It is important that the service be free, because we are volunteering our time for non-profit public good. Currently our City Hall is implementing an API, but it is not ready yet, and it would be useful to present them some popularly well-accepted use-cases to make some political pressure for further API development with THEIR server (specially remotely querying a database instead of crawling a bunch of .csv files as it is now...)
An alternative would be to put everything in GitHub and load the whole dataset client-side to be manipulated with D3.js for example, but that seems very inefficient either for the client/user as for the server.
Thanks for reading, and feel free to re-tag if needed.
You need Google Maps API for Business to achieve what you want, but it costs a lot of money.
However, in some cases, you can get this Business Licence if you work for non-profit organization. I can't find the exact rules to be eligible for this free licence. I tried googled them but I can't find anything. I only find this link, just take a look if it can answer your problem.
You should be able to do that with Google Fusion Tables. The limit is 100,000 points per table, but you can overlay 5 layers onto a single map so in effect you can reach 500,000 points. I implemented the website below and have run it with over 200,000 points.
http://www.skyscan.co.uk/mapsearch.html

Get Annual Financial Data for a Stock for many years in R

Suppose I want to regress in R Gross Profit on Total Revenue. I need data for this, and the more, the better.
There is a library on CRAN that I find very useful: quantmod , that does what I need.
library(quantmod)
getFinancials(Symbol="AMD", src="google")
#to get the names of the matrix: rownames(AMD.f$IS$A)
Total.Revenue<-AMD.f$IS$A["Revenue",]
Gross.Profit<-AMD.f$IS$A["Gross Profit",]
#finally:
reg1<-lm(Gross.Profit~Total.Revenue)
The biggest issue that I have is that this library gets me data only for 4 years (4 observations, and who runs a regression with only 4 observations???). Is there any other way (maybe other libraries) that would get data for MORE than 4 years?
I agree that this is not an R programming question, but I'm going to make a few comments anyway before this question is (likely) closed.
It boils down to this: getting reliable fundamental data across sectors and markets is difficult enough even if you have money to spend. If you are looking at the US then there are a number of options, but all the major (read 'relatively reliable') providers require thousands of dollars per month - FactSet, Bloomberg, Datastream and so on. For what it's worth, for working with fundamental data I prefer and use FactSet.
Generally speaking, because the Excel tools offered by each provider are more mature, I have found it easier to populate spreadsheets with the data and then read the data into R. Then again, I typically deal with the fundamentals of a few dozen companies at most, because once you move out of the domain of your "known" companies the time it takes to check anomalies increases exponentially.
There are numerous potential "gotchas". The most obvious is that definitions vary from sector to sector. "Sales" for an industrial company is very different from "sales" for a bank, for example. Another problem is changes in definitions. Pretty much every year some accounting regulation or other changes and breaks your data series. Last year minorities were reported here, but this year this item is moved to another position in the P&L and so on.
Another problem is companies themselves changing. How does one deal with mergers, acquisitions and spin-offs, for example? This sort of thing can make measuring organic sales growth next to impossible. Yet another point to bear in mind is that if you're dealing with operating or net profit, you have to consider exceptionals and whether to adjust for them.
Dealing with companies outside the US adds a whole bunch of further problems. Of course, the major data providers try to standardise globally (FactSet Fundamentals for example). This just adds another layer of abstraction and typically it is hard to check to see how the data has been manipulated.
In short, getting the data is onerous and I know of no reliable free sources. Unless you're dealing with the simplest items for a very homogenous group of companies, this is a can of worms even if you do have the data.

How do Alexa and Google Analytics track demographics?

How are services like Alexa and Google Analytics capable of tracking visitors' age, gender, college education, and so forth?
http://www.alexa.com/siteinfo/stackoverflow.com
Alexa definitely gets its traffic info from its toolbar users. Since that is a relatively small and self-selecting group of people, this inevitably leads to a biased sample (which is why Alexa traffic doesn't match measured traffic on the sites I run). Even with the best statistical techniques for reducing bias, you can never get rid of it entirely when the sampling distribution is not uniform.
Unclear how Google does it, although it might involve tracking cookies.
A project I have been working on recently has bearing on this question.
Another way to do this (that also has biases, but different ones) would be to use an IP to location service to find the approximate latitude and longitude of each visitor to your site. Then use my project (full disclosure: I run that site and it is commercial):
http://askgeo.com
To get demographic information for that location. AskGeo actually provides demographic information on several geographic levels (state, county, county subdivision, city, ZIP code, census tract (a few thousand people), and census block group (about a thousand people). You'd presumably want to use the lowest level (i.e., census block group) for a given latitude and longitude.
The site returns a huge number of demographic variables. The idea would be to use soft counts from the demographic variables provided on the block group level. To take an example, if you are trying to track the age distribution of your users, then you'd use the age ranges provided in the AskGeo response and for a given sample, you'd add a fractional soft count to each range that corresponds to the percentage of the population in that block group from the corresponding age range. For example, take my neighborhood in San Francisco. It has the following age distribution:
CensusAgePercent0To4: 7.3%
CensusAgePercent5To9: 3.5%
CensusAgePercent10To: 3.2%
... (skipping a bit, as you probably get the idea) ...
CensusAgePercentOver85: 1.5%
If you got an IP address that you tracked to that census block group, you'd add each of those percentages (as a fraction from 0 to 1) to your (soft) counters for those age ranges. (A soft counter is just a counter that allows for non-integer counts.)
You could do the same with race, gender, income level, house values, etc.
This method also has biases, for sure, since it assumes that all the people in a given block group are equally likely to visit your site. But it is something that you can do on your own site, not just Google and Alexa, and it would still give you a relative sense of who is visiting your site if your soft counts in a given category are higher than the national average in that category.
It is also possible that a more sophisticated technique than simple direct counts could lead to a much richer result.
I did some research, and apparently these demographics are tracked the same way TV audience demographics are tracked. There are people who browse with their (Alexa's) toolbars, which keeps track of the sites visited. These people willingly (?) supply information like age, gender, etc. and Alexa extrapolates the general demographics from this sample. This of course leaves room for bias, but that's a problem with statistics.
Alexa gets its information from browser toolbars that you install on purpose or as part of a bundle with some software.
It asks questions to understand demographic params and also tracks sites that you visit. If you know that 80% of site visitors are women and you have new visitor who visits this site that you can think that there is high probability that this person is a woman. If you know a lot of sites this person visits you can guess a lot.
But as http://netberry.co.uk/alexa-rank-explained.htm says you can rely only on information from Alexa TOP100,000 because then Alexa has enough information from small amount of users visiting these sites. They say "millions" but it's small share of total

Resources