I am working with RecordLinkage Library in R.
I have a data frame with id, name, phone, mail
My code looks like this:
ids = data$id
pairs = compare.dedup(data, identity=ids, blockfld=as.list(2,3,4))
The problem is that my ids are not the same in my result output
so if I had this data:
id Name Phone Mail
233 Nathali 2222 nathali#dd.com
435 Nathali 2222
553 Jean 3444 jean#dd.com
In my result output I will have something like
id1 id2
1 2
Instead of
id1 id2
233 435
I want to know if there is a way to keep the ids instead of the index, or someone could explain me the identity parameter.
Thanks
The identity vector tells the getPairs method which of the input records belong to the same entity. It actually holds information that you usually want to gain from record linkage, i.e. you have a couple of records and do not know in advance which of them belong together. However, when you use a training set to calibrate a method or you want to evaluate the accurateness of record linkage methods (the package was mainly written for this purpose), you start with an already deduplicated or linked data set.
In your example, the first two rows (ids 233, 435) obviously mean the same person and the third row a different one. A meaningful identity vector would therefore be:
c(1,1,2)
But it could also be:
c(42,42,128)
Just make sure that the identity vector has identical values exactly at those positions where the corresponding table rows hold matching record (vector index = row index).
About your question on how to display the ids in the result: You can get the full record pairs, including all data fields, with (see the documentation for more details):
getPairs(pairs)
There might be better ways to get hold of the original ids, depending on how you further process the record pairs (e.g. running a classification algorithm). Extend your example if you need more advice on this.
p.s.: I am one of the package authors. I have only very recently become aware that people ask questions about the package on Stack Overflow, so please excuse that a couple of questions have been around unanswered for a long time. I will look for a way to get notified on new questions posted here, but I would also like to mention that people can contact us directly via one of the email addresses listed in the package information.
You have to replace the index column with your identify column.
Related
I have a production model were the orders (agent population) run different stations. For each order the used stations and its sequence can be different. There are 12 different combinations of these stations. One random possibility should be assigned to the order.
How can I store and assign these possibilities in my Anylogic model? Which datatype would fit the best?
What I already tried was to use the Excel Interface, but as I later want to combine different possibilities to a longer list (about 50 possibilities combined with each other) Excel seems not the best way to do it.
I’m sure this is not a super hard problem, but I couldn’t find anything about it. Thanks in advance!
Hopefully I understand your question, so here it goes. The following is the model I propose:
Here the example has 3 stations (services). You put all the enter blocks in a collection called enterBlocks and all the names of the enter blocks on a collection called enterNames... so if you use excel, you can have in your excel the enter block names and initialize your enterNames collection in the beginning of the model by reading the excel. Each agent will probably have a different collection so the collection should be inside the agent, but here I'm just simplifying.
Then you use a counter (initial value 0) and a function called getNextService that will exist in each one of the 4 exit blocks. This function will choose the next station to use:
if(counter>=enterBlocks.size())//if the agent is done with all the stations
end.take(agent); // take the agent to the exit
else{
Enter enter=findFirst(enterBlocks,e>e.getName().equals(enterNames.get(counter)));//find the enter block with the correct name
enter.take(agent); //take the agent to the correct station
counter++; //update your counter
}
I'm trying to figure out how to query with filter with Geofire.
Suppose I have restaurants with different category. and I want to add that category to my query. How do I go about this?
One way I have now is querying the key with Geofire, run the for loop through each key and get the restaurant, and insert the appropriate restaurant to the array.
These seems so inefficient. Is there any other way to go about this?
Ideally I will have the filtered results, and only load each item when they're about to be shown.
Cheers!
Firebase queries can only filter by one condition. Geofire already does quite some "magic" to allow it to filter on both longitude and latitude. Adding another property to that equation might be possible, but is well beyond what Geofire handles by default. See GeoFire: How to add extra conditions within the query?
If you only ever want to access one category at a time, you can put the restaurants in a top-level node per category and point Geofire to one category.
/category1
item1
g: "pns0h0mf2u"
l: [-53.435719, 140.808716]
item2
g: "u417k3dwub"
l: [56.83069, 1.94822]
/category2
item3
g: "8m3rz3s480"
l: [30.902225, -166.66809]
/items
item1: ...
item2: ...
item3: ...
In the above example, we have two categories: category1 with 2 items and category2 with just 1 item. For each item, we see the data that Geofire uses: a geohash and the longitude and latitude. We also keep a single list with the other properties of these 3 items.
But more commonly, you simply do the extra filtering in client-side code. If you're worried about the performance of that: measure it, share the code, JSON data and measurements.
This is an old question, but I've seen it in a few places on the web, so I thought I might share one trick I've used.
The Problem
If you have a large collection in your database, maybe containing hundreds of thousands of keys, for example, it might not be feasible to grab them all. If you're trying to filter results based on location in addition to other criteria, you're stuck with something like:
Execute the location query
Loop through each returned geofire key and grab the corresponding data in the database
Check each returned piece of data to see if it matches the other criteria
Unfortunately, that's a lot of network requests, which is quite slow.
More concretely, let's say we want to get all users within e.g. 100 miles of a particular location that are male and between ages 20 and 25. If there are 10,000 users within 100 miles, that means 10,000 network requests to grab the user data and compare their gender and age.
The Workaround:
You can store the data you need for your comparisons in the geofire key itself, separated by a delimiter. Then, you can just split the keys returned by the geofire query to get access to the data. You still have to filter through them, but it's much faster than sending hundreds or thousands of requests.
For instance, you could use the format:
UserID*gender*age, which might look something like facebook:1234567*male*24. The important points are
Separate data points by a delimiter
Use a valid character for the delimiter -- "It can include any unicode characters except for . $ # [ ] / and ASCII control characters 0-31 and 127.)"
Use a character that is not going to be found elsewhere in your database - I used *, but that might not work for you. Do not use any characters from -0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz, since those are fair-game for keys generated by firebase's push()
Choose a consistent order for the data - in this case, UserID first, then gender, then age.
You can store up to 768 bytes of data in firebase keys, which goes a long way.
Hope this helps!
I'm fairly new to Tableau, and I'm struggling in building some routines that could be easily implemented in Excel (though it would take forever for big sets of data).
So here is the deal, consider a dataset with the following fields:
int [id_order] -> id of the sales order (deepest level, there are only unique entries of id_order)
int [id_client] -> as I want to know who bought it
date [purchase_date] -> when the customer bought the product
What I want to know is, for each order, when was the last time (if ever) the client has bought something. In order words, what is the highest purchase_date for that user that is smaller than current purchase_date.
In excel, approach is simple (but again, not efficient)
{=max(if(id_client=B1,if(purchase_order
Is there a way to do this kind of calculation in Tableau?
You can do this in Tableau using table calculations. They take a little time to understand how to use well, but are very powerful and flexible. I posted a sample Tableau workbook for a similar question in an answer for SO question Find first time a condition is met
Your situation is similar, but with the extra complication that you want to repeat the analysis for each client id, so you might want to try a recursive approach using the Previous_Value() function instead of the approach used in that example - though I'm not certain that previous_value() will fit your situation.
Still, it might be helpful to download the example workbook I mentioned to get an idea how table calculations can address similar problems.
Just to register the solution, in case someone has the same doubt.
So, basically the solution I found use table calculation, which is not calculated until it's called on a sheet (and is only calculated on the context of the sheet). That's a little bit limiting, so what I do is create a sheet with all the fields I need (+ what is necessary for the table calculation) then export the data (to mdb) and connect to this new file.
So, for my example, the right table calculation is (let's name it last_order_date):
LOOKUP(MAX([purchase_date]),-1)
Explanations. The MAX() is necessary because Lookup (and all table calculations) does not work with data directly, only with aggregations. You can use sum, avg, max, attr, whatever suits you. As in my case there will be only 1 correspondence, any function will do just fine and return the same value.
The -1 indicates that I'm looking for the element immediately before the current entry (of the table, as you define it). If it were FIRST(), it would go for the first entry of the table, and LAST() would go for the last.
Now, I have to put it on a sheet. So I'll bring the fields id_client, id_order, purchase_date and last_order_date.
Then I have to define the parameters of my table calculation last_order_date (Edit Table Calculation). I'll go to Compute using and choose advanced. Now I'll do Partitioning: id_client, and addressing all the rest. What will that do? This mean Tableau will create temporary tables for each id_client, and table calculations will use those tables as parameter.
Additionally, I will Sort by field purchase_date, Max (again the aggregation issue) and ascending, to guarantee my entries are in chronological order.
Now, what will it do? For each entry it will access the table of the id_client, and check what was the purchase_date that is immediately before the current entry (that is being assessed), exactly what I need.
To avoid spending Tableau processing in Visualization, I often put all the fields in details (and leave nothing on screen), use Bar chart (it's good because it allows me to see the data). Then I export it to mdb, then connect to it again. Unfortunately Tableau doesn't directly export to tde.
I am building an application in ASP.NET, C#, MVC3 and SQL Server 2008.
A form is presented to a user to fill out (name, email, address, etc).
I would like to allow the admin of the application to add extra, dynamic questions to this form.
The amount of extra questions and the type of data returned will vary.
For instance, the admin could add 0, 1 or more of the following types of questions:
Have you a full, clean driving liscence?
Rate your drivings skills from 1 to 5.
Describe the last time you went on a long journey?
etc ...
Note, that the answers provided could be binary (Q.1), integer (Q.2) or free text (Q.3).
What is the best way of storing random data like this in MS SQL?
Any help would be greatly appriecated.
Thanks in advance.
I would create a table with the following columns and store the name of the variable along with value in the appropriate column with all other values null.
id:int (primary)
name:varchar(100)
value_bool:bit(nullable)
value_int:int (nullable)
value_text:varchar(100) (nullable)
Unless space is an issue, I would use VARCHAR(MAX). It gives you up to 8,000 characters and stores numbers and text.
edit: Actually as Aaron points out below, that will give you 2 billion characters (enough for a book). You might go with VARCHAR(8000) or the like then, wich does give you up to 8,000 characters. Since it is VARCHAR, it will not take empty space (so a 0 or 1 will not take up 8,000 characters worth of space, only 1).
I have two different databases that are not connected in any way. In fact, one is a public school database and one is a hud (housing) database. By law they are not allowed to share names and other specific identifying addresses. Birthdates and addresses are okay - along with zip codes and other more general ids. The uses need to be able to query the other database to get non-specific information so it would appear that they need to share the same unique id. I was considering such things as using birthdates and perhaps initials of name or perhaps last 4 digits of ssn along with the birthdate. The client was thinking of global positioning data but I'm concerned about apartments next to one another or moving of families. Any ideas?
First you need to determine what will be your measure of uniqueness. If there are two people in either database with more than one entry for your measure of uniqueness, you need to change your strategy. After that, put a constraint on both databases constraining that these properties(Birthday, SSN) are what make a Person record unique.