I am looking to run a model that matches randomly matches 2 datasets (person and vacancy) based on matching characteristics of both datasets.
A person can have a role type, location, other and the vacancies will be looking for these characteristics.
Current methodology is using a for loop to work through the vacancies, subset the person table based on the matching characteristics and randomly pick a person.
Rough outline of current code:
for (I 1:dim(Vacancy)){
individual_vacancy = Vacancy[1]
available_person <- person[...matchingconditionsfromindividualvacancy....andavailable=1]
Vacancy$personid[I] = randomsampleofavailableperson
person$Available[personid == randomsampleofavailableperson] = 0
}
This is very slow and computationally expensive due to the size of the dataset and from what I can assume is the looping and writing back to the original datasets.
Are there any methodologies for this kind of problem/ R packages I'd be able to take advantage of?
Edit: This is a 1:1 matching problem which is where the problem is occurring. i.e. 1 person must be allocated to one vacancy. The Available flag update ensures this at the moment.
Related
My question is somewhat similar to this one. But I want to ask whether the column order matters or not. I have some time series data. For each cycle I computed some features (let's call them var1, var2,.... I now train the model using the following column order which of course will be consistent for the test set.
X_train=data['var1','var2','var3',var4']
After watching this video I've concluded that the order in which the columns appear is significant i.e. swapping var 1 and var 3 as:
X_train=data['var3','var2','var1',var4']
I would get a different loss function.
If the above is true, then how does one figure out the correct feature order to minimize the loss function, especially when the number of features could be in dozens.
I previously worked on a project where we examined some sociological data. I did the descriptive statistics and after several months, I was asked to make some graphs from the stats.
I made the graphs, but something seemed odd and when I compared the graph to the numbers in the report, I noticed that they are different. Upon investigating further, I noticed that my cleaning code (which removed participants with duplicate IDs) now results with more rows, e.g. more participants with unique IDs than previously. I now have 730 participants, whereas previously there were 702 I don't know if this was due to updates of some packages and unfortunately I cannot post the actual data here because it is confidential, but I am trying to find out who these 28 participants are and what happened in the data.
Therefore, I would like to know if there is a method that allows the user to filter the cases so that the mean of some variables is a set number. Ideally it would be something like this, but of course I know that it's not going to work in this form:
iris %>%
filter_if(mean(.$Petal.Length) == 1.3)
I know that this was an incorrect attempt but I don't know any other way that I would try this, so I am looking for help and suggestions.
I'm not convinced this is a tractable problem, but you may get somewhere by doing the following.
Firstly, work out what the sum of the variable was in your original analysis, and what it is now:
old_sum <- 702 * old_mean
new_sum <- 730 * new_mean
Now work out what the sum of the variable in the extra 28 cases would be:
extra_sum <- new_sum - old_sum
This allows you to work out the relative proportions of the sum of the variable from the old cases and from the extra cases. Put these proportions in a vector:
contributions <- c(extra_sum/new_sum, old_sum/new_sum)
Now, using the functions described in my answer to this question, you can find the optimal solution to partitioning your variable to match these two proportions. The rows which end up in the "extra" partition are likely to be the new ones. Even if they aren't the new ones, you will be left with a sample that has a mean that differs from your original by less than one part in a million.
Let me say that I have a below Data Frame in R with 500 player records with the following columns
PlayerID
TotalRuns
RunRate
AutionCost
Now out of the 500 players, I want my code to give me multiple combinations of 3 players that would satisfy the following criteria. Something like a Moneyball problem.
The sum of auction cost of all the 3 players shouldn't exceed X
They should have a minimum of Y TotalRuns
Their RunRate must be higher than the average run rate of all the players.
Kindly help with this. Thank you.
So there are choose(500,3) ways to choose 3 players which is 20,708,500. It's not impossible to generate all these combinations combn might do it for you, but I couldn't be bothered waiting to find out. If you do this with player IDs and then test your three conditions, this would be one way to solve your problem. An alternative would be a Monte Carlo method. Select three players that initially satisfy your conditions. Randomly select another player who doesn't belong to the current trio, if he satisfies the conditions save the combination and repeat. If you're optimizing (it's not clear but your question has optimization in the tag), then the new player has to result in a new trio that's better than the last, so if he doesn't improve your objective function (whatever it might be), then you don't accept the trade.
choose(500,3)
Shows there are almost 21,000,000 combinations of 3 players drawn from a pool of 500 which means a complete analysis of the entire search space ought to be reasonably doable in a reasonable time on a modern machine.
You can generate the indeces of these combinations using iterpc() and getnext() from the iterpc package. As in
# library(iterpc) # uncomment if not loaded
I <- iterpc(5, 3)
getnext(I)
You can also drastically cut the search space in a number of ways by setting up initial filtering criteria and/or by taking the first solution (while loop with condition = meeting criterion). Or, you can get and rank order all of them (loop through all combinations) or some intermediate where you get n solutions. And preprocessing can help reduce the search space. For example, ordering salaries in ascending order first will give you the cheapest salary solution first. Ordering the file by descending runs will give you the highest runs solutions first.
NOTE: While this works fine, I see iterpc now is superseded by the arrangements package where the relevant iterator is icombinations(). getnext() is still the access method for succeeding iterators.
Thanks, I used a combination of both John's and James's answers.
Filtered out all the players who don't satisfy the criteria and that boiled down only to 90+ players.
Then I used picked up players in random until all the variations got exhausted
Finally, I computed combined metrics for each variation (set) of players to arrive at the optimized set.
The code is a bit messy and doesn't wanna post it here.
I am trying to perform a record linkage on 2 datasets containing company names. While Reclin does a very good job indeed, the linked data will need some manual cleaning and because I will most likely have to clean about 3000 rows in a day or 2 it would be great to keep the weights generated in the reclin process as shown below:
CH_ecorda_to_Patstat_left <- pair_blocking(companies_x, companies_y) %>%
compare_pairs(by= "nameor", default_comparator = jaro_winkler()) %>%
score_problink() %>%
select_n_to_m()%>%
link(all_x=TRUE, all_y = FALSE)
I know these weights are still kept up until I use the link() function. I would like to add the weights based to compare the variable "nameor" so I can use these weights to order my data in ascending order, from smallest weight to biggest weight to find mistakes in the attempted match quicker.
For context: I need to find out how many companies_x have handed in patents in the patent data base companies_y. I don´t need to know how often they handed them in, just if there are any at all. So I need matches of x to y, however I don´t know the true number of matches and not every companies_x company will have a match, so some manual cleaning will be necessary as n_to_m forces a match for each entry even if there should be none.
Try doing something like this:
weight<-problink_em(paired)
paired<-score_problink(paired, weight)
You'll have the result stored as weight now.
I am using Google dataflow+ Scio to do a cross join of a dataset with itself to find out the topK most similar ones by doing cosine similarity. The data set has around 200k records and total size of the dataset is ~300MB.
I am joining the dataset with itself by passing it as a side input by setting the workerCacheMB to 500MB.
The dataset is a tuple and it looks like this: (String,Set(Integer)). The first element in the tuple is the URI and the next element is a set of entity indexes.
Most records in the dataset have under 500 entity indexes. However, there are about 7000 records which have over 10k entities and the maximum one has 171k entities.
I have some hot keys and hence the worker utitlization looks like this:
After it scaled up to 80 nodes and then scaled down to 1 node, it had already processed about 90% of the records. I assume, the hot keys have got stuck in the last one node and it took the rest of the time to process all the hotkeys.
I tried the --experiments=shuffle_mode=service option. Though it gave an improvement, the problem persists. I was thinking about ways to use the sharded HotKey join mentioned here, how ever since I need to find similarity I don't think I can afford to split the hot entities and rejoin them.
I was wondering if there is a way to solve it or if I basically have to live with this.
Obviously, this is a crude way to find sims. However, I am interested in finding solution to the Data engineering part of the problem, while letting ML engineers iterate on the Sim finding algorithms.
The stripped down version of the code looks as follows:
private def findSimilarities(e1: Set[Integer], e2: Set[Integer]): Float = {
val common = e1.intersect(e2)
val cosine = (common.size.toFloat) / (e1.size + e2.size).toFloat
cosine
}
val topN = sortedReverseTake[ElementSims](250)(by(_.getScore))
elements
.withSideInputs(elementsSI)
.flatMap { case (e1, si) =>
val fromUri = e1._1.toString
val fromEntities = e1._2
val sideInput: List[(String, Set[Integer])] = si(elementsSI)
val sims: List[ElementSims] = findSimilarities(fromUri,fromEntities,
sideInput)
topN(sims)
}
.toSCollection
.saveAsAvroFile(outputGCS)