How to randomly pick a number of combinations from all the combinations efficiently? - r

I know function combn can generate all the possible combinations. However, if the total number of members is large, this is really time-consuming and memory-consuming.
My goal is to randomly pick combinations from all the possible combinations. For example, I want 5000 distinct triple set of members from a pool of 3000 members. I think I don't need to generate all possible combinations and choose 5000 from them. But seems that R doesn't have a ready-to-use function to do this. So how to deal with this problem?

This is not exactly what you need but perhaps it can get you started:
library(data.table) #to make the table easier
members=1:3000;
X=data.table(RUN=1:5000)
X<-X[,as.list(sample(members, 3)),by=RUN]
This will create 3 new columns that are randomly selected from the members vector. See them as IDs of each member.
I would do a check to see how many as unique using:
X[duplicated(X, by=c('V1','V2','V3'))]
Is this helping you at all?

Related

Grouping and transposing data in R

It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks

R - select cases so that the mean of a variable is some given number

I previously worked on a project where we examined some sociological data. I did the descriptive statistics and after several months, I was asked to make some graphs from the stats.
I made the graphs, but something seemed odd and when I compared the graph to the numbers in the report, I noticed that they are different. Upon investigating further, I noticed that my cleaning code (which removed participants with duplicate IDs) now results with more rows, e.g. more participants with unique IDs than previously. I now have 730 participants, whereas previously there were 702 I don't know if this was due to updates of some packages and unfortunately I cannot post the actual data here because it is confidential, but I am trying to find out who these 28 participants are and what happened in the data.
Therefore, I would like to know if there is a method that allows the user to filter the cases so that the mean of some variables is a set number. Ideally it would be something like this, but of course I know that it's not going to work in this form:
iris %>%
filter_if(mean(.$Petal.Length) == 1.3)
I know that this was an incorrect attempt but I don't know any other way that I would try this, so I am looking for help and suggestions.
I'm not convinced this is a tractable problem, but you may get somewhere by doing the following.
Firstly, work out what the sum of the variable was in your original analysis, and what it is now:
old_sum <- 702 * old_mean
new_sum <- 730 * new_mean
Now work out what the sum of the variable in the extra 28 cases would be:
extra_sum <- new_sum - old_sum
This allows you to work out the relative proportions of the sum of the variable from the old cases and from the extra cases. Put these proportions in a vector:
contributions <- c(extra_sum/new_sum, old_sum/new_sum)
Now, using the functions described in my answer to this question, you can find the optimal solution to partitioning your variable to match these two proportions. The rows which end up in the "extra" partition are likely to be the new ones. Even if they aren't the new ones, you will be left with a sample that has a mean that differs from your original by less than one part in a million.

Need to get combination of records from Data Frame in R that satisfies a specific target in R

Let me say that I have a below Data Frame in R with 500 player records with the following columns
PlayerID
TotalRuns
RunRate
AutionCost
Now out of the 500 players, I want my code to give me multiple combinations of 3 players that would satisfy the following criteria. Something like a Moneyball problem.
The sum of auction cost of all the 3 players shouldn't exceed X
They should have a minimum of Y TotalRuns
Their RunRate must be higher than the average run rate of all the players.
Kindly help with this. Thank you.
So there are choose(500,3) ways to choose 3 players which is 20,708,500. It's not impossible to generate all these combinations combn might do it for you, but I couldn't be bothered waiting to find out. If you do this with player IDs and then test your three conditions, this would be one way to solve your problem. An alternative would be a Monte Carlo method. Select three players that initially satisfy your conditions. Randomly select another player who doesn't belong to the current trio, if he satisfies the conditions save the combination and repeat. If you're optimizing (it's not clear but your question has optimization in the tag), then the new player has to result in a new trio that's better than the last, so if he doesn't improve your objective function (whatever it might be), then you don't accept the trade.
choose(500,3)
Shows there are almost 21,000,000 combinations of 3 players drawn from a pool of 500 which means a complete analysis of the entire search space ought to be reasonably doable in a reasonable time on a modern machine.
You can generate the indeces of these combinations using iterpc() and getnext() from the iterpc package. As in
# library(iterpc) # uncomment if not loaded
I <- iterpc(5, 3)
getnext(I)
You can also drastically cut the search space in a number of ways by setting up initial filtering criteria and/or by taking the first solution (while loop with condition = meeting criterion). Or, you can get and rank order all of them (loop through all combinations) or some intermediate where you get n solutions. And preprocessing can help reduce the search space. For example, ordering salaries in ascending order first will give you the cheapest salary solution first. Ordering the file by descending runs will give you the highest runs solutions first.
NOTE: While this works fine, I see iterpc now is superseded by the arrangements package where the relevant iterator is icombinations(). getnext() is still the access method for succeeding iterators.
Thanks, I used a combination of both John's and James's answers.
Filtered out all the players who don't satisfy the criteria and that boiled down only to 90+ players.
Then I used picked up players in random until all the variations got exhausted
Finally, I computed combined metrics for each variation (set) of players to arrive at the optimized set.
The code is a bit messy and doesn't wanna post it here.

How to sort .csv files in R

I have one .csv file which i have imported into R. It contains a column with locations, some locations are repeated depending on how many times that location has been surveyed. I have another column with the total no. of plastic items.
I would like to add together the number of plastic items for locations that appear more than once, and create a separate column with the total no. of plastic and another column of the no. of times the location appeared.
I am unsure how to do this, any help will be much appreciated.
Using dplyr:
data %>%
group_by(location) %>%
mutate(TOTlocation=n(),TOTitems=sum(items))
And here's a base solution that does pretty much the same thing:
data[c("TOTloc","TOTitem")]<-t(sapply(data$location, function(x)
c(TOTloc=sum(data$location==x),
TOTitem=sum(data$items[data$location==x]))))
Note that in neither case do you need to sort anything - in dplyr you can use group_by to have each action done on only the part of the data set that belongs to a group determined by the contents of a certain column. In my base solution, I break down the locations list using sapply and then recalculate the TOTloc and TOTitem again for each row. This may not be a very efficient solution. A better solution will probably use split, but for some reason I couldn't make it work with my made up dataset, so maybe someone else can suggest how to best do that.

Multiple random selection in R

I'm aware that similar questions have been asked before, but I haven't found an answer to exactly what I need. It seems like a simple solution I'm missing.
I have a sample of approximately 20,000 participants and would like to randomly select 2500 from this sample to receive gift cards, and another unique 2500 (who aren't in the first group) to receive cash allowance. Participants shouldn't be repeated/duplicated in any means. Participants are identified by unique IDs.
I create indices for each row that represents participants (this step could be avoided, I believe).
Npool=1:dim(pool_20K)[[1]]
giftcards=sample(Npool,2500)
-- how do I create the cash allowance group so they are unique participants and do not include the ones selected for giftcards?
After, I would combine indices with the data
giftcards_ids=pool_20K[giftcards, ]
Any insight? I feel like I'm complicating a fairly simple problem.
Thanks in advanced!!
Shuffle the entire thing and then select subsets:
shuffled.indices = sample(nrow(pool_20K))
giftcards = shuffled.indices[1:2500]
cash = shuffled.indices[2501:5000]

Resources