I'm aware that similar questions have been asked before, but I haven't found an answer to exactly what I need. It seems like a simple solution I'm missing.
I have a sample of approximately 20,000 participants and would like to randomly select 2500 from this sample to receive gift cards, and another unique 2500 (who aren't in the first group) to receive cash allowance. Participants shouldn't be repeated/duplicated in any means. Participants are identified by unique IDs.
I create indices for each row that represents participants (this step could be avoided, I believe).
Npool=1:dim(pool_20K)[[1]]
giftcards=sample(Npool,2500)
-- how do I create the cash allowance group so they are unique participants and do not include the ones selected for giftcards?
After, I would combine indices with the data
giftcards_ids=pool_20K[giftcards, ]
Any insight? I feel like I'm complicating a fairly simple problem.
Thanks in advanced!!
Shuffle the entire thing and then select subsets:
shuffled.indices = sample(nrow(pool_20K))
giftcards = shuffled.indices[1:2500]
cash = shuffled.indices[2501:5000]
Related
I'm very new to R, so forgive any omissions or errors.
I have a dataframe that contains a series of events (called 'incidents') represented by a column named 'INCIDENT_NUM'. These are strings (ex. 2016111111), and there are multiple cells per incident if there are multiple employees involved in the incident. Employees are represented in their own string column ('EMPL_NO'), and they can be in the column multiple times if they're involved in multiple incidents.
So, the data I have looks like:
Incident Number
EMPL_NO
201611111
EID0012
201611111
EID0013
201611112
EID0012
201611112
EID0013
201611112
EID0011
What I am aiming to do is see which employees are connected to one another by how many incidents they're co-involved with. Looking at tutorials for network analysis, folks have data that looks like this, which is what I ultimately want:
From
To
Weight
EID0011
EID0012
2
EID0012
EID0013
1
Is there any easy process for this? My data has thousands of rows, so doing this by hand doesn't feel feasible.
Thanks in advance!!!
I previously worked on a project where we examined some sociological data. I did the descriptive statistics and after several months, I was asked to make some graphs from the stats.
I made the graphs, but something seemed odd and when I compared the graph to the numbers in the report, I noticed that they are different. Upon investigating further, I noticed that my cleaning code (which removed participants with duplicate IDs) now results with more rows, e.g. more participants with unique IDs than previously. I now have 730 participants, whereas previously there were 702 I don't know if this was due to updates of some packages and unfortunately I cannot post the actual data here because it is confidential, but I am trying to find out who these 28 participants are and what happened in the data.
Therefore, I would like to know if there is a method that allows the user to filter the cases so that the mean of some variables is a set number. Ideally it would be something like this, but of course I know that it's not going to work in this form:
iris %>%
filter_if(mean(.$Petal.Length) == 1.3)
I know that this was an incorrect attempt but I don't know any other way that I would try this, so I am looking for help and suggestions.
I'm not convinced this is a tractable problem, but you may get somewhere by doing the following.
Firstly, work out what the sum of the variable was in your original analysis, and what it is now:
old_sum <- 702 * old_mean
new_sum <- 730 * new_mean
Now work out what the sum of the variable in the extra 28 cases would be:
extra_sum <- new_sum - old_sum
This allows you to work out the relative proportions of the sum of the variable from the old cases and from the extra cases. Put these proportions in a vector:
contributions <- c(extra_sum/new_sum, old_sum/new_sum)
Now, using the functions described in my answer to this question, you can find the optimal solution to partitioning your variable to match these two proportions. The rows which end up in the "extra" partition are likely to be the new ones. Even if they aren't the new ones, you will be left with a sample that has a mean that differs from your original by less than one part in a million.
Let me say that I have a below Data Frame in R with 500 player records with the following columns
PlayerID
TotalRuns
RunRate
AutionCost
Now out of the 500 players, I want my code to give me multiple combinations of 3 players that would satisfy the following criteria. Something like a Moneyball problem.
The sum of auction cost of all the 3 players shouldn't exceed X
They should have a minimum of Y TotalRuns
Their RunRate must be higher than the average run rate of all the players.
Kindly help with this. Thank you.
So there are choose(500,3) ways to choose 3 players which is 20,708,500. It's not impossible to generate all these combinations combn might do it for you, but I couldn't be bothered waiting to find out. If you do this with player IDs and then test your three conditions, this would be one way to solve your problem. An alternative would be a Monte Carlo method. Select three players that initially satisfy your conditions. Randomly select another player who doesn't belong to the current trio, if he satisfies the conditions save the combination and repeat. If you're optimizing (it's not clear but your question has optimization in the tag), then the new player has to result in a new trio that's better than the last, so if he doesn't improve your objective function (whatever it might be), then you don't accept the trade.
choose(500,3)
Shows there are almost 21,000,000 combinations of 3 players drawn from a pool of 500 which means a complete analysis of the entire search space ought to be reasonably doable in a reasonable time on a modern machine.
You can generate the indeces of these combinations using iterpc() and getnext() from the iterpc package. As in
# library(iterpc) # uncomment if not loaded
I <- iterpc(5, 3)
getnext(I)
You can also drastically cut the search space in a number of ways by setting up initial filtering criteria and/or by taking the first solution (while loop with condition = meeting criterion). Or, you can get and rank order all of them (loop through all combinations) or some intermediate where you get n solutions. And preprocessing can help reduce the search space. For example, ordering salaries in ascending order first will give you the cheapest salary solution first. Ordering the file by descending runs will give you the highest runs solutions first.
NOTE: While this works fine, I see iterpc now is superseded by the arrangements package where the relevant iterator is icombinations(). getnext() is still the access method for succeeding iterators.
Thanks, I used a combination of both John's and James's answers.
Filtered out all the players who don't satisfy the criteria and that boiled down only to 90+ players.
Then I used picked up players in random until all the variations got exhausted
Finally, I computed combined metrics for each variation (set) of players to arrive at the optimized set.
The code is a bit messy and doesn't wanna post it here.
Thank you to everyone who commented already! I've edited my post with better code and hopefully some clarity on what I'm trying to do. (I appreciate all of the feedback - this is my first time asking a question here!)
I have a very similar question to this one here (Random Pairings that don't Repeat) but am trying to come up with a function or piece of code that I can run in R to create the pairings. Essentially, I have a pool of employees and want to come up with a way to randomly generate pairs of employees to meet every month, with no pairs repeating in future months/running of the function. (I will need to maintain the history of previous pairings.) The catch is that each employee is assigned to a working location, and I only want matches from different location.
I've gone through a number of previous queries on randomly sampling data sets in R and am comfortable with generating a random pair from my data, or pulling out an existing working group, but it's the "generating a pair that ALWAYS comes from a different group" that's tripping me up, especially since the different groups/locations have different numbers of employees so it's hard to sort the groups evenly.
Here's my dummy data, which currently has 10 "employees". The actual data set currently has over 100 employees with more being added to the pool each month:
ID <- (1:10)
Name <- c("Sansa", "Arya", "Hodor", "Jamie", "Cersei", "Tyrion", "Jon", "Sam", "Dany", "Drogo")
Email <- c("a#a.com","b#b.com", "c#c.com", "d#d.com", "e#e.com", "f#f.com", "g#g.com", "h#h.com", "i#i.com", "j#j.com")
Location <- c("winterfell", "Winterfell", "Winterfell", "Kings Landing", "Kings Landing", "Kings Landing",
"The Wall", "The Wall", "Essos", "Essos")
df <- data.frame(ID, Name, Email, Location)
Basically, I want to write something that would say that Sansa could be randomly paired with anyone who is not Arya or Hodor, because they're in the same location, and that Jon could be paired with anyone but Sam (i.e., anyone whose location is NOT Winterfell.) If Jon and Arya were paired up once, I would like them to not be paired up again going forward, so at that point Jon could be paired with anyone but Sam or Arya. I hope I'm making sense.
I was able to run the combn function on the ID column to generate groups of two, but it doesn't account for the historical pairings that we're trying to avoid.
Is there a way to do this in R? I've tried this one (Using R, Randomly Assigning Students Into Groups Of 4) but it wasn't quite right for my needs, again because of the historical pairings.
Thank you for any help you can provide!
Probably a pretty basic question, and hopefully one not repeated elsewhere. I’m looking at some ISSP survey data in R, and I made a separate data frame for respondents who answered “Government agencies” on one of the questions:
gov.child<-data[data$"V33"=="Government agencies",]
Then I used the table function to see how many total respondents answered that way in each country (C_ALPHAN is the variable name for country):
table(gov.child$C_ALPHAN)
Then I made a matrix of this table:
gov.child.matrix<-as.matrix(table(gov.child$C_ALPHAN))
So I now have a two-column matrix with just the two-letter country code (the C_ALPHAN code) and the number of people who answered “Government agencies.” But I want to know what percentage of respondents in those countries answered that way, so I need to divide this number by the total number of respondents for that country.
Is there some way (a function maybe?) to, after adding a new column, tell R that for each row, it has to divide the number in column two by the total number of rows in the original data set that correspond to the country code in column one (i.e., the n for that country)? Or should I just manually make a vector with the n for each country, which is available on the ISSP website, and add it to the matrix? I'm loathe to to that because of the possibility of making a data entry error, but maybe that's the best way.