I am trying to create pen-pal pairs in R. The problem is that I can't figure out how to loop it so that once I pair one person that person and their pair are eliminated from the pool and the loop continues until everyone has a pair.
I have already rated the criteria to pair them and found a score for every person for how well they would pair for the other person. I think added every pair score together to get a sense of how good the pair is overall (not perfect, but good enough for these purposes). I have found each person's ideal match then and ordered these matches by most picky person to least picky person (basically person with the lowest best-paired score to highest best-paired score). I also found their 2nd-8th best match (there will probably be about 300 people in the data).
A test of the best-matches is below:
indexed_fake apply.fin_fake..1..max. X1 X2 X3 X4 X5 X6 X7 X8
14 14 151 3 9 8 4 10 12 2 6
4 4 177 9 5 8 7 11 3 10 12
9 9 177 4 11 3 6 10 7 12 5
5 5 179 7 4 11 3 12 10 8 5
10 10 179 12 10 2 9 3 5 6 4
13 13 182 8 1 12 11 10 5 3 2
1 1 185 7 1 3 8 6 13 2 11
7 7 185 1 12 5 7 4 6 9 11
3 3 187 12 3 8 5 9 1 2 10
8 8 190 8 12 13 3 4 11 1 6
2 2 191 6 12 11 10 3 4 5 1
6 6 191 2 11 7 1 6 9 10 8
11 11 193 12 6 9 5 2 8 11 4
12 12 193 11 3 8 7 12 10 2 5
Columns X1-X8 are the 8 best pairs for the people listed in the first columns. With this example every person would ideally get paired with someone in their top 8, ideally maximizing the pair compatibility as another user mentioned. Every person would get one pair.
Any help is appreciated!
This is not a specific answer. But it's easier to write in this space. You have a classic assignment optimization problem. These problems can be solved using packages in R. You have to assign preference weights to your feasible pairings. So for example 14-3 could be assigned 8 points, 14-9; 7 points, 14-8; 6 points...14-6; 1 point. Note that 3-14 would be assigned no points because while 14 likes 3, 3 does not like 14. The preference score for any x-y, y-x pairing could be the weight for the x-y preference plus the weight of the y-x preference.
The optimization model would choose the weighted pairs to maximize the total satisfaction among all of the the pairings.
If you have 300 people I can't think of an alternative algorithm that could be simply implemented.
Related
This question already has answers here:
Summarizing by subgroup percentage in R
(2 answers)
Closed 9 months ago.
I am wrangling with a huge dataset and my R skills are very new. I am really trying to understand the terminology and processes but finding it a struggle as the R-documentation often makes no sense to me. So apologies if this is a dumb question.
I have data for plant species at different sites with different percentages of ground-cover. I want to create a new column PROP-COVER which gives the proportion of each species' cover as a percentage of the total cover of all species in a particular site. This is slightly different to calculating percentage cover by site area as it is disregards bare ground with no vegetation. This is an easy calculation with just one site, but I have over a hundred sites and need to perform the calculation on species ground-cover grouped by site. The desired column output is PROP-COVER.
SPECIES SITE COVER PROP-COVER(%)
1 1 10 7.7
2 1 20 15.4
3 1 10 7.7
4 1 20 15.4
5 1 30 23.1
6 1 40 30.8
2 2 20 22.2
3 2 50
5 2 10
6 2 10
1 3 5
2 3 25
3 3 40
5 3 10
I have looked at for loops and repeat but I can't see where the arguments should go. Every attempt I make returns a NULL.
Below is an example of something I tried which I am sure is totally wide of the mark, but I just can't work out where to begin with or know if it is even possible.
a<- for (i in data1$COVER) {
sum(data1$COVER[data1$SITE=="i"],na.rm = TRUE)
}
a
NULL
I have a major brain-blockage when it comes to how 'for' loops etc work, no amount of reading about it seems to help, but perhaps what I am trying to do isn't possible? :(
Many thanks for looking.
In Base R:
merge(df, prop.table(xtabs(COVER~SPECIES+SITE, df), 2)*100)
SPECIES SITE COVER Freq
1 1 1 10 7.692308
2 1 3 5 6.250000
3 2 1 20 15.384615
4 2 2 20 22.222222
5 2 3 25 31.250000
6 3 1 10 7.692308
7 3 2 50 55.555556
8 3 3 40 50.000000
9 4 1 20 15.384615
10 5 1 30 23.076923
11 5 2 10 11.111111
12 5 3 10 12.500000
13 6 1 40 30.769231
14 6 2 10 11.111111
In tidyverse you can do:
df %>%
group_by(SITE) %>%
mutate(n = proportions(COVER) * 100)
# A tibble: 14 x 4
# Groups: SITE [3]
SPECIES SITE COVER n
<int> <int> <int> <dbl>
1 1 1 10 7.69
2 2 1 20 15.4
3 3 1 10 7.69
4 4 1 20 15.4
5 5 1 30 23.1
6 6 1 40 30.8
7 2 2 20 22.2
8 3 2 50 55.6
9 5 2 10 11.1
10 6 2 10 11.1
11 1 3 5 6.25
12 2 3 25 31.2
13 3 3 40 50
14 5 3 10 12.5
The code could also be written as n = COVER/sum(COVER) or even n = prop.table(COVER)
I have a dataset with the reports from a local shop, where each line has a client's ID, date of purchase and total value per purchase.
I want to create a new plot where for each client ID I have all the purchases in the last month or even just sample purchases in a range of dates I choose.
The main problem is that certain customers might buy once a month, while others can come daily - so the number of observations per period of time can vary.
I have tried subsetting my dataset to a specific range of time, but either I choose a specific date - and then I only get a small % of all customers, or I choose a range and get multiple observations for certain customers.
(In this case - I wouldn't mind getting the earliest observation)
An important note: I know how to create a for loop to solve this problem, but since the dataset is over 4 million observations it isn't practical since it would take an extremely long time to run.
A basic example of what the dataset looks like:
ID Date Sum
1 1 1 234
2 1 2 45
3 1 3 1
4 2 4 223
5 3 5 546
6 4 6 12
7 2 1 20
8 4 3 30
9 6 2 3
10 3 5 45
11 7 6 456
12 3 7 65
13 8 8 234
14 1 9 45
15 3 2 1
16 4 3 223
17 6 6 546
18 3 4 12
19 8 7 20
20 9 5 30
21 11 6 3
22 12 6 45
23 14 9 456
24 15 10 65
....
And the new data set would look something like this:
ID 1Date 1Sum 2Date 2Sum 3Date 3Sum
1 1 234 2 45 3 1
2 1 20 4 223 NA NA
3 2 1 5 546 5 45
...
Thanks for your help!
I think you can do this with a bit if help from dplyr and tidyr
library(dplyr)
library(tidyr)
dd %>% group_by(ID) %>% mutate(seq=1:n()) %>%
pivot_wider("ID", names_from="seq", values_from = c("Date","Sum"))
Where dd is your sample data frame above.
Hi all: I am currently working on a dataset that is close to the Oxboys data in structure. What I am aspiring to achieve is a heatmap that shows for all 26 boys whether the percentage increase in their heights across the 9 occasions were the same as group average, higher than average or lower (Amber, Green, Red respectively). So, 26 rows & 8 columns with R-A-G in each intersecting cell. This is what I believe I need to do;
create a vector with actual percentage increase in heights (what was the % increase on 2nd Occasion vis-a-vis first and so on
calculate the average for each Occasion increase
write this into a matrix
use ggheat to create a heatmap
I need direction, advice, resources that I can look up to initiate this.
many thanks..
here's first 18 rows of the data
Subject numbers are reference to students // Occassion is the progressive time stamps when height measurements were taken
> head(ox_b, 18)
Subject height Occasion
1 1 140.5 1
2 1 143.4 2
3 1 144.8 3
4 1 147.1 4
5 1 147.7 5
6 1 150.2 6
7 1 151.7 7
8 1 153.3 8
9 1 155.8 9
10 2 136.9 1
11 2 139.1 2
12 2 140.1 3
13 2 142.6 4
14 2 143.2 5
15 2 144.0 6
16 2 145.8 7
17 2 146.8 8
18 2 148.3 9
This question already has answers here:
Filtering a data frame by values in a column [duplicate]
(3 answers)
Closed 3 years ago.
I have the following data with the ID of subjects.
V1
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 2
15 2
16 4
17 4
18 4
19 4
20 4
21 4
22 4
23 4
24 4
I want to subset all the rows of the data where V1 == 4. This way I can see which observations relate to subject 4.
For example, the correct output would be
16 4
17 4
18 4
19 4
20 4
21 4
22 4
23 4
24 4
However, the output I'm given after subsetting does not give me the correct rows . It simply gives me.
V1
1 4
2 4
3 4
4 4
5 4
6 4
7 4
8 4
I'm unable to tell which observations relate to subject 4, as observations 1:8 are for subject 2.
I've tried the usual methods, such as
condition<- df == 4
df[condition]
How can I subset the data so I'm given back a dataset that shows the correct row numbers for subject 4.
You can also use the subset function:
subset(df,df$V1==4)
I've managed to find a solution since posting.
newdf <- subset(df, V1 == 4).
However i'm still very interested in other solutions to this problems, so please post if you're aware of another method.
I printed out the summary of a column variables as such:
Please see below the summary table printed out from R:
I would like to generate it into a data.frame. However, there are too many subject names that it's very difficult to list out all, also, the term "OTHER" with number 31 means that there are 319 subjects which appear only 1 time in the original data.frame.
So, the new data.frame I hope to produce would look like below:
Here is one possible solution.
Table<-table(rpois(100,5))
as.data.frame(Table)
Var1 Freq
1 1 2
2 2 11
3 3 9
4 4 18
5 5 13
6 6 20
7 7 14
8 8 8
9 9 3
10 10 1
11 11 1