creating heatmap of the Oxboys dataset (from nlme) - r

Hi all: I am currently working on a dataset that is close to the Oxboys data in structure. What I am aspiring to achieve is a heatmap that shows for all 26 boys whether the percentage increase in their heights across the 9 occasions were the same as group average, higher than average or lower (Amber, Green, Red respectively). So, 26 rows & 8 columns with R-A-G in each intersecting cell. This is what I believe I need to do;
create a vector with actual percentage increase in heights (what was the % increase on 2nd Occasion vis-a-vis first and so on
calculate the average for each Occasion increase
write this into a matrix
use ggheat to create a heatmap
I need direction, advice, resources that I can look up to initiate this.
many thanks..
here's first 18 rows of the data
Subject numbers are reference to students // Occassion is the progressive time stamps when height measurements were taken
> head(ox_b, 18)
Subject height Occasion
1 1 140.5 1
2 1 143.4 2
3 1 144.8 3
4 1 147.1 4
5 1 147.7 5
6 1 150.2 6
7 1 151.7 7
8 1 153.3 8
9 1 155.8 9
10 2 136.9 1
11 2 139.1 2
12 2 140.1 3
13 2 142.6 4
14 2 143.2 5
15 2 144.0 6
16 2 145.8 7
17 2 146.8 8
18 2 148.3 9

Related

Creating a percentage column based on the sums of a column grouped by a different column? [duplicate]

This question already has answers here:
Summarizing by subgroup percentage in R
(2 answers)
Closed 9 months ago.
I am wrangling with a huge dataset and my R skills are very new. I am really trying to understand the terminology and processes but finding it a struggle as the R-documentation often makes no sense to me. So apologies if this is a dumb question.
I have data for plant species at different sites with different percentages of ground-cover. I want to create a new column PROP-COVER which gives the proportion of each species' cover as a percentage of the total cover of all species in a particular site. This is slightly different to calculating percentage cover by site area as it is disregards bare ground with no vegetation. This is an easy calculation with just one site, but I have over a hundred sites and need to perform the calculation on species ground-cover grouped by site. The desired column output is PROP-COVER.
SPECIES SITE COVER PROP-COVER(%)
1 1 10 7.7
2 1 20 15.4
3 1 10 7.7
4 1 20 15.4
5 1 30 23.1
6 1 40 30.8
2 2 20 22.2
3 2 50
5 2 10
6 2 10
1 3 5
2 3 25
3 3 40
5 3 10
I have looked at for loops and repeat but I can't see where the arguments should go. Every attempt I make returns a NULL.
Below is an example of something I tried which I am sure is totally wide of the mark, but I just can't work out where to begin with or know if it is even possible.
a<- for (i in data1$COVER) {
sum(data1$COVER[data1$SITE=="i"],na.rm = TRUE)
}
a
NULL
I have a major brain-blockage when it comes to how 'for' loops etc work, no amount of reading about it seems to help, but perhaps what I am trying to do isn't possible? :(
Many thanks for looking.
In Base R:
merge(df, prop.table(xtabs(COVER~SPECIES+SITE, df), 2)*100)
SPECIES SITE COVER Freq
1 1 1 10 7.692308
2 1 3 5 6.250000
3 2 1 20 15.384615
4 2 2 20 22.222222
5 2 3 25 31.250000
6 3 1 10 7.692308
7 3 2 50 55.555556
8 3 3 40 50.000000
9 4 1 20 15.384615
10 5 1 30 23.076923
11 5 2 10 11.111111
12 5 3 10 12.500000
13 6 1 40 30.769231
14 6 2 10 11.111111
In tidyverse you can do:
df %>%
group_by(SITE) %>%
mutate(n = proportions(COVER) * 100)
# A tibble: 14 x 4
# Groups: SITE [3]
SPECIES SITE COVER n
<int> <int> <int> <dbl>
1 1 1 10 7.69
2 2 1 20 15.4
3 3 1 10 7.69
4 4 1 20 15.4
5 5 1 30 23.1
6 6 1 40 30.8
7 2 2 20 22.2
8 3 2 50 55.6
9 5 2 10 11.1
10 6 2 10 11.1
11 1 3 5 6.25
12 2 3 25 31.2
13 3 3 40 50
14 5 3 10 12.5
The code could also be written as n = COVER/sum(COVER) or even n = prop.table(COVER)

Pairing People in R with Loop

I am trying to create pen-pal pairs in R. The problem is that I can't figure out how to loop it so that once I pair one person that person and their pair are eliminated from the pool and the loop continues until everyone has a pair.
I have already rated the criteria to pair them and found a score for every person for how well they would pair for the other person. I think added every pair score together to get a sense of how good the pair is overall (not perfect, but good enough for these purposes). I have found each person's ideal match then and ordered these matches by most picky person to least picky person (basically person with the lowest best-paired score to highest best-paired score). I also found their 2nd-8th best match (there will probably be about 300 people in the data).
A test of the best-matches is below:
indexed_fake apply.fin_fake..1..max. X1 X2 X3 X4 X5 X6 X7 X8
14 14 151 3 9 8 4 10 12 2 6
4 4 177 9 5 8 7 11 3 10 12
9 9 177 4 11 3 6 10 7 12 5
5 5 179 7 4 11 3 12 10 8 5
10 10 179 12 10 2 9 3 5 6 4
13 13 182 8 1 12 11 10 5 3 2
1 1 185 7 1 3 8 6 13 2 11
7 7 185 1 12 5 7 4 6 9 11
3 3 187 12 3 8 5 9 1 2 10
8 8 190 8 12 13 3 4 11 1 6
2 2 191 6 12 11 10 3 4 5 1
6 6 191 2 11 7 1 6 9 10 8
11 11 193 12 6 9 5 2 8 11 4
12 12 193 11 3 8 7 12 10 2 5
Columns X1-X8 are the 8 best pairs for the people listed in the first columns. With this example every person would ideally get paired with someone in their top 8, ideally maximizing the pair compatibility as another user mentioned. Every person would get one pair.
Any help is appreciated!
This is not a specific answer. But it's easier to write in this space. You have a classic assignment optimization problem. These problems can be solved using packages in R. You have to assign preference weights to your feasible pairings. So for example 14-3 could be assigned 8 points, 14-9; 7 points, 14-8; 6 points...14-6; 1 point. Note that 3-14 would be assigned no points because while 14 likes 3, 3 does not like 14. The preference score for any x-y, y-x pairing could be the weight for the x-y preference plus the weight of the y-x preference.
The optimization model would choose the weighted pairs to maximize the total satisfaction among all of the the pairings.
If you have 300 people I can't think of an alternative algorithm that could be simply implemented.

Creating an summary dataset with multiple objects and multiple observations per object

I have a dataset with the reports from a local shop, where each line has a client's ID, date of purchase and total value per purchase.
I want to create a new plot where for each client ID I have all the purchases in the last month or even just sample purchases in a range of dates I choose.
The main problem is that certain customers might buy once a month, while others can come daily - so the number of observations per period of time can vary.
I have tried subsetting my dataset to a specific range of time, but either I choose a specific date - and then I only get a small % of all customers, or I choose a range and get multiple observations for certain customers.
(In this case - I wouldn't mind getting the earliest observation)
An important note: I know how to create a for loop to solve this problem, but since the dataset is over 4 million observations it isn't practical since it would take an extremely long time to run.
A basic example of what the dataset looks like:
ID Date Sum
1 1 1 234
2 1 2 45
3 1 3 1
4 2 4 223
5 3 5 546
6 4 6 12
7 2 1 20
8 4 3 30
9 6 2 3
10 3 5 45
11 7 6 456
12 3 7 65
13 8 8 234
14 1 9 45
15 3 2 1
16 4 3 223
17 6 6 546
18 3 4 12
19 8 7 20
20 9 5 30
21 11 6 3
22 12 6 45
23 14 9 456
24 15 10 65
....
And the new data set would look something like this:
ID 1Date 1Sum 2Date 2Sum 3Date 3Sum
1 1 234 2 45 3 1
2 1 20 4 223 NA NA
3 2 1 5 546 5 45
...
Thanks for your help!
I think you can do this with a bit if help from dplyr and tidyr
library(dplyr)
library(tidyr)
dd %>% group_by(ID) %>% mutate(seq=1:n()) %>%
pivot_wider("ID", names_from="seq", values_from = c("Date","Sum"))
Where dd is your sample data frame above.

Coloring dgCMatrix image by factor in R

I am trying to color a sparse matrix image according to a grouping factor. I know the solution is related to matrix coloring in the lattice package but I have troubles to handle it.
I have a list of hits on an app list. Every hit is related to a user and a app at a specific time.
- On the y axis are users sorted by first install of the app
Every user then has a new line for his pages hits
- On the x axis is the time
Points are hits
Here is a preview of the data:
library(Matrix)
indexUser indexInstall time
1 1 1 3
2 1 1 17
3 1 1 19
4 1 1 32
5 1 1 81
6 1 1 86
7 1 1 124
8 1 1 231
9 1 1 233
10 1 2 249
11 2 3 4
12 2 3 6
13 2 3 7
14 2 3 15
15 2 3 25
16 2 3 32
17 2 3 45
18 2 3 74
19 2 3 75
20 3 4 36
21 3 4 37
22 3 4 113
23 4 5 69
24 4 5 70
25 4 5 71
I then create a sparse matrix as the full dataset is way larger than that (10000+ x 1000)
sM <- sparseMatrix(i=dat$indexInstall, j=dat$time, x=1)
And show an image of it:
image(sM)
I want to color every lines according to the indexUser column. For example to plot user 1 in blue and all others un red
Thanks in advance

How to join scatterplots together using ggplot in R?

I have been trying to join scatterplots together in one table of data:
The data is:
row Var1 Var2 Freq
1 0 Good 17
2 1 Good 479
3 2 Good 455
4 3 Good 273
5 4 Good 155
6 5 Good 9
7 0 Average 81
8 1 Average 2449
9 2 Average 4627
10 3 Average 3261
11 4 Average 3142
12 5 Average 110
13 0 Bad 74
14 1 Bad 1472
15 2 Bad 3881
16 3 Bad 3399
17 4 Bad 5431
18 5 Bad 188
Joining together in the following format is a problem since I can't join lines and get a chart as attached. I want the different scatter plots linked together
(edit)The code for the plot is below:
ggplot(dfs, aes(Var1, Freq, colour=Var2))+geom_line()+geom_point()
The result should look something like (that I have successfully joined - the same data but using another Var1 as the x axis):

Resources