All possible unique pair combinations of gamete positions - r

I have some gamete data in the following format:
Ind Letter Place Position
1 A 19 23
2 B 19 23
3 B 19 23
4 B 19 23
1 B 19 34
2 A 19 34
3 B 19 34
4 B 19 34
1 C 19 52
2 T 19 52
3 C 19 52
4 T 19 52
1 T 33 15
2 T 33 15
3 T 33 15
4 C 33 15
1 C 33 26
2 T 33 26
3 T 33 26
4 C 33 26
dput of data:
structure(list(Ind = c(1L,2L,3L,4L,1L,2L,3L,4L,1L,2L,3L,4L,1L,2L,3L,4L,1L,2L,3L,4L),
Letter = structure(c(1L,2L,2L,2L,2L,1L,2L,2L,3L,4L,3L,4L,4L,4L,4L,3L,3L,4L,4L,3L),
.Label = c("A","B","C","T"), class="factor"),
Place = c(19L,19L,19L,19L,19L,19L,19L,19L,19L,19L,19L,19L,33L,33L,33L,33L,33L,33L,33L,33L),
Position = c(23L,23L,23L,23L,34L,34L,34L,34L,52L,52L,52L,52L,15L,15L,15L,15L,26L,26L,26L,26L)),
.Names = c("Ind","Letter","Place","Position"),
class="data.frame", row.names = c(NA,-20L))
I need to pair and combine them, so I get all possible unique combinations with reference to Position within a pair. I have another data-file, that contains information on the pairs, and they are paired with reference to Place. So in this file I may see, that Place 19+Place 33 is a pair, and I want the following result:
Ind Letter Place Position Ind Letter Place Position
1 A 19 23 1 T 33 15
2 B 19 23 2 T 33 15
3 B 19 23 3 T 33 15
4 B 19 23 4 C 33 15
1 A 19 23 1 C 33 26
2 B 19 23 2 T 33 26
3 B 19 23 3 T 33 26
4 B 19 23 4 C 33 26
1 B 19 34 1 T 33 15
2 A 19 34 2 T 33 15
3 B 19 34 3 T 33 15
4 B 19 34 4 C 33 15
1 B 19 34 1 C 33 26
2 A 19 34 2 T 33 26
3 B 19 34 3 T 33 26
4 B 19 34 4 C 33 26
1 C 19 52 1 T 33 15
2 T 19 52 2 T 33 15
3 C 19 52 3 T 33 15
4 T 19 52 4 C 33 15
1 C 19 52 1 C 33 26
2 T 19 52 2 T 33 26
3 C 19 52 3 T 33 26
4 T 19 52 4 C 33 26
In this case unique means that A1:A2 is equal to A2:A1.
The reason I want to do this, is because I want to do a Four-Gamete-Test on the pairs, to the see if all possible combinations of Letter is existent. So e.g. for the last combined pair above, we have the letter-pairs CC, TT, CT, TC, so this combined pair will pass the FGT.
I have tried to do the combining with expand.grid, as it seems this is quite close to what I want. However, when I require all combination of data$Position, I lose the information for Ind, Letter, and Place. Also the output includes non-unique pairs.
Can anyone point me to a tool, that is closer to what I want? Or give me some guidelines on how to modify expand.grid, to get what I need.
Should you be aware of a tool, that actually does the Four-Gamete-Test, or something similar, then that would of course also be interesting for me to look at.

You can use expand.grid but not directly on the Position column. The idea is to find all combinations of the "quartets" (unique Positions):
pair <- c(19, 33)
df1 <- df1[df1$Place %in% pair, ]
split1 <- split( df1, df1$Position)
vec1 <- unique(df1$Position[df1$Place == pair[1]])
vec2 <- unique(df1$Position[df1$Place == pair[2]])
combin_num <- expand.grid(vec2, vec1)[,2:1]
do.call(
rbind,
lapply(seq_len(nrow(combin_num)), function(i){
cbind( split1[[as.character(combin_num[i,1])]],
split1[[as.character(combin_num[i,2])]] )
})
)[,]
Result:
# Ind Letter Place Position Ind.1 Letter.1 Place.1 Position.1
# 1 1 A 19 23 1 T 33 15
# 2 2 B 19 23 2 T 33 15
# 3 3 B 19 23 3 T 33 15
# 4 4 B 19 23 4 C 33 15
# 5 1 A 19 23 1 C 33 26
# 6 2 B 19 23 2 T 33 26
# 7 3 B 19 23 3 T 33 26
# 8 4 B 19 23 4 C 33 26
# 51 1 B 19 34 1 T 33 15
# 61 2 A 19 34 2 T 33 15
# 71 3 B 19 34 3 T 33 15
# 81 4 B 19 34 4 C 33 15
# 52 1 B 19 34 1 C 33 26
# 62 2 A 19 34 2 T 33 26
# 72 3 B 19 34 3 T 33 26
# 82 4 B 19 34 4 C 33 26
# 9 1 C 19 52 1 T 33 15
# 10 2 T 19 52 2 T 33 15
# 11 3 C 19 52 3 T 33 15
# 12 4 T 19 52 4 C 33 15
# 91 1 C 19 52 1 C 33 26
# 101 2 T 19 52 2 T 33 26
# 111 3 C 19 52 3 T 33 26
# 121 4 T 19 52 4 C 33 26

Related

Transpose from long to wide with pair groups in r

I have descriptive statistics for four groups. My sample dataset is:
df <- data.frame(
Grade = c(3,3,3,3,4,4,4,4),
group = c("none","G1","G2","both","none","G1","G2","both"),
mean=c(10,12,13,12,11,18,19,20),
sd=c(22,12,22,12,11,13,14,15),
N=c(35,33,34,32,43,45,46,47))
> df
Grade group mean sd N
1 3 none 10 22 35
2 3 G1 12 12 33
3 3 G2 13 22 34
4 3 both 12 12 32
5 4 none 11 11 43
6 4 G1 18 13 45
7 4 G2 19 14 46
8 4 both 20 15 47
I would like to compare groups as pairs and need the descriptive information side by side for each pair.
Here is what I would like to have:
So, each grade has 6 pairs of groups.
Does anyone have any idea on this?
Thanks!
1) sqldf We can join df to itself on the indicated condition. Note that we escaped group since group is an sql keyword.
library(sqldf)
sqldf('select
a.Grade,
a.[group] Group1, b.[group] Group2,
a.mean mean1, b.mean mean2,
a.sd sd1, b.sd sd2,
a.N n1, b.N n2
from df a
join df b on a.Grade = b.Grade and a.[group] > b.[group]')
giving:
Grade Group1 Group2 mean1 mean2 sd1 sd2 n1 n2
1 3 none G1 10 12 22 12 35 33
2 3 none G2 10 13 22 22 35 34
3 3 none both 10 12 22 12 35 32
4 3 G2 G1 13 12 22 12 34 33
5 3 both G1 12 12 12 12 32 33
6 3 both G2 12 13 12 22 32 34
7 4 none G1 11 18 11 13 43 45
8 4 none G2 11 19 11 14 43 46
9 4 none both 11 20 11 15 43 47
10 4 G2 G1 19 18 14 13 46 45
11 4 both G1 20 18 15 13 47 45
12 4 both G2 20 19 15 14 47 46
2) base R We can perform a merge on part of the condition and then subset it for the remainder. The names are slightly different so you will need to change them if that is important.
subset(merge(df, df, by = "Grade"), group.x > group.y)
giving:
Grade group.x mean.x sd.x N.x group.y mean.y sd.y N.y
2 3 none 10 22 35 G1 12 12 33
3 3 none 10 22 35 G2 13 22 34
4 3 none 10 22 35 both 12 12 32
8 3 G1 12 12 33 both 12 12 32
10 3 G2 13 22 34 G1 12 12 33
12 3 G2 13 22 34 both 12 12 32
18 4 none 11 11 43 G1 18 13 45
19 4 none 11 11 43 G2 19 14 46
20 4 none 11 11 43 both 20 15 47
24 4 G1 18 13 45 both 20 15 47
26 4 G2 19 14 46 G1 18 13 45
28 4 G2 19 14 46 both 20 15 47

How to add a column after the last column containing a certain string?

I have a dataset like this
df <- data.frame(abc_1 = 1:10, abc_2 = 11:20, abc_3 = 21:30, somevar = 31:40)
head(df)
abc_1 abc_2 abc_3 somevar
1 1 11 21 31
2 2 12 22 32
3 3 13 23 33
4 4 14 24 34
5 5 15 25 35
6 6 16 26 36
I would like to insert a new column after abc_3 (in my case the row sums of abc_1, abc_2, abc_3). Since (a) the dataset is huge, (b) I might decide to manipulate the dataset before adding the column (i.e. messing up the column indices), and since (c) I would like to do this for many variables that contain some different string, I am looking to use a way of doing this without referencing the column index but rather by matching the string abc.
I have found add_column in the tibble package but it only allows adding the column by its index like so
library(tibble)
add_column(df, abc_sum = rowSum(abc_1, abc_2, abc_3), .after = 3)
What I want is something like this:
abc_1 abc_2 abc_3 abc_sum somevar
1 1 11 21 33 31
2 2 12 22 36 32
3 3 13 23 39 33
4 4 14 24 42 34
5 5 15 25 45 35
6 6 16 26 48 36
I am looking to replace the .after = 3 with an expression that inserts it after abc_3 by matching the string abc_3.
You can do:
add_column(df,
abc_sum = rowSums(df[startsWith(names(df), "abc")]),
.after = "abc_3")
abc_1 abc_2 abc_3 abc_sum somevar
1 1 11 21 33 31
2 2 12 22 36 32
3 3 13 23 39 33
4 4 14 24 42 34
5 5 15 25 45 35
6 6 16 26 48 36
7 7 17 27 51 37
8 8 18 28 54 38
9 9 19 29 57 39
10 10 20 30 60 40
We can use reduce
library(dplyr)
library(purrr)
df %>%
mutate(abc_sum = select(., starts_with('abc')) %>%
reduce(`+`)) %>%
select(starts_with('abc'), everything())
# abc_1 abc_2 abc_3 abc_sum somevar
#1 1 11 21 33 31
#2 2 12 22 36 32
#3 3 13 23 39 33
#4 4 14 24 42 34
#5 5 15 25 45 35
#6 6 16 26 48 36
#7 7 17 27 51 37
#8 8 18 28 54 38
#9 9 19 29 57 39
#10 10 20 30 60 40

Subtract and find the difference of a value or volume

I have a volume measurements of brain parts (optic lobe, olfactory lobe, auditory cortex, etc), all the parts will add up to total brain volume. As shown in the example dataframe here.
a b c d e total
1 2 3 4 5 15
2 3 4 5 6 20
4 6 7 8 9 34
7 8 10 10 15 50
I would like to find the find the difference of brain volume if I subtract one components out of total volume.
So I was wondering how to go about it in R, without having to create a new column for every brain part.
For example: (total - a = 14, total - b =13, and so on for other components).
total-a total-b total-c total-d total-e
14 13 12 11 10
18 17 16 15 14
30 28 27 26 25
43 42 40 40 35
You can do
dat[, "total"] - dat[1:5]
# a b c d e
#1 14 13 12 11 10
#2 18 17 16 15 14
#3 30 28 27 26 25
#4 43 42 40 40 35
If you want also the column names, then one tidyverse possibility could be:
df %>%
gather(var, val, -total) %>%
mutate(var = paste0("total-", var),
val = total - val) %>%
spread(var, val)
total total-a total-b total-c total-d total-e
1 15 14 13 12 11 10
2 20 18 17 16 15 14
3 34 30 28 27 26 25
4 50 43 42 40 40 35
If you do not care about the column names, then with just dplyr you can do:
df %>%
mutate_at(vars(-matches("(total)")), list(~ total - .))
a b c d e total
1 14 13 12 11 10 15
2 18 17 16 15 14 20
3 30 28 27 26 25 34
4 43 42 40 40 35 50
Or without column names with just base R:
df[, grepl("total", names(df))] - df[, !grepl("total", names(df))]
a b c d e
1 14 13 12 11 10
2 18 17 16 15 14
3 30 28 27 26 25
4 43 42 40 40 35

Creating an index for each subject in R

I'm working with some data on repeated measures of subjects over time. The data is in this format:
Subject <- as.factor(c(rep("A", 20), rep("B", 35), rep("C", 13)))
variable.A <- rnorm(mean = 300, sd = 50, n = Subject)
dat <- data.frame(Subject, variable.A)
dat
Subject variable.A
1 A 334.6567
2 A 353.0988
3 A 244.0863
4 A 284.8918
5 A 302.6442
6 A 298.3162
7 A 271.4864
8 A 268.6848
9 A 262.3761
10 A 341.4224
11 A 190.4823
12 A 297.1981
13 A 319.8346
14 A 343.9855
15 A 332.5318
16 A 221.9502
17 A 412.9172
18 A 283.4206
19 A 310.9847
20 A 276.5423
21 B 181.5418
22 B 340.5812
23 B 348.5162
24 B 364.6962
25 B 312.2508
26 B 278.9855
27 B 242.8810
28 B 272.9585
29 B 239.2776
30 B 254.9140
31 B 253.8940
32 B 330.1918
33 B 300.7302
34 B 237.6511
35 B 314.4919
36 B 239.6195
37 B 282.7955
38 B 260.0943
39 B 396.5310
40 B 325.5422
41 B 374.8063
42 B 363.1897
43 B 258.0310
44 B 358.8605
45 B 251.8775
46 B 299.6995
47 B 303.4766
48 B 359.8955
49 B 299.7089
50 B 289.3128
51 B 401.7680
52 B 276.8078
53 B 441.4852
54 B 232.6222
55 B 305.1977
56 C 298.4580
57 C 210.5164
58 C 272.0228
59 C 282.0540
60 C 207.8797
61 C 263.3859
62 C 324.4417
63 C 273.5904
64 C 348.4389
65 C 174.2979
66 C 363.4353
67 C 260.8548
68 C 306.1833
I've used the seq_along() function and the dplyr package to create an index of each observation for every subject:
dat <- as.data.frame(dat %>%
group_by(Subject) %>%
mutate(index = seq_along(Subject)))
Subject variable.A index
1 A 334.6567 1
2 A 353.0988 2
3 A 244.0863 3
4 A 284.8918 4
5 A 302.6442 5
6 A 298.3162 6
7 A 271.4864 7
8 A 268.6848 8
9 A 262.3761 9
10 A 341.4224 10
11 A 190.4823 11
12 A 297.1981 12
13 A 319.8346 13
14 A 343.9855 14
15 A 332.5318 15
16 A 221.9502 16
17 A 412.9172 17
18 A 283.4206 18
19 A 310.9847 19
20 A 276.5423 20
21 B 181.5418 1
22 B 340.5812 2
23 B 348.5162 3
24 B 364.6962 4
25 B 312.2508 5
26 B 278.9855 6
27 B 242.8810 7
28 B 272.9585 8
29 B 239.2776 9
30 B 254.9140 10
31 B 253.8940 11
32 B 330.1918 12
33 B 300.7302 13
34 B 237.6511 14
35 B 314.4919 15
36 B 239.6195 16
37 B 282.7955 17
38 B 260.0943 18
39 B 396.5310 19
40 B 325.5422 20
41 B 374.8063 21
42 B 363.1897 22
43 B 258.0310 23
44 B 358.8605 24
45 B 251.8775 25
46 B 299.6995 26
47 B 303.4766 27
48 B 359.8955 28
49 B 299.7089 29
50 B 289.3128 30
51 B 401.7680 31
52 B 276.8078 32
53 B 441.4852 33
54 B 232.6222 34
55 B 305.1977 35
56 C 298.4580 1
57 C 210.5164 2
58 C 272.0228 3
59 C 282.0540 4
60 C 207.8797 5
61 C 263.3859 6
62 C 324.4417 7
63 C 273.5904 8
64 C 348.4389 9
65 C 174.2979 10
66 C 363.4353 11
67 C 260.8548 12
68 C 306.1833 13
What I'm now looking to do is set up an analysis that looks at every 10 observations, so I'd like to create another column that basically gives me a number for every 10 observations. For example, Subject A would have a sequence of ten "1's" followed by a sequence of ten "2's" (IE, two groupings of 10). I've tried to use the rep() function but the issue I'm running into is that the other subjects don't have a number of observations that is divisible by 10.
Is there a way for the rep() function to just assign the grouping the next number, even if it doesn't have 10 total observations? For example, Subject B would have ten "1's", ten "2's" and then five "3's" (representing that his last group of observations)?
You can use modular division %/% to generate the ids:
dat %>%
group_by(Subject) %>%
mutate(chunk_id = (seq_along(Subject) - 1) %/% 10 + 1) -> dat1
table(dat1$Subject, dat1$chunk_id)
# 1 2 3 4
# A 10 10 0 0
# B 10 10 10 5
# C 10 3 0 0
For a plain vanilla base R solution, you also could try this:
dat$newcol <- 1
dat$index <- ave(dat$newcol, dat$Subject, FUN = cumsum)
dat$chunk_id <- (dat$index - 1) %/% 10 + 1
which, when you run the table command as above gives you
table(dat$Subject, dat$chunk_id)
1 2 3 4
A 10 10 0 0
B 10 10 10 5
C 10 3 0 0
If you don't want the extra 'newcol' column, just use 'NULL' to get rid of it:
dat$newcol <- NULL

Add data frames row wise with [d]plyr

I have two data frames
df1
# a b
# 1 10 20
# 2 11 21
# 3 12 22
# 4 13 23
# 5 14 24
# 6 15 25
df2
# a b
# 1 4 8
I want the following output:
df3
# a b
# 1 14 28
# 2 15 29
# 3 16 30
# 4 17 31
# 5 18 32
# 6 19 33
i.e. add df2 to each row of df1.
Is there a way to get the desired output using plyr (mdplyr??) or dplyr?
I see no reason for "dplyr" for something like this. In base R you could just do:
df1 + unclass(df2)
# a b
# 1 14 28
# 2 15 29
# 3 16 30
# 4 17 31
# 5 18 32
# 6 19 33
Which is the same as df1 + list(4, 8).
One liner with dplyr.
mutate_each(df1, funs(.+ df2$.), a:b)
# a b
#1 14 28
#2 15 29
#3 16 30
#4 17 31
#5 18 32
#6 19 33
A base R solution using sweet function sweep:
sweep(df1, 2, unlist(df2), '+')
# a b
#1 14 28
#2 15 29
#3 16 30
#4 17 31
#5 18 32
#6 19 33

Resources